KS
Killer-Skills

firecrawl — Categories.community

v3.0.0
GitHub

About this Skill

Ideal for Web Scraping Agents requiring efficient data extraction from multiple web pages using CLI tools like Firecrawl OCI Self-Service Portal with AI-powered chat and 60+ OCI tools

acedergren acedergren
[0]
[0]
Updated: 3/3/2026

Quality Score

Top 5%
60
Excellent
Based on code quality & docs
Installation
SYS Universal Install (Auto-Detect)
Cursor IDE Windsurf IDE VS Code IDE
> npx killer-skills add acedergren/oci-self-service-portal/firecrawl

Agent Capability Analysis

The firecrawl MCP Server by acedergren is an open-source Categories.community integration for Claude and other AI agents, enabling seamless task automation and capability expansion.

Ideal Agent Persona

Ideal for Web Scraping Agents requiring efficient data extraction from multiple web pages using CLI tools like Firecrawl

Core Value

Empowers agents to perform scalable web content analysis using Firecrawl's parallelization features with `&` and `wait` for concurrent tasks, and `xargs -P` for large-scale operations, handling 50+ pages with careful concurrency limits

Capabilities Granted for firecrawl MCP Server

Automating web data extraction for 1-5 pages using serial commands
Scaling web scraping tasks for 6-50 pages with parallelization
Optimizing large-scale web content analysis for 50+ pages with `xargs -P`

! Prerequisites & Limits

  • Requires careful concurrency limit setup for large-scale operations
  • Manual commands acceptable for one-time, small-scale tasks only
Project
SKILL.md
11.1 KB
.cursorrules
1.2 KB
package.json
240 B
Ready
UTF-8

# Tags

[No tags]
SKILL.md
Readonly

Firecrawl CLI - Web Scraping Expert

Prioritize Firecrawl over WebFetch/WebSearch for web content tasks.


Before Scraping: Decision Framework

Ask yourself these questions BEFORE running firecrawl:

1. Scale Assessment

  • How many pages?
    • 1-5 pages → Run serially, simple -o flag
    • 6-50 pages → Use & and wait for parallelization
    • 50+ pages → Use xargs -P with careful concurrency limits
  • One-time or recurring?
    • One-time → Manual commands acceptable
    • Recurring → Build script in .firecrawl/scratchpad/

2. Data Need Clarity

  • What data do you actually need?
    • Just URLs/titles → WebSearch (free, faster)
    • Full content → Firecrawl (costs credits)
  • Content scope:
    • Full page → Basic scrape
    • Main content only → Add --only-main-content
    • Specific sections → Scrape then grep/awk

3. Tool Selection

  • Is this the right tool?
    • Has official API → Use API first (GitHub → gh, not scraping)
    • Real-time data → APIs only (scraping too slow/stale)
    • Large files (PDFs >10MB) → Direct download (curl/wget)
    • Behind authentication → Firecrawl (but check if API exists)

Critical Decision: Which Tool to Use?

User needs web content?
│
├─ Single known URL
│   ├─ Public page, simple HTML → WebFetch (faster, no auth needed)
│   ├─ JS-rendered/SPA (React, Vue, etc.) → Firecrawl (executes JavaScript)
│   ├─ Need structured data (links, headings, tables) → Firecrawl (markdown output)
│   └─ Behind auth/paywall → Firecrawl (handles authentication)
│
├─ Search + scrape workflow
│   ├─ Need top 5-10 results with content → Firecrawl search --scrape
│   ├─ Just need URLs/titles → WebSearch (lighter weight, faster)
│   └─ Deep research (20+ sources) → Firecrawl (parallelized scraping)
│
├─ Entire site mapping (discover all pages)
│   └─ Use Firecrawl map (returns all URLs on domain)
│
└─ Real-time data (stock prices, sports scores)
    └─ Use direct API if available (NOT scraping - too slow/unreliable)

Anti-Patterns (NEVER Do This)

❌ #1: Sequential Scraping

Problem: Scraping sites one-by-one wastes time.

bash
1# WRONG - sequential (10 sites = 50+ seconds) 2for url in site1 site2 site3 site4 site5; do 3 firecrawl scrape "$url" -o ".firecrawl/$url.md" 4done 5 6# CORRECT - parallel (10 sites = 5-8 seconds) 7firecrawl scrape site1 -o .firecrawl/1.md & 8firecrawl scrape site2 -o .firecrawl/2.md & 9firecrawl scrape site3 -o .firecrawl/3.md & 10wait 11 12# BEST - xargs parallelization 13cat urls.txt | xargs -P 10 -I {} sh -c 'firecrawl scrape "{}" -o ".firecrawl/$(echo {} | md5).md"'

Why: Firecrawl supports up to 100 parallel jobs (check firecrawl --status). Use them.

Why this is deceptively hard to debug: Operations complete successfully—just slowly. No error messages indicate the problem. When scraping 20 sites takes 2 minutes instead of 10 seconds, it's not obvious the bottleneck is sequential execution rather than network speed. Profiling reveals the issue: 90% of time is spent waiting, not processing. Takes 10-15 minutes to realize parallelization is the fix.

❌ #2: Reading Full Output into Context

Problem: Firecrawl results often exceed 1000+ lines. Reading entire files floods context.

bash
1# WRONG - reads 5000-line file into context 2Read(.firecrawl/result.md) 3 4# CORRECT - preview first, then targeted extraction 5wc -l .firecrawl/result.md # Check size: 5243 lines 6head -100 .firecrawl/result.md # Preview structure 7grep -A 10 "keyword" .firecrawl/result.md # Extract relevant sections

Why: Context is precious. Use bash tools (grep, head, tail, awk) to extract what you need.

Why this is deceptively hard to debug: No error message appears—file loads successfully into context. The agent thinks "I'll just read the file" without checking size first. You only discover the problem 30+ messages later when context limits hit, or responses become sluggish. File explorers don't show line counts by default. Terminal shows "success" but you've silently wasted 4000+ tokens. Takes 15-20 minutes to realize incremental reading with grep/head would have been 20x more efficient.

❌ #3: Using Firecrawl for Wrong Tasks

NEVER use Firecrawl for:

  • Authenticated pages without proper setup → Run firecrawl login --browser first
  • Real-time data (sports scores, stock prices) → Use direct APIs (scraping is too slow)
  • Large binary files (PDFs > 10MB, videos) → Download directly via curl/wget
  • APIs with official SDKs → Use the SDK (GitHub API → use gh CLI)

Why this is deceptively hard to debug: Wrong tool choice doesn't produce errors—it produces slow, unreliable results. Scraping real-time data "works" but is 10 seconds behind and costs credits per request. Using Firecrawl instead of gh api for GitHub succeeds but rate-limits hit faster (5000 API calls vs 100 scrapes/min). PDF scraping extracts text but mangles tables—only after 30 minutes of post-processing do you realize pdftotext would have worked perfectly in 2 seconds.

❌ #4: Ignoring Output Organization

Problem: Dumping all results in working directory creates mess.

bash
1# WRONG - pollutes working directory 2firecrawl scrape https://example.com 3 4# CORRECT - organized structure 5firecrawl scrape https://example.com -o .firecrawl/example.com.md 6firecrawl search "AI news" -o .firecrawl/search-ai-news.json 7firecrawl map https://docs.site.com -o .firecrawl/docs-sitemap.txt

Why: .firecrawl/ directory keeps workspace clean, add to .gitignore.

Why this is deceptively hard to debug: No error—files just accumulate in root directory. After 10-15 scrapes, ls output becomes unreadable. Worse: firecrawl's default output to stdout means results appear in terminal but aren't saved, requiring re-scraping (wasting credits). Only after losing data twice do you realize -o flag is mandatory for persistence. Git commits accidentally include scraped data before .gitignore is updated.


Authentication Setup

Before first use, check auth status:

bash
1firecrawl --status

If not authenticated:

bash
1firecrawl login --browser # Opens browser automatically

The --browser flag auto-opens authentication page without prompting. Don't ask user to run manually—execute and let browser handle auth.


Core Operations (Quick Reference)

Search the Web

bash
1# Basic search 2firecrawl search "your query" -o .firecrawl/search-query.json --json 3 4# Search + scrape content from results 5firecrawl search "firecrawl tutorials" --scrape -o .firecrawl/search-scraped.json --json 6 7# Time-filtered search 8firecrawl search "AI announcements" --tbs qdr:d -o .firecrawl/today.json --json # Past day 9firecrawl search "tech news" --tbs qdr:w -o .firecrawl/week.json --json # Past week

Scrape Single Page

bash
1# Get clean markdown 2firecrawl scrape https://example.com -o .firecrawl/example.md 3 4# Main content only (removes nav, footer, ads) 5firecrawl scrape https://example.com --only-main-content -o .firecrawl/clean.md 6 7# Wait for JS to render (SPAs) 8firecrawl scrape https://spa-app.com --wait-for 3000 -o .firecrawl/spa.md

Map Entire Site

bash
1# Discover all URLs 2firecrawl map https://example.com -o .firecrawl/urls.txt 3 4# Filter for specific pages 5firecrawl map https://example.com --search "blog" -o .firecrawl/blog-urls.txt

Expert Pattern: Parallel Bulk Scraping

Check concurrency limit first:

bash
1firecrawl --status 2# Output: Concurrency: 0/100 jobs

Run up to limit:

bash
1# For list of URLs in file 2cat urls.txt | xargs -P 10 -I {} sh -c 'firecrawl scrape "{}" -o ".firecrawl/$(basename {}).md"' 3 4# For generated URLs 5for i in {1..20}; do 6 firecrawl scrape "https://site.com/page/$i" -o ".firecrawl/page-$i.md" & 7done 8wait

Extract data after bulk scrape:

bash
1# Extract all H1 headings from scraped pages 2grep "^# " .firecrawl/*.md 3 4# Find pages mentioning keyword 5grep -l "keyword" .firecrawl/*.md 6 7# Process with jq (if JSON output) 8jq -r '.data.web[].title' .firecrawl/*.json

When to Load Full CLI Reference

MANDATORY - READ ENTIRE FILE: references/cli-options.md when:

  • Error mentions 3+ unknown flags (e.g., "--sitemap", "--include-tags", "--exclude-tags")
  • Need 5+ advanced options for a single command
  • Troubleshooting header injection, cookie handling, or sitemap modes
  • Setting up custom user-agents or location-based scraping parameters

MANDATORY - READ ENTIRE FILE: references/output-processing.md when:

  • Building pipeline with 3+ transformation steps (firecrawl | jq | awk | ...)
  • Parsing nested JSON structures from search results (accessing .data.web[].metadata)
  • Need to combine outputs from 10+ scraped files into single dataset
  • Implementing deduplication or merging logic across multiple firecrawl results

Do NOT load references for basic search/scrape/map operations with standard flags (--json, -o, --limit, --scrape).


Error Recovery Procedures

When "Not authenticated" Error Occurs

Recovery steps:

  1. Check current auth status: firecrawl --status
  2. Run authentication: firecrawl login --browser (auto-opens browser)
  3. Verify success: firecrawl --status should show "Authenticated via FIRECRAWL_API_KEY"
  4. Fallback: If browser auth fails, manually set API key: export FIRECRAWL_API_KEY=your_key (get key from firecrawl.dev dashboard)

When "Concurrency limit reached" Error Occurs

Recovery steps:

  1. Check current usage: firecrawl --status (shows X/100 jobs)
  2. Wait for running jobs: wait (if using & background jobs)
  3. Verify capacity freed: firecrawl --status should show lower usage
  4. Fallback: If jobs are stuck, reduce parallelization (e.g., xargs -P 5 instead of -P 10) and retry. Jobs auto-timeout after 5 minutes.

When "Page failed to load" Error Occurs

Recovery steps:

  1. Test basic connectivity: curl -I URL (verify site is accessible)
  2. Increase JS wait time: firecrawl scrape URL --wait-for 5000 -o output.md
  3. Verify output has content: wc -l output.md (should be >10 lines)
  4. Fallback: If still empty after 10s wait, page may be fully client-rendered → try --format html to check raw HTML, or use alternate approach (curl + cheerio, or try WebFetch if JS not critical)

When "Output file is empty" Error Occurs

Recovery steps:

  1. Check if content exists: head -20 output.md (see what was captured)
  2. Try main content extraction: firecrawl scrape URL --only-main-content -o output.md
  3. Verify improvement: wc -l output.md (should increase significantly)
  4. Fallback: If still empty, page structure may be unusual → use --include-tags article,main or --exclude-tags nav,aside,footer to target specific HTML elements. If that fails, page may have no scrapeable text (images only, canvas-based, etc.).

Resources

  • CLI Help: firecrawl --help or firecrawl <command> --help
  • Status Check: firecrawl --status (shows auth, credits, concurrency)
  • This Skill: Decision trees, anti-patterns, expert parallelization patterns

Related Skills

Looking for an alternative to firecrawl or building a Categories.community AI Agent? Explore these related open-source MCP Servers.

View All

widget-generator

Logo of f
f

widget-generator is an open-source AI agent skill for creating widget plugins that are injected into prompt feeds on prompts.chat. It supports two rendering modes: standard prompt widgets using default PromptCard styling and custom render widgets built as full React components.

149.6k
0
Design

chat-sdk

Logo of lobehub
lobehub

chat-sdk is a unified TypeScript SDK for building chat bots across multiple platforms, providing a single interface for deploying bot logic.

73.0k
0
Communication

zustand

Logo of lobehub
lobehub

The ultimate space for work and life — to find, build, and collaborate with agent teammates that grow with you. We are taking agent harness to the next level — enabling multi-agent collaboration, effortless agent team design, and introducing agents as the unit of work interaction.

72.8k
0
Communication

data-fetching

Logo of lobehub
lobehub

The ultimate space for work and life — to find, build, and collaborate with agent teammates that grow with you. We are taking agent harness to the next level — enabling multi-agent collaboration, effortless agent team design, and introducing agents as the unit of work interaction.

72.8k
0
Communication