Ideal for Web Scraping Agents requiring efficient data extraction from multiple web pages using CLI tools like Firecrawl OCI Self-Service Portal with AI-powered chat and 60+ OCI tools

How do I install firecrawl?

Run the command: npx killer-skills add acedergren/oci-self-service-portal/firecrawl. It works with Cursor, Windsurf, VS Code, Claude Code, and 15+ other IDEs.

What are the use cases for firecrawl?

Key use cases include: Automating web data extraction for 1-5 pages using serial commands, Scaling web scraping tasks for 6-50 pages with parallelization, Optimizing large-scale web content analysis for 50+ pages with `xargs -P`.

Which IDEs are compatible with firecrawl?

This skill is compatible with Cursor, Windsurf, VS Code, Claude Code, GitHub Copilot, JetBrains, Cline, Roo Code, and many more. Use the Killer-Skills CLI for universal one-command installation.

Are there any limitations for firecrawl?

Requires careful concurrency limit setup for large-scale operations. Manual commands acceptable for one-time, small-scale tasks only.

Firecrawl CLI - Web Scraping Expert

Name: firecrawl
Availability: InStock
Rating: 3.0 (1 reviews)
Author: acedergren

Prioritize Firecrawl over WebFetch/WebSearch for web content tasks.

Before Scraping: Decision Framework

Ask yourself these questions BEFORE running firecrawl:

1. Scale Assessment

How many pages?
- 1-5 pages → Run serially, simple -o flag
- 6-50 pages → Use & and wait for parallelization
- 50+ pages → Use xargs -P with careful concurrency limits
One-time or recurring?
- One-time → Manual commands acceptable
- Recurring → Build script in .firecrawl/scratchpad/

2. Data Need Clarity

What data do you actually need?
- Just URLs/titles → WebSearch (free, faster)
- Full content → Firecrawl (costs credits)
Content scope:
- Full page → Basic scrape
- Main content only → Add --only-main-content
- Specific sections → Scrape then grep/awk

3. Tool Selection

Is this the right tool?
- Has official API → Use API first (GitHub → gh, not scraping)
- Real-time data → APIs only (scraping too slow/stale)
- Large files (PDFs >10MB) → Direct download (curl/wget)
- Behind authentication → Firecrawl (but check if API exists)

Critical Decision: Which Tool to Use?

User needs web content?
│
├─ Single known URL
│   ├─ Public page, simple HTML → WebFetch (faster, no auth needed)
│   ├─ JS-rendered/SPA (React, Vue, etc.) → Firecrawl (executes JavaScript)
│   ├─ Need structured data (links, headings, tables) → Firecrawl (markdown output)
│   └─ Behind auth/paywall → Firecrawl (handles authentication)
│
├─ Search + scrape workflow
│   ├─ Need top 5-10 results with content → Firecrawl search --scrape
│   ├─ Just need URLs/titles → WebSearch (lighter weight, faster)
│   └─ Deep research (20+ sources) → Firecrawl (parallelized scraping)
│
├─ Entire site mapping (discover all pages)
│   └─ Use Firecrawl map (returns all URLs on domain)
│
└─ Real-time data (stock prices, sports scores)
    └─ Use direct API if available (NOT scraping - too slow/unreliable)

Anti-Patterns (NEVER Do This)

❌ #1: Sequential Scraping

Problem: Scraping sites one-by-one wastes time.

bash
1# WRONG - sequential (10 sites = 50+ seconds)
2for url in site1 site2 site3 site4 site5; do
3  firecrawl scrape "$url" -o ".firecrawl/$url.md"
4done
5
6# CORRECT - parallel (10 sites = 5-8 seconds)
7firecrawl scrape site1 -o .firecrawl/1.md &
8firecrawl scrape site2 -o .firecrawl/2.md &
9firecrawl scrape site3 -o .firecrawl/3.md &
10wait
11
12# BEST - xargs parallelization
13cat urls.txt | xargs -P 10 -I {} sh -c 'firecrawl scrape "{}" -o ".firecrawl/$(echo {} | md5).md"'

Why: Firecrawl supports up to 100 parallel jobs (check firecrawl --status). Use them.

Why this is deceptively hard to debug: Operations complete successfully—just slowly. No error messages indicate the problem. When scraping 20 sites takes 2 minutes instead of 10 seconds, it's not obvious the bottleneck is sequential execution rather than network speed. Profiling reveals the issue: 90% of time is spent waiting, not processing. Takes 10-15 minutes to realize parallelization is the fix.

❌ #2: Reading Full Output into Context

Problem: Firecrawl results often exceed 1000+ lines. Reading entire files floods context.

bash
1# WRONG - reads 5000-line file into context
2Read(.firecrawl/result.md)
3
4# CORRECT - preview first, then targeted extraction
5wc -l .firecrawl/result.md  # Check size: 5243 lines
6head -100 .firecrawl/result.md  # Preview structure
7grep -A 10 "keyword" .firecrawl/result.md  # Extract relevant sections

Why: Context is precious. Use bash tools (grep, head, tail, awk) to extract what you need.

Why this is deceptively hard to debug: No error message appears—file loads successfully into context. The agent thinks "I'll just read the file" without checking size first. You only discover the problem 30+ messages later when context limits hit, or responses become sluggish. File explorers don't show line counts by default. Terminal shows "success" but you've silently wasted 4000+ tokens. Takes 15-20 minutes to realize incremental reading with grep/head would have been 20x more efficient.

❌ #3: Using Firecrawl for Wrong Tasks

NEVER use Firecrawl for:

Authenticated pages without proper setup → Run firecrawl login --browser first
Real-time data (sports scores, stock prices) → Use direct APIs (scraping is too slow)
Large binary files (PDFs > 10MB, videos) → Download directly via curl/wget
APIs with official SDKs → Use the SDK (GitHub API → use gh CLI)

Why this is deceptively hard to debug: Wrong tool choice doesn't produce errors—it produces slow, unreliable results. Scraping real-time data "works" but is 10 seconds behind and costs credits per request. Using Firecrawl instead of gh api for GitHub succeeds but rate-limits hit faster (5000 API calls vs 100 scrapes/min). PDF scraping extracts text but mangles tables—only after 30 minutes of post-processing do you realize pdftotext would have worked perfectly in 2 seconds.

❌ #4: Ignoring Output Organization

Problem: Dumping all results in working directory creates mess.

bash
1# WRONG - pollutes working directory
2firecrawl scrape https://example.com
3
4# CORRECT - organized structure
5firecrawl scrape https://example.com -o .firecrawl/example.com.md
6firecrawl search "AI news" -o .firecrawl/search-ai-news.json
7firecrawl map https://docs.site.com -o .firecrawl/docs-sitemap.txt

Why: .firecrawl/ directory keeps workspace clean, add to .gitignore.

Why this is deceptively hard to debug: No error—files just accumulate in root directory. After 10-15 scrapes, ls output becomes unreadable. Worse: firecrawl's default output to stdout means results appear in terminal but aren't saved, requiring re-scraping (wasting credits). Only after losing data twice do you realize -o flag is mandatory for persistence. Git commits accidentally include scraped data before .gitignore is updated.

Authentication Setup

Before first use, check auth status:

bash
1firecrawl --status

If not authenticated:

bash
1firecrawl login --browser  # Opens browser automatically

The --browser flag auto-opens authentication page without prompting. Don't ask user to run manually—execute and let browser handle auth.

Core Operations (Quick Reference)

Search the Web

bash
1# Basic search
2firecrawl search "your query" -o .firecrawl/search-query.json --json
3
4# Search + scrape content from results
5firecrawl search "firecrawl tutorials" --scrape -o .firecrawl/search-scraped.json --json
6
7# Time-filtered search
8firecrawl search "AI announcements" --tbs qdr:d -o .firecrawl/today.json --json  # Past day
9firecrawl search "tech news" --tbs qdr:w -o .firecrawl/week.json --json          # Past week

Scrape Single Page

bash
1# Get clean markdown
2firecrawl scrape https://example.com -o .firecrawl/example.md
3
4# Main content only (removes nav, footer, ads)
5firecrawl scrape https://example.com --only-main-content -o .firecrawl/clean.md
6
7# Wait for JS to render (SPAs)
8firecrawl scrape https://spa-app.com --wait-for 3000 -o .firecrawl/spa.md

Map Entire Site

bash
1# Discover all URLs
2firecrawl map https://example.com -o .firecrawl/urls.txt
3
4# Filter for specific pages
5firecrawl map https://example.com --search "blog" -o .firecrawl/blog-urls.txt

Expert Pattern: Parallel Bulk Scraping

Check concurrency limit first:

bash
1firecrawl --status
2# Output: Concurrency: 0/100 jobs

Run up to limit:

bash
1# For list of URLs in file
2cat urls.txt | xargs -P 10 -I {} sh -c 'firecrawl scrape "{}" -o ".firecrawl/$(basename {}).md"'
3
4# For generated URLs
5for i in {1..20}; do
6  firecrawl scrape "https://site.com/page/$i" -o ".firecrawl/page-$i.md" &
7done
8wait

Extract data after bulk scrape:

bash
1# Extract all H1 headings from scraped pages
2grep "^# " .firecrawl/*.md
3
4# Find pages mentioning keyword
5grep -l "keyword" .firecrawl/*.md
6
7# Process with jq (if JSON output)
8jq -r '.data.web[].title' .firecrawl/*.json

When to Load Full CLI Reference

MANDATORY - READ ENTIRE FILE: references/cli-options.md when:

Error mentions 3+ unknown flags (e.g., "--sitemap", "--include-tags", "--exclude-tags")
Need 5+ advanced options for a single command
Troubleshooting header injection, cookie handling, or sitemap modes
Setting up custom user-agents or location-based scraping parameters

MANDATORY - READ ENTIRE FILE: references/output-processing.md when:

Building pipeline with 3+ transformation steps (firecrawl | jq | awk | ...)
Parsing nested JSON structures from search results (accessing .data.web[].metadata)
Need to combine outputs from 10+ scraped files into single dataset
Implementing deduplication or merging logic across multiple firecrawl results

Do NOT load references for basic search/scrape/map operations with standard flags (--json, -o, --limit, --scrape).

Error Recovery Procedures

When "Not authenticated" Error Occurs

Recovery steps:

Check current auth status: firecrawl --status
Run authentication: firecrawl login --browser (auto-opens browser)
Verify success: firecrawl --status should show "Authenticated via FIRECRAWL_API_KEY"
Fallback: If browser auth fails, manually set API key: export FIRECRAWL_API_KEY=your_key (get key from firecrawl.dev dashboard)

When "Concurrency limit reached" Error Occurs

Recovery steps:

Check current usage: firecrawl --status (shows X/100 jobs)
Wait for running jobs: wait (if using & background jobs)
Verify capacity freed: firecrawl --status should show lower usage
Fallback: If jobs are stuck, reduce parallelization (e.g., xargs -P 5 instead of -P 10) and retry. Jobs auto-timeout after 5 minutes.

When "Page failed to load" Error Occurs

Recovery steps:

Test basic connectivity: curl -I URL (verify site is accessible)
Increase JS wait time: firecrawl scrape URL --wait-for 5000 -o output.md
Verify output has content: wc -l output.md (should be >10 lines)
Fallback: If still empty after 10s wait, page may be fully client-rendered → try --format html to check raw HTML, or use alternate approach (curl + cheerio, or try WebFetch if JS not critical)

When "Output file is empty" Error Occurs

Recovery steps:

Check if content exists: head -20 output.md (see what was captured)
Try main content extraction: firecrawl scrape URL --only-main-content -o output.md
Verify improvement: wc -l output.md (should increase significantly)
Fallback: If still empty, page structure may be unusual → use --include-tags article,main or --exclude-tags nav,aside,footer to target specific HTML elements. If that fails, page may have no scrapeable text (images only, canvas-based, etc.).

Resources

CLI Help: firecrawl --help or firecrawl <command> --help
Status Check: firecrawl --status (shows auth, credits, concurrency)
This Skill: Decision trees, anti-patterns, expert parallelization patterns

firecrawl — Categories.community

↓ Quality Score

Agent Capability Analysis

Ideal Agent Persona

Core Value

↓ Capabilities Granted for firecrawl MCP Server

! Prerequisites & Limits

# Tags

Firecrawl CLI - Web Scraping Expert

Before Scraping: Decision Framework

1. Scale Assessment

2. Data Need Clarity

3. Tool Selection

Critical Decision: Which Tool to Use?

Anti-Patterns (NEVER Do This)

❌ #1: Sequential Scraping

❌ #2: Reading Full Output into Context

❌ #3: Using Firecrawl for Wrong Tasks

❌ #4: Ignoring Output Organization

Authentication Setup

Core Operations (Quick Reference)

Search the Web

Scrape Single Page

Map Entire Site

Expert Pattern: Parallel Bulk Scraping

When to Load Full CLI Reference

Error Recovery Procedures

When "Not authenticated" Error Occurs

When "Concurrency limit reached" Error Occurs

When "Page failed to load" Error Occurs

When "Output file is empty" Error Occurs

Resources

Related Skills

Looking for an alternative to firecrawl or building a Categories.community AI Agent? Explore these related open-source MCP Servers.

widget-generator

chat-sdk

zustand

data-fetching

firecrawl — Categories.community

About this Skill

↓ Quality Score

Agent Capability Analysis

Ideal Agent Persona

Core Value

↓ Capabilities Granted for firecrawl MCP Server

! Prerequisites & Limits

# Tags

Firecrawl CLI - Web Scraping Expert

Before Scraping: Decision Framework

1. Scale Assessment

2. Data Need Clarity

3. Tool Selection

Critical Decision: Which Tool to Use?

Anti-Patterns (NEVER Do This)

❌ #1: Sequential Scraping

❌ #2: Reading Full Output into Context

❌ #3: Using Firecrawl for Wrong Tasks

❌ #4: Ignoring Output Organization

Authentication Setup

Core Operations (Quick Reference)

Search the Web

Scrape Single Page

Map Entire Site

Expert Pattern: Parallel Bulk Scraping

When to Load Full CLI Reference

Error Recovery Procedures

When "Not authenticated" Error Occurs

When "Concurrency limit reached" Error Occurs

When "Page failed to load" Error Occurs

When "Output file is empty" Error Occurs

Resources

Related Skills

Looking for an alternative to firecrawl or building a Categories.community AI Agent? Explore these related open-source MCP Servers.

widget-generator

chat-sdk

zustand

data-fetching