To use Deep Scraper in OpenClaw, run `clawhub install deep-scraper` — no external API key required. Deep Scraper recursively follows links from a starting URL, scraping nested pages down to a configurable depth to build comprehensive datasets. It is the right tool when you need to extract data from an entire section of a website, not just a single page, without setting up a Firecrawl or similar API account.
What Makes Deep Scraper Different From Other ClawHub Extraction Skills
Most web scraping tools in ClawHub require an external API key and a third-party service account. Deep Scraper is the exception: it runs directly inside OpenClaw using its built-in HTTP capabilities. There is no Firecrawl account to set up, no Tavily key to configure, no monthly plan to evaluate. You install it and use it immediately — which makes it the fastest path to multi-page extraction for new OpenClaw users.
The defining characteristic of Deep Scraper is its recursive link-following behavior. Given a starting URL, it does not just scrape that page — it identifies all links on the page, follows them, scrapes those pages, and continues recursively to the configured depth. At depth 1, you get the starting page plus all directly linked pages. At depth 2, you get those pages plus all the pages they link to. This recursive behavior makes Deep Scraper the natural choice for extracting entire sections of a site: all articles under /blog, all products under /catalog, all documentation pages under /docs/getting-started.
The comparison with firecrawl-skills is important for choosing the right tool. Firecrawl Skills offers more sophisticated extraction features — AI-powered structured data extraction, site mapping, and finer control over what gets extracted from each page. Deep Scraper is more straightforward: it follows links and returns page content. For complex data extraction pipelines where you need structured fields pulled from unstructured pages, firecrawl-skills is worth the additional setup. For comprehensive content dumps from a known site section, Deep Scraper's no-configuration approach is the faster and simpler choice.
Integration method
Deep Scraper is a self-contained ClawHub skill that runs inside OpenClaw without relying on an external third-party API. Once installed via ClawHub, it uses OpenClaw's built-in HTTP capabilities to follow links from a starting URL, recursively scraping linked pages to the configured depth. No external account, no API key, and no paid service tier are required for basic use. This makes it the lowest-friction deep extraction tool in ClawHub — install it and start scraping immediately.
Prerequisites
- An OpenClaw account with ClawHub access
- ClawHub CLI installed and working — verify with `clawhub --version`
- No external API account required — Deep Scraper works immediately after install
- Basic familiarity with OpenClaw chat prompts
- A target website you have permission to scrape (always check terms of service and robots.txt)
Step-by-step guide
Install the deep-scraper ClawHub Skill
Install the deep-scraper ClawHub Skill
Installing Deep Scraper is the simplest ClawHub skill installation in the Search & Scraping category because there is no API key to configure afterward. One command and it is ready to use. Open your terminal and run: ``` clawhub install deep-scraper ``` The ClawHub registry downloads the skill package and registers it with your OpenClaw installation. The process typically takes 15-30 seconds. Verify the installation: ``` clawhub list | grep deep-scraper ``` You should see `deep-scraper` with a version number and 'active' status. Because Deep Scraper does not require an external API key, you can start using it immediately after installation — there is no Step 2 for credentials. This is what makes it the fastest extraction skill to get started with in ClawHub. One note on what 'no API key required' means in practice: Deep Scraper uses OpenClaw's HTTP client directly, which means it makes requests from your OpenClaw instance's IP address. If you are running OpenClaw on a server or cloud instance, the scraping requests originate from that server. If you are running OpenClaw locally, they originate from your local machine's IP. This is different from Firecrawl or Tavily, which route requests through their own infrastructure.
1# Install Deep Scraper2clawhub install deep-scraper34# Verify installation5clawhub list | grep deep-scraper67# View skill details and available options8clawhub info deep-scraperPro tip: Run `clawhub info deep-scraper` after installation to see the full list of options and default settings — in particular the default crawl depth, max pages, and timeout values before you run your first large scrape.
Expected result: deep-scraper appears in `clawhub list` as active, and `clawhub info deep-scraper` shows skill details. No additional configuration needed.
Configure Optional Settings (Depth, Page Limit, Timeout)
Configure Optional Settings (Depth, Page Limit, Timeout)
While Deep Scraper requires no API key, it does have configurable settings that control how it behaves during recursive crawls. Understanding and setting these before your first crawl prevents runaway scraping jobs that consume too many resources or time. The three most important settings are: **default_depth** — how many link levels to follow from the starting URL. Default is typically 1 (start page + directly linked pages). Setting this to 2 or 3 dramatically increases the number of pages scraped. **max_pages** — a hard cap on the total number of pages scraped in one job. This is your safety net against accidentally crawling an enormous site. Set it to a sensible limit for your use case (e.g., 50 for exploration, 200 for larger extraction jobs). **request_timeout_ms** — how long to wait for each page to respond before skipping it. For slow or international sites, increasing this prevents incomplete extractions. You can set these globally in OpenClaw config, or override them inline in your prompts ('scrape 3 levels deep with a maximum of 100 pages'). ``` openclaw config set skills.deep-scraper.default_depth 2 openclaw config set skills.deep-scraper.max_pages 100 openclaw config set skills.deep-scraper.request_timeout_ms 10000 ``` For your first few scrapes, keep max_pages at 20-30 to get a feel for how much content deep scraping generates before scaling up.
1# Set default crawl depth (2 = start page + 2 levels of linked pages)2openclaw config set skills.deep-scraper.default_depth 234# Set max pages per job (safety cap)5openclaw config set skills.deep-scraper.max_pages 10067# Set page request timeout in milliseconds8openclaw config set skills.deep-scraper.request_timeout_ms 10000910# Add delay between requests to be a polite scraper11openclaw config set skills.deep-scraper.request_delay_ms 500Pro tip: Adding a `request_delay_ms` of 500-1000ms between requests is considered good scraping etiquette — it reduces load on the target server and significantly lowers the chance of your IP being rate-limited or blocked.
Expected result: OpenClaw config reflects your updated deep-scraper settings, which will apply to all future deep scrape jobs unless overridden in individual prompts.
Run Your First Deep Scrape in OpenClaw Chat
Run Your First Deep Scrape in OpenClaw Chat
Open OpenClaw chat and give Deep Scraper a starting URL with a description of what content to extract. The skill follows links recursively and returns aggregated content from all scraped pages. Start with a focused test on a small, known section of a site to verify the skill is working correctly before running larger jobs. Good first prompt: ``` Deep scrape https://example.com/blog/page/1 up to 2 levels deep with a max of 20 pages. Extract the title and first paragraph of each article you find. ``` Deep Scraper will: 1. Fetch the starting URL 2. Extract content as specified 3. Find all internal links on the page 4. Follow those links and repeat 5. Stop when it reaches the depth limit or max pages cap 6. Return aggregated results in OpenClaw chat For content-heavy sites, deep scrapes on 50+ pages can take several minutes. OpenClaw will show progress updates as pages are scraped so you know the job is running. If the scrape returns very little content or seems to skip most pages, the target site may be blocking automated requests. See the Troubleshooting section for how to handle this.
Deep scrape https://en.wikipedia.org/wiki/Artificial_intelligence following links 1 level deep. Collect the first two paragraphs from each linked article page you find. Limit to 15 pages total.
Paste this in OpenClaw chat
Pro tip: Wikipedia is a great site for testing deep-scraper because it has reliable link structure, no bot blocking, and well-formatted content. Use it to verify your skill is working before running scrapes on harder targets.
Expected result: OpenClaw returns scraped content from multiple linked pages starting from your seed URL, aggregated into a single response with URLs and extracted text.
Advanced Usage: Scoped Deep Scraping and Domain Restriction
Advanced Usage: Scoped Deep Scraping and Domain Restriction
The most common problem with naive deep scraping is link bleed — following external links to other domains or navigating far outside the section you intended to scrape. Deep Scraper supports scoping instructions that keep the crawl within bounds. **Path prefix scoping:** Tell Deep Scraper to only follow links that stay within a specific URL path. This is the single most effective way to contain a deep scrape: ``` Deep scrape https://docs.example.com/api starting from the API overview page. Only follow links that stay within https://docs.example.com/api/. Go 3 levels deep. ``` **Domain restriction:** Limit the crawl to a single domain (no external links followed): ``` Deep scrape https://example.com/products. Stay within the example.com domain only. Do not follow links to external sites, payment processors, or social media. ``` **Content-type targeting:** Instruct Deep Scraper to extract specific content types rather than all page text: ``` Deep scrape https://example.com/blog 2 levels deep. For each page, extract only: the article headline, author name, and article body text. Skip navigation menus, sidebars, and footers. ``` For large-scale deep scraping use cases in production environments — building training datasets, knowledge base pipelines, or competitive intelligence feeds — the RapidDev team offers deep-scraper configuration templates and prompt libraries at rapiddev.ai that handle common edge cases like pagination, JavaScript-rendered content, and incremental scraping to avoid re-fetching already-scraped pages. Always review a site's robots.txt (`https://example.com/robots.txt`) and terms of service before running large deep scrapes. Deep Scraper checks robots.txt by default but verifying manually is good practice.
Deep scrape all pages under https://example.com/docs/tutorials. Only follow links that stay within /docs/tutorials/. Go 2 levels deep, max 50 pages. For each page extract the title, section headings, and full text content.
Paste this in OpenClaw chat
1# Check if a site allows scraping before starting2curl https://target-site.com/robots.txt34# Reload skill config after changes5openclaw reload67# View full deep-scraper config options8clawhub info deep-scraper --verbosePro tip: Add explicit exclusions in your prompt to avoid common link traps: 'Do not follow links to /login, /signup, /cart, or any external domains'. This prevents deep-scraper from wasting page slots on authentication flows and external sites.
Expected result: Deep scrapes stay within the specified URL prefix, returning content only from the intended section of the target site without bleeding into unrelated pages.
Common use cases
Complete Blog Section Export
Extract all articles from a blog or news section by starting at the index page and following links to individual posts. Deep Scraper recursively follows article links and returns the full text content of each post — useful for building knowledge bases, training datasets, or content archives.
Scrape all articles linked from https://example.com/blog using deep scraping. Go 2 levels deep. For each article, extract the title, publication date, author, and full article text. Return as a structured list.
Copy this prompt to try it in OpenClaw
Documentation Section Deep Extraction
Extract all pages from a specific documentation section by following internal links from a starting docs page. Useful for building offline documentation, creating embeddings for a knowledge base, or analyzing the structure and content of a competitor's developer docs.
Deep scrape all pages under https://docs.example.com/guides starting from the guides index. Follow internal links 2 levels deep. Return the full text content and heading structure of each page, along with its URL.
Copy this prompt to try it in OpenClaw
Forum Thread Collection
Collect all posts and replies in a forum thread by following pagination links and nested reply links. Deep Scraper can follow the structure of threaded discussions and aggregate all content without manually navigating page by page.
Scrape the full discussion at https://forum.example.com/topic/12345 including all reply pages. Follow pagination links up to 5 pages deep. Extract each post author, timestamp, and content.
Copy this prompt to try it in OpenClaw
Troubleshooting
Deep scrape returns mostly empty content or only gets 1-2 pages despite setting higher depth
Cause: The target site uses JavaScript-rendered content that Deep Scraper's HTTP client cannot execute, or the site's navigation links are dynamically generated rather than present in the HTML source.
Solution: Deep Scraper works best on server-rendered sites where link structure exists in the HTML. For JavaScript-heavy single-page apps, switch to firecrawl-skills which has JavaScript rendering support. You can check if this is the issue by viewing the page source in your browser (Ctrl+U or right-click → View Page Source) — if most links are absent from the raw HTML, the site is JavaScript-rendered.
`clawhub install deep-scraper` fails with a version conflict error
Cause: Another installed skill has a dependency that conflicts with deep-scraper's requirements, or the local registry index is stale.
Solution: Run `clawhub update` to refresh the registry index and resolve any pending dependency updates, then retry the install. If conflicts persist, run `clawhub list --conflicts` to identify which skills are conflicting.
1clawhub update2clawhub list --conflicts3clawhub install deep-scraperDeep scrape is blocked or returns HTTP 403/429 errors from the target site
Cause: The target site is detecting and blocking automated requests based on request frequency or missing browser-like headers.
Solution: Increase the request delay to make scraping less aggressive: `openclaw config set skills.deep-scraper.request_delay_ms 2000`. Also verify the site's robots.txt allows scraping. If the site aggressively blocks bots, consider using firecrawl-skills instead, which routes requests through Firecrawl's infrastructure with better anti-detection handling.
1# Increase delay between requests2openclaw config set skills.deep-scraper.request_delay_ms 200034# Check if site allows scraping5curl https://target-site.com/robots.txtOpenClaw chat hangs or times out during a deep scrape job
Cause: The scrape job is too large (too many pages or very slow target site) and exceeds OpenClaw's response timeout.
Solution: Reduce the scope of your scrape — lower the depth limit and add a smaller max_pages cap. For very large extraction jobs, break them into multiple smaller scrapes starting from different sub-sections of the site. Run `openclaw config set skills.deep-scraper.max_pages 30` to enforce a more conservative page limit.
1# Reduce max pages to prevent timeouts2openclaw config set skills.deep-scraper.max_pages 303openclaw config set skills.deep-scraper.default_depth 1Best practices
- Always set a max_pages limit before running a deep scrape on an unfamiliar site — without a cap, a link-rich site can balloon into thousands of pages and run for many minutes.
- Add path prefix scoping to your prompts ('only follow links within /docs/api/') to prevent link bleed into unrelated site sections, external domains, or authentication pages.
- Add a `request_delay_ms` of at least 500ms between requests — this is respectful to the target server and significantly reduces the chance of your IP being rate-limited or blocked.
- Check robots.txt manually at `https://example.com/robots.txt` before large scrapes — Deep Scraper respects robots.txt by default, but reviewing it first helps you understand which site sections are off-limits.
- Test on a small section first — run a depth-1 scrape with max 10 pages before committing to a full deep extraction, to verify content quality and that the target site is not JavaScript-rendered.
- Use Deep Scraper for server-rendered sites and firecrawl-skills for JavaScript-heavy SPAs — choosing the right tool for the target site saves significant time and produces better results.
- Exclude common link traps in your prompts — specify that the scraper should not follow links to /login, /signup, /cart, or external domains to keep the scrape focused and efficient.
- For incremental scraping (refreshing data from a site you have scraped before), keep a record of already-scraped URLs and instruct Deep Scraper to skip them — this avoids re-fetching unchanged content.
Alternatives
Firecrawl Skills requires a paid Firecrawl API key but offers AI-powered structured extraction, JavaScript rendering, and better anti-detection — use it for complex sites or structured data pipelines where Deep Scraper's simpler approach falls short.
Firecrawl Search is optimized for search queries against the web rather than systematic extraction from a specific site — use it when you do not know which URLs to scrape and need discovery first.
Web Search Plus aggregates search results from multiple engines for broad research coverage — use it when you need to discover information across many sites rather than extract everything from one site.
Tavily Web Search returns fast, raw web search results for discovery tasks — the right choice when you need to find pages rather than systematically extract content from pages you have already identified.
Frequently asked questions
How do I install Deep Scraper in OpenClaw?
Run `clawhub install deep-scraper` in your terminal. No API key or external account is required — Deep Scraper works immediately after installation. Verify it is active with `clawhub list | grep deep-scraper`, then start using it in OpenClaw chat with a URL and depth instruction.
Does Deep Scraper require an API key?
No — Deep Scraper is a self-contained ClawHub skill that uses OpenClaw's built-in HTTP capabilities. There is no external service account, no API key, and no paid plan required. This makes it the fastest web extraction skill to get started with in ClawHub. The trade-off is that it lacks some advanced features of API-backed scrapers like AI-powered structured extraction and JavaScript rendering.
What is the difference between Deep Scraper and Firecrawl Skills in OpenClaw?
Deep Scraper is simpler and requires no API key — it recursively follows links and extracts page text. Firecrawl Skills requires a Firecrawl API key but offers AI-powered structured data extraction, JavaScript rendering, site mapping, and finer control over extraction. Use Deep Scraper for straightforward content dumps from server-rendered sites; use firecrawl-skills for structured data pipelines or JavaScript-heavy sites.
How do I control how many pages Deep Scraper crawls?
Set a max_pages limit with `openclaw config set skills.deep-scraper.max_pages 50` to cap the total pages per job. You can also specify limits inline in your prompt: 'scrape up to 30 pages'. Always set a page cap before scraping unfamiliar sites — without one, a link-rich site can generate far more pages than expected.
Deep Scraper is not finding all the pages I expect — why?
The most common cause is that the target site uses JavaScript to render its navigation links, which Deep Scraper's HTTP client cannot execute. Check by viewing the page source in your browser (Ctrl+U) — if most links are absent from the raw HTML, the site is JavaScript-rendered and firecrawl-skills (which supports JavaScript rendering) is a better choice.
Can Deep Scraper help me build a knowledge base from a website?
Yes — Deep Scraper is well-suited for extracting raw text content from documentation sites, blogs, and wikis that you want to embed or index for a knowledge base. Use it to extract text from each page, then pass the results to an embedding model. For complex knowledge base pipelines, the RapidDev team offers OpenClaw configuration templates combining deep-scraper with embedding workflows — visit rapiddev.ai for details.
Talk to an Expert
Our team has built 600+ apps. Get personalized help with your project.
Book a free consultation