Skip to main content
RapidDev - Software Development Agency

How to Build a Web Scraping API with Lovable

Build a web scraping API in Lovable using Supabase Edge Functions with fetch and cheerio for HTML parsing. Features a scrape jobs queue with pg_cron scheduling, structured JSONB result storage, and Firecrawl as a fallback for anti-bot protected sites. Manage and monitor all scraping jobs from a dashboard.

What you'll build

  • Scrape job submission API endpoint (Edge Function) that accepts a URL and CSS selectors
  • Job queue table with priority, retry count, and status lifecycle (pending → running → done/failed)
  • pg_cron worker that polls the queue and dispatches scrape Edge Function calls
  • HTML parsing in Deno using cheerio-compatible CSS selector extraction
  • Structured JSONB result storage with extracted fields per configured selector
  • Firecrawl API fallback for JavaScript-heavy and anti-bot protected pages
  • Dashboard for submitting scrape jobs, viewing results, and monitoring queue health
Book a free consultation
4.9Clutch rating
600+Happy partners
17+Countries served
190+Team members
Advanced14 min read3–4 hoursLovable Pro or higherApril 2026RapidDev Engineering Team
TL;DR

Build a web scraping API in Lovable using Supabase Edge Functions with fetch and cheerio for HTML parsing. Features a scrape jobs queue with pg_cron scheduling, structured JSONB result storage, and Firecrawl as a fallback for anti-bot protected sites. Manage and monitor all scraping jobs from a dashboard.

What you're building

A web scraping API has three parts: job intake (accepting scrape requests), job processing (fetching and parsing pages), and result storage. The intake is an Edge Function that validates the URL, stores the job in a scrape_jobs queue table, and returns the job ID immediately. Processing happens asynchronously via a pg_cron job that polls for pending work every minute and calls the scrape-worker Edge Function for each job.

The scrape-worker function fetches the HTML using the standard Deno fetch() API, then uses CSS selectors from the job's selector_config JSONB to extract specific content. For example, a selector config like { 'title': 'h1', 'price': '.price-tag', 'description': '#description' } extracts the page title, price element, and description. The results are stored as JSONB in scrape_results.

Many sites use JavaScript rendering and anti-bot measures that a simple fetch() cannot bypass. For these, the Firecrawl API (via an Edge Function call) returns pre-rendered markdown or cleaned HTML. The scraper detects common bot blocks (403 responses, CAPTCHA pages) and automatically retries via Firecrawl before marking a job as failed.

Final result

A working web scraping API with a job queue, HTML parsing, structured result extraction, Firecrawl fallback, and a management dashboard.

Tech stack

LovableScraping dashboard frontend
Supabase Edge FunctionsScrape worker and job API (Deno)
SupabaseDatabase, pg_cron job queue
FirecrawlFallback scraper for JS-heavy sites
shadcn/uiDataTable, Badge, Cards, Tabs
RechartsQueue throughput and error rate charts

Prerequisites

  • Lovable Pro account for multiple Edge Functions
  • Supabase Pro plan for pg_cron (available on all plans but requires pg_cron extension enabled)
  • Firecrawl API key from firecrawl.dev (free tier: 500 credits)
  • Supabase service role key and Firecrawl API key saved to Cloud tab → Secrets

Build steps

1

Create the job queue schema

Prompt Lovable to set up the scrape jobs and results tables. The queue design is the foundation — job status lifecycle and priority ordering determine scraping throughput.

prompt.txt
1Build a web scraping API. Create these Supabase tables:
2
3- scrape_jobs: id, user_id, url (text), selector_config (jsonb, e.g. { fieldName: 'css-selector' }), priority (int default 5, 1=highest 10=lowest), status (pending|running|done|failed|retrying), attempt_count (int default 0), max_attempts (int default 3), use_firecrawl (bool default false), error_message (text), created_at, started_at, completed_at
4
5- scrape_results: id, job_id (FK scrape_jobs UNIQUE), extracted_data (jsonb), raw_html_url (text, Supabase Storage path), page_title (text), scraped_at, response_status_code (int), response_time_ms (int)
6
7RLS:
8- scrape_jobs: user_id = auth.uid() for all operations
9- scrape_results: accessible via job_id FK to user's jobs (check via EXISTS subquery)
10
11Create indexes:
12- CREATE INDEX idx_scrape_jobs_queue ON scrape_jobs(status, priority, created_at) WHERE status = 'pending'
13- CREATE INDEX idx_scrape_jobs_user ON scrape_jobs(user_id, created_at DESC)
14
15Enable pg_cron extension: it should already be enabled on your Supabase project. If not, ask in Supabase support chat.

Pro tip: Ask Lovable to create a scrape_job_templates table where users can save common URL patterns and selector configs (e.g. 'E-commerce Product Page' template with selectors for title, price, availability). Templates speed up new job creation.

Expected result: Both tables are created with the queue index. The app loads with a job submission form and a queue DataTable.

2

Build the scrape worker Edge Function

Create the Edge Function that does the actual scraping. It fetches the page, applies CSS selectors, stores results, and handles Firecrawl fallback.

supabase/functions/scrape-worker/index.ts
1// supabase/functions/scrape-worker/index.ts
2import { serve } from 'https://deno.land/std@0.168.0/http/server.ts'
3import { createClient } from 'https://esm.sh/@supabase/supabase-js@2'
4import { load } from 'https://esm.sh/cheerio@1.0.0-rc.12'
5
6const corsHeaders = { 'Access-Control-Allow-Origin': '*', 'Access-Control-Allow-Headers': 'authorization, apikey, content-type' }
7
8serve(async (req: Request) => {
9 if (req.method === 'OPTIONS') return new Response('ok', { headers: corsHeaders })
10
11 const supabase = createClient(Deno.env.get('SUPABASE_URL') ?? '', Deno.env.get('SUPABASE_SERVICE_ROLE_KEY') ?? '')
12 const { jobId } = await req.json()
13
14 const { data: job } = await supabase.from('scrape_jobs').select('*').eq('id', jobId).single()
15 if (!job) return new Response(JSON.stringify({ error: 'Job not found' }), { status: 404, headers: corsHeaders })
16
17 await supabase.from('scrape_jobs').update({ status: 'running', started_at: new Date().toISOString(), attempt_count: job.attempt_count + 1 }).eq('id', jobId)
18
19 const start = Date.now()
20 try {
21 let html = ''
22 let statusCode = 200
23
24 if (job.use_firecrawl) {
25 html = await scrapeWithFirecrawl(job.url)
26 } else {
27 const res = await fetch(job.url, {
28 headers: { 'User-Agent': 'Mozilla/5.0 (compatible; scraper/1.0)' },
29 signal: AbortSignal.timeout(15000),
30 })
31 statusCode = res.status
32 if (res.status === 403 || res.status === 429) {
33 html = await scrapeWithFirecrawl(job.url)
34 } else {
35 html = await res.text()
36 }
37 }
38
39 const $ = load(html)
40 const extractedData: Record<string, string> = {}
41 for (const [field, selector] of Object.entries(job.selector_config as Record<string, string>)) {
42 extractedData[field] = $(selector).first().text().trim()
43 }
44
45 await supabase.from('scrape_results').upsert({
46 job_id: jobId,
47 extracted_data: extractedData,
48 page_title: $('title').text().trim(),
49 scraped_at: new Date().toISOString(),
50 response_status_code: statusCode,
51 response_time_ms: Date.now() - start,
52 }, { onConflict: 'job_id' })
53
54 await supabase.from('scrape_jobs').update({ status: 'done', completed_at: new Date().toISOString() }).eq('id', jobId)
55 return new Response(JSON.stringify({ success: true, fields: Object.keys(extractedData).length }), { headers: corsHeaders })
56 } catch (err) {
57 const msg = err instanceof Error ? err.message : 'Scrape failed'
58 const nextStatus = job.attempt_count + 1 >= job.max_attempts ? 'failed' : 'pending'
59 await supabase.from('scrape_jobs').update({ status: nextStatus, error_message: msg, completed_at: nextStatus === 'failed' ? new Date().toISOString() : null }).eq('id', jobId)
60 return new Response(JSON.stringify({ error: msg }), { status: 500, headers: corsHeaders })
61 }
62})
63
64async function scrapeWithFirecrawl(url: string): Promise<string> {
65 const res = await fetch('https://api.firecrawl.dev/v1/scrape', {
66 method: 'POST',
67 headers: { 'Content-Type': 'application/json', Authorization: `Bearer ${Deno.env.get('FIRECRAWL_API_KEY')}` },
68 body: JSON.stringify({ url, formats: ['html'] }),
69 })
70 const data = await res.json()
71 if (!res.ok) throw new Error(data.error ?? 'Firecrawl failed')
72 return data.data?.html ?? data.data?.markdown ?? ''
73}

Expected result: The scrape-worker Edge Function deploys. Calling it manually with a jobId fetches the URL, applies selectors, and stores results in scrape_results.

3

Set up the pg_cron queue processor

Create the pg_cron job that polls for pending scrape jobs and dispatches them to the scrape-worker Edge Function. This makes the queue fully automated.

prompt.txt
1Set up the automated queue processor:
2
31. Create an Edge Function at supabase/functions/process-queue/index.ts that:
4 - Queries scrape_jobs WHERE status = 'pending' ORDER BY priority ASC, created_at ASC LIMIT 5
5 - For each job: updates status to 'running' (to claim it), then calls the scrape-worker Edge Function via fetch
6 - Uses Promise.allSettled to run up to 5 jobs concurrently
7 - Returns a JSON summary: { dispatched: number, jobIds: string[] }
8
92. Register the pg_cron schedule in Supabase SQL editor:
10SELECT cron.schedule(
11 'process-scrape-queue',
12 '* * * * *',
13 $$
14 SELECT net.http_post(
15 url:='https://YOUR_PROJECT.supabase.co/functions/v1/process-queue',
16 headers:=json_build_object('Authorization', 'Bearer YOUR_SERVICE_ROLE_KEY')::jsonb
17 ) AS request_id;
18 $$
19);
20
213. Add a job intake Edge Function at supabase/functions/submit-scrape-job/index.ts that:
22 - Accepts POST with { url, selectorConfig, priority?, useFirecrawl? }
23 - Validates the URL is a valid HTTP/HTTPS URL
24 - Inserts into scrape_jobs and returns the new job ID
25 - Can be called without auth (add an API key check using the same pattern as the api-backend guide) or with Supabase Auth

Pro tip: Add a concurrency control: before dispatching jobs, check how many are currently status='running'. If already at 5 running, skip this pg_cron tick. This prevents queue pile-up if jobs take longer than 1 minute.

Expected result: The pg_cron job runs every minute. Submitting a job via the intake Edge Function shows it as 'pending' in the dashboard, then transitions to 'running' and 'done' within 1-2 minutes.

4

Build the scraping dashboard

Create the management dashboard where users can submit new scrape jobs, see queue status, inspect results, and monitor error rates.

prompt.txt
1Build the scraping dashboard at src/pages/ScrapingDashboard.tsx:
2
31. Summary Cards at top: Jobs Today, Queue Depth (pending count), Success Rate %, Average Response Time
4
52. Job submission form (Card or Sheet):
6 - URL Input (required, validated as URL format)
7 - Selector Config builder: a key-value list where users add field name (e.g. 'price') + CSS selector (e.g. '.price-tag'). Show Add Row Button and remove buttons per row.
8 - Priority Slider (1-10)
9 - Use Firecrawl Checkbox with label 'Use Firecrawl for JavaScript-heavy sites (uses credits)'
10 - Submit Button: calls the submit-scrape-job Edge Function
11
123. Job queue DataTable with columns: created_at (relative), URL (truncated with Tooltip for full URL), Priority Badge, Status Badge (pending=gray, running=blue with spinner, done=green, failed=red, retrying=yellow), attempt_count, Actions menu (View Results, Retry, Delete)
13
144. Results Sheet (opens when clicking View Results):
15 - URL and scraped_at
16 - Extracted data as a key-value table (field name extracted text)
17 - Response status code Badge and response time
18 - Error message if failed
19
205. Recharts BarChart below the table: jobs per hour over the last 24 hours, colored by status (done=green, failed=red stacked)

Expected result: The dashboard shows queue stats. Submitting a job adds it to the DataTable. After 1-2 minutes, the status changes to 'done' and clicking 'View Results' shows the extracted data.

Complete code

src/components/scraping/SelectorBuilder.tsx
1import { useState } from 'react'
2import { Button } from '@/components/ui/button'
3import { Input } from '@/components/ui/input'
4import { Label } from '@/components/ui/label'
5import { Trash2, Plus } from 'lucide-react'
6
7export interface SelectorRow {
8 field: string
9 selector: string
10}
11
12interface Props {
13 value: SelectorRow[]
14 onChange: (rows: SelectorRow[]) => void
15}
16
17export function SelectorBuilder({ value, onChange }: Props) {
18 function addRow() {
19 onChange([...value, { field: '', selector: '' }])
20 }
21
22 function updateRow(index: number, key: keyof SelectorRow, newValue: string) {
23 const updated = value.map((row, i) => (i === index ? { ...row, [key]: newValue } : row))
24 onChange(updated)
25 }
26
27 function removeRow(index: number) {
28 onChange(value.filter((_, i) => i !== index))
29 }
30
31 return (
32 <div className="space-y-2">
33 {value.length > 0 && (
34 <div className="grid grid-cols-[1fr_1fr_auto] gap-2 text-sm font-medium text-muted-foreground px-1">
35 <span>Field Name</span>
36 <span>CSS Selector</span>
37 <span />
38 </div>
39 )}
40 {value.map((row, i) => (
41 <div key={i} className="grid grid-cols-[1fr_1fr_auto] gap-2 items-center">
42 <Input
43 placeholder="price"
44 value={row.field}
45 onChange={(e) => updateRow(i, 'field', e.target.value)}
46 />
47 <Input
48 placeholder=".product-price"
49 value={row.selector}
50 onChange={(e) => updateRow(i, 'selector', e.target.value)}
51 className="font-mono text-sm"
52 />
53 <Button variant="ghost" size="icon" onClick={() => removeRow(i)} className="text-destructive hover:text-destructive">
54 <Trash2 className="h-4 w-4" />
55 </Button>
56 </div>
57 ))}
58 <Button variant="outline" size="sm" onClick={addRow} className="w-full">
59 <Plus className="mr-2 h-4 w-4" />
60 Add Field
61 </Button>
62 </div>
63 )
64}

Customization ideas

Scheduled recurring scrapes

Add a schedule column to scrape_jobs (cron string like '0 9 * * *'). A daily pg_cron job checks for scheduled recurring jobs and creates new pending job entries at the right time. This turns the scraper into a monitoring tool that automatically checks a page for changes at regular intervals.

Change detection and alerts

Compare new scrape results against the previous result for the same URL. If the extracted data changes (e.g. a price dropped), send an email via Resend or a Slack message via webhook. Store previous_result in scrape_jobs and diff the extracted_data JSONB on each successful scrape.

Sitemap crawler mode

Add a crawl_jobs table for site-wide crawling. Accept a starting URL and crawl depth. The worker fetches the starting page, extracts all internal links using cheerio, and submits them as individual scrape_jobs. Limit depth to prevent infinite crawls. Show crawl progress as a tree visualization.

Result export and webhooks

Add an export button that downloads all results for a URL pattern as CSV or JSON. Also add a webhook delivery option: when a scrape job completes successfully, POST the result to a user-configured URL. Store webhook_url per job or per scrape template.

Common pitfalls

Pitfall: Scraping sites without checking robots.txt

How to avoid: Before scraping, fetch and parse the target site's /robots.txt. Check if the requested path is allowed for your User-Agent. Add a robots_check column to scrape_jobs and set it to false if robots.txt disallows the URL. Show a warning in the dashboard.

Pitfall: Using cheerio selectors that break when the page layout changes

How to avoid: Prefer semantic selectors: tag + attribute combinations like [itemprop='price'], structured data selectors, or data attribute selectors like [data-testid='price']. These change less frequently than generated CSS class names. Document why each selector was chosen.

Pitfall: Not setting a timeout on fetch() calls

How to avoid: Always use AbortSignal.timeout() with fetch: signal: AbortSignal.timeout(15000) for a 15-second timeout. Catch the timeout error specifically and log it as a distinct error type so you know how often pages are hanging.

Pitfall: Claiming jobs by status update without atomicity

How to avoid: Use a PostgreSQL UPDATE ... RETURNING with a WHERE status = 'pending' LIMIT 5 and update to 'running' in a single atomic statement. This is an atomic claim operation that prevents duplicate processing. Ask Lovable to use supabase.rpc('claim_scrape_jobs', { count: 5 }) with a SECURITY DEFINER function.

Best practices

  • Treat web scraping as a privilege, not a right. Always check robots.txt, use reasonable request rates (no more than 1 request per second per domain), and add a descriptive User-Agent so site operators know who is accessing their site.
  • Store the raw HTML in Supabase Storage alongside the extracted JSONB data. This lets you re-run selector extraction against historical HTML without fetching the page again — essential when you realize your CSS selector was wrong.
  • Use Firecrawl or a similar managed scraping service as a fallback, not a primary scraper. Direct fetch is cheaper and faster. Fall back to Firecrawl only on 403, 429, or when the extracted data is clearly empty (which may indicate a JavaScript-rendered page).
  • Add a domain-level rate limiter. Before dispatching a job, check how many other jobs for the same domain are currently running. Limit to 1-2 concurrent requests per domain to avoid being blocked.
  • Never store scraped data that includes personal information (names, emails, phone numbers) without a clear legal basis. Design your selector configs to extract only publicly available, non-PII structured data like prices, product names, and availability.
  • Test CSS selectors on a static snapshot before deploying. Save a local copy of the target page HTML and run your cheerio selectors against it to verify they return the expected data.

AI prompts to try

Copy these prompts to build this project faster.

ChatGPT Prompt

I'm building a web scraper that uses cheerio for CSS selector extraction. I have a selector_config JSON like { title: 'h1', price: '.price', availability: '[data-stock]' }. Help me write a TypeScript function extractData(html: string, config: Record<string, string>): Record<string, string> using cheerio that applies each selector and returns the trimmed text content. Also add a metadata extraction step that always extracts page title, meta description, and canonical URL regardless of the config.

Lovable Prompt

Add a scrape result comparison feature to the dashboard. For jobs that have been run multiple times against the same URL, add a 'Compare Runs' Button in the results Sheet. Show a diff view with two columns: previous result and current result. Highlight fields that changed in yellow, new fields in green, and removed fields in red. Store the history of results in a scrape_result_history table with job_id, extracted_data, and scraped_at.

Build Prompt

In Supabase, create a function claim_scrape_jobs(batch_size int) that atomically claims pending jobs using UPDATE ... RETURNING to prevent duplicate processing. The function should: 1) SELECT pending jobs ordered by priority ASC, created_at ASC LIMIT batch_size FOR UPDATE SKIP LOCKED, 2) UPDATE their status to 'running' and started_at to now(), 3) RETURN the claimed job rows. Use SKIP LOCKED to allow concurrent worker calls without blocking. Show the complete PL/pgSQL function.

Frequently asked questions

Is web scraping legal?

Web scraping occupies a legally complex space. Scraping publicly accessible data is generally permitted, but scraping data behind authentication, ignoring robots.txt, circumventing technical measures, or scraping personal data in EU jurisdictions (GDPR) may create legal liability. Always check a site's Terms of Service before scraping. Use scraped data only as permitted. This guide is for building the infrastructure — responsibility for compliance rests with the operator.

When should I use Firecrawl vs direct fetch?

Use direct fetch (Deno's built-in fetch()) first — it's free and faster. Fall back to Firecrawl when: the site returns 403 or 429, the extracted data is empty (suggesting JavaScript rendering), or the page content is loaded by client-side JavaScript that isn't present in the initial HTML. Firecrawl uses headless browsers and handles anti-bot measures, but uses paid credits. Keep Firecrawl as a fallback, not the default.

Can I scrape sites that require login?

Yes, but it requires additional complexity. You'd need to store session cookies or auth tokens in Supabase Vault (using the same credential vault pattern as the integration hub guide) and include them in Edge Function requests. This crosses into territory where Terms of Service violations are more likely. Ensure you have explicit permission from the site operator before scraping authenticated content.

What happens if the target site changes its HTML structure?

Your CSS selectors return empty strings silently. Add a validation step: after extracting data, check if required fields are empty. If a required field is empty, mark the job as 'failed' with an error message like 'Selector .price-tag returned empty result — the page structure may have changed.' This alerts you to re-check and update your selectors.

How many scrape jobs can I run per day on the free plan?

Supabase Edge Functions on the Free plan allow 500,000 invocations per month and each invocation can run for up to 150 seconds. For scraping, assume each job takes 5-15 seconds average. At 5 concurrent jobs per minute via pg_cron, you can process roughly 7,200 jobs per day. The binding constraint is usually the target site's rate limits, not Supabase's. Firecrawl's free tier adds a limit of 500 credits per month.

Can I use this to monitor competitor prices?

Technically yes, but with caveats. Price monitoring via scraping is common but operates in a legal gray area depending on the site's Terms of Service and jurisdiction. Many e-commerce sites explicitly prohibit automated price checking in their ToS. If you need price data reliably, consider official data partners or price intelligence APIs that provide structured product data through legitimate licensing agreements.

How do I extract data from PDFs instead of HTML pages?

PDF extraction requires a different approach — cheerio cannot parse PDFs. For PDFs, the Edge Function should detect the content type from the response headers (application/pdf), upload the raw PDF to Supabase Storage, and then call a PDF parsing service API. Ask Lovable to add a pdf_extraction_mode flag to scrape_jobs that changes the worker logic to use a PDF parsing API instead of cheerio.

RapidDev

Talk to an Expert

Our team has built 600+ apps. Get personalized help with your project.

Book a free consultation

Need help building your app?

Our experts have built 600+ apps and can accelerate your development. Book a free consultation — no strings attached.

Book a free consultation

We put the rapid in RapidDev

Need a dedicated strategic tech and growth partner? Discover what RapidDev can do for your business! Book a call with our team to schedule a free, no-obligation consultation. We'll discuss your project and provide a custom quote at no cost.