Puppeteer Data Extractor — Puppeteer Extractor

    Extracts websites with the headless Chrome and browser automation library using a provided server-side Node.js code. This data collector is an alternative to mapiok/web-data extractor that gives you finer control over the process. Supports both recursive extracting and list of URLs. Supports login to website.

    1 credits per request
    ~30s
    9 runs
    Features
    Headless Browser
    JSON/CSV Export
    API Access
    Scalable Automation
    Use Cases
    Data Extraction
    Developer Tools

    What This Tool Does

    browser automation Data Extractor — Extracts websites with the headless Chrome and browser automation library using a provided server-side Node.js code. This data collector is an alternative to mapiok/web-data extractor that gives you finer control over the process. Supports both recursive extracting and list of URLs. Supports login to website.

    Use it to extract structured data from any website: provide a URL and a custom extracting script, and the tool returns the data you need in JSON format.

    Use Cases

    • Data Extraction
    • Developer Tools

    Data Fields

    Output fields depend on your extracting script. Common patterns include:

    FieldTypeDescription
    urlstringURL that was scraped
    titlestringPage title
    htmlstringRaw HTML content (if requested)
    textstringExtracted plain text
    linksarrayLinks found on the page
    dataobjectCustom fields extracted by your script

    Example Request

    {
     "startUrls": "https://example.com",
     "pageFunction": 1,
     "proxyConfiguration": {
     "useApifyProxy": true
     }
    }
    

    Example Response

    {
     "url": "https://example.com",
     "title": "Example Domain",
     "text": "This domain is for use in illustrative examples...",
     "links": ["https://www.iana.org/domains/reserved"]
    }
    

    Limits and Tips

    • JavaScript-heavy pages require a browser-based data extractor (browser automation or browser automation). For static HTML, jsdom is faster.
    • Processing time depends on page load speed and script complexity — typically 10–60 seconds per page.
    • Results are cached for up to 15 minutes. Re-running the same URL may return cached data.
    • Respect robots.txt and the target site's terms of service when extracting.

    On this page