Puppeteer Data Extractor — Puppeteer Extractor
Extracts websites with the headless Chrome and browser automation library using a provided server-side Node.js code. This data collector is an alternative to mapiok/web-data extractor that gives you finer control over the process. Supports both recursive extracting and list of URLs. Supports login to website.
What This Tool Does
browser automation Data Extractor — Extracts websites with the headless Chrome and browser automation library using a provided server-side Node.js code. This data collector is an alternative to mapiok/web-data extractor that gives you finer control over the process. Supports both recursive extracting and list of URLs. Supports login to website.
Use it to extract structured data from any website: provide a URL and a custom extracting script, and the tool returns the data you need in JSON format.
Use Cases
- Data Extraction
- Developer Tools
Data Fields
Output fields depend on your extracting script. Common patterns include:
| Field | Type | Description |
|---|---|---|
| url | string | URL that was scraped |
| title | string | Page title |
| html | string | Raw HTML content (if requested) |
| text | string | Extracted plain text |
| links | array | Links found on the page |
| data | object | Custom fields extracted by your script |
Example Request
{
"startUrls": "https://example.com",
"pageFunction": 1,
"proxyConfiguration": {
"useApifyProxy": true
}
}
Example Response
{
"url": "https://example.com",
"title": "Example Domain",
"text": "This domain is for use in illustrative examples...",
"links": ["https://www.iana.org/domains/reserved"]
}
Limits and Tips
- JavaScript-heavy pages require a browser-based data extractor (browser automation or browser automation). For static HTML, jsdom is faster.
- Processing time depends on page load speed and script complexity — typically 10–60 seconds per page.
- Results are cached for up to 15 minutes. Re-running the same URL may return cached data.
- Respect robots.txt and the target site's terms of service when extracting.
On this page