Puppeteer Scraper
Pricing
Pay per usage
Puppeteer Scraper
0.0 (0)
Pricing
Pay per usage
1
1
1
Last modified
6 years ago
Apify Puppeteer Scraper
How it works
Puppeteer Scraper is the most powerful scraper tool in our arsenal (aside from developing your own actors). It uses the Puppeteer library to programmatically control a headless Chrome browser and it can make it do almost anything. If using the Web Scraper does not cut it, Puppeteer Scraper is what you need.
Puppeteer is a Node.js library, so knowledge of Node.js and its paradigms is expected when working with the Puppeteer Scraper.
If you need either a faster, or a simpler tool, see the Cheerio Scraper for speed, or Web Scraper for simplicity.
Input
Input is provided via the pre-configured UI. See the tooltips for more info on the available options.
Page function
Page function is a single JavaScript function that enables the user to control the Scraper's operation,
manipulate the visited pages and extract data as needed. It is invoked with a context object
containing the following properties:
context
The following tables describe the context object in more detail.
Data structures
| Argument | Type | 
| input | Object | 
| Input as it was received from the UI. Each pageFunctioninvocation gets a fresh
        copy and you can not modify the input by changing the values in this object. | |
| env | Object | 
| A map of all the relevant environment variables that you may want to use. See the Apify.getEnv()function for a preview of the structure and full documentation. | |
| customData | Object | 
| Since the input UI is fixed, it does not support adding of other fields that may be needed for all
        specific use cases. If you need to pass arbitrary data to the scraper, use the Custom data input field
        and its contents will be available under the customDatacontext key. | |
Functions
The context object provides several helper functions that make scraping and saving data easier
and more streamlined. All of the functions are async so make sure to use await with their invocations.
| Argument | Arguments | 
| setValue | (key: string, data: Object, options: Object) | 
| To save data to the default key-value store, you can use the setValuefunction.
        See the full documentation:Apify.setValue()function. | |
| getValue | (key: string) | 
| To read data from the default key-value store, you can use the getValuefunction.
        See the full documentation:Apify.getValue()function. | |
| saveSnapshot | |
| A helper function that enables saving a snapshot of the current page's HTML and its screenshot into the default key value store. Each snapshot overwrites the previous one and the function's invocations will also be throttled if invoked more than once in 2 seconds, to prevent abuse. So make sure you don't call it for every single request. You can find the screenshot under the SNAPSHOT-SCREENSHOT key and the HTML under the SNAPSHOT-HTML key. | |
| skipLinks | |
| With each invocation of the pageFunctionthe scraper attempts to extract
        new URLs from the page using the Link selector and PseudoURLs provided in the input UI.
        If you want to prevent this behavior in certain cases, call theskipLinksfunction and no URLs will be added to the queue for the given page. | |
| enqueueRequest | (request: Request|Object, options: Object) | 
| To enqueue a specific URL manually instead of automatically by a combination of a Link selector
        and a Pseudo URL, use the enqueueRequestfunction. It accepts a plain object as argument
        that needs to have the structure to construct aRequestobject.
        But frankly, you just need a URL:{ url: 'https://www.example.com } | |
Class instances and namespaces
The following are either class instances or namespaces, which is just a way of saying objects with functions on them.
Page
Reference to the Puppeteer Page object, which enables you to use the full power of Puppeteer in your Page functions.
Request
Apify uses a request object to represent metadata about the currently crawled page,
such as its URL or the number of retries. See the
Request
class for a preview of the structure and full documentation.
Response
The response object is produced by Puppeteer. Currently, we only pass the HTTP status code
and the response headers to the context.
PuppeteerPool
A reference to the running instance of the PuppeteerPool class. See
Apify SDK docs
for more information.
AutoscaledPool
A reference to the running instance of the AutoscaledPool class. See
Apify SDK docs
for more information.
Global Store
globalStore represents an instance of a very simple in memory store that is not scoped to the individual
pageFunction invocation. This enables you to easily share global data such as API responses, tokens and other.
Since the stored data need to cross from the Browser to the Node.js process, it cannot be any kind of data,
but only JSON stringifiable objects. You cannot store DOM objects, functions, circular objects and so on.
globalStore in Puppeteer Scraper is just a
Map.
Log
log is a reference to
Apify.utils.log.
You can use any of the logging methods such as log.info or log.exception.
log.debug is special, because you can trigger visibility of those messages in the
scraper's Log by the provided Debug log input option.
Apify
A reference to the full power of the Apify SDK. See the docs for more information and all the available functions and classes.
Caution: Since we're making the full SDK available, and Puppeteer Scraper
runs using the SDK, some edge case manipulations may lead to inconsistencies.
Use Apify with caution and avoid making global changes unless you know what you're doing.
Output
Output is a dataset containing extracted data for each scraped page. To save data into
the dataset, return an Object or an Object[] from the pageFunction.
Dataset
For each of the scraped URLs, the dataset contains an object with results and some metadata.
If you were scraping the HTML <title> of Apify and returning
the following object from the pageFunction
it would look like this:
You can remove the metadata (and results containing only metadata) from the results by selecting the Clean items option when downloading the dataset.
The result will look like this:
On this page
Share Actor:










