Crawler Puppeteer is the most powerful crawler tool in our arsenal (aside from developing your own actors). It uses the Puppeteer library to programmatically control a headless Chrome browser and it can make it do almost anything. If using the Crawler does not cut it, Crawler Puppeteer is what you need.

The downside is that Puppeteer is a Node.js library, so knowledge of Node.js and its paradigms is expected when working with the Crawler Puppeteer.

If you need either a more performant, or a simpler tool, see the crawler-cheerio for unmatched performance, or crawler for a plain old JavaScript tool.

Input

Input is provided via the pre-configured UI. See the tooltips for more info on the available options.

Page function

Page function is a single JavaScript function that enables the user to control the Crawler's operation, manipulate the crawled pages and extract data as needed. It is invoked with a context object containing the following properties:

const context = {
    // USEFUL DATA
    input, // Unaltered original input as parsed from the UI
    env, // Contains information about the run such as actorId or runId
    customData, // Value of the 'Custom data' Crawler option.
    request, // Apify.Request object.
    response, // Response object holding the status code and headers.
    
    // EXPOSED FUNCTIONS
    saveSnapshot, // Saves a screenshot and full HTML of the current page to the key value store.
    skipLinks, // Prevents enqueueing more links via Pseudo URLs on the current page.
    skipOutput, // Prevents saving the return value of the pageFunction to the default dataset.
    enqueuePage, // Adds a page to the request queue.
    jQuery, // A reference to the jQuery $ function (if injectJQuery was used).
    
    // EXPOSED OBJECTS
    globalStore, // Represents an in memory store that can be used to share data across pageFunction invocations.
    requestList, // Reference to the run's default Apify.RequestList.
    requestQueue, // Reference to the run's default Apify.RequestQueue.
    dataset, // Reference to the run's default Apify.Dataset.
    keyValueStore, // Reference to the run's default Apify.KeyValueStore.
    log, // Reference to Apify.utils.log 
    underscoreJs, // A reference to the Underscore _ object (if injectUnderscore was used).
}

`context`

The following tables describe the context object in more detail.

Data structures:

Argument	Type
`input`	`string`
Raw input as it was received from the UI, represented as a `string` for immutability. You can `JSON.parse()` it to get the values of individual configuration options.
`env`	`Object`
A map of all the relevant environment variables that you may want to use. See the `Apify.getEnv()` function for a preview of the structure and full documentation.
`customData`	`Object`
Since the input UI is fixed, it does not support adding of other fields that may be needed for all specific use cases. If you need to pass arbitrary data to the crawler, use the Custom data input field and its contents will be available under the `customData` context key.
`request`	`Request`
Apify uses a `request` object to represent metadata about the currently crawled page, such as its URL or the number of retries. See the `Request` class for a preview of the structure and full documentation.
`response`	`{status: number, headers: Object}`
The HTTP response object is produced by Puppeteer. Currently, we only pass the HTTP status code and the response headers to the `context`.

Functions:

Argument	Type
`saveSnapshot`	`Function`
A helper function that enables saving a snapshot of the current page's HTML and its screenshot into the default key value store. Each snapshot overwrites the previous one and the function's invocations will also be throttled if invoked more than once in 2 seconds, to prevent abuse. So make sure you don't call it for every single request. You can find the screenshot under the SNAPSHOT-SCREENSHOT key and the HTML under the SNAPSHOT-HTML key.
`skipLinks`	`Function`
With each invocation of the `pageFunction` the crawler attempts to extract new URLs from the page using the Link selector and PseudoURLs provided in the input UI. If you want to prevent this behavior in certain cases, call the `skipLinks` function and no URLs will be added to the queue for the given page.
`skipOutput`	`Function`
Since each return value of the `pageFunction` is saved to the default dataset, this provides a way of overriding that functionality. Just call `skipOutput` and the result of the current invocation will not be saved to the dataset.
`enqueuePage`	`Function`
To enqueue a specific URL manually instead of automatically by a combination of a Link selector and a Pseudo URL, use the `enqueuePage` function. It accepts a plain object as argument that needs to have the structure to construct a `Request` object. But frankly, you just need a URL: `{ url: 'https://www.example.com }`
`jQuery`	`Function`
To make the DOM manipulation within the page easier, you may choose the `injectJQuery` option in the UI and all the crawled pages will have an instance of the `jQuery` library available. However, since we do not want to modify the page in any way, we don't inject it into the global `$` object as you may be used to, but instead we make it available in `context`. Feel free to `const $ = context.jQuery` to get the familiar notation.

Class instances:

Global Store

globalStore represents an instance of a very simple in memory store that is not scoped to the individual pageFunction invocation. This enables you to easily share global data such as API responses, tokens and other. Since the stored data need to cross the from the Browser to the Node.js process, they cannot be any data, but always need to be JSON stringifiable. Therefore, you cannot store DOM objects, live class instances, functions etc. Only a JSON representation of the passed object will be stored, with all the relevant limitations.

Method	Return Type
`get(key:string)`	`Promise<Object>`
Retrieves a JSON serializable value from the global store using the provided key.
`set(key:string, value:Object)`	`Promise`
Saves a JSON serializable value to the global store using the provided key.
`size()`	`Promise<number>`
Returns the current number of values in the global store.
`list()`	`Promise<Array>`
Returns all the keys currently stored in the global store.

Output

Ouput is a dataset containing extracted data for each scraped page.

Dataset

For each of the scraped URLs, the dataset contains an object with results and some metadata. If you were scraping the HTML <title> of IANA it would look like this:

{
  "title": "Internet Assigned Numbers Authority",
  "#error": false,
  "#debug": {
    "url": "https://www.iana.org/",
    "method": "GET",
    "retryCount": 0,
    "errorMessages": null,
    "requestId": "e2Hd517QWfF4tVh"
  }
}

The metadata are prefixed with a #. Soon you will be able to exclude the metadata from the results by providing an API flag.

On this page

Apify Crawler Puppeteer
- How it works
- Input
- Page function
  - context
- Output
  - Dataset

Share Actor:

Puppeteer Scraper

apify/puppeteer-scraper

Apify Technologies

Crawler

mtrunkat.com/crawler

Experimental implementation of Apify crawler in act.

Marek Trunkát

Crawler Stats

drobnikj/crawler-stats

Jakub Drobník

Crawler In Act Test

drobnikj/crawler-in-act-test

Jakub Drobník

Legacy Phantomjs Crawler

apify/legacy-phantomjs-crawler

Apify Technologies

5.0

copy-hello-world

honzakirchner43/my-iframe-test

Actor for run example iframe test

Jan Kirchner

screenshot-url-copy-paid-2

pavel-at/screenshot-url-copy-paid-2

Actor serving as an example of Input Schema. Takes URL of website and screenshot configuration parameters as input and outputs a screenshot of the website into Key-Value store.

Pavel

Check Crawler Results

drobnikj/check-crawler-results

Act checks first 100 results from crawler execution agains json schema. It returns validation errors to act log file. It sends mail if one or more results is not valid. ==Environment vars== - JSON_SCHEMA json schema for validation - MAILGUN_API_KEY mailgun api key for sending mail from act - MAI...

Jakub Drobník

Test-title

honzakirchner43/screenshot-url-mine

Actor serving as an example of Input Schema. Takes URL of website and screenshot configuration parameters as input and outputs a screenshot of the website into Key-Value store.

Jan Kirchner

Screen URL Duplicate

matej-test/screenshot-url-duplicate

Actor serving as an example of Input Schema. Takes URL of website and screenshot configuration parameters as input and outputs a screenshot of the website into Key-Value store.