Cheerio Scraper
No credit card required
Cheerio Scraper
No credit card required
Apify Cheerio Scraper
How it works
Cheerio Scraper is a ready-made solution for crawling the web using plain HTTP requests to retrieve HTML pages and then parsing and inspecting the HTML using the Cheerio library. It's blazing fast.
Cheerio is a server-side version of the popular jQuery library, that does not run in the browser, but instead constructs a DOM out of a HTML string and then provides the user with API to work with that DOM.
Cheerio Scraper is ideal for scraping websites that do not rely on client-side JavaScript to serve their content. It can be as much as 20 times faster than using a full browser solution such as Puppeteer.
Input
Input is provided via the pre-configured UI. See the tooltips for more info on the available options.
Page function
Page function is a single JavaScript function that enables the user to control the Scraper's operation,
manipulate the visited pages and extract data as needed. It is invoked with a context
object
containing the following properties:
1const context = { 2 // USEFUL DATA 3 input, // Unaltered original input as parsed from the UI 4 env, // Contains information about the run such as actorId or runId 5 customData, // Value of the 'Custom data' scraper option. 6 html, // Raw HTML of the loaded page. 7 8 // EXPOSED OBJECTS 9 request, // Apify.Request object. 10 response, // Response object holding the status code and headers. 11 autoscaledPool, // Reference to the Apify.AutoscaledPool instance managing concurrency. 12 globalStore, // Represents an in memory store that can be used to share data across pageFunction invocations. 13 log, // Reference to Apify.utils.log 14 Apify, // Reference to the full power of Apify SDK. 15 16 // EXPOSED FUNCTIONS 17 $, // Reference to Cheerio. 18 setValue, // Reference to the Apify.setValue() function. 19 getValue, // Reference to the Apify.getValue() function. 20 saveSnapshot, // Saves the full HTML of the current page to the key value store. 21 skipLinks, // Prevents enqueueing more links via Pseudo URLs on the current page. 22 enqueueRequest, // Adds a page to the request queue. 23 24}
context
The following tables describe the context
object in more detail.
Data structures
Argument | Type |
input | Object |
Input as it was received from the UI. Each pageFunction invocation gets a fresh
copy and you can not modify the input by changing the values in this object.
| |
env | Object |
A map of all the relevant environment variables that you may want to use. See the
Apify.getEnv()
function for a preview of the structure and full documentation.
| |
customData | Object |
Since the input UI is fixed, it does not support adding of other fields that may be needed for all
specific use cases. If you need to pass arbitrary data to the scraper, use the Custom data input field
and its contents will be available under the customData context key.
| |
html | string |
This is the raw, unaltered HTML string as received from the target website. This is useful in cases where Cheerio is unable to parse the HTML. The HTML returned from Cheerio also might be different, with invalid tags removed, so use this property for debugging the differences. |
Functions
The context
object provides several helper functions that make scraping and saving data easier
and more streamlined. All of the functions are async
so make sure to use await
with their invocations.
Argument | Arguments |
$ |
selector, [context], [root]
|
Reference to the Cheerio function, which enables you to work with the page's HTML just as `jQuery` would. | |
setValue | (key: string, data: Object, options: Object) |
To save data to the default key-value store, you can use the setValue function.
See the full documentation:
Apify.setValue()
function.
| |
getValue | (key: string) |
To read data from the default key-value store, you can use the getValue function.
See the full documentation:
Apify.getValue()
function.
| |
saveSnapshot | |
A helper function that enables saving a snapshot of the current page's HTML, as parsed by Cheerio, into the default key value store. Each snapshot overwrites the previous one and the function's invocations will also be throttled if invoked more than once in 2 seconds, to prevent abuse. So make sure you don't call it for every single request. You can find the HTML under the SNAPSHOT-HTML key. | |
skipLinks | |
With each invocation of the pageFunction the scraper attempts to extract
new URLs from the page using the Link selector and PseudoURLs provided in the input UI.
If you want to prevent this behavior in certain cases, call the skipLinks
function and no URLs will be added to the queue for the given page.
| |
enqueueRequest | (request: Request|Object, options: Object) |
To enqueue a specific URL manually instead of automatically by a combination of a Link selector
and a Pseudo URL, use the enqueueRequest function. It accepts a plain object as argument
that needs to have the structure to construct a
Request object.
But frankly, you just need a URL: { url: 'https://www.example.com }
|
Class instances and namespaces
The following are either class instances or namespaces, which is just a way of saying objects with functions on them.
Request
Apify uses a request
object to represent metadata about the currently crawled page,
such as its URL or the number of retries. See the
Request
class for a preview of the structure and full documentation.
Response
The response
object is produced by the HTTP call. Currently, we only pass the HTTP status code
and the response headers to the context
.
1{ 2 status: Number, 3 headers: Object, 4}
AutoscaledPool
A reference to the running instance of the AutoscaledPool
class. See
Apify SDK docs
for more information.
Global Store
globalStore
represents an instance of a very simple in memory store that is not scoped to the individual
pageFunction
invocation. This enables you to easily share global data such as API responses, tokens and other.
Since the stored data need to cross from the Browser to the Node.js process, it cannot be any kind of data,
but only JSON stringifiable objects. You cannot store DOM objects, functions, circular objects and so on.
globalStore
in Cheerio Scraper is just a
Map
.
Log
log
is a reference to
Apify.utils.log
.
You can use any of the logging methods such as log.info
or log.exception
.
log.debug
is special, because you can trigger visibility of those messages in the
scraper's Log by the provided Debug log input option.
Apify
A reference to the full power of the Apify SDK. See the docs for more information and all the available functions and classes.
Caution: Since we're making the full SDK available, and Cheerio Scraper
runs using the SDK, some edge case manipulations may lead to inconsistencies.
Use Apify
with caution and avoid making global changes unless you know what you're doing.
Output
Output is a dataset containing extracted data for each scraped page. To save data into
the dataset, return an Object
or an Object[]
from the pageFunction
.
Dataset
For each of the scraped URLs, the dataset contains an object with results and some metadata.
If you were scraping the HTML <title>
of Apify and returning
the following object from the pageFunction
1return { 2 title: "Web Scraping, Data Extraction and Automation - Apify" 3}
it would look like this:
1{ 2 "title": "Web Scraping, Data Extraction and Automation - Apify", 3 "#error": false, 4 "#debug": { 5 "requestId": "fvwscO2UJLdr10B", 6 "url": "https://apify.com", 7 "loadedUrl": "https://apify.com/", 8 "method": "GET", 9 "retryCount": 0, 10 "errorMessages": null, 11 "statusCode": 200 12 } 13}
You can remove the metadata (and results containing only metadata) from the results by selecting the Clean items option when downloading the dataset.
The result will look like this:
1{ 2 "title": "Web Scraping, Data Extraction and Automation - Apify" 3}
Actor Metrics
1 monthly user
-
2 stars
>99% runs succeeded
Created in Apr 2019
Modified 6 years ago