Back to template gallery

Crawlee + Cheerio

A scraper example that uses Cheerio to parse HTML. It's fast, but it can't run the website's JavaScript or pass JS anti-scraping challenges.

Language

javascript

Tools

nodejs

crawlee

cheerio

Use cases

Starter

Web scraping

src/main.js

1import { setTimeout } from 'node:timers/promises';
2
3// Crawlee - web scraping and browser automation library (Read more at https://crawlee.dev)
4import { CheerioCrawler, Dataset } from '@crawlee/cheerio';
5// Apify SDK - toolkit for building Apify Actors (Read more at https://docs.apify.com/sdk/js/)
6import { Actor } from 'apify';
7
8// The init() call configures the Actor to correctly work with the Apify-provided environment - mainly the storage infrastructure. It is necessary that every Actor performs an init() call.
9await Actor.init();
10
11// Handle graceful abort - Actor is being stopped by user or platform
12Actor.on('aborting', async () => {
13 // Persist any state, do any cleanup you need, and terminate the Actor using `await Actor.exit()` explicitly as soon as possible
14 // This will help ensure that the Actor is doing best effort to honor any potential limits on costs of a single run set by the user
15 // Wait 1 second to allow Crawlee/SDK useState and other state persistence operations to complete
16 // This is a temporary workaround until SDK implements proper state persistence in the aborting event
17 await setTimeout(1000);
18 await Actor.exit();
19});
20
21// Structure of input is defined in input_schema.json
22const { startUrls = ['https://apify.com'], maxRequestsPerCrawl = 100 } = (await Actor.getInput()) ?? {};
23
24// Proxy configuration to rotate IP addresses and prevent blocking (https://docs.apify.com/platform/proxy)
25// `checkAccess` flag ensures the proxy credentials are valid, but the check can take a few hundred milliseconds.
26// Disable it for short runs if you are sure your proxy configuration is correct
27const proxyConfiguration = await Actor.createProxyConfiguration({ checkAccess: true });
28
29const crawler = new CheerioCrawler({
30 proxyConfiguration,
31 maxRequestsPerCrawl,
32 async requestHandler({ enqueueLinks, request, $, log }) {
33 log.info('enqueueing new URLs');
34 await enqueueLinks();
35
36 // Extract title from the page.
37 const title = $('title').text();
38 log.info(`${title}`, { url: request.loadedUrl });
39
40 // Save url and title to Dataset - a table-like storage.
41 await Dataset.pushData({ url: request.loadedUrl, title });
42 },
43});
44
45await crawler.run(startUrls);
46
47// Gracefully exit the Actor process. It's recommended to quit all Actors with an exit()
48await Actor.exit();

JavaScript Crawlee & CheerioCrawler Actor Template

This template example was built with Crawlee to scrape data from a website using Cheerio wrapped into CheerioCrawler.

Quick Start

Once you've installed the dependencies, start the Actor:

$apify run

Once your Actor is ready, you can push it to the Apify Console:

apify login # first, you need to log in if you haven't already done so
apify push

Project Structure

.actor/
├── actor.json # Actor config: name, version, env vars, runtime settings
├── dataset_schena.json # Structure and representation of data produced by an Actor
├── input_schema.json # Input validation & Console form definition
└── output_schema.json # Specifies where an Actor stores its output
src/
└── main.js # Actor entry point and orchestrator
storage/ # Local storage (mirrors Cloud during development)
├── datasets/ # Output items (JSON objects)
├── key_value_stores/ # Files, config, INPUT
└── request_queues/ # Pending crawl requests
Dockerfile # Container image definition

For more information, see the Actor definition documentation.

How it works

This code is a JavaScript script that uses Cheerio to scrape data from a website. It then stores the website titles in a dataset.

  • The crawler starts with URLs provided from the input startUrls field defined by the input schema. Number of scraped pages is limited by maxPagesPerCrawl field from the input schema.
  • The crawler uses requestHandler for each URL to extract the data from the page with the Cheerio library and to save the title and URL of each page to the dataset. It also logs out each result that is being saved.

What's included

  • Apify SDK - toolkit for building Actors
  • Crawlee - web scraping and browser automation library
  • Input schema - define and easily validate a schema for your Actor's input
  • Dataset - store structured data where each object stored has the same attributes
  • Cheerio - a fast, flexible & elegant library for parsing and manipulating HTML and XML
  • Proxy configuration - rotate IP addresses to prevent blocking

Resources

Creating Actors with templates

Already have a solution in mind?

Sign up for a free Apify account and deploy your code to the platform in just a few minutes! If you want a head start without coding it yourself, browse our Store of existing solutions.