Crawlee + Cheerio
A scraper example that uses Cheerio to parse HTML. It's fast, but it can't run the website's JavaScript or pass JS anti-scraping challenges.
src/main.js
1// Apify SDK - toolkit for building Apify Actors (Read more at https://docs.apify.com/sdk/js/)2import { Actor } from 'apify';3// Crawlee - web scraping and browser automation library (Read more at https://crawlee.dev)4import { CheerioCrawler, Dataset } from 'crawlee';5
6// The init() call configures the Actor for its environment. It's recommended to start every Actor with an init()7await Actor.init();8
9// Structure of input is defined in input_schema.json10const { startUrls = ['https://apify.com'], maxRequestsPerCrawl = 100 } = (await Actor.getInput()) ?? {};11
12// Proxy configuration to rotate IP addresses and prevent blocking (https://docs.apify.com/platform/proxy)13const proxyConfiguration = await Actor.createProxyConfiguration();14
15const crawler = new CheerioCrawler({16 proxyConfiguration,17 maxRequestsPerCrawl,18 async requestHandler({ enqueueLinks, request, $, log }) {19 log.info('enqueueing new URLs');20 await enqueueLinks();21
22 // Extract title from the page.23 const title = $('title').text();24 log.info(`${title}`, { url: request.loadedUrl });25
26 // Save url and title to Dataset - a table-like storage.27 await Dataset.pushData({ url: request.loadedUrl, title });28 },29});30
31await crawler.run(startUrls);32
33// Gracefully exit the Actor process. It's recommended to quit all Actors with an exit()34await Actor.exit();JavaScript Crawlee & CheerioCrawler Actor Template
This template example was built with Crawlee to scrape data from a website using Cheerio wrapped into CheerioCrawler.
Quick Start
Once you've installed the dependencies, start the Actor:
$apify run
Once your Actor is ready, you can push it to the Apify Console:
apify login # first, you need to log in if you haven't already done soapify push
Project Structure
.actor/├── actor.json # Actor config: name, version, env vars, runtime settings├── dataset_schena.json # Structure and representation of data produced by an Actor├── input_schema.json # Input validation & Console form definition└── output_schema.json # Specifies where an Actor stores its outputsrc/└── main.js # Actor entry point and orchestratorstorage/ # Local storage (mirrors Cloud during development)├── datasets/ # Output items (JSON objects)├── key_value_stores/ # Files, config, INPUT└── request_queues/ # Pending crawl requestsDockerfile # Container image definition
For more information, see the Actor definition documentation.
How it works
This code is a JavaScript script that uses Cheerio to scrape data from a website. It then stores the website titles in a dataset.
- The crawler starts with URLs provided from the input
startUrlsfield defined by the input schema. Number of scraped pages is limited bymaxPagesPerCrawlfield from the input schema. - The crawler uses
requestHandlerfor each URL to extract the data from the page with the Cheerio library and to save the title and URL of each page to the dataset. It also logs out each result that is being saved.
What's included
- Apify SDK - toolkit for building Actors
- Crawlee - web scraping and browser automation library
- Input schema - define and easily validate a schema for your Actor's input
- Dataset - store structured data where each object stored has the same attributes
- Cheerio - a fast, flexible & elegant library for parsing and manipulating HTML and XML
- Proxy configuration - rotate IP addresses to prevent blocking
Resources
- Quick Start guide for building your first Actor
- Video tutorial on building a scraper using CheerioCrawler
- Written tutorial on building a scraper using CheerioCrawler
- Web scraping with Cheerio in 2023
- How to scrape a dynamic page using Cheerio
- Integration with Zapier, Make, Google Drive and others
- Video guide on getting data using Apify API
Creating Actors with templates
One‑Page HTML Scraper with Cheerio
Scrape single page with provided URL with Axios and extract data from page's HTML with Cheerio.
Crawlee + Puppeteer + Chrome
Example of a Puppeteer and headless Chrome web scraper. Headless browsers render JavaScript and are harder to block, but they're slower than plain HTTP.
Crawlee + Playwright + Chrome
Web scraper example with Crawlee, Playwright and headless Chrome. Playwright is more modern, user-friendly and harder to block than Puppeteer.
Crawlee + Playwright + Camoufox
Web scraper example with Crawlee, Playwright and Camoufox. Camoufox is a custom stealthy fork of Firefox. Try this template if you're facing anti-scraping challenges.
Bootstrap CheerioCrawler
Skeleton project that helps you quickly bootstrap `CheerioCrawler` in JavaScript. It's best for developers who already know Apify SDK and Crawlee.
Cypress
Example of running Cypress tests and saving their results on the Apify platform. JSON results are saved to Dataset, videos to Key-value store.