Under maintenance

No credit card required

Go to Store

This Actor is under maintenance.

This Actor may be unreliable while under maintenance. Would you like to try a similar Actor instead?

See alternative Actors

website-content-crawler

acceptable_seahorse/website-content-crawler

Try for free

No credit card required

Start URLs

startUrlsarrayOptional

One or more URLs of pages where the crawler will start. Note that the Actor will additionally only crawl sub-pages of these URLs. For example, for start URL https://www.example.com/blog, it will crawl pages like https://example.com/blog/article-1, but will skip https://example.com/docs/something-else.

Crawler type

crawlerTypeEnumOptional

Select the crawling engine:

Headless web browser (default) - Useful for modern websites with anti-scraping protections and JavaScript rendering. It recognizes common blocking patterns like CAPTCHAs and automatically retries blocked requests through new sessions. However, running web browsers is more expensive as it requires more computing resources and is slower. It is recommended to use at least 8 GB of RAM.
Stealthy web browser - Headless web browser with antiblocking measures enabled. Try this if you encounter bot protection while scraping. For best performance, use with Apify proxy servers.
Raw HTTP client - High-performance crawling mode that uses raw HTTP requests to fetch the pages. It is faster and cheaper, but it might not work on all websites.

Value options:

"playwright:chrome": string"playwright:firefox": string"cheerio": string"jsdom": string

Default value of this property is "playwright:chrome"

Max crawling depth

maxCrawlDepthintegerOptional

The maximum number of links starting from the start URL that the crawler will recursively descend. The start URLs have depth 0, the pages linked directly from the start URLs have depth 1, and so on.

This setting is useful to prevent accidental crawler runaway. By setting it to 0, the actor will only crawl start URLs.

Default value of this property is 20

Max pages

maxCrawlPagesintegerOptional

The maximum number pages to crawl. It includes the start URLs, pagination pages, pages with no content, etc. The crawler will automatically finish after reaching this number. This setting is useful to prevent accidental crawler runaway.

Default value of this property is 9999999

Max results

maxResultsintegerOptional

The maximum number of results to store. The crawler will automatically finish after reaching this number. This setting is useful to prevent accidental crawler runaway. If both maxCrawlPages and maxResults are defined, then the crawler will finish when the first limit is reached. Note that the crawler skips pages with a canonical URL of a page that has already been crawled, so it might crawl more pages than there are results.

Default value of this property is 9999999

Max concurrency

maxConcurrencyintegerOptional

The maximum number of web browsers or HTTP clients running in parallel. This setting is useful to avoid overloading the target websites and to avoid getting blocked.

Default value of this property is 200

Initial concurrency

initialConcurrencyintegerOptional

The initial number of web browsers or HTTP clients running in parallel. The system then scales the concurrency up and down based on the actual performance and memory limit. If the value is set to 0, the Actor uses the default settings for the specific crawler type.

Note that if you set this value too high, the Actor will run out of memory and crash. If too low, it will be slow at start before it scales the concurrency up.

Default value of this property is 0

Proxy configuration

proxyConfigurationobjectOptional

Enables loading the websites from IP addresses in specific geographies and to circumvent blocking.

Text extractor

textExtractorEnumOptional

Select the text parser:

HTML to text (default) - Extracts full page text after removing HTML tags and side text from navigation, header and footer. Best parser to ensure all text is extracted
Unfluff - More strict article focused parser. Extracts only main article text. Doesn't work well on more complex pages like API reference. Can be too slow on very long pages.
Extractus - Extracts only main article text. Includes HTML tags.

You can examine output of all parsers by enabling debug mode.

Mozilla Readability - Extracts the main contents of the webpage. Utilizes similar logic as the Firefox Reader View.

Value options:

"crawleeHtmlToText": string"unfluff": string"extractus": string"readableText": string"none": string

Default value of this property is "crawleeHtmlToText"

Aggressive pruning

aggressivePrunebooleanOptional

When enabled, the crawler will prune content lines that are very similar to the ones already crawled. This is useful strip repeating content in the scraped data like menus, headers, footers, etc. In inprobable edge cases, it might remove relevant content from some pages.

Default value of this property is false

Remove HTML elements

removeElementsCssSelectorstringOptional

A CSS selector matching HTML elements that will be removed from the DOM, before converting it to text, Markdown, or saving as HTML. This is useful to skip irrelevant page content.

By default, the Actor removes headers, menus, or footers. You can disable the removal by setting it to some non-existent CSS selector like dummy.

Default value of this property is "header, nav, footer"

Wait for dynamic content (secs)

dynamicContentWaitSecsintegerOptional

Adds a sleep before the page contents are processed, to allow for dynamic loading to settle. Defaults to 10s, and will resolve after 2s when the page content stops changing. This option is ignored with cheerio crawler type. We always wait for the window load event, this sleep adds additional time after it.

Default value of this property is 10

Download document files

saveFilesbooleanOptional

If enabled, the crawler downloads and stores document files linked from the web pages to the key-value store. The metadata about the files is stored in the output dataset, similarly as for normal pages.

Only files whose URL ends with the following file extensions are stored: PDF, DOC, DOCX, XLS, XLSX, and CSV.

Default value of this property is false

Save HTML

saveHtmlbooleanOptional

If enabled, the crawler stores full HTML of all pages found, under the html field in the output dataset. This is useful for debugging, but reduces performance and increases storage costs.

Default value of this property is false

Save screenshots (headless browser only)

saveScreenshotsbooleanOptional

If enabled, the crawler stores a screenshot for each article page to the default key-value store. The link to the screenshot is stored under the screenshotUrl field in the output dataset. It is useful for debugging, but reduces performance and increases storage costs.

Note that this feature only works with headless browser crawler type!

Default value of this property is false

Save Markdown

saveMarkdownbooleanOptional

If enabled, the crawler stores Markdown of all pages found, under the markdown field in the output dataset.

Default value of this property is false

Readable text extractor character threshold

readableTextCharThresholdintegerOptional

Applies only to "Readable text" text extractor. The number of characters an article must have in order to return a result.

Default value of this property is 100

Debug mode (stores output of all types of extractors)

debugModebooleanOptional

If enabled, the actor will store the output of all types of extractors, including the ones that are not used by default and it will store HTML to Key-value Store with a link. All this data is stored under the debug field.

Default value of this property is false

Developer

Matěj Šesták

Actor Metrics

2 monthly users
1 star
>99% runs succeeded
Created in May 2023
Modified 2 years ago

Categories

Automation

Fast Website Content Crawler

6sigmag/fast-website-content-crawler

A high-performance web scraper that rapidly extracts and analyzes content from multiple websites simultaneously. Perfect for competitive research, content aggregation, and website structure analysis.

David Deng

181

Deep Website Content Crawler

6sigmag/deep-website-content-crawler

Scrape Failed Killer! A high-performance web scraper that rapidly extracts and analyzes content from multiple websites simultaneously. Perfect for competitive research, content aggregation, and website structure analysis.

David Deng

114

AI Website Content Markdown Scraper

quaking_pail/ai-website-content-markdown-scraper

This Apify Actor, "Website Content Crawler with Markdown Extraction," is designed to perform a comprehensive crawl of specified websites, extract their text content, convert it into Markdown format, and store it in a structured dataset. The extracted content is suitable for feeding LLMs.

AI_Builder

273

Website Screenshot Generator

apify/screenshot-url

Create a screenshot of a website based on a specified URL. The screenshot is stored as the output in a key-value store. It can be used to monitor web changes regularly after setting up the scheduler.

Apify

OpenSearch Integration

apify/opensearch-integration

Transfer data from Apify Actors to Amazon OpenSearch Service. This Actor is a good starting point for building question-answering systems, search functionality, or Retrieval-Augmented Generation (RAG) use cases.

Apify

Example Website Screenshot Crawler

dz_omar/example-website-screenshot-crawler

Automated website screenshot crawler using Pyppeteer and Apify. This open-source actor captures screenshots from specified URLs, uploads them to the Apify Key-Value Store, and provides easy access to the results, making it ideal for monitoring website changes and archiving web content.

Omar Abdlhakim

Web Scraper

apify/web-scraper

Crawls arbitrary websites using the Chrome browser and extracts data from pages using JavaScript code. The Actor supports both recursive crawling and lists of URLs and automatically manages concurrency for maximum performance. This is Apify's basic tool for web crawling and scraping.

Apify

73.9k

342

Google Maps Scraper

compass/crawler-google-places

Extract data from hundreds of Google Maps locations and businesses. Get Google Maps data including reviews, images, contact info, opening hours, location, popular times, prices & more. Export scraped data, run the scraper via API, schedule and monitor runs, or integrate with other tools.

Compass

82.5k

733

📩📍 Google Maps Email Extractor

lukaskrivka/google-maps-with-contact-details

Extract Google Maps contact details. Scrape websites of Google Maps places for contact details and get email addresses, website, location, address, zipcode, phone number, social media links. Export scraped data, run the scraper via API, schedule and monitor runs or integrate with other tools.

Lukáš Křivka

11.5k

337

Smart Article Extractor

lukaskrivka/article-extractor-smart

📰 Smart Article Extractor extracts articles from any scientific, academic, or news website with just one click. The extractor crawls the whole website and automatically distinguishes articles from other web pages. Download your data as HTML table, JSON, Excel, RSS feed, and more.

Lukáš Křivka

4.3k

How to never miss a beat on ever changing websites

The definitive guide to text scraping

Sharing is caring

Build new tools

Are you a developer? Build your own Actors and run them on Apify.

Learn more

Get a custom solution

Get a custom web scraping or RPA solution.

Book a demo

website-content-crawler

Start URLs

Crawler type

Value options:

Max crawling depth

Max pages

Max results

Max concurrency

Initial concurrency

Proxy configuration

Text extractor

Value options:

Aggressive pruning

Remove HTML elements

Wait for dynamic content (secs)

Download document files

Save HTML

Save screenshots (headless browser only)

Save Markdown

Readable text extractor character threshold

Debug mode (stores output of all types of extractors)

Fast Website Content Crawler

Deep Website Content Crawler

AI Website Content Markdown Scraper

Website Screenshot Generator

OpenSearch Integration

Example Website Screenshot Crawler

Web Scraper

Google Maps Scraper

📩📍 Google Maps Email Extractor

Smart Article Extractor

Related articles

Where next?

Build new tools

Get a custom solution

Start URLs

Crawler type

Value options:

Max crawling depth

Max pages

Max results

Max concurrency

Initial concurrency

Proxy configuration

Text extractor

Value options:

Aggressive pruning

Remove HTML elements

Wait for dynamic content (secs)

Download document files

Save HTML

Save screenshots (headless browser only)

Save Markdown

Readable text extractor character threshold

Debug mode (stores output of all types of extractors)

You might also like these Actors

Fast Website Content Crawler

Deep Website Content Crawler

AI Website Content Markdown Scraper

Website Screenshot Generator

OpenSearch Integration

Example Website Screenshot Crawler

Web Scraper

Google Maps Scraper

📩📍 Google Maps Email Extractor

Smart Article Extractor

Related articles

Where next?

Build new tools

Get a custom solution