website-content-crawler
Pricing
Pay per usage
website-content-crawler
Deprecated0.0 (0)
Pricing
Pay per usage
1
2
2
Last modified
2 years ago
Pricing
Pay per usage
0.0 (0)
Pricing
Pay per usage
1
2
2
Last modified
2 years ago
startUrls
arrayOptional
One or more URLs of pages where the crawler will start. Note that the Actor will additionally only crawl sub-pages of these URLs. For example, for start URL https://www.example.com/blog
, it will crawl pages like https://example.com/blog/article-1
, but will skip https://example.com/docs/something-else
.
crawlerType
EnumOptional
Select the crawling engine:
"playwright:chrome": string"playwright:firefox": string"cheerio": string"jsdom": string
Default value of this property is "playwright:chrome"
maxCrawlDepth
integerOptional
The maximum number of links starting from the start URL that the crawler will recursively descend. The start URLs have depth 0, the pages linked directly from the start URLs have depth 1, and so on.
This setting is useful to prevent accidental crawler runaway. By setting it to 0, the actor will only crawl start URLs.
Default value of this property is 20
maxCrawlPages
integerOptional
The maximum number pages to crawl. It includes the start URLs, pagination pages, pages with no content, etc. The crawler will automatically finish after reaching this number. This setting is useful to prevent accidental crawler runaway.
Default value of this property is 9999999
maxResults
integerOptional
The maximum number of results to store. The crawler will automatically finish after reaching this number. This setting is useful to prevent accidental crawler runaway. If both maxCrawlPages
and maxResults
are defined, then the crawler will finish when the first limit is reached. Note that the crawler skips pages with a canonical URL of a page that has already been crawled, so it might crawl more pages than there are results.
Default value of this property is 9999999
maxConcurrency
integerOptional
The maximum number of web browsers or HTTP clients running in parallel. This setting is useful to avoid overloading the target websites and to avoid getting blocked.
Default value of this property is 200
initialConcurrency
integerOptional
The initial number of web browsers or HTTP clients running in parallel. The system then scales the concurrency up and down based on the actual performance and memory limit. If the value is set to 0, the Actor uses the default settings for the specific crawler type.
Note that if you set this value too high, the Actor will run out of memory and crash. If too low, it will be slow at start before it scales the concurrency up.
Default value of this property is 0
proxyConfiguration
objectOptional
Enables loading the websites from IP addresses in specific geographies and to circumvent blocking.
textExtractor
EnumOptional
Select the text parser:
You can examine output of all parsers by enabling debug mode.
"crawleeHtmlToText": string"unfluff": string"extractus": string"readableText": string"none": string
Default value of this property is "crawleeHtmlToText"
aggressivePrune
booleanOptional
When enabled, the crawler will prune content lines that are very similar to the ones already crawled. This is useful strip repeating content in the scraped data like menus, headers, footers, etc. In inprobable edge cases, it might remove relevant content from some pages.
Default value of this property is false
removeElementsCssSelector
stringOptional
A CSS selector matching HTML elements that will be removed from the DOM, before converting it to text, Markdown, or saving as HTML. This is useful to skip irrelevant page content.
By default, the Actor removes headers, menus, or footers. You can disable the removal by setting it to some non-existent CSS selector like dummy
.
Default value of this property is "header, nav, footer"
dynamicContentWaitSecs
integerOptional
Adds a sleep before the page contents are processed, to allow for dynamic loading to settle. Defaults to 10s, and will resolve after 2s when the page content stops changing. This option is ignored with cheerio crawler type. We always wait for the window load event, this sleep adds additional time after it.
Default value of this property is 10
saveFiles
booleanOptional
If enabled, the crawler downloads and stores document files linked from the web pages to the key-value store. The metadata about the files is stored in the output dataset, similarly as for normal pages.
Only files whose URL ends with the following file extensions are stored: PDF, DOC, DOCX, XLS, XLSX, and CSV.
Default value of this property is false
saveHtml
booleanOptional
If enabled, the crawler stores full HTML of all pages found, under the html
field in the output dataset. This is useful for debugging, but reduces performance and increases storage costs.
Default value of this property is false
saveScreenshots
booleanOptional
If enabled, the crawler stores a screenshot for each article page to the default key-value store. The link to the screenshot is stored under the screenshotUrl
field in the output dataset. It is useful for debugging, but reduces performance and increases storage costs.
Note that this feature only works with headless browser crawler type!
Default value of this property is false
saveMarkdown
booleanOptional
If enabled, the crawler stores Markdown of all pages found, under the markdown
field in the output dataset.
Default value of this property is false
readableTextCharThreshold
integerOptional
Applies only to "Readable text" text extractor. The number of characters an article must have in order to return a result.
Default value of this property is 100
debugMode
booleanOptional
If enabled, the actor will store the output of all types of extractors, including the ones that are not used by default and it will store HTML to Key-value Store with a link. All this data is stored under the debug
field.
Default value of this property is false