website-content-crawler avatar
website-content-crawler

Deprecated

Pricing

Pay per usage

Go to Store
website-content-crawler

website-content-crawler

Deprecated

Developed by

Matěj Šesták

Matěj Šesták

Maintained by Community

0.0 (0)

Pricing

Pay per usage

1

Total users

2

Monthly users

2

Last modified

2 years ago

You can access the website-content-crawler programmatically from your own applications by using the Apify API. You can also choose the language preference from below. To use the Apify API, you’ll need an Apify account and your API token, found in Integrations settings in Apify Console.

{
"openapi": "3.0.1",
"info": {
"version": "0.0",
"x-build-id": "WeRRT1J6N6RrE1G9r"
},
"servers": [
{
"url": "https://api.apify.com/v2"
}
],
"paths": {
"/acts/acceptable_seahorse~website-content-crawler/run-sync-get-dataset-items": {
"post": {
"operationId": "run-sync-get-dataset-items-acceptable_seahorse-website-content-crawler",
"x-openai-isConsequential": false,
"summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
"tags": [
"Run Actor"
],
"requestBody": {
"required": true,
"content": {
"application/json": {
"schema": {
"$ref": "#/components/schemas/inputSchema"
}
}
}
},
"parameters": [
{
"name": "token",
"in": "query",
"required": true,
"schema": {
"type": "string"
},
"description": "Enter your Apify token here"
}
],
"responses": {
"200": {
"description": "OK"
}
}
}
},
"/acts/acceptable_seahorse~website-content-crawler/runs": {
"post": {
"operationId": "runs-sync-acceptable_seahorse-website-content-crawler",
"x-openai-isConsequential": false,
"summary": "Executes an Actor and returns information about the initiated run in response.",
"tags": [
"Run Actor"
],
"requestBody": {
"required": true,
"content": {
"application/json": {
"schema": {
"$ref": "#/components/schemas/inputSchema"
}
}
}
},
"parameters": [
{
"name": "token",
"in": "query",
"required": true,
"schema": {
"type": "string"
},
"description": "Enter your Apify token here"
}
],
"responses": {
"200": {
"description": "OK",
"content": {
"application/json": {
"schema": {
"$ref": "#/components/schemas/runsResponseSchema"
}
}
}
}
}
}
},
"/acts/acceptable_seahorse~website-content-crawler/run-sync": {
"post": {
"operationId": "run-sync-acceptable_seahorse-website-content-crawler",
"x-openai-isConsequential": false,
"summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
"tags": [
"Run Actor"
],
"requestBody": {
"required": true,
"content": {
"application/json": {
"schema": {
"$ref": "#/components/schemas/inputSchema"
}
}
}
},
"parameters": [
{
"name": "token",
"in": "query",
"required": true,
"schema": {
"type": "string"
},
"description": "Enter your Apify token here"
}
],
"responses": {
"200": {
"description": "OK"
}
}
}
}
},
"components": {
"schemas": {
"inputSchema": {
"type": "object",
"properties": {
"startUrls": {
"title": "Start URLs",
"type": "array",
"description": "One or more URLs of pages where the crawler will start. Note that the Actor will additionally only crawl sub-pages of these URLs. For example, for start URL `https://www.example.com/blog`, it will crawl pages like `https://example.com/blog/article-1`, but will skip `https://example.com/docs/something-else`.",
"items": {
"type": "object",
"required": [
"url"
],
"properties": {
"url": {
"type": "string",
"title": "URL of a web page",
"format": "uri"
}
}
}
},
"crawlerType": {
"title": "Crawler type",
"enum": [
"playwright:chrome",
"playwright:firefox",
"cheerio",
"jsdom"
],
"type": "string",
"description": "Select the crawling engine:\n- **Headless web browser** (default) - Useful for modern websites with anti-scraping protections and JavaScript rendering. It recognizes common blocking patterns like CAPTCHAs and automatically retries blocked requests through new sessions. However, running web browsers is more expensive as it requires more computing resources and is slower. It is recommended to use at least 8 GB of RAM.\n- **Stealthy web browser** - Headless web browser with antiblocking measures enabled. Try this if you encounter bot protection while scraping. For best performance, use with Apify proxy servers. \n- **Raw HTTP client** - High-performance crawling mode that uses raw HTTP requests to fetch the pages. It is faster and cheaper, but it might not work on all websites.",
"default": "playwright:chrome"
},
"maxCrawlDepth": {
"title": "Max crawling depth",
"minimum": 0,
"type": "integer",
"description": "The maximum number of links starting from the start URL that the crawler will recursively descend. The start URLs have depth 0, the pages linked directly from the start URLs have depth 1, and so on.\n\nThis setting is useful to prevent accidental crawler runaway. By setting it to 0, the actor will only crawl start URLs.",
"default": 20
},
"maxCrawlPages": {
"title": "Max pages",
"minimum": 0,
"type": "integer",
"description": "The maximum number pages to crawl. It includes the start URLs, pagination pages, pages with no content, etc. The crawler will automatically finish after reaching this number. This setting is useful to prevent accidental crawler runaway.",
"default": 9999999
},
"maxResults": {
"title": "Max results",
"minimum": 0,
"type": "integer",
"description": "The maximum number of results to store. The crawler will automatically finish after reaching this number. This setting is useful to prevent accidental crawler runaway. If both `maxCrawlPages` and `maxResults` are defined, then the crawler will finish when the first limit is reached. Note that the crawler skips pages with a canonical URL of a page that has already been crawled, so it might crawl more pages than there are results.",
"default": 9999999
},
"maxConcurrency": {
"title": "Max concurrency",
"minimum": 1,
"maximum": 999,
"type": "integer",
"description": "The maximum number of web browsers or HTTP clients running in parallel. This setting is useful to avoid overloading the target websites and to avoid getting blocked.",
"default": 200
},
"initialConcurrency": {
"title": "Initial concurrency",
"minimum": 0,
"maximum": 999,
"type": "integer",
"description": "The initial number of web browsers or HTTP clients running in parallel. The system then scales the concurrency up and down based on the actual performance and memory limit. If the value is set to 0, the Actor uses the default settings for the specific crawler type.\n\nNote that if you set this value too high, the Actor will run out of memory and crash. If too low, it will be slow at start before it scales the concurrency up.",
"default": 0
},
"proxyConfiguration": {
"title": "Proxy configuration",
"type": "object",
"description": "Enables loading the websites from IP addresses in specific geographies and to circumvent blocking."
},
"textExtractor": {
"title": "Text extractor",
"enum": [
"crawleeHtmlToText",
"unfluff",
"extractus",
"readableText",
"none"
],
"type": "string",
"description": "Select the text parser:\n- **HTML to text** (default) - Extracts full page text after removing HTML tags and side text from navigation, header and footer. Best parser to ensure all text is extracted\n- **Unfluff** - More strict article focused parser. Extracts only main article text. Doesn't work well on more complex pages like API reference. Can be too slow on very long pages. \n- **Extractus** - Extracts only main article text. Includes HTML tags.\n\nYou can examine output of all parsers by enabling debug mode.\n- **Mozilla Readability** - Extracts the main contents of the webpage. Utilizes similar logic as the Firefox Reader View.",
"default": "crawleeHtmlToText"
},
"aggressivePrune": {
"title": "Aggressive pruning",
"type": "boolean",
"description": "When enabled, the crawler will prune content lines that are very similar to the ones already crawled. This is useful strip repeating content in the scraped data like menus, headers, footers, etc. In inprobable edge cases, it might remove relevant content from some pages.",
"default": false
},
"removeElementsCssSelector": {
"title": "Remove HTML elements",
"type": "string",
"description": "A CSS selector matching HTML elements that will be removed from the DOM, before converting it to text, Markdown, or saving as HTML. This is useful to skip irrelevant page content.\n\nBy default, the Actor removes headers, menus, or footers. You can disable the removal by setting it to some non-existent CSS selector like `dummy`.",
"default": "header, nav, footer"
},
"dynamicContentWaitSecs": {
"title": "Wait for dynamic content (secs)",
"type": "integer",
"description": "Adds a sleep before the page contents are processed, to allow for dynamic loading to settle. Defaults to 10s, and will resolve after 2s when the page content stops changing. This option is ignored with cheerio crawler type. We always wait for the window load event, this sleep adds additional time after it.",
"default": 10
},
"saveFiles": {
"title": "Download document files",
"type": "boolean",
"description": "If enabled, the crawler downloads and stores document files linked from the web pages to the key-value store. The metadata about the files is stored in the output dataset, similarly as for normal pages.\n\nOnly files whose URL ends with the following file extensions are stored: PDF, DOC, DOCX, XLS, XLSX, and CSV.",
"default": false
},
"saveHtml": {
"title": "Save HTML",
"type": "boolean",
"description": "If enabled, the crawler stores full HTML of all pages found, under the `html` field in the output dataset. This is useful for debugging, but reduces performance and increases storage costs.",
"default": false
},
"saveScreenshots": {
"title": "Save screenshots (headless browser only)",
"type": "boolean",
"description": "If enabled, the crawler stores a screenshot for each article page to the default key-value store. The link to the screenshot is stored under the `screenshotUrl` field in the output dataset. It is useful for debugging, but reduces performance and increases storage costs.\n\nNote that this feature only works with headless browser crawler type!",
"default": false
},
"saveMarkdown": {
"title": "Save Markdown",
"type": "boolean",
"description": "If enabled, the crawler stores Markdown of all pages found, under the `markdown` field in the output dataset.",
"default": false
},
"readableTextCharThreshold": {
"title": "Readable text extractor character threshold",
"type": "integer",
"description": "Applies only to \"Readable text\" text extractor. The number of characters an article must have in order to return a result.",
"default": 100
},
"debugMode": {
"title": "Debug mode (stores output of all types of extractors)",
"type": "boolean",
"description": "If enabled, the actor will store the output of all types of extractors, including the ones that are not used by default and it will store HTML to Key-value Store with a link. All this data is stored under the `debug` field.",
"default": false
}
}
},
"runsResponseSchema": {
"type": "object",
"properties": {
"data": {
"type": "object",
"properties": {
"id": {
"type": "string"
},
"actId": {
"type": "string"
},
"userId": {
"type": "string"
},
"startedAt": {
"type": "string",
"format": "date-time",
"example": "2025-01-08T00:00:00.000Z"
},
"finishedAt": {
"type": "string",
"format": "date-time",
"example": "2025-01-08T00:00:00.000Z"
},
"status": {
"type": "string",
"example": "READY"
},
"meta": {
"type": "object",
"properties": {
"origin": {
"type": "string",
"example": "API"
},
"userAgent": {
"type": "string"
}
}
},
"stats": {
"type": "object",
"properties": {
"inputBodyLen": {
"type": "integer",
"example": 2000
},
"rebootCount": {
"type": "integer",
"example": 0
},
"restartCount": {
"type": "integer",
"example": 0
},
"resurrectCount": {
"type": "integer",
"example": 0
},
"computeUnits": {
"type": "integer",
"example": 0
}
}
},
"options": {
"type": "object",
"properties": {
"build": {
"type": "string",
"example": "latest"
},
"timeoutSecs": {
"type": "integer",
"example": 300
},
"memoryMbytes": {
"type": "integer",
"example": 1024
},
"diskMbytes": {
"type": "integer",
"example": 2048
}
}
},
"buildId": {
"type": "string"
},
"defaultKeyValueStoreId": {
"type": "string"
},
"defaultDatasetId": {
"type": "string"
},
"defaultRequestQueueId": {
"type": "string"
},
"buildNumber": {
"type": "string",
"example": "1.0.0"
},
"containerUrl": {
"type": "string"
},
"usage": {
"type": "object",
"properties": {
"ACTOR_COMPUTE_UNITS": {
"type": "integer",
"example": 0
},
"DATASET_READS": {
"type": "integer",
"example": 0
},
"DATASET_WRITES": {
"type": "integer",
"example": 0
},
"KEY_VALUE_STORE_READS": {
"type": "integer",
"example": 0
},
"KEY_VALUE_STORE_WRITES": {
"type": "integer",
"example": 1
},
"KEY_VALUE_STORE_LISTS": {
"type": "integer",
"example": 0
},
"REQUEST_QUEUE_READS": {
"type": "integer",
"example": 0
},
"REQUEST_QUEUE_WRITES": {
"type": "integer",
"example": 0
},
"DATA_TRANSFER_INTERNAL_GBYTES": {
"type": "integer",
"example": 0
},
"DATA_TRANSFER_EXTERNAL_GBYTES": {
"type": "integer",
"example": 0
},
"PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
"type": "integer",
"example": 0
},
"PROXY_SERPS": {
"type": "integer",
"example": 0
}
}
},
"usageTotalUsd": {
"type": "number",
"example": 0.00005
},
"usageUsd": {
"type": "object",
"properties": {
"ACTOR_COMPUTE_UNITS": {
"type": "integer",
"example": 0
},
"DATASET_READS": {
"type": "integer",
"example": 0
},
"DATASET_WRITES": {
"type": "integer",
"example": 0
},
"KEY_VALUE_STORE_READS": {
"type": "integer",
"example": 0
},
"KEY_VALUE_STORE_WRITES": {
"type": "number",
"example": 0.00005
},
"KEY_VALUE_STORE_LISTS": {
"type": "integer",
"example": 0
},
"REQUEST_QUEUE_READS": {
"type": "integer",
"example": 0
},
"REQUEST_QUEUE_WRITES": {
"type": "integer",
"example": 0
},
"DATA_TRANSFER_INTERNAL_GBYTES": {
"type": "integer",
"example": 0
},
"DATA_TRANSFER_EXTERNAL_GBYTES": {
"type": "integer",
"example": 0
},
"PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
"type": "integer",
"example": 0
},
"PROXY_SERPS": {
"type": "integer",
"example": 0
}
}
}
}
}
}
}
}
}
}

website-content-crawler OpenAPI definition

OpenAPI is a standard for designing and describing RESTful APIs, allowing developers to define API structure, endpoints, and data formats in a machine-readable way. It simplifies API development, integration, and documentation.

OpenAPI is effective when used with AI agents and GPTs by standardizing how these systems interact with various APIs, for reliable integrations and efficient communication.

By defining machine-readable API specifications, OpenAPI allows AI models like GPTs to understand and use varied data sources, improving accuracy. This accelerates development, reduces errors, and provides context-aware responses, making OpenAPI a core component for AI applications.

You can download the OpenAPI definitions for website-content-crawler from the options below:

If you’d like to learn more about how OpenAPI powers GPTs, read our blog post.

You can also check out our other API clients: