website-content-crawler
No credit card required
This Actor may be unreliable while under maintenance. Would you like to try a similar Actor instead?
See alternative Actorswebsite-content-crawler
No credit card required
Website Content Crawler is an Apify Actor that can perform a deep crawl of one or more websites to extract their content, such as documentation, knowledge bases, help articles, blog posts, or any other text content.
The actor was specifically designed to extract data for feeding, fine-tuning, or training large language models (LLMs) such as GPT-4, ChatGPT or LLaMA, and other AI models. It automatically removes headers, footers, menus, ads, and other noise from the web pages in order to return only the text content that can be directly fed to the models.
The actor has a simple input configuration so that it can be easily integrated into customer-facing products, where customers can enter just a URL of their website that want to have indexed by LLMs. The actor scales gracefully and can be used for small sites as well as sites with millions of pages. You can retrieve the results using API to formats such as JSON or CSV, which can be fed directly to your LLM, vector database, or directly to ChatGPT.
How does it work?
Website Content Crawler only needs one or more start URLs, typically the top-level URL of the documentation site, blog, or knowledge base that you want to scrape. The actor crawls the start URLs, finds links to other pages, and recursively crawls those pages too, as long as their URL is under the start URL.
For example, if you enter the start URL https://example.com/blog/
, the
actor will crawl pages like https://example.com/blog/article-1
or https://example.com/blog/section/article-2
,
but will skip pages like https://example.com/docs/something-else
.
The actor also extracts important metadata about the content, such as author, language, publishing date, etc. It can save also the full HTML and screenshots of the pages, which is useful for debugging.
Website Content Crawler can be further configured for optimal performance. For example, you can select the crawler type:
- Headless web browser (default) - Useful for modern websites with anti-scraping protections and JavaScript rendering. It recognizes common blocking patterns like CAPTCHAs and automatically retries blocked requests through new sessions. However, running web browsers is more expensive as it requires more computing resources and is slower.
- Raw HTTP client - High-performance crawling mode that uses raw HTTP requests to fetch the pages. It is faster and cheaper, but it might not work on all websites.
- Raw HTTP client with JS execution (JSDOM) (experimental) - A compromise between a browser and raw HTTP crawlers. Good performance and should work on almost all websites including those with dynamic content. However, it is still experimental and might sometimes crash so we don't recommend it in production settings yet.
You can also set additional input parameters such as a maximum number of pages, maximum crawling depth, maximum concurrency, proxy configuration, timeout, etc. to control the behavior and performance of the actor.
Designed for generative AI and LLMs
The results of the Website Content Crawler can help you feed, fine-tune or train your large language models (LLMs) or provide context for prompts for ChatGPT. In return, the model will answer questions based on your or your customer's websites and content.
Custom chatbots for customer support
Chatbots personalized on customer data such as documentation or knowledge bases are the next big thing for customer support and success teams. Let your customers simply type in the URL of their documentation or help center, and in minutes, your chatbot will have full knowledge about their product with zero integration costs.
Generate personalized content based on customer’s copy
ChatGPT and LLMs can write articles for you, but they won’t sound like you wrote them. Feed all your old blogs into your model to make it sound like you. Or train the model on your customers’ blogs and have it write in their tone of voice. Or help their technical writers with making first drafts of new documentation pages.
Summarization, translation, proofreading at scale
Got some old docs or blogs that need to be improved? Use Website Content Crawler to scrape the content, feed it to ChatGPT API and ask it to summarize, proofread, translate or change the style of the content.
Example
This example shows how to scrape all pages from the Apify documentation at https://docs.apify.com/:
Input
See full input with description.
Output
This is how one crawled page (https://docs.apify.com/academy/web-scraping-for-beginners) looks in a browser:
And here is how the crawling result looks in JSON format (note that other formats like CSV or Excel are also supported).
The main page content can be found in the text
field, and it only contains the valuable
content, without menus and other noise:
1{ 2 "url": "https://docs.apify.com/academy/web-scraping-for-beginners", 3 "crawl": { 4 "loadedUrl": "https://docs.apify.com/academy/web-scraping-for-beginners", 5 "loadedTime": "2023-04-05T16:26:51.030Z", 6 "referrerUrl": "https://docs.apify.com/academy", 7 "depth": 0 8 }, 9 "metadata": { 10 "canonicalUrl": "https://docs.apify.com/academy/web-scraping-for-beginners", 11 "title": "Web scraping for beginners | Apify Documentation", 12 "description": "Learn how to develop web scrapers with this comprehensive and practical course. Go from beginner to expert, all in one place.", 13 "author": null, 14 "keywords": null, 15 "languageCode": "en" 16 }, 17 "screenshotUrl": null, 18 "text": "Skip to main content\nOn this page\nWeb scraping for beginners\nLearn how to develop web scrapers with this comprehensive and practical course. Go from beginner to expert, all in one place.\nWelcome to Web scraping for beginners, a comprehensive, practical and long form web scraping course that will take you from an absolute beginner to a successful web scraper developer. If you're looking for a quick start, we recommend trying this tutorial instead.\nThis course is made by Apify, the web scraping and automation platform, but we will use only open-source technologies throughout all academy lessons. This means that the skills you learn will be applicable to any scraping project, and you'll be able to run your scrapers on any computer. No Apify account needed.\nIf you would like to learn about the Apify platform and how it can help you build, run and scale your web scraping and automation projects, see the Apify platform course, where we'll teach you all about Apify serverless infrastructure, proxies, API, scheduling, webhooks and much more.\nWhy learn scraper development?\nWith so many point-and-click tools and no-code software that can help you extract data from websites, what is the point of learning web scraper development? Contrary to what their marketing departments say, a point-and-click or no-code tool will never be as flexible, as powerful, or as optimized as a custom-built scraper.\nAny software can do only what it was programmed to do. If you build your own scraper, it can do anything you want. And you can always quickly change it to do more, less, or the same, but faster or cheaper. The possibilities are endless once you know how scraping really works.\nScraper development is a fun and challenging way to learn web development, web technologies, and understand the internet. You will reverse-engineer websites and understand how they work internally, what technologies they use and how they communicate with their servers. You will also master your chosen programming language and core programming concepts. When you truly understand web scraping, learning other technology like React or Next.js will be a piece of cake.\nCourse Summary\nWhen we set out to create the Academy, we wanted to build a complete guide to modern web scraping - a course that a beginner could use to create their first scraper, as well as a resource that professionals will continuously use to learn about advanced and niche web scraping techniques and technologies. All lessons include code examples and code-along exercises that you can use to immediately put your scraping skills into action.\nThis is what you'll learn in the Web scraping for beginners course:\nWeb scraping for beginners\nBasics of data extraction\nBasics of crawling\nBest practices\nRequirements\nYou don't need to be a developer or a software engineer to complete this course, but basic programming knowledge is recommended. Don't be afraid, though. We explain everything in great detail in the course and provide external references that can help you level up your web scraping and web development skills. If you're new to programming, pay very close attention to the instructions and examples. A seemingly insignificant thing like using [] instead of () can make a lot of difference.\nIf you don't already have basic programming knowledge and would like to be well-prepared for this course, we recommend taking a JavaScript course and learning about CSS Selectors.\nAs you progress to the more advanced courses, the coding will get more challenging, but will still be manageable to a person with an intermediate level of programming skills.\nIdeally, you should have at least a moderate understanding of the following concepts:\nJavaScript + Node.js\nIt is recommended to understand at least the fundamentals of JavaScript and be proficient with Node.js prior to starting this course. If you are not yet comfortable with asynchronous programming (with promises and async...await), loops (and the different types of loops in JavaScript), modularity, or working with external packages, we would recommend studying the following resources before coming back and continuing this section:\nasync...await (YouTube)\nJavaScript loops (MDN)\nModularity in Node.js\nGeneral web development\nThroughout the next lessons, we will sometimes use certain technologies and terms related to the web without explaining them. This is because the knowledge of them will be assumed (unless we're showing something out of the ordinary).\nHTML\nHTTP protocol\nDevTools\njQuery or Cheerio\nWe'll be using the Cheerio package a lot to parse data from HTML. This package provides a simple API using jQuery syntax to help traverse downloaded HTML within Node.js.\nNext up\nThe course begins with a small bit of theory and moves into some realistic and practical examples of extracting data from the most popular websites on the internet using your browser console. So let's get to it!\nIf you already have experience with HTML, CSS, and browser DevTools, feel free to skip to the Basics of crawling section.\nWhy learn scraper development?\nCourse Summary\nRequirements\nJavaScript + Node.js\nGeneral web development\njQuery or Cheerio\nNext up", 19 "html": null, 20 "markdown": " Web scraping for beginners | Apify Documentation \n\n[Skip to main content](#docusaurus_skipToContent_fallback)\n\nOn this page\n\n# Web scraping for beginners\n\n**Learn how to develop web scrapers with this comprehensive and practical course. Go from beginner to expert, all in one place.**\n\n* * *\n\nWelcome to **Web scraping for beginners**, a comprehensive, practical and long form web scraping course that will take you from an absolute beginner to a successful web scraper developer. If you're looking for a quick start, we recommend trying [this tutorial](https://blog.apify.com/web-scraping-javascript-nodejs/) instead.\n\nThis course is made by [Apify](https://apify.com), the web scraping and automation platform, but we will use only open-source technologies throughout all academy lessons. This means that the skills you learn will be applicable to any scraping project, and you'll be able to run your scrapers on any computer. No Apify account needed.\n\nIf you would like to learn about the Apify platform and how it can help you build, run and scale your web scraping and automation projects, see the [Apify platform course](/academy/apify-platform), where we'll teach you all about Apify serverless infrastructure, proxies, API, scheduling, webhooks and much more.\n\n## Why learn scraper development?[](#why-learn \"Direct link to Why learn scraper development?\")\n\nWith so many point-and-click tools and no-code software that can help you extract data from websites, what is the point of learning web scraper development? Contrary to what their marketing departments say, a point-and-click or no-code tool will never be as flexible, as powerful, or as optimized as a custom-built scraper.\n\nAny software can do only what it was programmed to do. If you build your own scraper, it can do anything you want. And you can always quickly change it to do more, less, or the same, but faster or cheaper. The possibilities are endless once you know how scraping really works.\n\nScraper development is a fun and challenging way to learn web development, web technologies, and understand the internet. You will reverse-engineer websites and understand how they work internally, what technologies they use and how they communicate with their servers. You will also master your chosen programming language and core programming concepts. When you truly understand web scraping, learning other technology like React or Next.js will be a piece of cake.\n\n## Course Summary[](#summary \"Direct link to Course Summary\")\n\nWhen we set out to create the Academy, we wanted to build a complete guide to modern web scraping - a course that a beginner could use to create their first scraper, as well as a resource that professionals will continuously use to learn about advanced and niche web scraping techniques and technologies. All lessons include code examples and code-along exercises that you can use to immediately put your scraping skills into action.\n\nThis is what you'll learn in the **Web scraping for beginners** course:\n\n* [Web scraping for beginners](/academy/web-scraping-for-beginners)\n * [Basics of data extraction](/academy/web-scraping-for-beginners/data-collection)\n * [Basics of crawling](/academy/web-scraping-for-beginners/crawling)\n * [Best practices](/academy/web-scraping-for-beginners/best-practices)\n\n## Requirements[](#requirements \"Direct link to Requirements\")\n\nYou don't need to be a developer or a software engineer to complete this course, but basic programming knowledge is recommended. Don't be afraid, though. We explain everything in great detail in the course and provide external references that can help you level up your web scraping and web development skills. If you're new to programming, pay very close attention to the instructions and examples. A seemingly insignificant thing like using `[]` instead of `()` can make a lot of difference.\n\n> If you don't already have basic programming knowledge and would like to be well-prepared for this course, we recommend taking a [JavaScript course](https://www.codecademy.com/learn/introduction-to-javascript) and learning about [CSS Selectors](https://www.w3schools.com/css/css_selectors.asp).\n\nAs you progress to the more advanced courses, the coding will get more challenging, but will still be manageable to a person with an intermediate level of programming skills.\n\nIdeally, you should have at least a moderate understanding of the following concepts:\n\n### JavaScript + Node.js[](#javascript-and-node \"Direct link to JavaScript + Node.js\")\n\nIt is recommended to understand at least the fundamentals of JavaScript and be proficient with Node.js prior to starting this course. If you are not yet comfortable with asynchronous programming (with promises and `async...await`), loops (and the different types of loops in JavaScript), modularity, or working with external packages, we would recommend studying the following resources before coming back and continuing this section:\n\n* [`async...await` (YouTube)](https://www.youtube.com/watch?v=vn3tm0quoqE&ab_channel=Fireship)\n* [JavaScript loops (MDN)](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Loops_and_iteration)\n* [Modularity in Node.js](https://www.section.io/engineering-education/how-to-use-modular-patterns-in-nodejs/)\n\n### General web development[](#general-web-development \"Direct link to General web development\")\n\nThroughout the next lessons, we will sometimes use certain technologies and terms related to the web without explaining them. This is because the knowledge of them will be **assumed** (unless we're showing something out of the ordinary).\n\n* [HTML](https://developer.mozilla.org/en-US/docs/Web/HTML)\n* [HTTP protocol](https://developer.mozilla.org/en-US/docs/Web/HTTP)\n* [DevTools](/academy/web-scraping-for-beginners/data-collection/browser-devtools)\n\n### jQuery or Cheerio[](#jquery-or-cheerio \"Direct link to jQuery or Cheerio\")\n\nWe'll be using the [**Cheerio**](https://www.npmjs.com/package/cheerio) package a lot to parse data from HTML. This package provides a simple API using jQuery syntax to help traverse downloaded HTML within Node.js.\n\n## Next up[](#next \"Direct link to Next up\")\n\nThe course begins with a small bit of theory and moves into some realistic and practical examples of extracting data from the most popular websites on the internet using your browser console. So [let's get to it!](/academy/web-scraping-for-beginners/introduction)\n\n> If you already have experience with HTML, CSS, and browser DevTools, feel free to skip to the [Basics of crawling](/academy/web-scraping-for-beginners/crawling) section.\n\n* [Why learn scraper development?](#why-learn)\n* [Course Summary](#summary)\n* [Requirements](#requirements)\n * [JavaScript + Node.js](#javascript-and-node)\n * [General web development](#general-web-development)\n * [jQuery or Cheerio](#jquery-or-cheerio)\n* [Next up](#next)" 21}
LangChain integration
LangChain is the most popular framework for developing applications powered by language models. It provides an integration for Apify, so that you can feed Actor results directly to LangChain’s vector databases, enabling you to easily create ChatGPT-like query interfaces to websites with documentation, knowledge base, blog, etc.
Example
First, install LangChain with common LLMs and Apify API client for Python:
pip install langchain[llms] apify-client
And then create a ChatGPT-powered answering machine:
1from langchain.document_loaders.base import Document 2from langchain.indexes import VectorstoreIndexCreator 3from langchain.utilities import ApifyWrapper 4import os 5 6#Set up your Apify API token and OpenAI API key 7os.environ["OPENAI_API_KEY"] = "Your OpenAI API key" 8os.environ["APIFY_API_TOKEN"] = "Your Apify API token" 9 10apify = ApifyWrapper() 11 12#Run the Website Content Crawler on a website, wait for it to finish, and save 13#its results into a LangChain document loader: 14loader = apify.call_actor( 15 actor_id="apify/website-content-crawler", 16 run_input={"startUrls": [{"url": "https://docs.apify.com/"}]}, 17 dataset_mapping_function=lambda item: Document( 18 page_content=item["text"] or "", metadata={"source": item["url"]} 19 ), 20) 21 22#Initialize the vector database with the text documents: 23index = VectorstoreIndexCreator().from_loaders([loader]) 24 25#Finally, query the vector database: 26query = "What is Apify?" 27result = index.query_with_sources(query) 28print(result["answer"]) 29print(result["sources"])
The query produces an answer like this:
Apify is a platform for developing, running, and sharing serverless cloud programs. It enables users to create web scraping and automation tools and publish them on the Apify platform.
https://docs.apify.com/platform/actors, https://docs.apify.com/platform/actors/running/actors-in-store, https://docs.apify.com/platform/security, https://docs.apify.com/platform/actors/examples
For details and Jupyter notebook, see Apify integration for LangChain.
How much does Website Content Crawler cost?
You pay only for the Apify platform usage required by the Actor to crawl the websites and extract the content. The exact price depends on the crawler type and settings, website complexity, network speed, and random circumstances.
The main cost driver of Website Content Crawler is the actor compute units (CU), where 1 CU corresponds to an actor with 1 GB of memory running for 1 hour. With the baseline price of $0.25/CU, from our tests, the actor usage costs approximately:
- $0.5 - $5 per 1,000 web pages with a headless browser, depending on the website
- $0.2 per 1,000 web pages with raw HTTP crawler
Note that the Apify Free plan gives you $5 free credits every month and access to Apify Proxy, which is sufficient for testing and low-volume use cases.
Troubleshooting
- If the extracted text doesn’t contain the expected page content, try to select another Crawler type. Generally, a headless browser will extract more text as it loads dynamic page content and is less likely to be blocked.
- If the extracted text has more than expected page content (e.g. navigation or footer), try to select another Text extractor, or use the Remove HTML elements setting to skip unwanted parts of the page.
Known limitations and development roadmap
Website Content Crawler is under active development. Here are some things that we're currently working on:
- Support other files such as PDF, TXT, DOCX, PPTX, or MD
- Support for language selection
- Integration with ChatGPT Retrieval Plugin to automatically update the vector database
- Other crawler types and automatic selection
- Support for website authentication
If you’re interested in these or other features, please get in touch at ai@apify.com or submit an issue.
Is it legal to scrape content?
Web scraping is generally legal if you scrape publicly available non-personal data. What you do with the data is another question. Documentation, help articles, or blogs are typically protected by copyright, so you can't republish the content without permission. However, to scrape your own or your customers’ documentation or blogs, you can easily get their consent, which can be given simply by accepting your terms of service. If you want to learn more about the legality of web scraping, read our detailed blog post.
Changelog
0.3.6 (2023-05-04)
- Input:
- Made the
initialConcurrency
option visible in the input editor. - Added
aggressivePruning
option. With this option set totrue
, the crawler will try to deduplicate the scraped content. This can be useful when the crawler is scraping a website with a lot of duplicate content (header menus, footers, etc.)
- Made the
- Behavior:
- The actor now stays alive and restarts the crawl on certain known errors (Playwright Assertion Error).
0.3.4 (2023-05-04)
- Input:
- Added a new hidden option
initialConcurrency
. This option sets the initial number of web browsers or HTTP clients running in parallel during the actor run. Increasing this number can speed up the crawling process. Bear in mind this option is hidden and can be changed only by editing the actor input using the JSON editor.
- Added a new hidden option
0.3.3 (2023-04-28)
- Input:
- Added a new option
maxResults
to limit the total number of results. If used withmaxCrawlPages
, the crawler will stop when either of the limits is reached.
- Added a new option
0.3.1 (2023-04-24)
- Input:
- Added an option to download linked document files from the page -
saveFiles
. This is useful for downloading pdf, docx, xslx... files from the crawled pages. The files are saved to the default key-value store of the run and the links to the files are added to the dataset. - Added a new crawler - Stealthy web browser - that uses a Firefox browser with a stealthy profile. It is useful for crawling websites that block scraping.
- Added an option to download linked document files from the page -
0.0.13 (2023-04-18)
- Input:
- Added new
textExtractor
optionreadableText
. It is generally very accurate and has a good ratio of coverage to noise. It extracts only the main article body (similar tounfluff
) but can work for more complex pages. - Added
readableTextCharThreshold
option. This only applies toreadableText
extractor. It allows fine-tuning which part of the text should be focused on. That only matters for very complex pages where it is not obvious what should be extracted.
- Added new
- Output:
- Added simplified output view
Overview
that has onlyurl
andtext
for quick output check
- Added simplified output view
- Behavior:
- Domains starting with
www.
are now considered equal to ones without it. This means that the start URLhttps://apify.com
can enqueuehttps://www.apify.com
and vice versa.
- Domains starting with
0.0.10 (2023-04-05)
- Input:
- Added new
crawlerType
optionjsdom
for processing with JSDOM. It allows client-side script processing, trying to mimic the browser behavior in Node.js but with much better performance. This is still experimental and may crash on some particular pages. - Added
dynamicContentWaitSecs
option (defaults to 10s), which is the maximum waiting time for dynamic waiting.
- Added new
- Output (BREAKING CHANGE):
- Renamed
crawl.date
tocrawl.loadedTime
- Moved
crawl.screenshotUrl
to top-level object - The
markdown
field was made visible - Renamed
metadata.language
tometadata.languageCode
- Removed
metadata.createdAt
(for now) - Added
metadata.keywords
- Renamed
- Behavior:
- Added waiting for dynamically rendered content (supported in Headless browser and JSDOM crawlers). The crawler checks every half a second for content changes. When there are no changes for 2 seconds, the crawler proceeds to extraction.
0.0.7 (2023-03-30)
- Input:
- BREAKING CHANGE: Added
textExtractor
input option to choose how strictly to parse the content. Swapped the previousunfluff
forCrawleeHtmlToText
as default which in general will extract more text. We chose to output more text rather than less by default. - Added
removeElementsCssSelector
which allows passing extra CSS selectors to further strip down the HTML before it is converted to text. This can help fine-tuning. By default, the actor removes the page navigation bar, header, and footer.
- BREAKING CHANGE: Added
- Output:
- Added markdown to output if
saveMarkdown
option is chosen - All extractor outputs + HTML as a link can be obtained if
debugMode
is set. - Added
pageType
to the output (only asdebug
for now), it will be fine-tuned in the future.
- Added markdown to output if
- Behavior:
- Added deduplication by
canonicalUrl
. E.g. if more different URLs point to the same canonical URL, they are skipped - Skip pages that redirect outside the original start URLs domain.
- Only run a single text extractor unless in debug mode. This improves performance.
- Added deduplication by
Actor Metrics
2 monthly users
-
1 star
>99% runs succeeded
Created in May 2023
Modified 2 years ago