website-content-crawler avatar

website-content-crawler

Under maintenance
Try for free

No credit card required

Go to Store
This Actor is under maintenance.

This Actor may be unreliable while under maintenance. Would you like to try a similar Actor instead?

See alternative Actors
website-content-crawler

website-content-crawler

acceptable_seahorse/website-content-crawler
Try for free

No credit card required

Website Content Crawler is an Apify Actor that can perform a deep crawl of one or more websites to extract their content, such as documentation, knowledge bases, help articles, blog posts, or any other text content.

The actor was specifically designed to extract data for feeding, fine-tuning, or training large language models (LLMs) such as GPT-4, ChatGPT or LLaMA, and other AI models. It automatically removes headers, footers, menus, ads, and other noise from the web pages in order to return only the text content that can be directly fed to the models.

The actor has a simple input configuration so that it can be easily integrated into customer-facing products, where customers can enter just a URL of their website that want to have indexed by LLMs. The actor scales gracefully and can be used for small sites as well as sites with millions of pages. You can retrieve the results using API to formats such as JSON or CSV, which can be fed directly to your LLM, vector database, or directly to ChatGPT.

How does it work?

Website Content Crawler only needs one or more start URLs, typically the top-level URL of the documentation site, blog, or knowledge base that you want to scrape. The actor crawls the start URLs, finds links to other pages, and recursively crawls those pages too, as long as their URL is under the start URL.

For example, if you enter the start URL https://example.com/blog/, the actor will crawl pages like https://example.com/blog/article-1 or https://example.com/blog/section/article-2, but will skip pages like https://example.com/docs/something-else.

The actor also extracts important metadata about the content, such as author, language, publishing date, etc. It can save also the full HTML and screenshots of the pages, which is useful for debugging.

Website Content Crawler can be further configured for optimal performance. For example, you can select the crawler type:

  • Headless web browser (default) - Useful for modern websites with anti-scraping protections and JavaScript rendering. It recognizes common blocking patterns like CAPTCHAs and automatically retries blocked requests through new sessions. However, running web browsers is more expensive as it requires more computing resources and is slower.
  • Raw HTTP client - High-performance crawling mode that uses raw HTTP requests to fetch the pages. It is faster and cheaper, but it might not work on all websites.
  • Raw HTTP client with JS execution (JSDOM) (experimental) - A compromise between a browser and raw HTTP crawlers. Good performance and should work on almost all websites including those with dynamic content. However, it is still experimental and might sometimes crash so we don't recommend it in production settings yet.

You can also set additional input parameters such as a maximum number of pages, maximum crawling depth, maximum concurrency, proxy configuration, timeout, etc. to control the behavior and performance of the actor.

Designed for generative AI and LLMs

The results of the Website Content Crawler can help you feed, fine-tune or train your large language models (LLMs) or provide context for prompts for ChatGPT. In return, the model will answer questions based on your or your customer's websites and content.

Custom chatbots for customer support

Chatbots personalized on customer data such as documentation or knowledge bases are the next big thing for customer support and success teams. Let your customers simply type in the URL of their documentation or help center, and in minutes, your chatbot will have full knowledge about their product with zero integration costs.

Generate personalized content based on customer’s copy

ChatGPT and LLMs can write articles for you, but they won’t sound like you wrote them. Feed all your old blogs into your model to make it sound like you. Or train the model on your customers’ blogs and have it write in their tone of voice. Or help their technical writers with making first drafts of new documentation pages.

Summarization, translation, proofreading at scale

Got some old docs or blogs that need to be improved? Use Website Content Crawler to scrape the content, feed it to ChatGPT API and ask it to summarize, proofread, translate or change the style of the content.

Example

This example shows how to scrape all pages from the Apify documentation at https://docs.apify.com/:

Input

input-screenshot.png

See full input with description.

Output

This is how one crawled page (https://docs.apify.com/academy/web-scraping-for-beginners) looks in a browser:

page-screenshot.png

And here is how the crawling result looks in JSON format (note that other formats like CSV or Excel are also supported). The main page content can be found in the text field, and it only contains the valuable content, without menus and other noise:

1{
2    "url": "https://docs.apify.com/academy/web-scraping-for-beginners",
3    "crawl": {
4        "loadedUrl": "https://docs.apify.com/academy/web-scraping-for-beginners",
5        "loadedTime": "2023-04-05T16:26:51.030Z",
6        "referrerUrl": "https://docs.apify.com/academy",
7        "depth": 0
8    },
9    "metadata": {
10        "canonicalUrl": "https://docs.apify.com/academy/web-scraping-for-beginners",
11        "title": "Web scraping for beginners | Apify Documentation",
12        "description": "Learn how to develop web scrapers with this comprehensive and practical course. Go from beginner to expert, all in one place.",
13        "author": null,
14        "keywords": null,
15        "languageCode": "en"
16    },
17    "screenshotUrl": null,
18    "text": "Skip to main content\nOn this page\nWeb scraping for beginners\nLearn how to develop web scrapers with this comprehensive and practical course. Go from beginner to expert, all in one place.\nWelcome to Web scraping for beginners, a comprehensive, practical and long form web scraping course that will take you from an absolute beginner to a successful web scraper developer. If you're looking for a quick start, we recommend trying this tutorial instead.\nThis course is made by Apify, the web scraping and automation platform, but we will use only open-source technologies throughout all academy lessons. This means that the skills you learn will be applicable to any scraping project, and you'll be able to run your scrapers on any computer. No Apify account needed.\nIf you would like to learn about the Apify platform and how it can help you build, run and scale your web scraping and automation projects, see the Apify platform course, where we'll teach you all about Apify serverless infrastructure, proxies, API, scheduling, webhooks and much more.\nWhy learn scraper development?​\nWith so many point-and-click tools and no-code software that can help you extract data from websites, what is the point of learning web scraper development? Contrary to what their marketing departments say, a point-and-click or no-code tool will never be as flexible, as powerful, or as optimized as a custom-built scraper.\nAny software can do only what it was programmed to do. If you build your own scraper, it can do anything you want. And you can always quickly change it to do more, less, or the same, but faster or cheaper. The possibilities are endless once you know how scraping really works.\nScraper development is a fun and challenging way to learn web development, web technologies, and understand the internet. You will reverse-engineer websites and understand how they work internally, what technologies they use and how they communicate with their servers. You will also master your chosen programming language and core programming concepts. When you truly understand web scraping, learning other technology like React or Next.js will be a piece of cake.\nCourse Summary​\nWhen we set out to create the Academy, we wanted to build a complete guide to modern web scraping - a course that a beginner could use to create their first scraper, as well as a resource that professionals will continuously use to learn about advanced and niche web scraping techniques and technologies. All lessons include code examples and code-along exercises that you can use to immediately put your scraping skills into action.\nThis is what you'll learn in the Web scraping for beginners course:\nWeb scraping for beginners\nBasics of data extraction\nBasics of crawling\nBest practices\nRequirements​\nYou don't need to be a developer or a software engineer to complete this course, but basic programming knowledge is recommended. Don't be afraid, though. We explain everything in great detail in the course and provide external references that can help you level up your web scraping and web development skills. If you're new to programming, pay very close attention to the instructions and examples. A seemingly insignificant thing like using [] instead of () can make a lot of difference.\nIf you don't already have basic programming knowledge and would like to be well-prepared for this course, we recommend taking a JavaScript course and learning about CSS Selectors.\nAs you progress to the more advanced courses, the coding will get more challenging, but will still be manageable to a person with an intermediate level of programming skills.\nIdeally, you should have at least a moderate understanding of the following concepts:\nJavaScript + Node.js​\nIt is recommended to understand at least the fundamentals of JavaScript and be proficient with Node.js prior to starting this course. If you are not yet comfortable with asynchronous programming (with promises and async...await), loops (and the different types of loops in JavaScript), modularity, or working with external packages, we would recommend studying the following resources before coming back and continuing this section:\nasync...await (YouTube)\nJavaScript loops (MDN)\nModularity in Node.js\nGeneral web development​\nThroughout the next lessons, we will sometimes use certain technologies and terms related to the web without explaining them. This is because the knowledge of them will be assumed (unless we're showing something out of the ordinary).\nHTML\nHTTP protocol\nDevTools\njQuery or Cheerio​\nWe'll be using the Cheerio package a lot to parse data from HTML. This package provides a simple API using jQuery syntax to help traverse downloaded HTML within Node.js.\nNext up​\nThe course begins with a small bit of theory and moves into some realistic and practical examples of extracting data from the most popular websites on the internet using your browser console. So let's get to it!\nIf you already have experience with HTML, CSS, and browser DevTools, feel free to skip to the Basics of crawling section.\nWhy learn scraper development?\nCourse Summary\nRequirements\nJavaScript + Node.js\nGeneral web development\njQuery or Cheerio\nNext up",
19    "html": null,
20    "markdown": "  Web scraping for beginners | Apify Documentation       \n\n[Skip to main content](#docusaurus_skipToContent_fallback)\n\nOn this page\n\n# Web scraping for beginners\n\n**Learn how to develop web scrapers with this comprehensive and practical course. Go from beginner to expert, all in one place.**\n\n* * *\n\nWelcome to **Web scraping for beginners**, a comprehensive, practical and long form web scraping course that will take you from an absolute beginner to a successful web scraper developer. If you're looking for a quick start, we recommend trying [this tutorial](https://blog.apify.com/web-scraping-javascript-nodejs/) instead.\n\nThis course is made by [Apify](https://apify.com), the web scraping and automation platform, but we will use only open-source technologies throughout all academy lessons. This means that the skills you learn will be applicable to any scraping project, and you'll be able to run your scrapers on any computer. No Apify account needed.\n\nIf you would like to learn about the Apify platform and how it can help you build, run and scale your web scraping and automation projects, see the [Apify platform course](/academy/apify-platform), where we'll teach you all about Apify serverless infrastructure, proxies, API, scheduling, webhooks and much more.\n\n## Why learn scraper development?[​](#why-learn \"Direct link to Why learn scraper development?\")\n\nWith so many point-and-click tools and no-code software that can help you extract data from websites, what is the point of learning web scraper development? Contrary to what their marketing departments say, a point-and-click or no-code tool will never be as flexible, as powerful, or as optimized as a custom-built scraper.\n\nAny software can do only what it was programmed to do. If you build your own scraper, it can do anything you want. And you can always quickly change it to do more, less, or the same, but faster or cheaper. The possibilities are endless once you know how scraping really works.\n\nScraper development is a fun and challenging way to learn web development, web technologies, and understand the internet. You will reverse-engineer websites and understand how they work internally, what technologies they use and how they communicate with their servers. You will also master your chosen programming language and core programming concepts. When you truly understand web scraping, learning other technology like React or Next.js will be a piece of cake.\n\n## Course Summary[​](#summary \"Direct link to Course Summary\")\n\nWhen we set out to create the Academy, we wanted to build a complete guide to modern web scraping - a course that a beginner could use to create their first scraper, as well as a resource that professionals will continuously use to learn about advanced and niche web scraping techniques and technologies. All lessons include code examples and code-along exercises that you can use to immediately put your scraping skills into action.\n\nThis is what you'll learn in the **Web scraping for beginners** course:\n\n*   [Web scraping for beginners](/academy/web-scraping-for-beginners)\n    *   [Basics of data extraction](/academy/web-scraping-for-beginners/data-collection)\n    *   [Basics of crawling](/academy/web-scraping-for-beginners/crawling)\n    *   [Best practices](/academy/web-scraping-for-beginners/best-practices)\n\n## Requirements[​](#requirements \"Direct link to Requirements\")\n\nYou don't need to be a developer or a software engineer to complete this course, but basic programming knowledge is recommended. Don't be afraid, though. We explain everything in great detail in the course and provide external references that can help you level up your web scraping and web development skills. If you're new to programming, pay very close attention to the instructions and examples. A seemingly insignificant thing like using `[]` instead of `()` can make a lot of difference.\n\n> If you don't already have basic programming knowledge and would like to be well-prepared for this course, we recommend taking a [JavaScript course](https://www.codecademy.com/learn/introduction-to-javascript) and learning about [CSS Selectors](https://www.w3schools.com/css/css_selectors.asp).\n\nAs you progress to the more advanced courses, the coding will get more challenging, but will still be manageable to a person with an intermediate level of programming skills.\n\nIdeally, you should have at least a moderate understanding of the following concepts:\n\n### JavaScript + Node.js[​](#javascript-and-node \"Direct link to JavaScript + Node.js\")\n\nIt is recommended to understand at least the fundamentals of JavaScript and be proficient with Node.js prior to starting this course. If you are not yet comfortable with asynchronous programming (with promises and `async...await`), loops (and the different types of loops in JavaScript), modularity, or working with external packages, we would recommend studying the following resources before coming back and continuing this section:\n\n*   [`async...await` (YouTube)](https://www.youtube.com/watch?v=vn3tm0quoqE&ab_channel=Fireship)\n*   [JavaScript loops (MDN)](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Loops_and_iteration)\n*   [Modularity in Node.js](https://www.section.io/engineering-education/how-to-use-modular-patterns-in-nodejs/)\n\n### General web development[​](#general-web-development \"Direct link to General web development\")\n\nThroughout the next lessons, we will sometimes use certain technologies and terms related to the web without explaining them. This is because the knowledge of them will be **assumed** (unless we're showing something out of the ordinary).\n\n*   [HTML](https://developer.mozilla.org/en-US/docs/Web/HTML)\n*   [HTTP protocol](https://developer.mozilla.org/en-US/docs/Web/HTTP)\n*   [DevTools](/academy/web-scraping-for-beginners/data-collection/browser-devtools)\n\n### jQuery or Cheerio[​](#jquery-or-cheerio \"Direct link to jQuery or Cheerio\")\n\nWe'll be using the [**Cheerio**](https://www.npmjs.com/package/cheerio) package a lot to parse data from HTML. This package provides a simple API using jQuery syntax to help traverse downloaded HTML within Node.js.\n\n## Next up[​](#next \"Direct link to Next up\")\n\nThe course begins with a small bit of theory and moves into some realistic and practical examples of extracting data from the most popular websites on the internet using your browser console. So [let's get to it!](/academy/web-scraping-for-beginners/introduction)\n\n> If you already have experience with HTML, CSS, and browser DevTools, feel free to skip to the [Basics of crawling](/academy/web-scraping-for-beginners/crawling) section.\n\n*   [Why learn scraper development?](#why-learn)\n*   [Course Summary](#summary)\n*   [Requirements](#requirements)\n    *   [JavaScript + Node.js](#javascript-and-node)\n    *   [General web development](#general-web-development)\n    *   [jQuery or Cheerio](#jquery-or-cheerio)\n*   [Next up](#next)"
21}

LangChain integration

LangChain is the most popular framework for developing applications powered by language models. It provides an integration for Apify, so that you can feed Actor results directly to LangChain’s vector databases, enabling you to easily create ChatGPT-like query interfaces to websites with documentation, knowledge base, blog, etc.

Example

First, install LangChain with common LLMs and Apify API client for Python:

pip install langchain[llms] apify-client

And then create a ChatGPT-powered answering machine:

1from langchain.document_loaders.base import Document
2from langchain.indexes import VectorstoreIndexCreator
3from langchain.utilities import ApifyWrapper
4import os
5
6#Set up your Apify API token and OpenAI API key
7os.environ["OPENAI_API_KEY"] = "Your OpenAI API key"
8os.environ["APIFY_API_TOKEN"] = "Your Apify API token"
9
10apify = ApifyWrapper()
11
12#Run the Website Content Crawler on a website, wait for it to finish, and save
13#its results into a LangChain document loader:
14loader = apify.call_actor(
15    actor_id="apify/website-content-crawler",
16    run_input={"startUrls": [{"url": "https://docs.apify.com/"}]},
17    dataset_mapping_function=lambda item: Document(
18        page_content=item["text"] or "", metadata={"source": item["url"]}
19    ),
20)
21
22#Initialize the vector database with the text documents:
23index = VectorstoreIndexCreator().from_loaders([loader])
24
25#Finally, query the vector database:
26query = "What is Apify?"
27result = index.query_with_sources(query)
28print(result["answer"])
29print(result["sources"])

The query produces an answer like this:

Apify is a platform for developing, running, and sharing serverless cloud programs. It enables users to create web scraping and automation tools and publish them on the Apify platform.

https://docs.apify.com/platform/actors, https://docs.apify.com/platform/actors/running/actors-in-store, https://docs.apify.com/platform/security, https://docs.apify.com/platform/actors/examples

For details and Jupyter notebook, see Apify integration for LangChain.

How much does Website Content Crawler cost?

You pay only for the Apify platform usage required by the Actor to crawl the websites and extract the content. The exact price depends on the crawler type and settings, website complexity, network speed, and random circumstances.

The main cost driver of Website Content Crawler is the actor compute units (CU), where 1 CU corresponds to an actor with 1 GB of memory running for 1 hour. With the baseline price of $0.25/CU, from our tests, the actor usage costs approximately:

  • $0.5 - $5 per 1,000 web pages with a headless browser, depending on the website
  • $0.2 per 1,000 web pages with raw HTTP crawler

Note that the Apify Free plan gives you $5 free credits every month and access to Apify Proxy, which is sufficient for testing and low-volume use cases.

Troubleshooting

  • If the extracted text doesn’t contain the expected page content, try to select another Crawler type. Generally, a headless browser will extract more text as it loads dynamic page content and is less likely to be blocked.
  • If the extracted text has more than expected page content (e.g. navigation or footer), try to select another Text extractor, or use the Remove HTML elements setting to skip unwanted parts of the page.

Known limitations and development roadmap

Website Content Crawler is under active development. Here are some things that we're currently working on:

  • Support other files such as PDF, TXT, DOCX, PPTX, or MD
  • Support for language selection
  • Integration with ChatGPT Retrieval Plugin to automatically update the vector database
  • Other crawler types and automatic selection
  • Support for website authentication

If you’re interested in these or other features, please get in touch at ai@apify.com or submit an issue.

Web scraping is generally legal if you scrape publicly available non-personal data. What you do with the data is another question. Documentation, help articles, or blogs are typically protected by copyright, so you can't republish the content without permission. However, to scrape your own or your customers’ documentation or blogs, you can easily get their consent, which can be given simply by accepting your terms of service. If you want to learn more about the legality of web scraping, read our detailed blog post.

Changelog

0.3.6 (2023-05-04)

  • Input:
    • Made the initialConcurrency option visible in the input editor.
    • Added aggressivePruning option. With this option set to true, the crawler will try to deduplicate the scraped content. This can be useful when the crawler is scraping a website with a lot of duplicate content (header menus, footers, etc.)
  • Behavior:
    • The actor now stays alive and restarts the crawl on certain known errors (Playwright Assertion Error).

0.3.4 (2023-05-04)

  • Input:
    • Added a new hidden option initialConcurrency. This option sets the initial number of web browsers or HTTP clients running in parallel during the actor run. Increasing this number can speed up the crawling process. Bear in mind this option is hidden and can be changed only by editing the actor input using the JSON editor.

0.3.3 (2023-04-28)

  • Input:
    • Added a new option maxResults to limit the total number of results. If used with maxCrawlPages, the crawler will stop when either of the limits is reached.

0.3.1 (2023-04-24)

  • Input:
    • Added an option to download linked document files from the page - saveFiles. This is useful for downloading pdf, docx, xslx... files from the crawled pages. The files are saved to the default key-value store of the run and the links to the files are added to the dataset.
    • Added a new crawler - Stealthy web browser - that uses a Firefox browser with a stealthy profile. It is useful for crawling websites that block scraping.

0.0.13 (2023-04-18)

  • Input:
    • Added new textExtractor option readableText. It is generally very accurate and has a good ratio of coverage to noise. It extracts only the main article body (similar to unfluff) but can work for more complex pages.
    • Added readableTextCharThreshold option. This only applies to readableText extractor. It allows fine-tuning which part of the text should be focused on. That only matters for very complex pages where it is not obvious what should be extracted.
  • Output:
    • Added simplified output view Overview that has only url and text for quick output check
  • Behavior:
    • Domains starting with www. are now considered equal to ones without it. This means that the start URL https://apify.com can enqueue https://www.apify.com and vice versa.

0.0.10 (2023-04-05)

  • Input:
    • Added new crawlerType option jsdom for processing with JSDOM. It allows client-side script processing, trying to mimic the browser behavior in Node.js but with much better performance. This is still experimental and may crash on some particular pages.
    • Added dynamicContentWaitSecs option (defaults to 10s), which is the maximum waiting time for dynamic waiting.
  • Output (BREAKING CHANGE):
    • Renamed crawl.date to crawl.loadedTime
    • Moved crawl.screenshotUrl to top-level object
    • The markdown field was made visible
    • Renamed metadata.language to metadata.languageCode
    • Removed metadata.createdAt (for now)
    • Added metadata.keywords
  • Behavior:
    • Added waiting for dynamically rendered content (supported in Headless browser and JSDOM crawlers). The crawler checks every half a second for content changes. When there are no changes for 2 seconds, the crawler proceeds to extraction.

0.0.7 (2023-03-30)

  • Input:
    • BREAKING CHANGE: Added textExtractor input option to choose how strictly to parse the content. Swapped the previous unfluff for CrawleeHtmlToText as default which in general will extract more text. We chose to output more text rather than less by default.
    • Added removeElementsCssSelector which allows passing extra CSS selectors to further strip down the HTML before it is converted to text. This can help fine-tuning. By default, the actor removes the page navigation bar, header, and footer.
  • Output:
    • Added markdown to output if saveMarkdown option is chosen
    • All extractor outputs + HTML as a link can be obtained if debugMode is set.
    • Added pageType to the output (only as debug for now), it will be fine-tuned in the future.
  • Behavior:
    • Added deduplication by canonicalUrl. E.g. if more different URLs point to the same canonical URL, they are skipped
    • Skip pages that redirect outside the original start URLs domain.
    • Only run a single text extractor unless in debug mode. This improves performance.
Developer
Maintained by Community

Actor Metrics

  • 2 monthly users

  • 1 star

  • >99% runs succeeded

  • Created in May 2023

  • Modified 2 years ago

Categories