Crawler

Pricing

Pay per usage

Try for free

Go to Apify Store

Crawler

Try for free

Experimental implementation of Apify crawler in act.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Marek Trunkát

Maintained by Community

Actor stats

Bookmarked

Total users

Monthly active users

8 years ago

Last modified

Act Crawler

Apify act compatible with Apify crawler - same input ⟹ same output.

WARNING: This is an early version and may contain some bugs and may not be fully compatible with crawler product.

WARNING 2: It's also unstable and every version may contain breaking changes.

Usage

There are two ways how to use this act:

pass crawler configuration as input of this act. Int his case the input looks like:

{
  "startUrls": [{ "key": "", "value": "https://news.ycombinator.com" }],
  "maxParallelRequests": 10,
  "pageFunction": "function() { return context.jQuery('title').text(); }",
  "injectJQuery": true,
  "clickableElementsSelector": "a"
}

pass ID of own crawler and act fetches the configuration from that crawler. You can override any attribute you want in the act input:

{
  "crawlerId": "snoftq230dkcxm7w0",
  "clickableElementsSelector": "a"
}

This acts persists it's state in key-value store during the run and finally stores the results in files RESULTS-1.json, RESULTS-2.json, RESULTS-3.json, … .

Input attributes

Crawler compatible attributes

Act supports following crawler configuration attributes (for documentation see https://www.apify.com/docs/crawler#home):

Attribute	Type	Default	Required	Description
startUrls	`[{key: String, value: String}]`	`[]`	yes
pseudoUrls	`[{key: String, value: String}]`
clickableElementsSelector	`String`			Currently supports only links (`a` elements)
pageFunction	`Function`		yes
interceptRequest	`Function`
injectJQuery	`Boolean`
injectUnderscore	`Boolean`
maxPageRetryCount	`Number`	`3`
maxParallelRequests	`Number`	`1`
maxCrawledPagesPerSlave	`Number`	`50`
pageLoadTimeout	`Number`	`30s`
customData	`Any`
maxCrawledPages	`Number`
maxOutputPages	`Number`
considerUrlFragment	`Boolean`	`false`
maxCrawlDepth	`Number`
maxInfiniteScrollHeight	`Number`
cookies	`[Object]`			Currently used for all requests
pageFunctionTimeout	`Number`	`60000`
disableWebSecurity	`Boolean`	`false`

Additional attributes

Attribute	Type	Default	Required	Description
maxPagesPerFile	`Number`	`1000`	yes	Number of outputed pages saved into 1 results file.
browserInstanceCount	`Number`	`10`	yes	Number of browser instances to be used in the pool.
crawlerId	`String`			ID of a crawler to fetch configuration from.
urlList	`String`			Url of the file containing urls to be enqueued as `startUrls`. This file must either contain one url per line or `urlListRegExp` configuration attribute must be provided.
urlListRegExp	`String`			RegExp to match array of urls from `urlList` file ^. This RegExp is used this way against the file and must return array of url strings: `contentOfFile.match(new RegExp(urlListRegExp, 'g'));` For example `(http
userAgent	`String`			User agent to be used in browser
customProxies	`[String]`			Array of proxies to be used for browsing.
dumpio	`Boolean`	true		If `true` then Chrome console log will be piped into act run log.
saveSimplifiedResults	`Boolean`	false		If `true` then also simplified version of results will be outputted.
fullStackTrace	`Boolean`	false		If `true` then `request.errorInfo` and act log will contain full stack trace of each error.

Local usage

To run act locally you must have NodeJS installed:

Clone this repository: git clone https://github.com/apifytech/act-crawler.git
Install dependencies: npm install
Configure input in /kv-store-dev/INPUT
Run it: npm run local

Crawler In Act Test

drobnikj/crawler-in-act-test

Jakub Drobník

Crawler Stats

drobnikj/crawler-stats

Jakub Drobník

Crawler Puppeteer

jancurn/crawler-puppeteer

Copy of https://github.com/apifytech/actor-crawler-puppeteer

Jan Curn

Legacy Phantomjs Crawler

apify/legacy-phantomjs-crawler

Apify Technologies

5.0

Jarda's website content crawler

gip/website-content-crawler

Staging copy of website content crawler, not updated to the latest versions.

Jaroslav Hejlek

3.0

Check Crawler Results

drobnikj/check-crawler-results

Act checks first 100 results from crawler execution agains json schema. It returns validation errors to act log file. It sends mail if one or more results is not valid. ==Environment vars== - JSON_SCHEMA json schema for validation - MAILGUN_API_KEY mailgun api key for sending mail from act - MAI...

Jakub Drobník

Apify Mcp Server

jakub.kopecky/apify-mcp-server

Jakub Kopecký

Apify Store Discounts Example Actor

mhamas/apify-store-discounts-example-actor

Matej Hamas 1

YouTube Video Summarizer

mhamas/youtube-video-summarizer

YouTube crawler and video scraper. Alternative YouTube API with no limits or quotas. Extract and download channel name, likes, number of views, and number of subscribers.

Matej Hamas 1

Example Secret Input

apify/example-secret-input

This Apify actor showcases how to use secret fields in the actor input.

Apify Technologies

Crawler

Crawler

Act Crawler

Usage

Input attributes

Crawler compatible attributes

Additional attributes

Local usage

You might also like

Crawler In Act Test

Crawler Stats

Crawler Puppeteer

Legacy Phantomjs Crawler

Jarda's website content crawler

Check Crawler Results

Apify Mcp Server

Apify Store Discounts Example Actor

YouTube Video Summarizer

Example Secret Input

Related articles