Actor With Ui2
No credit card required
Actor With Ui2
No credit card required
Act Crawler
Apify act compatible with Apify crawler - same input ⟹ same output.
WARNING: This is an early version and may contain some bugs and may not be fully compatible with crawler product.
WARNING 2: It's also unstable and every version may contain breaking changes.
Usage
There are two ways how to use this act:
-
pass crawler configuration as input of this act. Int his case the input looks like:
1{ 2 "startUrls": [{ "key": "", "value": "https://news.ycombinator.com" }], 3 "maxParallelRequests": 10, 4 "pageFunction": "function() { return context.jQuery('title').text(); }", 5 "injectJQuery": true, 6 "clickableElementsSelector": "a" 7}
-
pass ID of own crawler and act fetches the configuration from that crawler. You can override any attribute you want in the act input:
1{ 2 "crawlerId": "snoftq230dkcxm7w0", 3 "clickableElementsSelector": "a" 4}
This acts persists it's state in key-value store during the run and finally stores the results in files RESULTS-1.json
, RESULTS-2.json
, RESULTS-3.json
, … .
Input attributes
Crawler compatible attributes
Act supports following crawler configuration attributes (for documentation see https://www.apify.com/docs/crawler#home):
Attribute | Type | Default | Required | Description |
---|---|---|---|---|
startUrls | [{key: String, value: String}] | [] | yes | |
pseudoUrls | [{key: String, value: String}] | |||
clickableElementsSelector | String | Currently supports only links (a elements) | ||
pageFunction | Function | yes | ||
interceptRequest | Function | |||
injectJQuery | Boolean | |||
injectUnderscore | Boolean | |||
maxPageRetryCount | Number | 3 | ||
maxParallelRequests | Number | 1 | ||
maxCrawledPagesPerSlave | Number | 50 | ||
pageLoadTimeout | Number | 30s | ||
customData | Any | |||
maxCrawledPages | Number | |||
maxOutputPages | Number | |||
considerUrlFragment | Boolean | false | ||
maxCrawlDepth | Number | |||
maxInfiniteScrollHeight | Number | |||
cookies | [Object] | Currently used for all requests | ||
pageFunctionTimeout | Number | 60000 | ||
disableWebSecurity | Boolean | false |
Additional attributes
Attribute | Type | Default | Required | Description |
---|---|---|---|---|
maxPagesPerFile | Number | 1000 | yes | Number of outputed pages saved into 1 results file. |
browserInstanceCount | Number | 10 | yes | Number of browser instances to be used in the pool. |
crawlerId | String | ID of a crawler to fetch configuration from. | ||
urlList | String | Url of the file containing urls to be enqueued as startUrls . This file must either contain one url per line or urlListRegExp configuration attribute must be provided. | ||
urlListRegExp | String | RegExp to match array of urls from urlList file ^.This RegExp is used this way against the file and must return array of url strings: contentOfFile.match(new RegExp(urlListRegExp, 'g')); For example `(http | ||
userAgent | String | User agent to be used in browser | ||
customProxies | [String] | Array of proxies to be used for browsing. | ||
dumpio | Boolean | true | If true then Chrome console log will be piped into act run log. | |
saveSimplifiedResults | Boolean | false | If true then also simplified version of results will be outputted. | |
fullStackTrace | Boolean | false | If true then request.errorInfo and act log will contain full stack trace of each error. |
Local usage
To run act locally you must have NodeJS installed:
- Clone this repository:
git clone https://github.com/apifytech/act-crawler.git
- Install dependencies:
npm install
- Configure input in
/kv-store-dev/INPUT
- Run it:
npm run local
Actor Metrics
1 monthly user
-
1 star
>99% runs succeeded
Created in Sep 2018
Modified 6 years ago