Actor With Ui2 avatar

Actor With Ui2

Try for free

No credit card required

Go to Store
Actor With Ui2

Actor With Ui2

mtrunkat.com/actor-with-ui2
Try for free

No credit card required

Act Crawler

Apify act compatible with Apify crawler - same input ⟹ same output.

WARNING: This is an early version and may contain some bugs and may not be fully compatible with crawler product.

WARNING 2: It's also unstable and every version may contain breaking changes.

Usage

There are two ways how to use this act:

  • pass crawler configuration as input of this act. Int his case the input looks like:

    1{
    2  "startUrls": [{ "key": "", "value": "https://news.ycombinator.com" }],
    3  "maxParallelRequests": 10,
    4  "pageFunction": "function() { return context.jQuery('title').text(); }",
    5  "injectJQuery": true,
    6  "clickableElementsSelector": "a"
    7}
  • pass ID of own crawler and act fetches the configuration from that crawler. You can override any attribute you want in the act input:

    1{
    2  "crawlerId": "snoftq230dkcxm7w0",
    3  "clickableElementsSelector": "a"
    4}

This acts persists it's state in key-value store during the run and finally stores the results in files RESULTS-1.json, RESULTS-2.json, RESULTS-3.json, … .

Input attributes

Crawler compatible attributes

Act supports following crawler configuration attributes (for documentation see https://www.apify.com/docs/crawler#home):

AttributeTypeDefaultRequiredDescription
startUrls[{key: String, value: String}][]yes
pseudoUrls[{key: String, value: String}]
clickableElementsSelectorStringCurrently supports only links (a elements)
pageFunctionFunctionyes
interceptRequestFunction
injectJQueryBoolean
injectUnderscoreBoolean
maxPageRetryCountNumber3
maxParallelRequestsNumber1
maxCrawledPagesPerSlaveNumber50
pageLoadTimeoutNumber30s
customDataAny
maxCrawledPagesNumber
maxOutputPagesNumber
considerUrlFragmentBooleanfalse
maxCrawlDepthNumber
maxInfiniteScrollHeightNumber
cookies[Object]Currently used for all requests
pageFunctionTimeoutNumber60000
disableWebSecurityBooleanfalse

Additional attributes

AttributeTypeDefaultRequiredDescription
maxPagesPerFileNumber1000yesNumber of outputed pages saved into 1 results file.
browserInstanceCountNumber10yesNumber of browser instances to be used in the pool.
crawlerIdStringID of a crawler to fetch configuration from.
urlListStringUrl of the file containing urls to be enqueued as startUrls. This file must either contain one url per line or urlListRegExp configuration attribute must be provided.
urlListRegExpStringRegExp to match array of urls from urlList file ^.

This RegExp is used this way against the file and must return array of url strings: contentOfFile.match(new RegExp(urlListRegExp, 'g'));

For example `(http
userAgentStringUser agent to be used in browser
customProxies[String]Array of proxies to be used for browsing.
dumpioBooleantrueIf true then Chrome console log will be piped into act run log.
saveSimplifiedResultsBooleanfalseIf true then also simplified version of results will be outputted.
fullStackTraceBooleanfalseIf true then request.errorInfo and act log will contain full stack trace of each error.

Local usage

To run act locally you must have NodeJS installed:

  • Clone this repository: git clone https://github.com/apifytech/act-crawler.git
  • Install dependencies: npm install
  • Configure input in /kv-store-dev/INPUT
  • Run it: npm run local
Developer
Maintained by Community

Actor Metrics

  • 1 monthly user

  • 1 star

  • >99% runs succeeded

  • Created in Sep 2018

  • Modified 6 years ago

Categories