Publication test 1
Pricing
Pay per usage
Go to Apify Store
Pricing
Pay per usage
Rating
0.0
(0)
Developer

Jan Novotny
Maintained by Community
Actor stats
0
Bookmarked
3
Total users
2
Monthly active users
8 days ago
Last modified
Categories
Share
Publication test 1
Pricing
Pay per usage
Pricing
Pay per usage
Rating
0.0
(0)
Developer

Jan Novotny
Actor stats
0
Bookmarked
3
Total users
2
Monthly active users
8 days ago
Last modified
Categories
Share
{ "$schema": "https://apify.com/schemas/v1/actor.ide.json", "actorSpecification": 1, "name": "my-actor-15", "title": "Project Cheerio Crawler Javascript", "description": "Crawlee and Cheerio project in javascript.", "version": "0.0", "meta": { "templateId": "js-crawlee-cheerio", "generatedBy": "<FILL-IN-MODEL>" }, "input": "./input_schema.json", "output": "./output_schema.json", "storages": { "dataset": "./dataset_schema.json" }, "dockerfile": "../Dockerfile"}{ "$schema": "https://apify.com/schemas/v1/dataset.ide.json", "actorSpecification": 1, "fields": {}, "views": { "overview": { "title": "Overview", "transformation": { "fields": ["title", "url"] }, "display": { "component": "table", "properties": { "title": { "label": "Title", "format": "text" }, "url": { "label": "URL", "format": "link" } } } } }}{ "$schema": "https://apify.com/schemas/v1/input.ide.json", "title": "CheerioCrawler Template", "type": "object", "schemaVersion": 1, "properties": { "startUrls": { "title": "Start URLs", "type": "array", "description": "URLs to start with.", "editor": "requestListSources", "prefill": [ { "url": "https://crawlee.dev" } ] }, "maxRequestsPerCrawl": { "title": "Max Requests per Crawl", "type": "integer", "description": "Maximum number of requests that can be made by this crawler.", "default": 100 } }}{ "$schema": "https://apify.com/schemas/v1/output.ide.json", "actorOutputSchemaVersion": 1, "title": "Output schema", "properties": { "results": { "type": "string", "title": "Results", "template": "{{links.apiDefaultDatasetUrl}}/items" } }}# configurations.idea.vscode.zed
# crawlee and apify storage foldersapify_storagecrawlee_storagestorage
# installed filesnode_modules
# git folder.gitroot = true
[*]indent_style = spaceindent_size = 4charset = utf-8trim_trailing_whitespace = trueinsert_final_newline = trueend_of_line = lfquote_type = single# This file tells Git which files shouldn't be added to source control
.DS_Store.idea.vscode.zeddistnode_modulesapify_storagestorage{ "printWidth": 120, "tabWidth": 4, "singleQuote": true}1# Apify Actors Development Guide2
3Important: Before you begin, fill in the `generatedBy` property in the meta section of `.actor/actor.json`. Replace it with the tool and model you're currently using, such as "Claude Code with Claude Sonnet 4.5". This helps Apify monitor and improve AGENTS.md for specific AI tools and models.4
5## What are Apify Actors?6
7- Actors are serverless programs that run in the cloud. They're inspired by the UNIX philosophy - programs that do one thing well and can be easily combined to build complex systems.8- Actors are programs packaged as Docker images that run in isolated containers9
10## Core Concepts11
12- Accept well-defined JSON input13- Perform isolated tasks (web scraping, automation, data processing)14- Produce structured JSON output to datasets and/or store data in key-value stores15- Can run from seconds to hours or even indefinitely16- Persist state and can be restarted17
18## Do19
20- accept well-defined JSON input and produce structured JSON output21- use Apify SDK (`apify`) for code running ON Apify platform22- validate input early with proper error handling and fail gracefully23- use CheerioCrawler for static HTML content (10x faster than browsers)24- use PlaywrightCrawler only for JavaScript-heavy sites and dynamic content25- use router pattern (createCheerioRouter/createPlaywrightRouter) for complex crawls26- implement retry strategies with exponential backoff for failed requests27- use proper concurrency settings (HTTP: 10-50, Browser: 1-5)28- set sensible defaults in `.actor/input_schema.json` for all optional fields29- set up output schema in `.actor/output_schema.json`30- clean and validate data before pushing to dataset31- use semantic CSS selectors and fallback strategies for missing elements32- respect robots.txt, ToS, and implement rate limiting with delays33- check which tools (cheerio/playwright/crawlee) are installed before applying guidance34- use `apify/log` package for logging (censors sensitive data)35- implement readiness probe handler for standby Actors36- handle the `aborting` event to gracefully shut down when Actor is stopped37
38## Don't39
40- do not rely on `Dataset.getInfo()` for final counts on Cloud platform41- do not use browser crawlers when HTTP/Cheerio works (massive performance gains with HTTP)42- do not hard code values that should be in input schema or environment variables43- do not skip input validation or error handling44- do not overload servers - use appropriate concurrency and delays45- do not scrape prohibited content or ignore Terms of Service46- do not store personal/sensitive data unless explicitly permitted47- do not use deprecated options like `requestHandlerTimeoutMillis` on CheerioCrawler (v3.x)48- do not use `additionalHttpHeaders` - use `preNavigationHooks` instead49- do not assume that local storage is persistent or automatically synced to Apify Console - when running locally with `apify run`, the `storage/` directory is local-only and is NOT pushed to the Cloud50- do not disable standby mode (`usesStandbyMode: false`) without explicit permission51
52## Logging53
54- **ALWAYS use the `apify/log` package for logging** - This package contains critical security logic including censoring sensitive data (Apify tokens, API keys, credentials) to prevent accidental exposure in logs55
56### Available Log Levels in `apify/log`57
58The Apify log package provides the following methods for logging:59
60- `log.debug()` - Debug level logs (detailed diagnostic information)61- `log.info()` - Info level logs (general informational messages)62- `log.warning()` - Warning level logs (warning messages for potentially problematic situations)63- `log.warningOnce()` - Warning level logs (same warning message logged only once)64- `log.error()` - Error level logs (error messages for failures)65- `log.exception()` - Exception level logs (for exceptions with stack traces)66- `log.perf()` - Performance level logs (performance metrics and timing information)67- `log.deprecated()` - Deprecation level logs (warnings about deprecated code)68- `log.softFail()` - Soft failure logs (non-critical failures that don't stop execution, e.g., input validation errors, skipped items)69- `log.internal()` - Internal level logs (internal/system messages)70
71**Best practices:**72
73- Use `log.debug()` for detailed operation-level diagnostics (inside functions)74- Use `log.info()` for general informational messages (API requests, successful operations)75- Use `log.warning()` for potentially problematic situations (validation failures, unexpected states)76- Use `log.error()` for actual errors and failures77- Use `log.exception()` for caught exceptions with stack traces78
79## Graceful Abort Handling80
81Handle the `aborting` event to terminate the Actor quickly when stopped by user or platform, minimizing costs especially for PPU/PPE+U billing.82
83```javascript84import { setTimeout } from 'node:timers/promises';85
86Actor.on('aborting', async () => {87 // Persist any state, do any cleanup you need, and terminate the Actor using `await Actor.exit()` explicitly as soon as possible88 // This will help ensure that the Actor is doing best effort to honor any potential limits on costs of a single run set by the user89 // Wait 1 second to allow Crawlee/SDK useState and other state persistence operations to complete90 // This is a temporary workaround until SDK implements proper state persistence in the aborting event91 await setTimeout(1000);92 await Actor.exit();93});94```95
96## Standby Mode97
98- **NEVER disable standby mode (`usesStandbyMode: false`) in `.actor/actor.json` without explicit permission** - Actor Standby mode solves this problem by letting you have the Actor ready in the background, waiting for the incoming HTTP requests. In a sense, the Actor behaves like a real-time web server or standard API server instead of running the logic once to process everything in batch. Always keep `usesStandbyMode: true` unless there is a specific documented reason to disable it99- **ALWAYS implement readiness probe handler for standby Actors** - Handle the `x-apify-container-server-readiness-probe` header at GET / endpoint to ensure proper Actor lifecycle management100
101You can recognize a standby Actor by checking the `usesStandbyMode` property in `.actor/actor.json`. Only implement the readiness probe if this property is set to `true`.102
103### Readiness Probe Implementation Example104
105```javascript106// Apify standby readiness probe at root path107app.get('/', (req, res) => {108 res.writeHead(200, { 'Content-Type': 'text/plain' });109 if (req.headers['x-apify-container-server-readiness-probe']) {110 res.end('Readiness probe OK\n');111 } else {112 res.end('Actor is ready\n');113 }114});115```116
117Key points:118
119- Detect the `x-apify-container-server-readiness-probe` header in incoming requests120- Respond with HTTP 200 status code for both readiness probe and normal requests121- This enables proper Actor lifecycle management in standby mode122
123## Commands124
125```bash126# Local development127apify run # Run Actor locally128
129# Authentication & deployment130apify login # Authenticate account131apify push # Deploy to Apify platform132
133# Help134apify help # List all commands135```136
137## Safety and Permissions138
139Allowed without prompt:140
141- read files with `Actor.getValue()`142- push data with `Actor.pushData()`143- set values with `Actor.setValue()`144- enqueue requests to RequestQueue145- run locally with `apify run`146
147Ask first:148
149- npm/pip package installations150- apify push (deployment to cloud)151- proxy configuration changes (requires paid plan)152- Dockerfile changes affecting builds153- deleting datasets or key-value stores154
155## Project Structure156
157.actor/158├── actor.json # Actor config: name, version, env vars, runtime settings159├── input_schema.json # Input validation & Console form definition160└── output_schema.json # Specifies where an Actor stores its output161src/162└── main.js # Actor entry point and orchestrator163storage/ # Local-only storage for development (NOT synced to Cloud)164├── datasets/ # Output items (JSON objects)165├── key_value_stores/ # Files, config, INPUT166└── request_queues/ # Pending crawl requests167Dockerfile # Container image definition168AGENTS.md # AI agent instructions (this file)169
170## Local vs Cloud Storage171
172When running locally with `apify run`, the Apify SDK emulates Cloud storage APIs using the local `storage/` directory. This local storage behaves differently from Cloud storage:173
174- **Local storage is NOT persistent** - The `storage/` directory is meant for local development and testing only. Data stored there (datasets, key-value stores, request queues) exists only on your local disk.175- **Local storage is NOT automatically pushed to Apify Console** - Running `apify run` does not upload any storage data to the Apify platform. The data stays local.176- **Each local run may overwrite previous data** - The local `storage/` directory is reused between runs, but this is local-only behavior, not Cloud persistence.177- **Cloud storage only works when running on Apify platform** - After deploying with `apify push` and running the Actor in the Cloud, storage calls (`Actor.pushData()`, `Actor.setValue()`, etc.) interact with real Apify Cloud storage, which is then visible in the Apify Console.178- **To verify Actor output, deploy and run in Cloud** - Do not rely on local `storage/` contents as proof that data will appear in the Apify Console. Always test by deploying (`apify push`) and running the Actor on the platform.179
180## Actor Input Schema181
182The input schema defines the input parameters for an Actor. It's a JSON object comprising various field types supported by the Apify platform.183
184### Structure185
186```json187{188 "title": "<INPUT-SCHEMA-TITLE>",189 "type": "object",190 "schemaVersion": 1,191 "properties": {192 /* define input fields here */193 },194 "required": []195}196```197
198### Example199
200```json201{202 "title": "E-commerce Product Scraper Input",203 "type": "object",204 "schemaVersion": 1,205 "properties": {206 "startUrls": {207 "title": "Start URLs",208 "type": "array",209 "description": "URLs to start scraping from (category pages or product pages)",210 "editor": "requestListSources",211 "default": [{ "url": "https://example.com/category" }],212 "prefill": [{ "url": "https://example.com/category" }]213 },214 "followVariants": {215 "title": "Follow Product Variants",216 "type": "boolean",217 "description": "Whether to scrape product variants (different colors, sizes)",218 "default": true219 },220 "maxRequestsPerCrawl": {221 "title": "Max Requests per Crawl",222 "type": "integer",223 "description": "Maximum number of pages to scrape (0 = unlimited)",224 "default": 1000,225 "minimum": 0226 },227 "proxyConfiguration": {228 "title": "Proxy Configuration",229 "type": "object",230 "description": "Proxy settings for anti-bot protection",231 "editor": "proxy",232 "default": { "useApifyProxy": false }233 },234 "locale": {235 "title": "Locale",236 "type": "string",237 "description": "Language/country code for localized content",238 "default": "cs",239 "enum": ["cs", "en", "de", "sk"],240 "enumTitles": ["Czech", "English", "German", "Slovak"]241 }242 },243 "required": ["startUrls"]244}245```246
247## Actor Output Schema248
249The Actor output schema builds upon the schemas for the dataset and key-value store. It specifies where an Actor stores its output and defines templates for accessing that output. Apify Console uses these output definitions to display run results.250
251### Structure252
253```json254{255 "actorOutputSchemaVersion": 1,256 "title": "<OUTPUT-SCHEMA-TITLE>",257 "properties": {258 /* define your outputs here */259 }260}261```262
263### Example264
265```json266{267 "actorOutputSchemaVersion": 1,268 "title": "Output schema of the files scraper",269 "properties": {270 "files": {271 "type": "string",272 "title": "Files",273 "template": "{{links.apiDefaultKeyValueStoreUrl}}/keys"274 },275 "dataset": {276 "type": "string",277 "title": "Dataset",278 "template": "{{links.apiDefaultDatasetUrl}}/items"279 }280 }281}282```283
284### Output Schema Template Variables285
286- `links` (object) - Contains quick links to most commonly used URLs287- `links.publicRunUrl` (string) - Public run url in format `https://console.apify.com/view/runs/:runId`288- `links.consoleRunUrl` (string) - Console run url in format `https://console.apify.com/actors/runs/:runId`289- `links.apiRunUrl` (string) - API run url in format `https://api.apify.com/v2/actor-runs/:runId`290- `links.apiDefaultDatasetUrl` (string) - API url of default dataset in format `https://api.apify.com/v2/datasets/:defaultDatasetId`291- `links.apiDefaultKeyValueStoreUrl` (string) - API url of default key-value store in format `https://api.apify.com/v2/key-value-stores/:defaultKeyValueStoreId`292- `links.containerRunUrl` (string) - URL of a webserver running inside the run in format `https://<containerId>.runs.apify.net/`293- `run` (object) - Contains information about the run same as it is returned from the `GET Run` API endpoint294- `run.defaultDatasetId` (string) - ID of the default dataset295- `run.defaultKeyValueStoreId` (string) - ID of the default key-value store296
297## Dataset Schema Specification298
299The dataset schema defines how your Actor's output data is structured, transformed, and displayed in the Output tab in the Apify Console.300
301### Example302
303Consider an example Actor that calls `Actor.pushData()` to store data into dataset:304
305```javascript306import { Actor } from 'apify';307// Initialize the JavaScript SDK308await Actor.init();309
310/**311 * Actor code312 */313await Actor.pushData({314 numericField: 10,315 pictureUrl: 'https://www.google.com/images/branding/googlelogo/2x/googlelogo_color_92x30dp.png',316 linkUrl: 'https://google.com',317 textField: 'Google',318 booleanField: true,319 dateField: new Date(),320 arrayField: ['#hello', '#world'],321 objectField: {},322});323
324// Exit successfully325await Actor.exit();326```327
328To set up the Actor's output tab UI, reference a dataset schema file in `.actor/actor.json`:329
330```json331{332 "actorSpecification": 1,333 "name": "book-library-scraper",334 "title": "Book Library Scraper",335 "version": "1.0.0",336 "storages": {337 "dataset": "./dataset_schema.json"338 }339}340```341
342Then create the dataset schema in `.actor/dataset_schema.json`:343
344```json345{346 "actorSpecification": 1,347 "fields": {},348 "views": {349 "overview": {350 "title": "Overview",351 "transformation": {352 "fields": [353 "pictureUrl",354 "linkUrl",355 "textField",356 "booleanField",357 "arrayField",358 "objectField",359 "dateField",360 "numericField"361 ]362 },363 "display": {364 "component": "table",365 "properties": {366 "pictureUrl": {367 "label": "Image",368 "format": "image"369 },370 "linkUrl": {371 "label": "Link",372 "format": "link"373 },374 "textField": {375 "label": "Text",376 "format": "text"377 },378 "booleanField": {379 "label": "Boolean",380 "format": "boolean"381 },382 "arrayField": {383 "label": "Array",384 "format": "array"385 },386 "objectField": {387 "label": "Object",388 "format": "object"389 },390 "dateField": {391 "label": "Date",392 "format": "date"393 },394 "numericField": {395 "label": "Number",396 "format": "number"397 }398 }399 }400 }401 }402}403```404
405### Structure406
407```json408{409 "actorSpecification": 1,410 "fields": {},411 "views": {412 "<VIEW_NAME>": {413 "title": "string (required)",414 "description": "string (optional)",415 "transformation": {416 "fields": ["string (required)"],417 "unwind": ["string (optional)"],418 "flatten": ["string (optional)"],419 "omit": ["string (optional)"],420 "limit": "integer (optional)",421 "desc": "boolean (optional)"422 },423 "display": {424 "component": "table (required)",425 "properties": {426 "<FIELD_NAME>": {427 "label": "string (optional)",428 "format": "text|number|date|link|boolean|image|array|object (optional)"429 }430 }431 }432 }433 }434}435```436
437**Dataset Schema Properties:**438
439- `actorSpecification` (integer, required) - Specifies the version of dataset schema structure document (currently only version 1)440- `fields` (JSONSchema object, required) - Schema of one dataset object (use JsonSchema Draft 2020-12 or compatible)441- `views` (DatasetView object, required) - Object with API and UI views description442
443**DatasetView Properties:**444
445- `title` (string, required) - Visible in UI Output tab and API446- `description` (string, optional) - Only available in API response447- `transformation` (ViewTransformation object, required) - Data transformation applied when loading from Dataset API448- `display` (ViewDisplay object, required) - Output tab UI visualization definition449
450**ViewTransformation Properties:**451
452- `fields` (string[], required) - Fields to present in output (order matches column order)453- `unwind` (string[], optional) - Deconstructs nested children into parent object454- `flatten` (string[], optional) - Transforms nested object into flat structure455- `omit` (string[], optional) - Removes specified fields from output456- `limit` (integer, optional) - Maximum number of results (default: all)457- `desc` (boolean, optional) - Sort order (true = newest first)458
459**ViewDisplay Properties:**460
461- `component` (string, required) - Only `table` is available462- `properties` (Object, optional) - Keys matching `transformation.fields` with ViewDisplayProperty values463
464**ViewDisplayProperty Properties:**465
466- `label` (string, optional) - Table column header467- `format` (string, optional) - One of: `text`, `number`, `date`, `link`, `boolean`, `image`, `array`, `object`468
469## Key-Value Store Schema Specification470
471The key-value store schema organizes keys into logical groups called collections for easier data management.472
473### Example474
475Consider an example Actor that calls `Actor.setValue()` to save records into the key-value store:476
477```javascript478import { Actor } from 'apify';479// Initialize the JavaScript SDK480await Actor.init();481
482/**483 * Actor code484 */485await Actor.setValue('document-1', 'my text data', { contentType: 'text/plain' });486
487await Actor.setValue(`image-${imageID}`, imageBuffer, { contentType: 'image/jpeg' });488
489// Exit successfully490await Actor.exit();491```492
493To configure the key-value store schema, reference a schema file in `.actor/actor.json`:494
495```json496{497 "actorSpecification": 1,498 "name": "data-collector",499 "title": "Data Collector",500 "version": "1.0.0",501 "storages": {502 "keyValueStore": "./key_value_store_schema.json"503 }504}505```506
507Then create the key-value store schema in `.actor/key_value_store_schema.json`:508
509```json510{511 "actorKeyValueStoreSchemaVersion": 1,512 "title": "Key-Value Store Schema",513 "collections": {514 "documents": {515 "title": "Documents",516 "description": "Text documents stored by the Actor",517 "keyPrefix": "document-"518 },519 "images": {520 "title": "Images",521 "description": "Images stored by the Actor",522 "keyPrefix": "image-",523 "contentTypes": ["image/jpeg"]524 }525 }526}527```528
529### Structure530
531```json532{533 "actorKeyValueStoreSchemaVersion": 1,534 "title": "string (required)",535 "description": "string (optional)",536 "collections": {537 "<COLLECTION_NAME>": {538 "title": "string (required)",539 "description": "string (optional)",540 "key": "string (conditional - use key OR keyPrefix)",541 "keyPrefix": "string (conditional - use key OR keyPrefix)",542 "contentTypes": ["string (optional)"],543 "jsonSchema": "object (optional)"544 }545 }546}547```548
549**Key-Value Store Schema Properties:**550
551- `actorKeyValueStoreSchemaVersion` (integer, required) - Version of key-value store schema structure document (currently only version 1)552- `title` (string, required) - Title of the schema553- `description` (string, optional) - Description of the schema554- `collections` (Object, required) - Object where each key is a collection ID and value is a Collection object555
556**Collection Properties:**557
558- `title` (string, required) - Collection title shown in UI tabs559- `description` (string, optional) - Description appearing in UI tooltips560- `key` (string, conditional) - Single specific key for this collection561- `keyPrefix` (string, conditional) - Prefix for keys included in this collection562- `contentTypes` (string[], optional) - Allowed content types for validation563- `jsonSchema` (object, optional) - JSON Schema Draft 07 format for `application/json` content type validation564
565Either `key` or `keyPrefix` must be specified for each collection, but not both.566
567## Apify MCP Tools568
569If MCP server is configured, use these tools for documentation:570
571- `search-apify-docs` - Search documentation572- `fetch-apify-docs` - Get full doc pages573
574Otherwise, reference: `@https://mcp.apify.com/`575
576## Resources577
578- [docs.apify.com/llms.txt](https://docs.apify.com/llms.txt) - Quick reference579- [docs.apify.com/llms-full.txt](https://docs.apify.com/llms-full.txt) - Complete docs580- [crawlee.dev](https://crawlee.dev) - Crawlee documentation581- [whitepaper.actor](https://raw.githubusercontent.com/apify/actor-whitepaper/refs/heads/master/README.md) - Complete Actor specification1# Apify Actors Development Guide2
3Important: Before you begin, fill in the `generatedBy` property in the meta section of `.actor/actor.json`. Replace it with the tool and model you're currently using, such as "Claude Code with Claude Sonnet 4.5". This helps Apify monitor and improve AGENTS.md for specific AI tools and models.4
5## What are Apify Actors?6
7- Actors are serverless programs that run in the cloud. They're inspired by the UNIX philosophy - programs that do one thing well and can be easily combined to build complex systems.8- Actors are programs packaged as Docker images that run in isolated containers9
10## Core Concepts11
12- Accept well-defined JSON input13- Perform isolated tasks (web scraping, automation, data processing)14- Produce structured JSON output to datasets and/or store data in key-value stores15- Can run from seconds to hours or even indefinitely16- Persist state and can be restarted17
18## Do19
20- accept well-defined JSON input and produce structured JSON output21- use Apify SDK (`apify`) for code running ON Apify platform22- validate input early with proper error handling and fail gracefully23- use CheerioCrawler for static HTML content (10x faster than browsers)24- use PlaywrightCrawler only for JavaScript-heavy sites and dynamic content25- use router pattern (createCheerioRouter/createPlaywrightRouter) for complex crawls26- implement retry strategies with exponential backoff for failed requests27- use proper concurrency settings (HTTP: 10-50, Browser: 1-5)28- set sensible defaults in `.actor/input_schema.json` for all optional fields29- set up output schema in `.actor/output_schema.json`30- clean and validate data before pushing to dataset31- use semantic CSS selectors and fallback strategies for missing elements32- respect robots.txt, ToS, and implement rate limiting with delays33- check which tools (cheerio/playwright/crawlee) are installed before applying guidance34- use `apify/log` package for logging (censors sensitive data)35- implement readiness probe handler for standby Actors36- handle the `aborting` event to gracefully shut down when Actor is stopped37
38## Don't39
40- do not rely on `Dataset.getInfo()` for final counts on Cloud platform41- do not use browser crawlers when HTTP/Cheerio works (massive performance gains with HTTP)42- do not hard code values that should be in input schema or environment variables43- do not skip input validation or error handling44- do not overload servers - use appropriate concurrency and delays45- do not scrape prohibited content or ignore Terms of Service46- do not store personal/sensitive data unless explicitly permitted47- do not use deprecated options like `requestHandlerTimeoutMillis` on CheerioCrawler (v3.x)48- do not use `additionalHttpHeaders` - use `preNavigationHooks` instead49- do not assume that local storage is persistent or automatically synced to Apify Console - when running locally with `apify run`, the `storage/` directory is local-only and is NOT pushed to the Cloud50- do not disable standby mode (`usesStandbyMode: false`) without explicit permission51
52## Logging53
54- **ALWAYS use the `apify/log` package for logging** - This package contains critical security logic including censoring sensitive data (Apify tokens, API keys, credentials) to prevent accidental exposure in logs55
56### Available Log Levels in `apify/log`57
58The Apify log package provides the following methods for logging:59
60- `log.debug()` - Debug level logs (detailed diagnostic information)61- `log.info()` - Info level logs (general informational messages)62- `log.warning()` - Warning level logs (warning messages for potentially problematic situations)63- `log.warningOnce()` - Warning level logs (same warning message logged only once)64- `log.error()` - Error level logs (error messages for failures)65- `log.exception()` - Exception level logs (for exceptions with stack traces)66- `log.perf()` - Performance level logs (performance metrics and timing information)67- `log.deprecated()` - Deprecation level logs (warnings about deprecated code)68- `log.softFail()` - Soft failure logs (non-critical failures that don't stop execution, e.g., input validation errors, skipped items)69- `log.internal()` - Internal level logs (internal/system messages)70
71**Best practices:**72
73- Use `log.debug()` for detailed operation-level diagnostics (inside functions)74- Use `log.info()` for general informational messages (API requests, successful operations)75- Use `log.warning()` for potentially problematic situations (validation failures, unexpected states)76- Use `log.error()` for actual errors and failures77- Use `log.exception()` for caught exceptions with stack traces78
79## Graceful Abort Handling80
81Handle the `aborting` event to terminate the Actor quickly when stopped by user or platform, minimizing costs especially for PPU/PPE+U billing.82
83```javascript84import { setTimeout } from 'node:timers/promises';85
86Actor.on('aborting', async () => {87 // Persist any state, do any cleanup you need, and terminate the Actor using `await Actor.exit()` explicitly as soon as possible88 // This will help ensure that the Actor is doing best effort to honor any potential limits on costs of a single run set by the user89 // Wait 1 second to allow Crawlee/SDK useState and other state persistence operations to complete90 // This is a temporary workaround until SDK implements proper state persistence in the aborting event91 await setTimeout(1000);92 await Actor.exit();93});94```95
96## Standby Mode97
98- **NEVER disable standby mode (`usesStandbyMode: false`) in `.actor/actor.json` without explicit permission** - Actor Standby mode solves this problem by letting you have the Actor ready in the background, waiting for the incoming HTTP requests. In a sense, the Actor behaves like a real-time web server or standard API server instead of running the logic once to process everything in batch. Always keep `usesStandbyMode: true` unless there is a specific documented reason to disable it99- **ALWAYS implement readiness probe handler for standby Actors** - Handle the `x-apify-container-server-readiness-probe` header at GET / endpoint to ensure proper Actor lifecycle management100
101You can recognize a standby Actor by checking the `usesStandbyMode` property in `.actor/actor.json`. Only implement the readiness probe if this property is set to `true`.102
103### Readiness Probe Implementation Example104
105```javascript106// Apify standby readiness probe at root path107app.get('/', (req, res) => {108 res.writeHead(200, { 'Content-Type': 'text/plain' });109 if (req.headers['x-apify-container-server-readiness-probe']) {110 res.end('Readiness probe OK\n');111 } else {112 res.end('Actor is ready\n');113 }114});115```116
117Key points:118
119- Detect the `x-apify-container-server-readiness-probe` header in incoming requests120- Respond with HTTP 200 status code for both readiness probe and normal requests121- This enables proper Actor lifecycle management in standby mode122
123## Commands124
125```bash126# Local development127apify run # Run Actor locally128
129# Authentication & deployment130apify login # Authenticate account131apify push # Deploy to Apify platform132
133# Help134apify help # List all commands135```136
137## Safety and Permissions138
139Allowed without prompt:140
141- read files with `Actor.getValue()`142- push data with `Actor.pushData()`143- set values with `Actor.setValue()`144- enqueue requests to RequestQueue145- run locally with `apify run`146
147Ask first:148
149- npm/pip package installations150- apify push (deployment to cloud)151- proxy configuration changes (requires paid plan)152- Dockerfile changes affecting builds153- deleting datasets or key-value stores154
155## Project Structure156
157.actor/158├── actor.json # Actor config: name, version, env vars, runtime settings159├── input_schema.json # Input validation & Console form definition160└── output_schema.json # Specifies where an Actor stores its output161src/162└── main.js # Actor entry point and orchestrator163storage/ # Local-only storage for development (NOT synced to Cloud)164├── datasets/ # Output items (JSON objects)165├── key_value_stores/ # Files, config, INPUT166└── request_queues/ # Pending crawl requests167Dockerfile # Container image definition168AGENTS.md # AI agent instructions (this file)169
170## Local vs Cloud Storage171
172When running locally with `apify run`, the Apify SDK emulates Cloud storage APIs using the local `storage/` directory. This local storage behaves differently from Cloud storage:173
174- **Local storage is NOT persistent** - The `storage/` directory is meant for local development and testing only. Data stored there (datasets, key-value stores, request queues) exists only on your local disk.175- **Local storage is NOT automatically pushed to Apify Console** - Running `apify run` does not upload any storage data to the Apify platform. The data stays local.176- **Each local run may overwrite previous data** - The local `storage/` directory is reused between runs, but this is local-only behavior, not Cloud persistence.177- **Cloud storage only works when running on Apify platform** - After deploying with `apify push` and running the Actor in the Cloud, storage calls (`Actor.pushData()`, `Actor.setValue()`, etc.) interact with real Apify Cloud storage, which is then visible in the Apify Console.178- **To verify Actor output, deploy and run in Cloud** - Do not rely on local `storage/` contents as proof that data will appear in the Apify Console. Always test by deploying (`apify push`) and running the Actor on the platform.179
180## Actor Input Schema181
182The input schema defines the input parameters for an Actor. It's a JSON object comprising various field types supported by the Apify platform.183
184### Structure185
186```json187{188 "title": "<INPUT-SCHEMA-TITLE>",189 "type": "object",190 "schemaVersion": 1,191 "properties": {192 /* define input fields here */193 },194 "required": []195}196```197
198### Example199
200```json201{202 "title": "E-commerce Product Scraper Input",203 "type": "object",204 "schemaVersion": 1,205 "properties": {206 "startUrls": {207 "title": "Start URLs",208 "type": "array",209 "description": "URLs to start scraping from (category pages or product pages)",210 "editor": "requestListSources",211 "default": [{ "url": "https://example.com/category" }],212 "prefill": [{ "url": "https://example.com/category" }]213 },214 "followVariants": {215 "title": "Follow Product Variants",216 "type": "boolean",217 "description": "Whether to scrape product variants (different colors, sizes)",218 "default": true219 },220 "maxRequestsPerCrawl": {221 "title": "Max Requests per Crawl",222 "type": "integer",223 "description": "Maximum number of pages to scrape (0 = unlimited)",224 "default": 1000,225 "minimum": 0226 },227 "proxyConfiguration": {228 "title": "Proxy Configuration",229 "type": "object",230 "description": "Proxy settings for anti-bot protection",231 "editor": "proxy",232 "default": { "useApifyProxy": false }233 },234 "locale": {235 "title": "Locale",236 "type": "string",237 "description": "Language/country code for localized content",238 "default": "cs",239 "enum": ["cs", "en", "de", "sk"],240 "enumTitles": ["Czech", "English", "German", "Slovak"]241 }242 },243 "required": ["startUrls"]244}245```246
247## Actor Output Schema248
249The Actor output schema builds upon the schemas for the dataset and key-value store. It specifies where an Actor stores its output and defines templates for accessing that output. Apify Console uses these output definitions to display run results.250
251### Structure252
253```json254{255 "actorOutputSchemaVersion": 1,256 "title": "<OUTPUT-SCHEMA-TITLE>",257 "properties": {258 /* define your outputs here */259 }260}261```262
263### Example264
265```json266{267 "actorOutputSchemaVersion": 1,268 "title": "Output schema of the files scraper",269 "properties": {270 "files": {271 "type": "string",272 "title": "Files",273 "template": "{{links.apiDefaultKeyValueStoreUrl}}/keys"274 },275 "dataset": {276 "type": "string",277 "title": "Dataset",278 "template": "{{links.apiDefaultDatasetUrl}}/items"279 }280 }281}282```283
284### Output Schema Template Variables285
286- `links` (object) - Contains quick links to most commonly used URLs287- `links.publicRunUrl` (string) - Public run url in format `https://console.apify.com/view/runs/:runId`288- `links.consoleRunUrl` (string) - Console run url in format `https://console.apify.com/actors/runs/:runId`289- `links.apiRunUrl` (string) - API run url in format `https://api.apify.com/v2/actor-runs/:runId`290- `links.apiDefaultDatasetUrl` (string) - API url of default dataset in format `https://api.apify.com/v2/datasets/:defaultDatasetId`291- `links.apiDefaultKeyValueStoreUrl` (string) - API url of default key-value store in format `https://api.apify.com/v2/key-value-stores/:defaultKeyValueStoreId`292- `links.containerRunUrl` (string) - URL of a webserver running inside the run in format `https://<containerId>.runs.apify.net/`293- `run` (object) - Contains information about the run same as it is returned from the `GET Run` API endpoint294- `run.defaultDatasetId` (string) - ID of the default dataset295- `run.defaultKeyValueStoreId` (string) - ID of the default key-value store296
297## Dataset Schema Specification298
299The dataset schema defines how your Actor's output data is structured, transformed, and displayed in the Output tab in the Apify Console.300
301### Example302
303Consider an example Actor that calls `Actor.pushData()` to store data into dataset:304
305```javascript306import { Actor } from 'apify';307// Initialize the JavaScript SDK308await Actor.init();309
310/**311 * Actor code312 */313await Actor.pushData({314 numericField: 10,315 pictureUrl: 'https://www.google.com/images/branding/googlelogo/2x/googlelogo_color_92x30dp.png',316 linkUrl: 'https://google.com',317 textField: 'Google',318 booleanField: true,319 dateField: new Date(),320 arrayField: ['#hello', '#world'],321 objectField: {},322});323
324// Exit successfully325await Actor.exit();326```327
328To set up the Actor's output tab UI, reference a dataset schema file in `.actor/actor.json`:329
330```json331{332 "actorSpecification": 1,333 "name": "book-library-scraper",334 "title": "Book Library Scraper",335 "version": "1.0.0",336 "storages": {337 "dataset": "./dataset_schema.json"338 }339}340```341
342Then create the dataset schema in `.actor/dataset_schema.json`:343
344```json345{346 "actorSpecification": 1,347 "fields": {},348 "views": {349 "overview": {350 "title": "Overview",351 "transformation": {352 "fields": [353 "pictureUrl",354 "linkUrl",355 "textField",356 "booleanField",357 "arrayField",358 "objectField",359 "dateField",360 "numericField"361 ]362 },363 "display": {364 "component": "table",365 "properties": {366 "pictureUrl": {367 "label": "Image",368 "format": "image"369 },370 "linkUrl": {371 "label": "Link",372 "format": "link"373 },374 "textField": {375 "label": "Text",376 "format": "text"377 },378 "booleanField": {379 "label": "Boolean",380 "format": "boolean"381 },382 "arrayField": {383 "label": "Array",384 "format": "array"385 },386 "objectField": {387 "label": "Object",388 "format": "object"389 },390 "dateField": {391 "label": "Date",392 "format": "date"393 },394 "numericField": {395 "label": "Number",396 "format": "number"397 }398 }399 }400 }401 }402}403```404
405### Structure406
407```json408{409 "actorSpecification": 1,410 "fields": {},411 "views": {412 "<VIEW_NAME>": {413 "title": "string (required)",414 "description": "string (optional)",415 "transformation": {416 "fields": ["string (required)"],417 "unwind": ["string (optional)"],418 "flatten": ["string (optional)"],419 "omit": ["string (optional)"],420 "limit": "integer (optional)",421 "desc": "boolean (optional)"422 },423 "display": {424 "component": "table (required)",425 "properties": {426 "<FIELD_NAME>": {427 "label": "string (optional)",428 "format": "text|number|date|link|boolean|image|array|object (optional)"429 }430 }431 }432 }433 }434}435```436
437**Dataset Schema Properties:**438
439- `actorSpecification` (integer, required) - Specifies the version of dataset schema structure document (currently only version 1)440- `fields` (JSONSchema object, required) - Schema of one dataset object (use JsonSchema Draft 2020-12 or compatible)441- `views` (DatasetView object, required) - Object with API and UI views description442
443**DatasetView Properties:**444
445- `title` (string, required) - Visible in UI Output tab and API446- `description` (string, optional) - Only available in API response447- `transformation` (ViewTransformation object, required) - Data transformation applied when loading from Dataset API448- `display` (ViewDisplay object, required) - Output tab UI visualization definition449
450**ViewTransformation Properties:**451
452- `fields` (string[], required) - Fields to present in output (order matches column order)453- `unwind` (string[], optional) - Deconstructs nested children into parent object454- `flatten` (string[], optional) - Transforms nested object into flat structure455- `omit` (string[], optional) - Removes specified fields from output456- `limit` (integer, optional) - Maximum number of results (default: all)457- `desc` (boolean, optional) - Sort order (true = newest first)458
459**ViewDisplay Properties:**460
461- `component` (string, required) - Only `table` is available462- `properties` (Object, optional) - Keys matching `transformation.fields` with ViewDisplayProperty values463
464**ViewDisplayProperty Properties:**465
466- `label` (string, optional) - Table column header467- `format` (string, optional) - One of: `text`, `number`, `date`, `link`, `boolean`, `image`, `array`, `object`468
469## Key-Value Store Schema Specification470
471The key-value store schema organizes keys into logical groups called collections for easier data management.472
473### Example474
475Consider an example Actor that calls `Actor.setValue()` to save records into the key-value store:476
477```javascript478import { Actor } from 'apify';479// Initialize the JavaScript SDK480await Actor.init();481
482/**483 * Actor code484 */485await Actor.setValue('document-1', 'my text data', { contentType: 'text/plain' });486
487await Actor.setValue(`image-${imageID}`, imageBuffer, { contentType: 'image/jpeg' });488
489// Exit successfully490await Actor.exit();491```492
493To configure the key-value store schema, reference a schema file in `.actor/actor.json`:494
495```json496{497 "actorSpecification": 1,498 "name": "data-collector",499 "title": "Data Collector",500 "version": "1.0.0",501 "storages": {502 "keyValueStore": "./key_value_store_schema.json"503 }504}505```506
507Then create the key-value store schema in `.actor/key_value_store_schema.json`:508
509```json510{511 "actorKeyValueStoreSchemaVersion": 1,512 "title": "Key-Value Store Schema",513 "collections": {514 "documents": {515 "title": "Documents",516 "description": "Text documents stored by the Actor",517 "keyPrefix": "document-"518 },519 "images": {520 "title": "Images",521 "description": "Images stored by the Actor",522 "keyPrefix": "image-",523 "contentTypes": ["image/jpeg"]524 }525 }526}527```528
529### Structure530
531```json532{533 "actorKeyValueStoreSchemaVersion": 1,534 "title": "string (required)",535 "description": "string (optional)",536 "collections": {537 "<COLLECTION_NAME>": {538 "title": "string (required)",539 "description": "string (optional)",540 "key": "string (conditional - use key OR keyPrefix)",541 "keyPrefix": "string (conditional - use key OR keyPrefix)",542 "contentTypes": ["string (optional)"],543 "jsonSchema": "object (optional)"544 }545 }546}547```548
549**Key-Value Store Schema Properties:**550
551- `actorKeyValueStoreSchemaVersion` (integer, required) - Version of key-value store schema structure document (currently only version 1)552- `title` (string, required) - Title of the schema553- `description` (string, optional) - Description of the schema554- `collections` (Object, required) - Object where each key is a collection ID and value is a Collection object555
556**Collection Properties:**557
558- `title` (string, required) - Collection title shown in UI tabs559- `description` (string, optional) - Description appearing in UI tooltips560- `key` (string, conditional) - Single specific key for this collection561- `keyPrefix` (string, conditional) - Prefix for keys included in this collection562- `contentTypes` (string[], optional) - Allowed content types for validation563- `jsonSchema` (object, optional) - JSON Schema Draft 07 format for `application/json` content type validation564
565Either `key` or `keyPrefix` must be specified for each collection, but not both.566
567## Apify MCP Tools568
569If MCP server is configured, use these tools for documentation:570
571- `search-apify-docs` - Search documentation572- `fetch-apify-docs` - Get full doc pages573
574Otherwise, reference: `@https://mcp.apify.com/`575
576## Resources577
578- [docs.apify.com/llms.txt](https://docs.apify.com/llms.txt) - Quick reference579- [docs.apify.com/llms-full.txt](https://docs.apify.com/llms-full.txt) - Complete docs580- [crawlee.dev](https://crawlee.dev) - Crawlee documentation581- [whitepaper.actor](https://raw.githubusercontent.com/apify/actor-whitepaper/refs/heads/master/README.md) - Complete Actor specification# Specify the base Docker image. You can read more about# the available images at https://docs.apify.com/sdk/js/docs/guides/docker-images# You can also use any other image from Docker Hub.FROM apify/actor-node:22
# Check preinstalled packagesRUN npm ls @crawlee/core apify puppeteer playwright
# Copy just package.json and package-lock.json# to speed up the build using Docker layer cache.COPY package*.json ./
# Install NPM packages, skip optional and development dependencies to# keep the image small. Avoid logging too much and print the dependency# tree for debuggingRUN npm --quiet set progress=false \ && npm install --omit=dev --omit=optional \ && echo "Installed NPM packages:" \ && (npm list --omit=dev --all || true) \ && echo "Node.js version:" \ && node --version \ && echo "NPM version:" \ && npm --version \ && rm -r ~/.npm
# Next, copy the remaining files and directories with the source code.# Since we do this after NPM install, quick build will be really fast# for most source file changes.COPY . ./
# Run the image.CMD ["node", "src/main.js"]1import prettier from 'eslint-config-prettier';2
3import apify from '@apify/eslint-config/js.js';4
5// eslint-disable-next-line import/no-default-export6export default [{ ignores: ['**/dist'] }, ...apify, prettier];{ "name": "crawlee-cheerio-javascript", "version": "0.0.1", "type": "module", "description": "This is a boilerplate of an Apify Actor.", "engines": { "node": ">=18.0.0" }, "dependencies": { "apify": "^3.5.2", "@crawlee/cheerio": "^3.15.3" }, "devDependencies": { "@apify/eslint-config": "^1.0.0", "eslint": "^9.29.0", "eslint-config-prettier": "^10.1.5", "prettier": "^3.5.3" }, "scripts": { "start": "node src/main.js", "format": "prettier --write .", "format:check": "prettier --check .", "lint": "eslint", "lint:fix": "eslint --fix", "test": "echo \"Error: oops, the Actor has no tests yet, sad!\" && exit 1" }, "author": "It's not you it's me", "license": "ISC"}1import { setTimeout } from 'node:timers/promises';2
3// Crawlee - web scraping and browser automation library (Read more at https://crawlee.dev)4import { CheerioCrawler, Dataset } from '@crawlee/cheerio';5// Apify SDK - toolkit for building Apify Actors (Read more at https://docs.apify.com/sdk/js/)6import { Actor } from 'apify';7
8// The init() call configures the Actor to correctly work with the Apify-provided environment - mainly the storage infrastructure. It is necessary that every Actor performs an init() call.9await Actor.init();10
11// Handle graceful abort - Actor is being stopped by user or platform12Actor.on('aborting', async () => {13 // Persist any state, do any cleanup you need, and terminate the Actor using `await Actor.exit()` explicitly as soon as possible14 // This will help ensure that the Actor is doing best effort to honor any potential limits on costs of a single run set by the user15 // Wait 1 second to allow Crawlee/SDK useState and other state persistence operations to complete16 // This is a temporary workaround until SDK implements proper state persistence in the aborting event17 await setTimeout(1000);18 await Actor.exit();19});20
21// Structure of input is defined in input_schema.json22const { startUrls = ['https://apify.com'], maxRequestsPerCrawl = 100 } = (await Actor.getInput()) ?? {};23
24// Proxy configuration to rotate IP addresses and prevent blocking (https://docs.apify.com/platform/proxy)25// `checkAccess` flag ensures the proxy credentials are valid, but the check can take a few hundred milliseconds.26// Disable it for short runs if you are sure your proxy configuration is correct27const proxyConfiguration = await Actor.createProxyConfiguration({ checkAccess: true });28
29const crawler = new CheerioCrawler({30 proxyConfiguration,31 maxRequestsPerCrawl,32 async requestHandler({ enqueueLinks, request, $, log }) {33 log.info('enqueueing new URLs');34 await enqueueLinks();35
36 // Extract title from the page.37 const title = $('title').text();38 log.info(`${title}`, { url: request.loadedUrl });39
40 // Save url and title to Dataset - a table-like storage.41 await Dataset.pushData({ url: request.loadedUrl, title });42 },43});44
45await crawler.run(startUrls);46
47// Gracefully exit the Actor process. It's recommended to quit all Actors with an exit()48await Actor.exit();