Page Analyzer avatar
Page Analyzer

No credit card required

View all Actors
Page Analyzer

Page Analyzer

apify/page-analyzer

No credit card required

Analyzes pages and searches for provided query strings

This Apify actor analyzes a web page on a specific URL. It extracts HTML and javascript variables from main response and HTML/JSON data from XHR requests. Then it analyses loaded data:

  1. It performs analysis of initial HTML (html loaded directly from response):
  • Looks for Schema.org data and if it finds anything, it saves it to output as schemaOrgData variable.
  • Looks for JSON-LD link tags and parses found JSON, if it finds anything it outputs it as jsonLDData variable.
  • Looks for meta and title tags and outputs found content as metadata variable.
  1. Loads all XHR requests -> discards request that do no contain HTML or JSON -> parses HTML and JSON into objects
  2. When all XHR requests are finished it loads HTML from the rendered page (it might have changed thanks to JS manipulation) and does work from step 1 again because javascript might have changed the HTML of the website.
  3. Loads all window variables and discards common global variables (console, innerHeight, navigator, ...), cleans the output (removes all functions and circular paths) and outputs it as allWindowProperties variable.

When analysis is finished it checks INPUT parameters if there are any strings to search for and if there are. Then it attempts to find the strings in all found content.

The actor ends when all output is parsed and searched. If connection to URL fails or if any part of the actor crashes, the actor ends with error in output and log.

Input to actor is provided from INPUT file. If the actor is run through Apify, then INPUT comes from key value store. If you want to start the actor localy, then call

npm run start-local

and provide input as a file in directory kv-store-dev.

INPUT

1{
2    // url to website, that is supposed to be analyzed
3    "url": "http://example.com",
4    // array of strings too look for on the website, if empty, search is skipped during analysis
5    "searchFor": ["About us"]
6}

During the actor run, it saves output into OUTPUT file, which is saved in key value store if the actor is run through Apify, or in kv-store-dev folder if the actor is run localy.

OUTPUT

1{
2  // Initial response headers
3  "initialResponse": {
4    "url": "https://www.flywire.com/",
5    "headers": {...}
6  },
7  // True if window variables were parsed after XHR requests finished
8  "windowPropertiesParsed": true,
9  // True if meta tags were parsed from initial response
10  "metaDataParsed": true,
11  // True if Schema.org was loaded and parsed from initial response
12  "schemaOrgDataParsed": true,
13  // True if JSON-LD was loaded and parsed from initial response
14  "jsonLDDataParsed": true,
15  // True if HTML was loaded and parsed from initial response
16  "htmlParsed": true,
17  // True if HTML was loaded and parsed after XHR requests finished
18  "htmlFullyParsed": true,
19  // True if XHR requests were all parsed
20  "xhrRequestsParsed": true,
21  // Filtered window properties by search strings
22  "windowProperties": {},
23  // Object containing cleaned up window object properties
24  "allWindowProperties": {...},
25  // Array of properties which contain searched strings (at least one) with path to variable from root
26  "windowPropertiesFound": [],
27  // Schema.org data filtered by search strings.
28  "schemaOrgData": {},
29  // Array of schema org properties which contain searched strings (at least one) with path to variable from root
30  "schemaOrgDataFound": [],
31  // Complete output of found schema.org data
32  "allSchemaOrgData": [],
33  // Complete output of all found meta tags
34  "metaData": {
35    "viewport": "width=device-width, initial-scale=1",
36    "og:title": "International Payments Solution",
37    ...
38  },
39  // List of meta tags matching the searched strings
40  "metaDataFound": [],
41  // JSON-LD Data filtered by search strings.
42  "jsonLDData": {},
43  // Array of JSON-LD data properties which contain searched strings (at least one) with path to variable from root
44  "jsonLDDataFound": [],
45  // Complete output of found JSON-LD
46  "allJsonLDData": [],
47  // Array of selectors to HTML elements that contain the searched values
48  "htmlFound": [],
49  // Array of parsed XHR requests with content type of JSON or HTML
50  "xhrRequests": [
51    {
52      "url": "https://www.flywire.com/destinations",
53      "method": "GET",
54      "responseStatus": 200,
55      "responseHeaders": {...},
56      "responseBody": {
57        // Valid provides information whether JSON was parsed successfully
58        "valid": true/false,
59        // Data contains the parsed JSON
60        "data": [...],
61      }
62    },
63    {
64      "url": "https://www.flywire.com/asdasd",
65      "method": "GET",
66      "responseStatus": 200,
67      "responseHeaders": {...},
68      // For HTML requests responseBody contains HTML as string
69      "responseBody": "<html>...."
70    },
71  ],
72  // same list as above, but filtered by search strings
73  "xhrRequestsFound": [...],
74  // contains error if actor failed outside of page function
75  "error": null,
76  // contains error if actor failed in page.evaluate
77  "pageError": null,
78  "outputFinished": true,
79
80  // timestamps for debugging
81  "analysisStarted": "2018-02-09T12:34:49.938Z",
82  "scrappingStarted": "2018-02-09T12:34:50.050Z",
83  "pageNavigated": "2018-02-09T12:34:53.495Z",
84  "windowPropertiesSearched": "2018-02-09T12:34:53.810Z",
85  "metadataSearched": "2018-02-09T12:34:51.624Z",
86  "schemaOrgSearched": "2018-02-09T12:34:51.627Z",
87  "jsonLDSearched": "2018-02-09T12:34:51.625Z",
88  "htmlSearched": "2018-02-09T12:34:53.746Z",
89  "xhrRequestsSearched": "2018-02-09T12:34:53.517Z",
90  "analysisEnded": "2018-02-09T12:34:53.810Z",
91}
Developer
Community logoMaintained by Community
Actor metrics
  • 1 monthly users
  • 100.0% runs succeeded
  • Modified over 1 year ago
Categories

You might also like these Actors