Back to template gallery

🦜️🔗 LangChain

Example of how to use LangChain.js with Apify to crawl the web data, vectorize them, and prompt the OpenAI model.

Language

javascript

Tools

nodejs

src/main.js

src/vector_index_cache.js

1import { Actor } from 'apify';
2import { ApifyDatasetLoader } from 'langchain/document_loaders/web/apify_dataset';
3import { Document } from 'langchain/document';
4import { HNSWLib } from 'langchain/vectorstores/hnswlib';
5import { OpenAIEmbeddings } from 'langchain/embeddings/openai';
6import { RetrievalQAChain } from 'langchain/chains';
7import { OpenAI } from 'langchain/llms/openai';
8import { rm } from 'node:fs/promises';
9
10// this is ESM project, and as such, it requires you to specify extensions in your relative imports
11// read more about this here: https://nodejs.org/docs/latest-v18.x/api/esm.html#mandatory-file-extensions
12import { retrieveVectorIndex, cacheVectorIndex } from './vector_index_cache.js';
13
14await Actor.init();
15
16// There are 2 steps you need to proceed first in order to be able to run this template:
17// 1. If you are running template locally then you need to authenticate to Apify platform by calling `apify login` in your terminal. Without this, you won't be able to run the required Website Content Crawler Actor to gather the data.
18// 2. Configure the OPENAI_API_KEY environment variable (https://docs.apify.com/cli/docs/vars#set-up-environment-variables-in-apify-console) with your OpenAI API key you obtain at https://platform.openai.com/account/api-keys.
19const { OPENAI_API_KEY, APIFY_TOKEN } = process.env;
20
21// You can configure the input for the Actor in the Apify UI when running on the Apify platform or editing storage/key_value_stores/default/INPUT.json when running locally.
22const {
23    startUrls = [{ url: 'https://wikipedia.com' }],
24    maxCrawlPages = 3,
25    forceRecrawl = false, // Enforce a re-crawl of website content and re-creation of the vector index.
26    query = 'What is Wikipedia?',
27    openAIApiKey = OPENAI_API_KEY, // This is a fallback to the OPENAI_API_KEY environment variable when value is not present in the input.
28} = await Actor.getInput() || {};
29
30// Local directory where the vector index will be stored.
31const VECTOR_INDEX_PATH = './vector_index';
32
33if (!openAIApiKey) throw new Error('Please configure the OPENAI_API_KEY as environment variable or enter it into the input!');
34if (!APIFY_TOKEN) throw new Error('Please configure the APIFY_TOKEN environment variable! Call `apify login` in your terminal to authenticate.');
35
36// Now we want to creare a vector index from the crawled documents.
37// Following object represents an input for the https://apify.com/apify/website-content-crawler actor that crawls the website to gather the data.
38const websiteContentCrawlerInput = { startUrls, maxCrawlPages };
39
40// This variable will contain a vector index that we will use to retrieve the most relevant documents for a given query.
41let vectorStore;
42
43// First, we check if the vector index is already cached. If not, we run the website content crawler to get the documents.
44// By setting up forceRecrawl=true you can enforce a re-scrape of the website content and re-creation of the vector index.
45console.log('Fetching cached vector index from key-value store...');
46const reinitializeIndex = forceRecrawl || !(await retrieveVectorIndex(websiteContentCrawlerInput));
47if (reinitializeIndex) {
48    // Run the Actor, wait for it to finish, and fetch its results from the Apify dataset into a LangChain document loader.
49    console.log('Vector index was not found.')
50    console.log('Running apify/website-content-crawler to gather the data...');
51    const loader = await ApifyDatasetLoader.fromActorCall(
52        'apify/website-content-crawler',
53        websiteContentCrawlerInput,
54        {
55            datasetMappingFunction: (item) => new Document({
56                pageContent: (item.text || ''),
57                metadata: { source: item.url },
58            }),
59            clientOptions: { token: APIFY_TOKEN },
60        }
61    );
62
63    // Initialize the vector index from the crawled documents.
64    console.log('Feeding vector index with crawling results...');
65    const docs = await loader.load();
66    vectorStore = await HNSWLib.fromDocuments(
67        docs,
68        new OpenAIEmbeddings({ openAIApiKey })
69    );
70
71    // Save the vector index to the key-value store so that we can skip this phase in the next run.
72    console.log('Saving vector index to the disk...')
73    await vectorStore.save(VECTOR_INDEX_PATH);
74    await cacheVectorIndex(websiteContentCrawlerInput, VECTOR_INDEX_PATH);
75}
76
77// Load the vector index from the disk if not already initialized above.
78if (!vectorStore) {
79    console.log('Initializing the vector store...');
80    vectorStore = await HNSWLib.load(
81        VECTOR_INDEX_PATH,
82        new OpenAIEmbeddings({ openAIApiKey })
83    );
84}
85
86// Next, create the retrieval chain and enter a query:
87console.log('Asking model a question...');
88const model = new OpenAI({ openAIApiKey });
89const chain = RetrievalQAChain.fromLLM(model, vectorStore.asRetriever(), {
90    returnSourceDocuments: true,
91});
92const res = await chain.call({ query });
93
94console.log(`\n${res.text}\n`);
95
96// Remove the vector index directory as we have it cached in the key-value store for the next time.
97await rm(VECTOR_INDEX_PATH, { recursive: true });
98
99await Actor.setValue('OUTPUT', res);
100await Actor.exit();

LangChain.js template

LangChain is a framework for developing applications powered by language models.

This example template illustrates how to use LangChain.js with Apify to crawl the web data, vectorize them, and prompt the OpenAI model. All of this is within a single Apify Actor and slightly over a hundred lines of code.

Included features

  • Apify SDK - a toolkit for building Actors
  • Input schema - define and easily validate a schema for your actor's input
  • Langchain.js - a framework for developing applications powered by language models
  • OpenAI - a powerful language model

How it works

The code contains the following steps:

  1. Crawls given website using Website Content Crawler Actor.
  2. Vectorizes the data using the OpenAI API.
  3. Caches the vector index in the key-value store so that when you run Actor for the same website again, the cached data are used to speed it up.
  4. Data are fed to the OpenAI model using Langchain.js, and a given query is asked.

Before you start

To be able to run this template both locally and on the Apify platform, you need to:

Production use

This serves purely as an example of the whole pipeline.

For production use, we recommend you to:

  • Separate crawling, data vectorization, and prompting into separate Actors. This way, you can run them independently and scale them separately.
  • Replace the local vector store with Pinecone or a similar database. See the LangChain.js docs for more information.

Resources


[langchain content crawler](https://www.youtube.com/watch?v=8uvHH-ocSes)

Already have a solution in mind?

Sign up for a free Apify account and deploy your code to the platform in just a few minutes! If you want a head start without coding it yourself, browse our Store of existing solutions.