ScrapEmbeddingNextjsDoc
1.0.0
This will scrap the data from nextjs doc with Playwright. Data transformation and cleaning + adding wrappers to make sens of the data for ia with Cheerio. Finally save it in separate files in data/nextjs folder.
npm run scrap
If you want stats on scrapping datas you can run this command
npm run scrapstat
DROP SCHEMA public CASCADE;
CREATE SCHEMA public;
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE IF NOT EXISTS documents (text text, n_tokens integer, file_path text, embeddings vector(1536));
CREATE INDEX ON documents USING ivfflat (embeddings vector_cosine_ops);
CREATE TABLE IF NOT EXISTS openai_ft_data (
id SERIAL PRIMARY KEY,
query TEXT NOT NULL,
answer TEXT NOT NULL,
suggested_answer TEXT,
user_feedback BOOLEAN
);
CREATE TABLE IF NOT EXISTS usage (
id SERIAL PRIMARY KEY,
ip_address TEXT NOT NULL,
created_at TIMESTAMP NOT NULL DEFAULT NOW()
); npm run embedding
this command will do this actions:
tiktoken library is used to transform text into tokens. We will use this for calculate how many tokens we need to split the text in order to be able to embed it with openAi.
⏳ Link to npm tiktoken / Lien vers le github de tiktoken
You can uncomment displayTokenLengthStats function if you want to check the token sending statistics before saveToDatabase. In this case, don't forget to comment out saveToDatabase function.