ScrapEmbeddingNextjsDoc Download - ScrapEmbeddingNextjsDoc Source code download

ScrapEmbeddingNextjsDoc

Other source code

1.0.0

Download

Nextjs Doc Scraper

Scrap datas from nextjs doc:

This will scrap the data from nextjs doc with Playwright. Data transformation and cleaning + adding wrappers to make sens of the data for ia with Cheerio. Finally save it in separate files in data/nextjs folder.

npm run scrap

Link to Playwright

Link to npm Cheerio

Scrap stats:

If you want stats on scrapping datas you can run this command

  npm run scrapstat

Create dataBase for store embedding data:

On Neon.tech create a database (Neon because is compatible with vector data) and create a collection for store the data.
add the connection string in DATABASE_URL in .env. Be sure to complete userName and replace ******* by password
Create Tables with the command SQL in database.sql

DROP SCHEMA public CASCADE;

CREATE SCHEMA public;

CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE IF NOT EXISTS documents (text text, n_tokens integer, file_path text, embeddings vector(1536));

CREATE INDEX ON documents USING ivfflat (embeddings vector_cosine_ops);

CREATE TABLE IF NOT EXISTS openai_ft_data (
  id SERIAL PRIMARY KEY,
  query TEXT NOT NULL,
  answer TEXT NOT NULL,
  suggested_answer TEXT,
  user_feedback BOOLEAN
);

CREATE TABLE IF NOT EXISTS usage (
  id SERIAL PRIMARY KEY,
  ip_address TEXT NOT NULL,
  created_at TIMESTAMP NOT NULL DEFAULT NOW()
);

Link to Neon

OpenAi Key:

Add OpenAi key in .env for use the Api for embedding the data.

Link openAi

Embedding datas:

 npm run embedding

this command will do this actions:

Create array of objects with texts and fileName and save it to a json file (texts.json)
tokenize all texts with tiktoken to know token Number and save it to a json file (textsTokens.json)
Split the texts in max 1500 tokens. If split, split according to the subtitles (Tag h2) and save it to a json file (textsTokensSplited.json)
embedding all split texts with text-embedding-3-small from openai and save it to a json file (textsTokensSplitedEmbedding.json)
save the embedding data to the database

tiktoken library is used to transform text into tokens. We will use this for calculate how many tokens we need to split the text in order to be able to embed it with openAi.

⏳ Link to npm tiktoken / Lien vers le github de tiktoken

You can uncomment displayTokenLengthStats function if you want to check the token sending statistics before saveToDatabase. In this case, don't forget to comment out saveToDatabase function.

Expand

Additional Information

Version 1.0.0
Type Other source code
Update Time 2025-05-30
size 19.46KB
From Github

Related Applications

Google Dorks

2025-03-10
shepherd

2025-06-04
mongo express

2025-06-04
hidusbf

2025-02-14
Free Algorithms Books

2025-05-29
markdownpedia

2025-04-22

Recommended for You

chat.petals.dev

Other source code

1.0.0
GPT Prompt Templates

Other source code

1.0.0
GPTyped

Other source code

GPTyped 1.0.5
Google Dorks

Other source code

1.0
shepherd

Other source code

v6.1.6-react-shepherd: Prepare Release (#3063)
mongo express

Other source code

v1.1.0-rc-3
Google Dorks

Other source code

1.0
shepherd

Other source code

v6.1.6-react-shepherd: Prepare Release (#3063)
mongo express

Other source code

v1.1.0-rc-3

Related Information All