An scalable web crawler, here a list of the feature of this crawler:
By saving the representations into a vector database, you can retrieve similar pages according to how close two vectors are. This is critical for a browser to retrieve the most relevant results.
Run the crawler with the terminal:
$ python cli_crawl.py --help
options:
-h, --help show this help message and exit
-u INITIAL_URLS [INITIAL_URLS ...], --initial-urls INITIAL_URLS [INITIAL_URLS ...]
-lm LANGUAGE_MODEL, --language-model LANGUAGE_MODEL
-m MAX_DEPTH, --max-depth MAX_DEPTHHost the API with uvicorn and FastAPI.
uvicorn api_app:app --host 0.0.0.0 --port 80Take a look to the example in start_api_and_head_node.sh. Note that the ray head nodes needs to be initialized first.
For our use case, we simply use BERT model implemented by Huggingface to extract embeddings from the web text. More precisely, we use bert-base-uncased. Note that the code is agnostic and new models could be registered and added with few lines of code, take a look to llm/best.py.
We use Milvus as our main database administrator software. We use a vector-style database due to its inherited capability of searching and saving entries based on vector representations (embeddings).
Start your standalone Milvus server as follows, I suggest using an multiplexer software such as tmux:
tmux new -s milvus
milvus-serverTake a look under scripts/ to see some of the basic requests to Milvus.
You can also use the official docker compose template:
docker compose --file milvus-docker-compose.yml up -dWe use Ray, is great python framework to run distributed and parallel processing. Ray follows the master-worker paradigm, where a head node will request tasks to be executed to the connected workers.
ray start --headimport ray
# Connect to the head
ray.init("auto")In case you want to stop ray node:
ray stopOr checking the status:
ray statusray startThe worker node does not need to have the code implementation as the head node will serialize and submit the arguments and implementation to the workers.
The current implementation is a PoC. Many improvements can be made:
All issues and PRs are welcome ?.