English | 中文版
This project provides a powerful web scraping tool that fetches search results and converts them into Markdown format using FastAPI, SearXNG, and Browserless. It includes the capability to use proxies for web scraping and handles HTML content conversion to Markdown efficiently. Now featuring AI Integration for filtering search results. Alternatives include Jina.ai, FireCrawl AI, Exa AI, and 2markdown, offering various web scraping and search engine solutions for developers.
Ensure you have the following installed:
You can use Docker to simplify the setup process. Follow these steps:
Clone the repository:
git clone https://github.com/essamamdani/search-result-scraper-markdown.git
cd search-result-scraper-markdownRun Docker Compose:
docker compose up --buildWith this setup, if you change the .env or main.py file, you no longer need to restart Docker. Changes will be reloaded automatically.
Follow these steps for manual setup:
Clone the repository:
git clone https://github.com/essamamdani/search-result-scraper-markdown.git
cd search-result-scraper-markdownCreate and activate virtual environment:
virtualenv venv
source venv/bin/activateInstall dependencies:
pip install -r requirements.txtCreate a .env file in the root directory with the following content:
SEARXNG_URL=http://searxng:8080
BROWSERLESS_URL=http://browserless:3000
TOKEN=your_browserless_token_here # Replace with your actual token
# PROXY_PROTOCOL=http
# PROXY_URL=your_proxy_url
# PROXY_USERNAME=your_proxy_username
# PROXY_PASSWORD=your_proxy_password
# PROXY_PORT=your_proxy_port
REQUEST_TIMEOUT=30
# AI Integration for search result filter
FILTER_SEARCH_RESULT_BY_AI=true
AI_ENGINE=groq
# GROQ
GROQ_API_KEY=yours_groq_api_key_here
GROQ_MODEL=llama3-8b-8192
# OPENAI
# OPENAI_API_KEY=your_openai_api_key_here
# OPENAI_MODEL=gpt-3.5-turbo-0125Run Docker containers for SearXNG and Browserless:
./run-services.shStart the FastAPI application:
uvicorn main:app --host 0.0.0.0 --port 8000To perform a search query, send a GET request to the root endpoint / with the query parameters q (search query), num_results (number of results), and format (get response in JSON or by default in Markdown).
Example:
curl "http://localhost:8000/?q=python&num_results=5&format=json" # for JSON format
curl "http://localhost:8000/?q=python&num_results=5" # by default MarkdownTo fetch and convert the content of a specific URL to Markdown, send a GET request to the /r/{url:path} endpoint.
Example:
curl "http://localhost:8000/r/https://example.com&format=json" # for JSON format
curl "http://localhost:8000/r/https://example.com" # by default MarkdownTo fetch image search results, send a GET request to the /images endpoint with the query parameters q (search query) and num_results (number of results).
Example:
curl "http://localhost:8000/images?q=puppies&num_results=5"To fetch video search results, send a GET request to the /videos endpoint with the query parameters q (search query) and num_results (number of results).
Example:
curl "http://localhost:8000/videos?q=cooking+recipes&num_results=5"This project uses Geonode proxies for web scraping. You can use my Geonode affiliate link to get started with their proxy services.
For a detailed explanation of the code, visit the article here.
This project is licensed under the MIT License. See the LICENSE file for details.
Essa Mamdani - essamamdani.com
Contributions are welcome! Please feel free to submit a Pull Request.