Wunner
A toy search engine that searches the web inside your terminal :p
Features
- Implemented in C++14.
- Crawls webpages progressively starting from seed URL(s).
- Parses the documents and the query, trying to generate more appropriate results.
- Builds an index (hash map) for the parsed documents.
- The crawled documents and index are refreshed periodically.
- Autocompletes query using a trie, based on most recently asked queries.
- Maintains two threads, to allow refreshing the index and querying simultaneuosly.
- Generates most relevant results in order ranked on the basis of harmonic mean of PageRank (to get the importance of webpage) and Okapi BM25 (to get query-based result) algorithm ranks.
- Provides query suggestions (only when the input query does not generate any results), on the basis of common incorrect and correct words. Ranks them using n-gram algorithm and edit-distance DP to compare two strings.
Steps to Run
Command to run : wunner_search (make sure your PWD is the project's root directory)
Add option -f or --fresh as in wunner_search -f to start the search engine afresh (i.e., crawling and indexing again)
- After indexing gets completed, simply type your query and hit Enter to start searching
- To use autocomplete, press Ctrl+G while typing query and then type the desired result's number to complete the query (it's not of relevance until a web UI is developed)
Steps to Build
- Clone (
git clone https://github.com/Anishka0107/Wunner.git) or download this repository
cd Wunner from where it was cloned/downloaded
Build (tested on Linux)
- Requirements : GCC (5.0 & above) / Clang (3.4 & above), Boost, Wget
- Two options :
- Requires
ar :
- Run
chmod +x wunner_build.sh
- Run
./wunner_build.sh (note that this defaults to g++ compiler; append compiler name to use other, eg: ./wunner_build.sh clang++)
- Requires
cmake and make:
- Run
mkdir -p build && cd build && cmake .. && make -j$(nproc)
- Ultimately run
wunner_search (either directly ./build/bin/wunner_search or do export PATH=$PATH:${PWD}/build/bin before)
Docker based (for Linux/Windows/OS-X)
- Set up Docker on your system (need root priviledges for docker commands)
- Build the image using
docker build -t wunner .
- Run using
docker run -v ${PWD}:/tmp wunner wunner_search (append wunner_search options if required)
TODO checklist:
Resources
- Crawler Seed URLs ->
- Erroneous Words ->
- List of Stop Words -> https://www.webconfs.com/stop-words.php