minimal search engine Download - minimal search engine Source code download

minimal search engine

Other source code

1.0.0

Download

Dependent Packages and Dependent Software

See the code on GitHub [https://github.com/GINK03/minimal-search-engine:embed]

You need to travel around various sites and requests will have a high probability of failing to infer character codes, so you need to have nkf installed in a linux environment.

$ sudo apt install nkf
$ which nkf
/usr/bin/nkf

I'll also include Mecab

 $ sudo apt install mecab libmecab-dev mecab-ipadic
$ sudo apt install mecab-ipadic-utf8
$ sudo apt install python-mecab
$ pip3 install mecab-python3
$ git clone --depth 1 https://github.com/neologd/mecab-ipadic-neologd.git
$ ./bin/install-mecab-ipadic-neologd -n

Install the remaining dependencies

$ pip3 install -r requirements.txt

Reproduction

Basically, you can reproduce GitHub code by running it in order of A on Linux such as Ubuntu.

If you want to try crawlers (scrapers), you will get infinitely, so we assume that you can decide on the SEED and finish it at an appropriate time.

Overall processing flow

A. Crawling
B. Parsing the crawled HTML with title, description, body, and hrefs to format the data
C. Creating an IDF dictionary
D. Create TFIDF data
F. Create a correspondence of transposed URLs and hrefs (simple referenced feature amount)
G. Counting non-reference numbers and creating training data for PageRank
H. Create a transposed index for URL and tfidf weights
I. Creating a correspondence table between a hashed URL and an actual URL
J. Learning PageRank
K. Search interface

Program details

A. Crawling

We will crawl comprehensively regardless of the specific domain. We use our blog site as a seed and go deeper as we go, without limiting our domain.
It crawls a variety of sites but is very heavy, so I use my own decentralized KVS as my backend database. Files are prone to corruption with SQLLite, while LevelDB can only be single access.

B. Parsing and shaping HTML

The data obtained in A is too large, so the process in B extracts the main features of searching in tfidf, "title", "description", and "body".
It also parses all external URLs that the page refers to.

 soup = BeautifulSoup ( html , features = 'lxml' )
for script in soup ([ 'script' , 'style' ]):
    script . decompose ()
title = soup . title . text
description = soup . find ( 'head' ). find (
                'meta' , { 'name' : 'description' })
if description is None :
    description = ''
else :
    description = description . get ( 'content' )
body = soup . find ( 'body' ). get_text ()
body = re . sub ( ' n ' , ' ' , body )
body = re . sub ( r's{1,}' , ' ' , body )

It can be easily processed with BeautifulSoup.

C. Creating an IDF dictionary

To reduce the importance of frequently-occurring words, count how much documented each word is referenced.

D. Calculating TDIDF

Use data from B and C to complete it as a TFIDF
Each title description body has a different level of importance, and title : description : body = 1 : 1 : 0.001
It was treated as.

 # title desc weight = 1
text = arow . title + arow . description 
text = sanitize ( text )
for term in m . parse ( text ). strip (). split ():
    if term_freq . get ( term ) is None :
        term_freq [ term ] = 0
    term_freq [ term ] += 1

# title body = 0.001 
text = arow . body
text = sanitize ( text )
for term in m . parse ( text ). strip (). split ():
    if term_freq . get ( term ) is None :
        term_freq [ term ] = 0
    term_freq [ term ] += 0.001 # ここのweightを 0.001 のように小さい値を設定する

F. Create a transposed index for a URL with the URL's HTML linked to

I knew that it was common in the past that it would be possible to SEO by giving URL links from various places, so I did this to know how much external references are being referred to.

G. Counting reference numbers and creating training data for PageRank

Based on the data created in F, you can learn the weights of PageRank nodes using a library called networkx, so you can create training data.

Such a dataset is desired as input (right hash is the link source, left hash is the link destination)

 d2a88da0ca550a8b 37a3d49657247e61
d2a88da0ca550a8b 6552c5a8ff9b2470
d2a88da0ca550a8b 3bf8e875fc951502
d2a88da0ca550a8b 935b17a90f5fb652
7996001a6e079a31 aabef32c9c8c4c13
d2a88da0ca550a8b e710f0bdab0ac500
d2a88da0ca550a8b a4bcfc4597f138c7
4cd5e7e2c81108be 7de6859b50d1eed2

H. Creating an inverted index so that you can easily reverse URLs from words

To accommodate searches with only the simplest words, we create an index that allows you to search for URLs from words. The output is a text file for each word (hash value) and URLのハッシュ, weight(tfidf) and refnum(被参照数) files become a file with a concrete transposed index.

 0010c40c7ed2c240        0.000029752     4
000ca0244339eb34        0.000029773     0
0017a9b7d83f5d24        0.000029763     0
00163826057db7c3        0.000029773     0

I. Creating a URL and hash value correspondence table

If you keep the URL as is in memory, it will overflow, so by using sha256 to consider it as a small hash value using only the first 16 characters, you can search for documents with a minimum of practical use even for documents of 1 million orders.

J. Learning PageRank

Learn the data created in G and learn the PageRank value in the URL.

Using networkx, you can learn with very simple code.

 import networkx as nx
import json
G = nx.read_edgelist('tmp/to_pagerank.txt', nodetype=str)
# ノード数とエッジ数を出力
print(nx.number_of_nodes(G))
print(nx.number_of_edges(G))
print('start calc pagerank')
pagerank = nx.pagerank(G)
print('finish calc pagerank')
json.dump(pagerank, fp=open('tmp/pagerank.json', 'w'), indent=2)

K. Search interface

Provides search IF

$ python3 K001_search_query.py
(ここで検索クエリを入力)

example

$ python3 K001_search_query.py
ふわふわ
                   hurl    weight  refnum  weight_norm                                                            url  pagerank  weight*refnum_score+pagerank
9276   36b736bccbbb95f2  0.000049       1     1.000000  https://bookwalker.jp/dea270c399-d1c5-470e-98bd-af9ba8d8464a/  0.000146                      1.009695
2783   108a6facdef1cf64  0.000037       0     0.758035     http://blog.livedoor.jp/usausa_life/archives/79482577.html  1.000000                      0.995498
32712  c3ed3d4afd05fc43  0.000045       1     0.931093          https://item.fril.jp/bc7ae485a59de01d6ad428ee19671dfa  0.000038                      0.940083
...

Actual use examples

When I searched for "Rai-chan", I was able to tune it so that the information I wanted is roughly above.
Pixiv is not explicitly set to the crawling destination, but it was automatically acquired as a result of A's crawler following the links and creating an index.