OCR_FontsSearchEngine Download - OCR_FontsSearchEngine Source code download

【OCR/Machine Learning/Search Engine】Tesseract-based graphic recognition search engine

1. Preface:

This is a technical blog related to image recognition OCR technology, machine learning, and simple search engine construction. It is where I keep recording research results and experiences every day while doing the graduation project.

2. Topic background

OCR (Optical Character Recognition) technology refers to the process in which electronic devices (such as scanners or digital cameras) check characters printed on paper, determine their shapes by detecting dark and bright patterns, and then translate the shapes into computer text using character recognition methods.

Tesseract's OCR engine was first developed by HP Labs in 1985, and by 1995 it had become one of the three most accurate identification engines in the OCR industry. However, HP soon decided to give up its OCR business, and Tesseract was also kept in vain. Several years later, HP realized that instead of putting Tesseract on the shelf, it would be better to contribute to the open source software industry and rejuvenate it - in 2005, Tesseract was obtained by the Nevada Institute of Information Technology in the United States, and sought to improve, eliminate bugs and optimize Tesseract from Google. Tesseract has been released as an open source project on Google Project, and its latest version 3.0 already supports Chinese OCR.

In such a mature technology background, I really want to use this OCR technology and combine it with the current popular mobile Internet development technology and information retrieval technology to realize a mobile web search engine that can successfully identify the Chinese characters in the picture, aiming to obtain the information you want from the pictures more quickly and accurately.

3. Requirement analysis

With the rapid development of the Internet and the arrival of big data, people are increasingly dependent on data and information. However, today's Internet data is very large, and there have always been great problems with the accuracy of data and the reasonable classification of data. In view of this situation, more and more people hope to find a more convenient way to obtain accurate data in daily work and life, and to find the way to find the information they are looking for more efficiently. At the same time, with the popularity of smartphones, more and more people are more accustomed to using extremely efficient methods such as taking photos to replace the previous copying and typing methods to record data that needs to be recorded in life and work. Inspired by this, I want to use the more mature OCR (optical character recognition) technology at present, as well as the current popular Internet development technology and information retrieval technology to realize a web search engine that can successfully recognize image language fonts. It aims to search more friends from pictures more quickly and accurately through convenient methods such as taking photos and screenshots.

IV. Use case design

V. Application areas

Poster information cloud search
Advertising map information cloud search
Cloud Translation
Business card cloud search

6. Architectural design

7. Technical point analysis

8. Project implementation

Back-end engineering implementation

Introduction

The backend architecture is mainly divided into three major modules: OCR module, search engine module, and PHP message middleware module.

OCR module:
- Tesseract build and deploy under Mac
- Font language sample data training
Search engine module:
- Nutch module deployment configuration
- Solr module deployment configuration
PHP message middleware module: divided into three major message modules
- Tesseract-OCR-PHP Middleware
- PHP picture transmission middleware
- PHP Cloud Retrieval Middleware

Open Source Library

Use Composer dependencies;
- Silex framework ;
- thiagoalessio;
Nutch1.10+Solr4.10.4;

The first module: OCR module

Tesseract build and deploy under Mac

1. Climb the wall first

2. Open the terminal of Mac OS and type

 ``` shell
brew install tesseract
```

3. If you do not agree to the Xcode agreement license, you need to type the agreement license first and agree.

 ```shell
sudo xcodebuild -license

...

agree
```

4. Continue to use Homebrew to install

 ``` shell
brew install tesseract
```

5. After the installation is successful, conduct a test to see if Tesseract can run normally on Mac OS, as shown in the figure below.

6. Here is an explanation of the usage of Tesseract terminal:

 ```shell
Usage:tesseract imagename outputbase [-l lang] [-psm pagesegmode] [configfile...]
pagesegmode values are:
0 = Orientation and script detection (OSD) only.
1 = Automatic page segmentation with OSD.
2 = Automatic page segmentation, but no OSD, or OCR
3 = Fully automatic page segmentation, but no OSD. (Default)
4 = Assume a single column of text of variable sizes.
5 = Assume a single uniform block of vertically aligned text.
6 = Assume a single uniform block of text.
7 = Treat the image as a single text line.
8 = Treat the image as a single word.
9 = Treat the image as a single word in a circle.
10 = Treat the image as a single character.
-l lang and/or -psm pagesegmode must occur before anyconfigfile.
```

in:
```
 tesseract imagename outputbase [-l lang] [-psm pagesegmode] [configfile...]
```
Indicates tesseract 图片名输出文件名-l 字库文件-psm pagesegmode 配置文件.
For example: tesseract code.jpg result -l chi_sim -psm 7 nobatch
- -l chi_sim means using simplified Chinese font library (you need to download the Chinese font library file, decompress it, and store it in the tessdata directory. The font file extension is .raineddata Simplified Chinese font library file name: chi_sim.traineddata ).
- -psm 7 means telling tesseract code.jpg that the image is a line of text. This parameter can reduce the recognition error rate. The default is 3 .
- The configfile parameter values are the file names in the tessdataconfigs and tessdatatessconfigs directories.

7. Now let’s use and test it, as shown in the figure below

** English font test: **

** Chinese font test: **

Font language sample data training

Now let’s build the font language library and the training of font language sample data

Official original version of font library creation:

 **font_properties (new in 3.01)**

A new requirement for training in 3.01 is a font_properties file. The purpose of this file is to provide font style information that will appear in the output when the font is recognized. The font_properties file is a text file specified by the -F filename option to mftraining.

Each line of the font_properties file is formatted as follows:

<fontname> <italic> <bold> <fixed> <serif> <fraktur>
where <fontname> is a string naming the font (no spaces allowed!), and <italic>, <bold>, <fixed>, <serif> and <fraktur> are all simple 0 or 1 flags indicating whether the font has the named property.

When running mftraining, each .tr filename must match an entry in the font_properties file, or mftraining will abort. At some point, possibly before the release of 3.01, this matching requirement is likely to shift to the font name in the .tr file itself. The name of the .tr file may be either fontname.tr or [lang].[fontname].exp[num].tr.

**Example:**

font_properties file:

timesitalic 1 0 0 1 0
shapeclustering -F font_properties -U unicharset eng.timesitalic.exp0.tr
mftraining -F font_properties -U unicharset -O eng.unicharset eng.timesitalic.exp0.tr 
Note that in 3.03, there is a default font_properties file, that covers 3000 fonts (not necessarily accurately) in training/langdata/font_properties.

**Clustering**

When the character features of all the training pages have been extracted, we need to cluster them to create the prototypes.

The character shape features can be clustered using the shapeclustering, mftraining and cntraining programs:

**shapeclustering (new in 3.02)**

shapeclustering should not be used except for the Indic languages.

shapeclustering -F font_properties -U unicharset lang.fontname.exp0.tr lang.fontname.exp1.tr ...
shapeclustering creates a master shape table by shape clustering and writes it to a file named shapetable.

**mftraining**

mftraining -F font_properties -U unicharset -O lang.unicharset lang.fontname.exp0.tr lang.fontname.exp1.tr ...
The -U file is the unicharset generated by unicharset_extractor above, and lang.unicharset is the output unicharset that will be given to combine_tessdata.

mftraining will output two other data files: inttemp (the shape prototypes) and pffmtable (the number of expected features for each character). In versions 3.00/3.01, a third file called Microfeat is also written by this program, but it is not used. Later versions don't produce this file.

NOTE: mftraining will produce a shapetable file if you didn't run shapeclustering. You must include this shapetable in your traineddata file, whether or not shapeclustering was used.

**cntraining**

cntraining lang.fontname.exp0.tr lang.fontname.exp1.tr ...
This will output the normproto data file (the character normalization sensitivity prototypes).

How to do machine learning and train custom new data:

Official wiki
Chinese guidance

Practical process:

First, you can download the official language package from Tesseract official Github for reference: Portal

The second module: search engine module

Nutch module deployment configuration

Installation and configuration development package

On Mac, use Spotlight to enable Terminal

JDK installation and deployment

Introduction: I won't explain this.
Version selection: 1.8.0_77
Download address: JD Download Official Website
Start the installation: I remember that the java installation package for Mac version has dmg, and you can install it directly with the graphical interface operation.

 vi /etc/profile

At this time, press the letter i on the keyboard to enter the editing mode, and enter the following two lines of command below the terminal:

 export JAVA_HOME=/usr/lib/jvm/jdk1.8.0_77
export CLASSPATH=.:$JAVA_HOME/jre/lib/rt.jar:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
export PATH=$PATH:$JAVA_HOME/bin

Then press esc to end the editing, and then enter: wq! Exit

 source /etc/profile
java -version

If the Java version information appears, it proves that the installation is successful!

Ant installation and deployment

Introduction: When a code project becomes large, every time it is recompiled, packaged, tested, etc. becomes very complicated and repetitive. Therefore, there are make scripts in the C language to help complete these tasks in batches. Applications in Java are platform-independent, and of course, they will not use platform-related make scripts to complete these batch tasks. ANT itself is such a process script engine, which is used to automate the call program to complete the compilation, packaging, testing of projects, etc. In addition to being platform-independent based on JAVA, the script format is based on XML, which is easier to maintain than make scripts.
Version selection: apache-ant-1.9.6-bin.zip
Download address: Ant official website
Start the installation:

 sudo sh
cd /usr/local/
chown YourUserName:staff apache-ant-1.9.6
ln -s apache-ant-1.9.6 ant
vi /etc/profile

At this time, press the letter i on the keyboard to enter the editing mode, and enter the following two lines of command below the terminal:

 export ANT_HOME=/usr/local/ant
export PATH=${PATH}:${ANT_HOME}/bin

Then press esc to end the editing, and then enter: wq! Exit

 source /etc/profile
ant -version

Apache Ant(TM) version 1.9.6 compiled on appears... This display proves that the installation is successful!

Nutch installation and deployment

Introduction: Currently, Nutch is divided into two large versions 1.x and 2.x. Apache independently develops and maintains these two large versions. Among them, the biggest difference between 1.x and 2.x is that 1.x is based on hadoop's HDFS file system, while 2.x abstracts the data storage layer and can save the data in databases such as Hbase and MySQL. Another important thing is that Nutch was released as a complete search engine in 1.2 and before. Since 1.3, Nutch itself has mainly only crawling functions. If it is necessary to index and search the crawled data, it is also necessary to use the Solr full-text search server. Since both Nutch and Solr are developed based on Lucene, the data crawled by Nutch can be easily indexed in Solr. The Nutch official website can download the compiled 1.x package, but 2.x only provides source code and needs to be compiled by yourself. Nutch is built using Ant. If you compile it yourself, you need to install Ant to compile the source code. Regarding how to choose the Nutch version, we mainly consider the following issues: If you only need to crawl a small number of websites and index them, you can use 1.x and 2.x, and you can even use stand-alone without the need for distributed ones. But if you want to crawl a large number of websites, or even crawl the entire web, it is best to choose 1.x and use distributed, because 1.x is based on the hadoop file system, which was created specifically for processing big data. If you use 2.x when crawling a large number of websites, you may encounter some performance problems. If you use MySQL to store data, performance will be a nightmare when the web page data exceeds tens of billions. Different versions of Nutch1.x have also changed a lot, and the execution of commands has undergone major changes. Therefore, it is recommended that beginners download the corresponding version 1.10 of this tutorial. When you are familiar with using Nutch, those changes will not have much impact on you. Nutch is one of the most popular open source crawlers today, and has been widely used by enterprises. Nutch's plug-in mechanism allows developers to flexibly customize web crawling strategies. Nutch has a long history, and the famous Hadoop today was developed from Nutch. Nutch can not only run in stand-alone mode, but also in distributed mode. Nutch only supports working in Linux environments, so it can be used directly under Linux-like OS X.
Version selection: apache-nutch-1.10-src.zip
Download address: Nutch official website
Start the installation:

 unzip apache-nutch-1.10-src.zip
cd apache-nutch-1.10
vi conf/nutch-default.xml

Find the attribute http.agent.name, copy it to conf/nutch-site.xml, and modify the value value to not be empty. Here is a customization as: HD nutch agent, and then continue to crawl the command will report an error. The modified nutch-site.xml is as follows:

 <configuration>
<property>
  <name>http.agent.name</name>
    <value>myNutch</value>
      <description>HTTP 'User-Agent' request header. MUST NOT be empty -
        please set this to a single word uniquely related to your organization.
  NOTE: You should also check other related properties:
    http.robots.agents
        http.agent.description
            http.agent.url
                http.agent.email
                    http.agent.version
  and set their values appropriately.
  </description>
  </property>
</configuration>

The attribute http.agent.name is used to mark crawlers, so that the crawled website can identify them.

The properties configured in nutch-site.xml will override the default properties in nutch-default . Here we only modify the attribute http.agent.name , and no changes will be made to the others.

At this point, we have configured Nutch, and then compile the source code using the following command in the Nutch home directory.

Start the service

Ant compiles Nutch source code

Switch to the Nutch home directory to execute:

ant

The first compilation process will take a lot of time because more dependency packages need to be downloaded. The specific time depends on the actual network situation. It takes 5-10 minutes if it is fast, and more than 20 minutes if it is slow.

The following warning will be reported at the start of compilation:

Could not load definitions from resource org/sonar/ant/antlib.xml. It could not be found.

This warning does not affect the compilation result and can be ignored.

There may also be a network problem during the compilation process. You only need to use the following command to clear the last compilation result (the already downloaded dependency package will not be deleted):

 ant clean

In the case of poor network, the above two steps may be repeated multiple times.

When similar information appears, it means that the compilation is successful:

BUILD SUCCESSFUL

Total time: 1 minute 7 seconds

As shown in the figure below:

After Nutch is successfully compiled, a runtime folder will be generated in the home directory. It contains two subfolders deploy and local. deploy is used for distributed crawling, while local is used for local stand-alone crawling. This section first explains the use of local stand-alone crawling, and distributed crawling is placed in the subsequent tutorial.

Enter the local folder and then enter the bin folder. There are two script files, one is nutch and the other is crawl. Among them, nutch contains all the required commands, and crawl is mainly used for one-stop crawl.

As shown in the figure below:

Solr module deployment configuration

Introduction: Solr is an excellent Lucene-based full-text search server. It extends Lucene, provides a very rich query language, and optimizes queries performance.
Version selection: solr-4.10.4.zip
Download address: Solr official website
Start the installation:

 unzip solr-4.10.4.zip

Get the folder solr-4.10.4, copy the runtime/local/conf/schema-solr4.xml in the Nutch directory to the configuration file directory of solr under example/solr/collection1/conf:

 cp apache-nutch-1.10/runtime/local/conf/schema-solr4.xml solr-4.10.4/example/solr/collection1/conf

Delete the original schema.xml file of solr:

 rm –f solr-4.10.4/example/solr/collection1/conf/schema.xml

And comment out the schema-solr4.xml

 <copyField source="latLon" dest="location"/>

Rename schema-solr4.xml to schema.xml:

 mv solr-4.10.4/example/solr/collection1/conf/ schema-solr4.xml solr-4.10.4/example/solr/collection1/conf/ schema.xml

At this point, Solr is configured and enter the solr-4.10.4/example directory:

 cd solr-4.10.4/example

Start Solr:

At this time, you can access port 8983 through the browser and view Solr's control interface:

http://localhost:8983/solr

Use Nutch crawler for data crawling and Solr for data retrieval

One-stop crawling

Enter the Nutch home directory, most of the commands we execute in the Nutch home directory, rather than in the Nutch bin directory, because this can more conveniently execute some complex commands. Check out the one-stop crawling command:

 bin/crawl
bin/nutch

Entering the above two commands shows their respective usage methods. A few commonly used commands will be explained in detail later, as shown in the figure below:

Check how to use crawl:

-i|index is used to tell nutch to add the crawled result to the configured indexer.

-D is used to configure the parameters passed to the Nutch call, we can configure the indexer here.

Seed Dir seed file directory, used to store seed URL, that is, the URL that the crawler initially crawls.

Crawl Dir storage path for data crawling.

Num Rounds loop crawls.

Example of usage:

Enter Nutch's runtime/local directory and create a new urls folder:

Create a new seed file that stores urls in the urls folder, seed.txt

Add the initial crawl URL to urls/seed.txt: http://www.163.com

Turn on the Solr service, otherwise the index cannot be established normally in Solr

 bin/crawl -i -D solr.server.url=http://localhost:8983/solr/ urls/ TestCrawl/  2

In this command, -i tells the crawler to add the crawler to the given index. solr.server.url=http://localhost:8983/solr/ is the address of the Solr indexer, urls/ is the seed URL file path, and TestCrawl is the folder used by Nutch to store crawler data (including URL, crawled content, etc.). Parameter 2 here means loop crawling twice.

By executing the above command, you can start crawling the web page. Enter http://:8983/solr in the browser, select collection1, and you can search for the indexed content through keywords. It should be noted here that the crawler has not crawled all pages of the specified URL. For the specific method of viewing the crawling situation, please refer to the distributed crawling below.

After the crawling is successful, the following figure is shown in the following figure:

Distributed crawling

Sometimes, one-stop crawl cannot meet our needs well. Therefore, here I will introduce the method of distributed crawl: the actual crawl process of distributed crawl contains multiple commands. In order to simplify operations, crawl combines multiple commands together and provides them to users. If you want to learn Nutch crawler technology in depth, it is not enough to just use the crawl command. You also need to be very familiar with the crawl process. Here you need to use the URL information saved in seed.txt in the previous tutorial. You also need to delete the contents under the data/crawldb, data/linkdb and data/segments folders, because we need to re-crawl the data in steps.

Nutch data folder composition

After executing the crawl command, a TestCrawl folder will be generated under runtime/local of Nutch, which contains three folders: crawldb, linkdb and segments.

crawldb: It contains all URLs found by Nutch, which contains information about whether the URL was crawled and when it was crawled.

linkdb: It contains all the links corresponding to the URL in crawldb discovered by Nutch, as well as the source URL and anchor text.

segments: It contains multiple segment folders named after time. Each segment is a crawling unit, containing a series of URLs, and each segment contains the following folders:

 crawl_generate：待抓取的URL

crawl_fetch：每个URL的抓取状态

content：从每个URL抓取到的原始内容

parse_text：从每个URL解析得到的文本

parse_data：从每个URL解析得到的外链和元数据

crawl_parse：包含外链URL，用来更新crawldb

Inject URL list into crawldb

 bin/nutch inject data/crawldb urls

Generate a crawl list

In order to crawl the page with the specified URL, we need to generate a crawl list from the database (crawldb):

 bin/nutch generate data/crawldb data/segments

After the generate command is executed, a list of pages to be crawled will be generated, and the crawling list will be stored in a newly created segment path. The folder of segment is named according to the time it was created (the folder name of this tutorial is 201507151245).

There are many optional parameters for generate, readers can view them by themselves through the following commands (the same is true for other commands):

 bin/nutch generate

Start crawling

Crawl the web page according to the crawl list generated by generate:

 bin/nutch fetch data/segments/201507151245  #这里的201507151245为文件夹名，需要根据自己的情况进行更改，或者直接采用data/segments文件夹，这样的操作对segments文件夹下的所有子文件夹生效，后文同理。

Analysis

 bin/nutch parse data/segments/201507151245

Update the database

Update the database based on the crawled results:

 bin/nutch updated data/crawldb –dir data/segments/201507151245

The database now contains all the entry points after the initial page has been updated, as well as new entry points for newly discovered pages from the initial collection.

Reverse link

Before creating an index, we first invert all links so that we can index the source anchor text of the page.

 bin/nutch invertlinks data/linkdb –dir data/segments/201507151245

Add the crawled data to the Solr index

Start the Solr service, and now we index the crawled resources:

 bin/nutch index data/crawldb -linkdb data/linkdb -params solr.server.url=http://localhost:8983/solr -dir data/segments/201507151245

Remove duplicate URLs

Once the full text index is established, it must process duplicate URLs so that the URL is unique:

 bin/nutch dedup

This command looks for duplicate URLs based on signatures. For duplicate URLs marked as STATUS_DB_DUPLICATE, Cleaning and Indexing tasks will delete them according to the tag.

Clean up

 bin/nutch clean –D solr.server.url=http://192.168.1.11:8983/solr data/crawldb

Remove HTTP301, 404 and duplicate documents from solr.

So far, we have completed all the crawling steps using step-by-step crawling. Under normal crawling, we can search at http://localhost:8983/solr.

Grab results analysis

readdb

Used to read or export Nutch's crawling database, usually used to view the status information of the database and view the usage of readdb:

 Usage: CrawlDbReader <crawldb> (-stats | -dump <out_dir> | -topN <nnnn> <out_dir> [<min>] | -url <url>)
<crawldb>directory name where crawldb is located
-stats [-sort] print overall statistics to System.out
[-sort]list status sorted by host
-dump <out_dir> [-format normal|csv|crawldb]dump the whole db to a text file in <out_dir>
[-format csv]dump in Csv format
[-format normal]dump in standard format (default option)
[-format crawldb]dump as CrawlDB
[-regex <expr>]filter records with expression
[-retry <num>]minimum retry count
[-status <status>]filter records by CrawlDatum status
-url <url>print information on <url> to System.out
-topN <nnnn> <out_dir> [<min>]dump top <nnnn> urls sorted by score to <out_dir>
[<min>]skip records with scores below this value.
This can significantly improve performance.

Here crawldb is the database that saves URL information. -stats means viewing statistical status information, -dump means exporting statistical information, url means viewing information of the specified URL, and viewing database status information:

The statistical results obtained are as follows:

 MacBook-Pro:local root# bin/nutch readdb TestCrawl/crawldb -stats
CrawlDb statistics start: TestCrawl/crawldb
Statistics for CrawlDb: TestCrawl/crawldb
TOTAL urls:	290
retry 0:	290
min score:	0.0
avg score:	0.017355172
max score:	1.929
status 1 (db_unfetched):	270
status 2 (db_fetched):	17
status 3 (db_gone):	2
status 4 (db_redir_temp):	1
CrawlDb statistics: done

TOTAL urls represents the total number of URLs, retry represents the number of retry times, mins score is the lowest score, max score is the highest score, status 1 (db_unfetched) is the number not crawled, and status 2 (db_fetched) is the number crawled.

readlinkdb

readlinkdb is used to export all URLs and anchor text, view usage:

 Usage: LinkDbReader <linkdb> (-dump <out_dir> [-regex <regex>]) | -url <url>
-dump <out_dir>dump whole link db to a text file in <out_dir>
-regex <regex>restrict to url's matching expression
-url <url>print information about <url> to System.out

The dump and url parameters here are the same as the readdb command, exporting data:

bin/nutch readlinkdb data/linkdb -dump linkdb_dump

Import the data into the linkdb_dump folder and view the exported data information:

cat linkdb_dump /*

You can see that the exported information is similar to the following format:

 fromUrl: http://www.sanesee.com/article/step-by-step-nutch-introduction anchor: http://archive.apache.org/dist/nutch/

That is, the source URL is recorded.

readseg

readseg is used to view or export the data in segment and view usage:

 Usage: SegmentReader (-dump ... | -list ... | -get ...) [general options]
* General options:
-nocontentignore content directory
-nofetchignore crawl_fetch directory
-nogenerateignore crawl_generate directory
-noparseignore crawl_parse directory
-noparsedataignore parse_data directory
-noparsetextignore parse_text directory
* SegmentReader -dump <segment_dir> <output> [general options]
  Dumps content of a <segment_dir> as a text file to <output>.
<segment_dir>name of the segment directory.
<output>name of the (non-existent) output directory.
* SegmentReader -list (<segment_dir1> ... | -dir <segments>) [general options]
  List a synopsis of segments in specified directories, or all segments in
  a directory <segments>, and print it on System.out
<segment_dir1> ...list of segment directories to process
-dir <segments>directory that contains multiple segments
* SegmentReader -get <segment_dir> <keyValue> [general options]
  Get a specified record from a segment, and print it on System.out.
<segment_dir>name of the segment directory.
<keyValue>value of the key (url).
Note: put double-quotes around strings with spaces.

Export segment data:

bin/nutch readseg -dump data/segments/20150715124521 segment_dump

Import the data into the segment_dump folder to view the exported data information:

cat segment_dump /*

You can see that it contains very specific web page information.

The third module: PHP message middleware module

Implementation of Tesseract-OCR-PHP middleware

1. Configure PHP and server environments

You can use WAMP/MAMP, or PHPStorm and its built-in server.

2. Use Composer to build PHP source project

For specific operations, please refer to: Portal

Open the terminal and switch to your project path:

composer require silex/silex twig/twig thiagoalessio/tesseract_ocr:dev-master

Because we use PHP's mini framework Silex Framework , we need to establish the structure of the PHP source engineering project MVC (public, uploads, views), as shown in the figure:

**public:**Storage PHP scripts
**uploads: **Storage pictures uploaded by users
**vendor: **All composer dependency packages
**views: **front-end code

3. Create PHP message middleware using Sliex library

Start initialization

public/index.php

 <?php 
//如果是在WAMP等其他集成环境下，需要重新获取环境变量的PATH，不然无法调用Tesseract

$ path = getenv ( ' PATH ' );
putenv ( " PATH= $ path :/usr/local/bin " );

require __DIR__ . ' /../vendor/autoload.php ' ; 

use Symfony  Component  HttpFoundation  Request ; 

$ app = new Silex  Application (); 

$ app -> register ( new Silex  Provider  TwigServiceProvider (), [
  ' twig.path ' => __DIR__ . ' /../views ' ,
]);

$ app [ ' debug ' ] = true ; 

$ app -> get ( ' / ' , function () use ( $ app ) { 

  return $ app [ ' twig ' ]-> render ( ' index.twig ' );

}); 

$ app -> post ( ' / ' , function ( Request $ request ) use ( $ app ) { 

    //TODP
    
}); 

$ app -> run ();

Upload files and store uniquely

 // Grab the uploaded file
$ file = $ request -> files -> get ( ' upload ' ); 

// Extract some information about the uploaded file
$ info = new SplFileInfo ( $ file -> getClientOriginalName ());

// 产生随机文件名来减少文件名冲突
$ filename = sprintf ( ' %d.%s ' , time (), $ info -> getExtension ());

// Copy the file
$ file -> move ( __DIR__ . ' /../uploads ' , $ filename );

Tesseract-OCR interface class and call methods

Implementation of Solr-PHP middleware

Front-end engineering implementation and testing

Interactive design

** Interaction process: **

1) Users enter the homepage website URL, enter the homepage, enjoy the service, and learn about service details.

2) The user uploads the required search images through the search box and previews before uploading.

3) After the user confirms that the uploaded picture is correct, click the picture search button to upload and identify the picture. Because this part of the server has a large amount of calculation, it takes 2-5 seconds to return the result, so the user is presented with a Loading page.

4) The image recognition is completed, the Loading page disappears, and the recognition result preview confirmation page is entered.

5) After the user confirms the content, he can click Search to enter the search engine module to obtain the search results.

Visual design

Visual design occupies an extremely important component in the product composition, which directly affects the user's initial impression of the product, experience during use, and the final impression left, etc. Even in many cases, the success or failure of a product often depends on successful visual design experience. For this design, I did not regard it as a system that can handle things, but as a product that I carefully created. Therefore, I pay great attention to the front-end visual design and user experience of this system. To select colors on the entire site, I chose the Baidu search engine theme colors that are more recognized by the public, red and blue. The home page background uses red and blue transition colors, and adjusts the transparency. It is drawn through CSS code, which saves loading time and gives users a good visual impact. At the same time, the text description of the home page has been added to the bottom shadow and uses Microsoft thin black fonts to make the visual feel more layered. The image search box and preview box also increase the shadow, and different colors and chromaticity distinction is made to the importance of the preview field. It makes the user visually refreshing and concise, and can find the information you need as soon as possible. Then, it will pass through a simple loading page. Here, a waiting circle is made to zoom in and out, so that the user does not feel irritated by the waiting time. At the same time, it tells the user that the system background is performing calculations and running. When jumping to the recognition result page, all text colors and font sizes are still visually adjusted according to the importance of the copy, so that the user does not have to spend too much time filtering out important information. The scheduling and color selection of the two buttons also tend to increase the desire and sense of clicking, prompting the user to perform the next operation. For the final search results page, I designed it into a chapter similar to a book. Each list has different scheduling and font size color adjustments based on the search results web page title, web page summary, inclusion time, weight, to increase visual impact and recognition. It makes users feel that they have some commonalities with mass search engines, but they reveal some of their own personalities, giving users a familiar and novel experience, and also retains the characteristics of refreshing and ad-free and unnecessary interference information. Moreover, all visual designs combine the current responsive design concept, and have good user experience and visual effects on both PC and mobile.

Template writing

** Template writing using Twig: **

views
- index.twig
- search_results.twig
- results.twig
- css
- js
- fonts
- images
- favicon.ico

The front-end experience is shown in the figure below:

Mobile engineering implementation and testing

It is mainly based on Bootstrap 3.4, and can be packaged with XDK/phoneGap and compiled into the corresponding NA App and published to the application market. The mobile experience is shown in the figure below:

Experience

Thanks to all the teachers and all the classmates who helped me in the four years of college. They taught me professional knowledge. Through the past four years of study and scientific research, not only has my knowledge structure and scientific research ability reached a new level, but more importantly, it has integrated into society and given me full internship experience, allowing me to experience many Internet company work experiences that graduate students cannot experience in my undergraduate degree. In the blink of an eye, four years of college are coming to an end. When I completed this graduation thesis with a nervous mood, I also changed from an ignorant child to a mature young man. The same thing - only sweat will not deceive you. Finally, thank the University of Electronic Science and Technology, everyone I met in the university, and thank myself for my four years of hard work.

Project open source address:

https://github.com/daijiale/OCR_FontsSearchEngine.

Project video demonstration address:

http://v.youku.com/v_show/id_XMTYzNDY2NDYxNg==.html.

References

Detailed steps for training new data in tesseract OCR - with a large amount of training data
tesseract OCR's multi-language, multi-font character recognition
Tesseract: Installation and command line usage
Use Vagrant to create a local development environment under Mac
PHP OCR practice: use Tesseract to read text from images
Github—Tesseract-OCR
Tesseract-OCR Character Recognition---Sample Training
Open source graphic and text recognition engine to realize verification code recognition
Tesseract-OCR 3.02 Training Notes
Tesseract 3 language data training method
Tesseract 3.02 Chinese font library training
Tesseract installation and deployment under Mac
Business card recognition system based on Tesseract-OCR
Github_Tesseract-OCR_For_PHP
Nutch1.10 Introduction Tutorial
Getting started with Solr
Solr official website

Expand