It is a lot of people’s idea to have our own search engine, but how can we have our own search engine? Now the editor will teach you how to use the current popular data collection methods to implement your own search engine. Now let’s take a look at the methods to build your own search engine.
1. Understand Baidu Search
Baidu Search, the world's largest Chinese search engine, was listed on the Nasdaq in the United States on August 5, 2005. It is currently the search engine with the highest user usage rate in China, providing various searches such as web pages, news, pictures, music, maps, etc.
1. Query parameters for Baidu web search
Required parameters
☆wd--Keyword for query (Keyword)
☆pn--Number of pages showing the result (PageNumber)
☆cl--Search type (Class), cl=3 is web search
Optional parameters
☆rn--The number of search results (RecordNumber), the value range is between 10-100, the default setting is rn=10
☆ie--Query the encoding of the input text (InputEncoding), the default setting is ie=gb2312, which is simplified Chinese
☆tn--The source site for submitting search request
Several useful tns
tn=baidulocal means that searches on Baidu site, the returned results are very clean and there is no ad interference. For example, search for happiness on Baidu to see if the results are refreshing.
tn=baiducnnic wants to put Baidu in the framework? Just try this parameter, it is customized by Baidu for Cnnic
☆si--Search in limited domain names. For example, if you want to search on Sina's website, you can use the parameter si=sina.com.cn. To make this parameter effective, you must use it in conjunction with the ct parameter.
☆ct--The value of this parameter is generally a string of numbers, which is estimated to be the verification code for the search request.
Use si and ct parameters in combination, for example, searching for ideals in sina.com.cn, available: http://www.baidu.com/baidu?ie=utf-8&am...n&cl=3&word=ideal
☆bs--The keyword of the last search (BeforeSearch), which is estimated to be related to related searches.
2. Baidu search results page structure
According to the source code structure, the top-down is:
Search box
Fixed ranking of hot areas on the right
Search results
Pagination area
Related Searches
Bottom Search Box
Copyright area
The search results and the pagination area are the valid data we need. According to the code results, you can find its unique string identifier. Just use this identifier to intercept the content. For details, look at the following code.
2. Core function--xmlhttp component using asp
Data collection program, commonly known as thief program, is the core part of this xmlhttp component. It is a bit old-fashioned to use xmlhttp to collect data, and there is also a lot of online information. Generally, the collection code is
setthttp=Server.createobject(MSXML2.XMLHTTP)
Http.openGET,url,false'Open xmlhttp
Http.send()'Send a request
ifHttp.readystate<>4then
exitfunction
endif
getHTTPPage=bytesToBSTR(Http.responseBody,GB2312)'Return the result (usually a byte stream) and convert the byte stream into a string
setthttp=nothing'release xmlhttp
See the complete code below for detailed application
3. Complete code (file name: searchi_bd.asp)
<%
optionexplicit
Dimwd,pn
wd=Request(wd)
pn=Request.QueryString(pn)
'Start error handling
OnErrorResumeNext
IfErr.Number<>0Then
Response.Clear
'Show error message to the user
There is an error in Response.Write<palign='center'><fontsize=3>, please open Baidu search again.</font></p>
endif
%>
<HTML>
<HEAD>
<TITLE>Baidu search--<%=wd%></TITLE>
</HEAD>
<STYLEtype=text/css>
<!--
body,td{font-family:arial}
TD{FONT-SIZE:9pt;LINE-HEIGHT:18px}
.cred{color:#FF0000}
//-->
</STYLE>
<BODYleftmargin=0topmargin=3marginwidth=0marginheight=0>
<tablealign=centerwidth=98%cellpacing=0cellpadding=0border=0bgcolor=#ffffff>
<tr>
<formname=f1method=postaction=searchi_bd.asp>
<tdwidth=150height=50>
Your logo
</td>
<tdalign=left>
<inputname=wdsize=40maxlength=100title=Enter keywords, and then Let'sSearching...value=<%=wd%>>
<inputtype=submitvalue=Baidu search>
</td></form></tr>
</table>
<%
DimstrUrl,strTmp_bd,strInfo,strPage,strPageSum_bd,strQtime_bd
DimbNoResult_bd,regEx,patrn
'Baidu query string
strUrl=http://www.baidu.com/s?ie=gb2312&wd=&wd&am...&pn&&cl=3
'Start the collection
strTmp_bd=GetHTTPPage(strUrl)
IfInStr(strtmp_bd, not found and your query)<>0Then
bNoResult_bd=1
EndIf
'Intercept the content of the search results section
strinfo=strCut(strTmp_bd,<DIVid=ScriptDiv></DIV>,<brclear=all>,2)
patrn=</td></tr></table><br>
SetregEx=NewRegExp' creates a regular expression.
regEx.Pattern=patrn'Set mode.
regEx.IgnoreCase=true
regEx.Global=false
strinfo=regEx.replace(strinfo,)
'Seave the content of the paging area
strPage=strCut(strTmp_bd,<brclear=all>,<br>,2)
strPage=Replace(strPage,href=s?,href=searchi_bd.asp?)
'The number of results and time
strPageSum_bd=strCut(strtmp_bd, find the relevant web page, article, 2)
ifnotIsNumeric(strPageSum_bd)then
strPageSum_bd=strCut(strtmp_bd, find the relevant web page, article, 2)
endif
strQtime_bd=strCut(strtmp_bd, time, seconds, 2)
SetstrTmp_bd=nothing
%>
<!--T1-Start-->
<tablecellpacing=0cellpadding=0border=0width=98%align=center>
<trvalign=centeralign=middleheight=18>
<tdwidth=1bgcolor=#999999>
<tdnowrapstyle=FONT-WEIGHT:bold;COLOR:#ffffff;BACKGROUND-COLOR:#0033ccwidth=64>Internet</td>
<tdalign=rightbgcolor=#eeeeeeee><nobr>Find relevant web pages that match <b><%=wd%></b><b><%=strPageSum_bd%></b>, and take <b><%=strQtime_bd%></b>seconds</nobr></td>
</tr>
<tr><tdbgcolor=#999999colspan=3height=2></td></tr></table>
</td>
</tr>
</table>
<%
ifwd=then
Response.Write<palign='center'><fontsize=-1>Hello, please enter keywords in the search box.</font></p>
elseifbNoResult_bd=1then
Response.Write<palign='center'><fontsize=-1>Sorry, no information that meets your query conditions was found. Please reselect the appropriate keyword to query.</font></p>
else
%>
<tablewidth=98%align=centercellspace=0cellpacing=0cellpadding=0border=0>
<tr>
<tdstyle=line-height:160%bgcolor=#ffffffwidth=75%valign=top><br>
<%=strinfo%>
</td>
<tdwidth=25%valign=top><br>This is the space for you to play!
</td>
</tr>
</table>
<tablewidth=98%align=centercellspace=0cellpacing=0cellpadding=4border=0>
<tr>
<tdalign=center>
<br><fontsize=3><%=strPage%></font>
</td>
</tr>
</table>
<%EndIf
setstrinfo=nothing
%>
<hrsize=1width=760color=#0000ff>
<divalign=center><fontsize=-1>
Please go to <spanclass=cred>(Knowledge Sharing Forum)</span> to view</font>
</div>
</BODY>
</HTML>
<%
'Collection of functions
FunctiongetHTTPPage(url)
OnErrorResumeNext
dimhttp
setthttp=Server.createobject(MSXML2.XMLHTTP)
Http.openGET,url,false
Http.send()
ifHttp.readystate<>4then
exitfunction
endif
getHTTPPage=bytesToBSTR(Http.responseBody,GB2312)
setthttp=nothing
IfErr.number<>0then
Response.Write<divalign='center'><b>The server errored in obtaining file content</b></div>
Err.Clear
EndIf
Endfunction
'Byte stream converts to string
FunctionBytesToBstr(body,Cset)
dimobjstream
setobjstream=Server.createObject(adodb.stream)
objstream.Type=1
objstream.Mode=3
objstream.Open
objstream.Writebody
objstream.Position=0
objstream.Type=2
objstream.Charset=Cset
BytesToBstr=objstream.ReadText
objstream.Close
setobjstream=nothing
EndFunction
'Intercepting string, 1. Includes before and after strings, 2. Not including before and after strings
FunctionstrCut(strContent,StartStr,EndStr,CutType)
DimS1, S2
OnErrorResumeNext
selectCaseCutType
Case1
S1=InStr(strContent,StartStr)
S2=InStr(S1, strContent,EndStr)+Len(EndStr)
Case2
S1=InStr(strContent,StartStr)+Len(StartStr)
S2=InStr(S1, strContent, EndStr)
Endselect
IfErrThen
strCute=<palign='center'><fontsize=-1>An error occurred intercepting the string.</font></p>
Err.Clear
ExitFunction
Else
strCut=Mid(strContent,S1,S2-S1)
EndIf
EndFunction
%>
Copy the above code to Notepad and save it as searchi_bd.asp, and you can use it. If you want to change the file name, please also change the blue identification part of the following code to your file name
strPage=Replace(strPage,href=s?,href=searchi_bd.asp?)
A few explanations:
1. Baidu search basically does not have any anti-collection measures. The main point is that Baidu will change the source code of the return result page every once in a while, so you should often observe Baidu's search result page. If the code changes, you can change the string logo. In terms of anti-collection, Baidu is much more generous than Google. At present, no phenomenon of temporarily blocking the IP of the source site due to frequent query of Baidu. This phenomenon often occurs in Google queries. How to solve it is discussed in the next article.
2. Collecting is more resource-consuming, and searching for thieves is the same as searching for programs, so try to release variables or objects as early as possible in the program. If you don’t have much space resources, it is recommended not to do these things.
3. Some people may not want to retain any Baidu's functional connections in the search thief they do, such as Baidu snapshots and on-site search functions. For this reason, I provide a simplified version without any connection to Baidu in the download package. You can use it as needed. The code will not be listed in this article, which is actually similar to the full version.
The above is all the content of this article. I hope it will be helpful to everyone's learning, and I hope everyone will support the wrong new technology channel.