Do you want to use ASP to create your favorite thief collection program? After reading the following article, you can create it yourself. principle
The collection program actually calls web pages on other websites through the XMLHTTP component in XML. For example, many of the news gathering programs call Sina's news web pages, and some of the HTML in them are replaced, and advertisements are also filtered. The advantages of using a collection program are: there is no need to maintain the website, because the data in the collection program comes from other websites, and it will be updated as the website is updated; it can save server resources. Generally, the collection program only has a few files, and all web content is from other websites. Disadvantages are: instability, if the target website goes wrong, the program will also go wrong, and if the target website is upgraded and maintained, the collection program will also need to be modified accordingly; speed, because it is a remote call, the speed is as fast as reading data on the local server It's definitely slower than that.
1. Cases
The following is a brief explanation of the application of XMLHTTP in ASP.
Copy the code code as follows:
<%
'Commonly used functions
'1. Enter the url target web page address, and the return value getHTTPPage is the html code of the target web page.
function getHTTPage(url)
dimHttp
set Http=server.createobject(MSXML2.XMLHTTP)
Http.open GET,url,false
Http.send()
if Http.readystate<>4 then
exit function
end if
getHTTPPage=bytesToBstr(Http.responseBody,GB2312)
set http=nothing
if err.number<>0 then err.Clear
end function
'2. Convert Ranma. Directly use xmlhttp to call web pages with Chinese characters. What you get will be Ranma. You can convert it through the adodb.stream component.
Function BytesToBstr(body)
dim objstream
set objstream = Server.CreateObject(adodb.stream)
objstream.Type = 1
objstream.Mode =3
objstream.Open
objstream.Write body
objstream.Position = 0
objstream.Type = 2
objstream.Charset = GB2312 'Convert the original default UTF-8 encoding to GB2312 encoding. Otherwise, directly using the XMLHTTP component to call a web page with Chinese characters will result in garbled code.
BytesToBstr = objstream.ReadText
objstream.Close
set objstream = nothing
End Function
'Try to call the html content of http://www.vevb.com
Dim Url,Html
Url=http://www.vevb.com;
Html = getHTTPage(Url)
Response.write Html
%>
2. Several commonly used functions
(1) InStr function
describe
Returns the position where a certain character (string2) appears for the first time in another string (string1).
grammar
InStr(string1, string2)
For example:
Dim SearchString, SearchChar
SearchString =http://www.vevb.com ' The string to search for.
SearchChar = jb51 'Search for jb51.
MyBK = Instr(SearchString, SearchChar) ' Return 8
'Return 0 if not found, for example:
SearchChar = BK
MyBK = Instr(SearchString, SearchChar) ' Return 0
(2) Mid function
describe
Returns the specified number of characters from a string.
grammar
Mid(string, start, over)
For example:
Dim MyBK
MyBK = Mid (our BK (www.google) design, 7, 12) 'Intercept the string 12 characters after the 7th character of our BK (www.google) design
'At this time the value of MyBK becomes www.google
(3) Replace function
Dim SearchString, SearchChar
SearchString = Our BK Design is a website building resource website's string to be searched within.
SearchString =Replace(SearchString,BK design,Www.google)
'At this time the value of SearchString becomes our Www.google is a website construction resource website
3. Intercept the HTML code of the specified area
For example, I only want to get the text part between <td> and </td> in the following HTML code:
<html>
<title>(www.google)Google search engine</title>
<body>
<table>
<tr><td></td></tr>
<tr><td id=Content>BK (www.google) Google search engine is a site with many resources...</td></tr>
</table>
</body>
</html>
<%
…
Dim StrBK,start,over,RsBK
StrBK=getHTTPPage (the address of the web page)
start=Instr(StrBK,<td id=Content>) 'The function here is to get the position of the beginning of the string. Someone is going to ask here: the original code is <td id=Content>, why are you calling <td id=Content> here? Answer: in asp (to be precise, it is represented by two double quotes in VBscript A double quote, because double quote is a sensitive character for the program).
over=Instr(StrBK,…</td></tr>)'The function here is to get the position of the end of the string.
'Someone is going to ask again here:( : Why are there three extra dots in front of the HTML code that the program calls...? Answer: Tip: There is also a </td></tr> in the above line, if you use </td></ tr> to locate, the program will mistakenly regard </td></tr> in the above line as the end of the string to be obtained.
RsBK=mid(StrBK,start,over-start) 'The function here is to extract the string between the start character and the over character in StrBK. I also talked about the mid function in the previous section; over-start is to calculate the distance between the start position and the end position, which is the number of characters.
response.write(RsBK) 'Finally output the content obtained by the program
%>
Don't be too happy. When you run it, you will find that there is an error in the html code of the page. Why? Because the html code you get is:
<td id=Content>BK (www.google) Google search engine is a site with many resources...
Did you see that? There is incomplete HTML code! What to do? The statement start=Instr(StrBK,<td id=Content>) obtains the position number of <td id=Content> in StrBK. Now we can add 17 after the program statement, then the program will point to the position The character after <td id=Content>.
Okay, the program will change to this:
<%
…
Dim StrBK,start,over,RsBK
StrBK=getHTTPPage (the address of the web page)
start=Instr(StrBK,<td id=Content>) + 17
over=Instr(StrBK,…</td></tr>) 'Here you can also subtract seven (-7) to remove 3 points
RsBK=mid(StrBK,start,over-start)
response.write(RsBK)
%>
This is OK, we can steal what we want and display it on our own page, haha~
4. Delete or modify the obtained characters
Replace BK(www.google) in RsBK with BK:
RsBK=replace(RsBK,BK(www.google),BK)
Or delete (www.google) directly:
RsBK=replace(RsBK,(www.google),)
Okay, now RsBK becomes: BK Google search engine is a site with many resources...
But in fact, the replace function may not be suitable for some situations. For example, we want to remove all connections in a certain string. Connections may include many types, and replace can only replace a specific one of them. We cannot use one Another corresponding replace function to replace it?
But you can use regular expressions to do this job instead. I won’t go into details here.
(1) How to process the page turning of the other party’s website into our own?
The answer is: use the replace function and the passing of page parameters.
For example, the other party's page contains such page turning code: <a href=2.htm>Next page</a>. We can first use the content mentioned above to obtain this string, and then use the replace function: RsBK=replace( RsBK,<a href=,<a href=page.asp?Url=)
Then get the parameter value of Url in the page.asp program, and finally use collection technology to get the content you want on the next page.
(2) How to store the obtained content into the database
Due to limited space, I will briefly mention it here.
It's actually very simple:
Process the stolen content to prevent SQL injection errors when writing to the database, for example: replace(String,','')
Then execute a sql command to insert into the database and it will be ok~
The above are just some basic applications of the XMLHTTP component. In fact, it can also implement many functions, such as saving remote images to the local server, and using the adodb.stream component to save the acquired data into the database. Collection has a wide range of functions and uses.