Crawler project practice
illustrate
Author's personal blog
Blog of fried peppers with hot pot
All projects are author training and sharing projects . If there is any infringement, please contact us to delete it . It is for learning and sharing only and cannot carry out any commercial activities .
Due to time issues in the program completion, some projects may not be reused .
See note.txt for practice notes
This project will be updated continuously
For some practical explanations of the project, please refer to B station: https://space.bilibili.com/35242527/channel/collectiondetail?sid=1590251
Below is the personal rating for website crawling difficulty
| grade | Logo | Difficulty description |
|---|
| Spider egg | 0 | getting Started |
| Young spider | 00 | Crossed the threshold |
| Little spider | * | primary |
| Big spider | ** | A little higher than the junior |
| Giant spider | *** | Medium difficulty |
| Radiant Spider | + | Moderate and upper difficulty |
| Poisonous spider | ++ | More difficult |
| Spider King | +++ | Disaster |
| Spider spirit | KING | hell |
Project Catalog
graph TD;
Basics-->request;
Basics->Analysis of html and regulars;
Basics-->scrapy;
Basics-->High-performance asynchronous crawler;
Basics-->feapder;
Automation-->selenium
Automation-->playwright;
Advanced chapter->Comprehensive case;
Advanced chapter->js reverse topic;
js reverse topic-->Request header or response data encryption;
js reverse topic-->browser fingerprint detection;
js reverse topic-->webPack article;
js reverse topic-->Environmental detection;
js reverse topic-->wasm article;
Verification code-->Slider;
Verification code-->click to select;
Loading Third-party library used in the project
pip install requests # requests库,爬虫的开始
pip install curl_cffi # 标准tls请求库
pip install lxml # xpath提取数据
pip install playwright # 自动化需要
pip install ddddocr # 识别验证码
pip install selenium # 自动化需要,推荐playwright
pip install scrapy # 爬虫框架
pip install " feapder[all] " # 新一代爬虫框架
pip install pycryptodome # python标准密码库
pip install pyexecjs2 # python调用js代码
pip install m3u8 # 下载m3u8视频
pip install prettytable # 格式化输出
pip install tqdm # 进度条
pip install loguru # 强大的日志工具库
pip install retrying # 强大的重试工具
npm install crypto-js/cryptojs # 二选一,js标准密码库
npm install jsdom # js模拟浏览器的dom和bom
npm install tough-cookie # 浏览器cookie
Basics
Request
| Difficulty mark | Project name | Replenish | Quick navigation |
|---|
| Knight attendant | Baidu web page | The first crawler program | Click here |
| Knight attendant | UA Identification | Initial reverse crawl | Click here |
| Knight attendant | Baidu Translation | Know the post request | Click here |
| Knight attendant | Douban Movies | Base | Click here |
| Knight attendant | KFC location query | json practice | Click here |
Parsing html and regular articles
| Difficulty mark | Project name | Replenish | Quick navigation |
|---|
| Quasi-knight | Get fakeua | lxml parsing | Click here |
| Quasi-knight | 4k picture crawling | lxml and solve the problem of encoding errors | Click here |
| Quasi-knight | 58 | lxml and paging crawl | Click here |
| Quasi-knight | bs basics | Initial bs | Click here |
| Quasi-knight | bs case | Practical BS | Click here |
| Quasi-knight | xpath basics | Initial xpath | Click here |
| Quasi-knight | xpath parsing | Practice xpath | Click here |
| Quasi-knight | Regular Basics | Initial Regulation | Click here |
| Quasi-knight | Regular exercises | Practical rules | Click here |
| Quasi-knight | Resume crawling | The above small comprehensive | Click here |
scrapy
| Difficulty mark | Project name | Replenish | Quick navigation |
|---|
| The Great Knight | bossjob | Level 1 page crawling, may not be available | Click here |
| The Great Knight | Double color ball | All are basic scrapy operations | Click here |
| The Great Knight | picture | All are basic scrapy operations | Click here |
| The Great Knight | Sunshine Policy | All are basic scrapy operations | Click here |
| The Great Knight | Yi car data crawling | With js reverse, it is just entry-level, and large-scale json data analysis | Click here |
| The Great Knight | School Beauty Network | All are basic scrapy operations | Click here |
| The Great Knight | NetEase News | All are basic scrapy operations | Click here |
| The Great Knight | 17k novel crawling | All are basic scrapy operations | Click here |
High-performance asynchronous crawler
| Difficulty mark | Project name | Replenish | Quick navigation |
|---|
| Knight attendant | Meet flask | Basic knowledge | Click here |
| knight | Thread pool basics | Basic knowledge | Click here |
| The Great Knight | Meinv image batch crawling | Base | Click here |
| The Great Knight | Celebrity picture crawling | Base | Click here |
| The Great Knight | Multitasking coroutine | Base | Click here |
| The Great Knight | Thread pool application | Base | Click here |
feapder
| Difficulty mark | Project name | Replenish | Quick navigation |
|---|
| The Knight of the Raven | Xiaohongshu data collection | Use the Air mode feapder to customize the csv storage pipeline. In the future, more modes will be rewritten and more functions will be added. We also need to add additional information. | Click here |
Automation
selenium
| Difficulty mark | Project name | Replenish | Quick navigation |
|---|
| Knight attendant | Basic automatic operation | Basic automation operation | Click here |
| Knight attendant | Simulation login | Practice automation | Click here |
| Knight attendant | Action chain and ifream processing | Practice automation | Click here |
| Knight attendant | Headless browser and anti-detection | practise | Click here |
| knight | 12306 Simulation login | Mostly unavailable | Click here |
| knight | damai.com | Mostly unavailable | Click here |
playwright
| Difficulty mark | Project name | Replenish | Quick navigation |
|---|
| knight | postal code | Query the postal code through the address, use synchronization method, including waiting operations and selecting different tables according to the situation, and use pandas to operate on excel files at the same time | Click here |
| The Great Knight | Use local browser for anti-crawling | We sometimes detect when using automation. We use local browsers for anti-crawling. Since it is a local browser, our session status and cookie status both exist. That is to say, we directly access the website we logged in to, which is much more convenient and does not require us to create the browser context. | Click here |
| The Earth Knight | Collect information | The difficulty is that each website has different styles, all data are not the same and the number is large. It is difficult to write regular rules, and it is also difficult to deal with asynchronous. The warehouse only lists 10 of the pages, which requires the writer to have a certain understanding of regular rules and asynchronous playwright. | Click here |
| The Great Knight | Anti-detection browser | Create an anti-detection browser through Daniu's js file, which can bypass most detections | Click here |
| The Earth Knight | Qidian VIP novel crawling | Bypassing the CSS anti-climbing of Qidian VIP novels through screenshots, the knowledge points used are: positioning boxes, screenshots, sliding, processing boundaries, and merging screenshots. This solution is not the optimal solution, everyone is welcome to add | Click here |
Advanced chapter
Comprehensive case
| Difficulty mark | Project name | Replenish | Quick navigation |
|---|
| knight | A certain poetry website | Verification code related, login and image verification code solution --- ddddocr | Click here |
| The Great Knight | Language crawler | Use the Internet to convert text into languages, support Chinese, English and Korean languages | Click here |
| The Great Knight | B station comprehensive | Check whether the user likes you, pull the message list, and pull the like list | Click here |
| The Earth Knight | A video website | m3u8 video download, solve the situation with key and without key, m3u8 entry-level and multi-threaded download | Click here |
| The Earth Knight | ins crawler | For page parameter extraction and parsing json files | Click here |
| The Earth Knight | douyin data crawl on the entire site | Including video image download, comment crawling, user information crawling... At present, some interfaces have also begun xb detection. If you need to use the detected interface, you need to add xb to obtain the data. Now reintegrate the signature. You can find the js file to obtain the signature in github, put it in the same directory as douyin file and name it xb.js. Up provides a code repository that is open source on github. The repository is marked in the code, and it can be used at present. | Click here |
| The Earth Knight | weibo data crawling across the site | Including searching users, searching for posts, downloading comments, downloading user albums, user homepage, user information... | Click here |
| Unknown level | Reptile wheel | Personally, secondary encapsulation of commonly used crawler methods is convenient for later development | Click here |
js reverse topic
Request header or response data encryption
| Difficulty mark | Project name | Replenish | Quick navigation |
|---|
| knight | Nenniu data | Request header encryption, response body encryption | Click here |
| knight | Entertainment Index | Basic introduction | Click here |
| knight | Yien Data | Response body encryption | Click here |
| knight | Anyway check | Response body encryption | Click here |
| The Great Knight | fjs public transaction | Obfuscated parameter encryption | Click here |
| The Great Knight | The only art | Dynamic js running code | Click here |
| The Earth Knight | A weather website | Dynamic js Dynamic key Dynamic parameter anti-debug | Click here |
| The Earth Knight | A football website | Multiple encryption of the request body, it is difficult to locate the encrypted location | Click here |
| The Earth Knight | wangyiyun music | Implement data crawling on the entire site | Click here |
| The Earth Knight | gds public transaction | Confusing parameters, you need to find locations | Click here |
| The Earth Knight | A certain translation | Requesting encryption response decryption is not difficult | Click here |
| The Earth Knight | Login on B station | RSA encrypted password, the third generation of text clicks can be selected, and the text clicks can be found in the verification code section | Click here |
webPack
| Difficulty mark | Project name | Replenish | Quick navigation |
|---|
| The Earth Knight | China Minerals | Basic webpack, standard version encryption algorithm, simple, can be implemented in various ways (nodejs, python and decode) | Click here |
Environmental testing
| Difficulty mark | Project name | Replenish | Quick navigation |
|---|
| The Knight of the Raven | redBook | xhs xs environment detection, you need to put cookies and localstorage into the file yourself | Click here |
| The Knight of the Raven | bossjob | __zp_s...__environment detection, js are different every day, you need to make up some environments, and modify js, there is a module detection, etc. | Click here |
| The Knight of the Raven | Ape Man Studies Question 1 2023 | Magically modify md5 and aes to delete some honeypots and replenish the browser environment | Click here |
| The Earth Knight | Ele.me parameters | Get bx_et parameter through playwright | Click here |
| The Knight of the Raven | pdd's anti_content parameter | This is not a replenishment environment, but an algorithm for deduction. The encryption of pdd is probably the same in different sites, and the values of some objects are different. The encrypted main functions are all logical | Click here |
| The Earth Knight | Update boss direct recruitment items click to unblock IP, so I just found one online | This click-to-select trajectory encryption is the third generation of extreme test | Updated in the boss file |
wasm encryption
| Difficulty mark | Project name | Replenish | Quick navigation |
|---|
| The Knight of the Raven | A certain airline | Wasm operation content realizes encryption and decryption request header parameters encryption update Alibaba system v2 detection Alibaba system v3 detection (automatic acquisition), so all encryption parameters have been resolved | Click here |
Browser fingerprint detection
| Difficulty mark | Project name | Replenish | Quick navigation |
|---|
| The Earth Knight | Yi Jiubi | First, it is the encrypted request body, followed by TLS fingerprint detection. Currently, the homepage request is passed using a third-party library. | Click here |
Verification code
Slider
| Difficulty mark | Project name | Replenish | Quick navigation |
|---|
| The Great Knight | JD slider | After we obtain the picture, we can use ddddocr to identify the slider, and then generate the trajectory and send the request. The trajectory here is written by the boss. First, prepare a benchmark, slide it manually from left to right, and then a shaking trajectory, and then splice two trajectories (I don’t know why the trajectory I slide myself did not pass, so I directly used the trajectory of the boss) | Click here |
| The Great Knight | Alibaba 226 | This update playwright is relatively simple to obtain | Click here |
| The Great Knight | Feigua gets verification code slider | This update playwright is relatively simple to obtain | Click here |
Click to choose
| Difficulty mark | Project name | Replenish | Quick navigation |
|---|
| The Knight of the Raven | The third generation of points to choose | We request the interface to obtain pictures in sequence according to the request order, and after obtaining the image information, we send it to the identification interface to obtain the point-selected coordinates. After obtaining the coordinates, we convert the coordinates and send it to the JS to generate the trajectory. After obtaining the trajectory, we request the interface to obtain the validate | Click here |
Star History
sponsor
If you think this warehouse is helpful to you in learning crawlers and reverse directions, you are welcome to sponsor the author and ask the author to have a cup of milk tea~! !
(Your support can make the author happy all day?)