XXL-CRAWLER
XXL-CRAWLER, a distributed web crawler framework.
-- Home Page --
Introduction
XXL-CRAWLER is a distributed web crawler framework. One line of code develops a distributed crawler. Features such as "multithreaded, asynchronous, dynamic IP proxy, distributed, javascript-rendering".
XXL-CRAWLER is a distributed crawler framework. Develop a distributed crawler with one line of code, which has the characteristics of "multi-threading, asynchronous, IP dynamic proxy, distributed, JS rendering" and other features;
Documentation
Features
- 1. Concise: The API is intuitive and concise, and can be quickly started;
- 2. Lightweight: The underlying implementation only relies on jsoup, which is simple and efficient;
- 3. Modular: Modular structural design, easy to expand
- 4. Object-oriented: Supports easy mapping of page data to PageVO objects through annotations, and the underlying layer automatically completes data extraction and encapsulation return of PageVO objects; a single page supports extraction of one or more PageVOs.
- 5. Multi-threading: run in a thread pool to improve collection efficiency;
- 6. Distributed support: Distributed can be achieved by extending the "RunData" module and combining Redis or DB shared running data. LocalRunData stand-alone crawler is provided by default.
- 7. JS rendering: By extending the "PageLoader" module, it supports the acquisition of JS dynamic rendering data. Natively, it provides Jsoup (non-JS rendering, faster), HtmlUnit (JS rendering), Selenium+Phantomjs (JS rendering, high compatibility), and other implementations, supporting free expansion of other implementations.
- 8. Failed retry: Retry after the request fails, and supports setting the number of retry times;
- 9. Agent IP: Anti-acquisition policy rules WAF;
- 10. Dynamic proxy: supports dynamic adjustment of proxy pools at runtime and customize proxy pool routing policies;
- 11. Asynchronous: supports two ways of running synchronously and asynchronously;
- 12. Diffusing the entire site: Support diffusion and crawling the entire site from the existing URL as the starting point;
- 13. Deduplication: prevent repeated crawling;
- 14. URL whitelist: Supports setting page whitelist rules and filtering URLs;
- 15. Custom request information, such as: request parameters, cookies, header, UserAgent polling, Referrer, etc.;
- 16. Dynamic parameters: Support dynamic adjustment of request parameters during runtime;
- 17. Timeout control: Support setting the timeout time of crawler request;
- 18. Active pause: The crawler thread actively pauses after processing the page to avoid being intercepted too frequently;
Communication
Contributing
Contributions are welcome! Open a pull request to fix a bug, or open an Issue to discuss a new feature or change.
Welcome to participate in the project contribution! For example, submit a PR to fix a bug, or create a new Issue to discuss new features or changes.
Access registration
For more companies that access, please register at the registration address. Registration is only for product promotion.
Copyright and License
This product is open source and free, and will continue to provide free community technical support. Individual or enterprise users are free to access and use.
- Licensed under the Apache License, Version 2.0.
- Copyright (c) 2015-present, xuxueli.
The product is open source and free, and free community technical support will continue to be provided. Free access and use within individuals or enterprises.
Donate
No matter how much the amount is enough to express your thought, thank you very much :) To donate
No matter how much the amount is, it is enough to express your feelings. Thank you very much:) Go to donate