ดาวน์โหลด autoscraper - autoscraper Source Source Download

autoscraper

ซอร์สโค้ดอื่น ๆ

1.1.14

ดาวน์โหลด

AutoScraper: เครื่องขูดเว็บที่ฉลาดอัตโนมัติรวดเร็วและมีน้ำหนักเบาสำหรับ Python

โครงการนี้ทำขึ้นสำหรับการขูดเว็บอัตโนมัติเพื่อให้การขูดง่าย ได้รับ URL หรือเนื้อหา HTML ของหน้าเว็บและรายการข้อมูลตัวอย่างที่เราต้องการขูดจากหน้านั้น ข้อมูลนี้สามารถเป็นข้อความ URL หรือค่าแท็ก HTML ใด ๆ ของหน้านั้น มันเรียนรู้กฎการขูดและส่งคืนองค์ประกอบที่คล้ายกัน จากนั้นคุณสามารถใช้วัตถุที่เรียนรู้นี้ด้วย URL ใหม่เพื่อรับเนื้อหาที่คล้ายกันหรือองค์ประกอบเดียวกันของหน้าใหม่เหล่านั้น

การติดตั้ง

เข้ากันได้กับ Python 3

ติดตั้งเวอร์ชันล่าสุดจากที่เก็บ GIT โดยใช้ PIP:

$ pip install git+https://github.com/alirezamika/autoscraper.git

ติดตั้งจาก PYPI:

$ pip install autoscraper

ติดตั้งจากแหล่งที่มา:

$ python setup.py install

วิธีใช้

ได้รับผลลัพธ์ที่คล้ายกัน

สมมติว่าเราต้องการดึงชื่อโพสต์ที่เกี่ยวข้องทั้งหมดในหน้า stackoverflow:

 from autoscraper import AutoScraper

url = 'https://stackoverflow.com/questions/2081586/web-scraping-with-python'

# We can add one or multiple candidates here.
# You can also put urls here to retrieve urls.
wanted_list = [ "What are metaclasses in Python?" ]

scraper = AutoScraper ()
result = scraper . build ( url , wanted_list )
print ( result )

นี่คือผลลัพธ์:

[
    'How do I merge two dictionaries in a single expression in Python (taking union of dictionaries)?' , 
    'How to call an external command?' , 
    'What are metaclasses in Python?' , 
    'Does Python have a ternary conditional operator?' , 
    'How do you remove duplicates from a list whilst preserving order?' , 
    'Convert bytes to a string' , 
    'How to get line count of a large file cheaply in Python?' , 
    "Does Python have a string 'contains' substring method?" , 
    'Why is “1000000000000000 in range(1000000000000001)” so fast in Python 3?'
]

ตอนนี้คุณสามารถใช้วัตถุ scraper เพื่อรับหัวข้อที่เกี่ยวข้องของหน้า stackoverflow ใด ๆ :

 scraper . get_result_similar ( 'https://stackoverflow.com/questions/606191/convert-bytes-to-a-string' )

ได้รับผลลัพธ์ที่แน่นอน

สมมติว่าเราต้องการขูดราคาหุ้นสดจาก Yahoo Finance:

 from autoscraper import AutoScraper

url = 'https://finance.yahoo.com/quote/AAPL/'

wanted_list = [ "124.81" ]

scraper = AutoScraper ()

# Here we can also pass html content via the html parameter instead of the url (html=html_content)
result = scraper . build ( url , wanted_list )
print ( result )

โปรดทราบว่าคุณควรอัปเดต wanted_list หากคุณต้องการคัดลอกรหัสนี้เป็นเนื้อหาของหน้าการเปลี่ยนแปลงแบบไดนามิก

นอกจากนี้คุณยังสามารถผ่านพารามิเตอร์โมดูล requests ที่กำหนดเองได้ ตัวอย่างเช่นคุณอาจต้องการใช้พร็อกซีหรือส่วนหัวที่กำหนดเอง:

 proxies = {
    "http" : 'http://127.0.0.1:8001' ,
    "https" : 'https://127.0.0.1:8001' ,
}

result = scraper . build ( url , wanted_list , request_args = dict ( proxies = proxies ))

ตอนนี้เราสามารถรับราคาสัญลักษณ์ใด ๆ :

 scraper . get_result_exact ( 'https://finance.yahoo.com/quote/MSFT/' )

คุณอาจต้องการรับข้อมูลอื่น ๆ เช่นกัน ตัวอย่างเช่นหากคุณต้องการรับ Cap Market ด้วยคุณสามารถต่อท้ายกับรายการที่ต้องการได้ โดยใช้วิธี get_result_exact มันจะดึงข้อมูลเป็นลำดับที่แน่นอนเดียวกันในรายการที่ต้องการ

อีกตัวอย่างหนึ่ง: บอกว่าเราต้องการขูดข้อความเกี่ยวกับจำนวนดาวและลิงก์ไปยังปัญหาของหน้า repo gitHub:

 from autoscraper import AutoScraper

url = 'https://github.com/alirezamika/autoscraper'

wanted_list = [ 'A Smart, Automatic, Fast and Lightweight Web Scraper for Python' , '6.2k' , 'https://github.com/alirezamika/autoscraper/issues' ]

scraper = AutoScraper ()
scraper . build ( url , wanted_list )

เรียบง่ายใช่มั้ย

การบันทึกโมเดล

ตอนนี้เราสามารถบันทึกโมเดลที่สร้างขึ้นเพื่อใช้ในภายหลัง เพื่อบันทึก:

 # Give it a file path
scraper . save ( 'yahoo-finance' )

และโหลด:

 scraper . load ( 'yahoo-finance' )

บทเรียน

ดูส่วนสำคัญนี้สำหรับการใช้งานขั้นสูงมากขึ้น
AutoScraper และ Flask: สร้าง API จากเว็บไซต์ใด ๆ ในเวลาไม่ถึง 5 นาที

ปัญหา

อย่าลังเลที่จะเปิดปัญหาหากคุณมีปัญหาใด ๆ โดยใช้โมดูล

สนับสนุนโครงการ

การเขียนโค้ดที่มีความสุข

ขยาย

ข้อมูลเพิ่มเติม

เวอร์ชัน 1.1.14
ประเภท ซอร์สโค้ดอื่น ๆ
เวลาอัปเดต 2025-02-25
ขนาด 12.55KB
มาจาก Github

แอปที่เกี่ยวข้อง

Google Dorks

2025-03-10
shepherd

2025-06-04
hidusbf

2025-02-14
mongo express

2025-06-04
Free Algorithms Books

2025-05-29
markdownpedia

2025-04-22

แนะนำสำหรับคุณ

chat.petals.dev

ซอร์สโค้ดอื่น ๆ

1.0.0
GPT Prompt Templates

ซอร์สโค้ดอื่น ๆ

1.0.0
GPTyped

ซอร์สโค้ดอื่น ๆ

GPTyped 1.0.5
Google Dorks

ซอร์สโค้ดอื่น ๆ

1.0
shepherd

ซอร์สโค้ดอื่น ๆ

v6.1.6-react-shepherd: Prepare Release (#3063)
hidusbf

ซอร์สโค้ดอื่น ๆ

1.0.0
Google Dorks

ซอร์สโค้ดอื่น ๆ

1.0
shepherd

ซอร์สโค้ดอื่น ๆ

v6.1.6-react-shepherd: Prepare Release (#3063)
hidusbf

ซอร์สโค้ดอื่น ๆ

1.0.0

ข้อมูลที่เกี่ยวข้อง ทั้งหมด