pychan是與4chan互動的Python客戶。 4chan沒有官方的API,並且試圖通過第三方實施一個API,因此傾向於苦苦掙扎,因此,該庫提供了與(刮擦)4chan交互的抽象。 pychan是面向對象的,其實現是懶惰的,在合理的地方(使用Python發電機),以優化性能並最大程度地減少多餘的阻塞I/O操作。
如果您安裝了python> = 3.10和<4.0,則可以使用類似的東西從PYCI安裝pychan
pip install pychan所有4chan互動都可以通過睡覺線程來內部進行。如果您以多線程的方式執行pychan ,則將無法獲得此節流的好處。在這種情況下, pychan對過度HTTP請求的後果不承擔責任。
from pychan import FourChan , LogLevel , PychanLogger
# With all defaults (logging disabled, all exceptions raised)
fourchan = FourChan ()
# Tell pychan to gracefully ignore HTTP exceptions, if any, within its internal logic
fourchan = FourChan ( raise_http_exceptions = False )
# Tell pychan to gracefully ignore parsing exceptions, if any, within its internal logic
fourchan = FourChan ( raise_parsing_exceptions = False )
# Configure logging explicitly
logger = PychanLogger ( LogLevel . INFO )
fourchan = FourChan ( logger = logger )
# Use all of the above settings at once
logger = PychanLogger ( LogLevel . INFO )
fourchan = FourChan ( logger = logger , raise_http_exceptions = True , raise_parsing_exceptions = True )此README中的其餘示例假定您已經創建了FourChan類的實例,如上所示。
此功能在呼叫時動態從4chan獲取板。
注意:此列表中未返回與
pychan不兼容的板。
boards = fourchan . get_boards ()
# Sample return value:
# ['a', 'b', 'c', 'd', 'e', 'g', 'gif', 'h', 'hr', 'k', 'm', 'o', 'p', 'r', 's', 't', 'u', 'v', 'vg', 'vm', 'vmg', 'vr', 'vrpg', 'vst', 'w', 'wg', 'i', 'ic', 'r9k', 's4s', 'vip', 'qa', 'cm', 'hm', 'lgbt', 'y', '3', 'aco', 'adv', 'an', 'bant', 'biz', 'cgl', 'ck', 'co', 'diy', 'fa', 'fit', 'gd', 'hc', 'his', 'int', 'jp', 'lit', 'mlp', 'mu', 'n', 'news', 'out', 'po', 'pol', 'pw', 'qst', 'sci', 'soc', 'sp', 'tg', 'toy', 'trv', 'tv', 'vp', 'vt', 'wsg', 'wsr', 'x', 'xs'] # Iterate over all threads in /b/
for thread in fourchan . get_threads ( "b" ):
# Do stuff with the thread
print ( thread . title )
# You can also iterate over all the posts in the thread
for post in fourchan . get_posts ( thread ):
# Do stuff with the post - refer to the model documentation in pychan's README for details
print ( post . text )注意:某些板沒有存檔(例如
/b/)。此類董事會將根據您如何配置FourChan實例返回空列表或提出異常。
此功能返回的線程將始終具有一個title字段,其中包含“摘錄”列標題下4chan接口中顯示的文本。該文本可以是線程的真實標題,也可以是原始帖子文本的預覽。將該方法返回的任何線程傳遞給get_posts()方法將在附加到返回的帖子的線程上自動糾正title字段(如有必要)。有關更多詳細信息,請參見獲取帖子以獲取特定線程。
從技術上講,
pychan可以通過為每個線程發出額外的HTTP請求來解決上述title行為,以獲取其真實標題,但本著使最小數量的HTTP請求成為可能的精神,pychan直接使用摘錄。
for thread in fourchan . get_archived_threads ( "pol" ):
# Do stuff with the thread
print ( thread . title )
# You can also iterate over all the posts in the thread
for post in fourchan . get_posts ( thread ):
# Do stuff with the post - refer to the model documentation in pychan's README for details
print ( post . text )對4chan進行搜索要比訪問其餘的4chan數據要繁瑣。這是因為4chan在其REST API面前有一個Cloudflare防火牆,因此從搜索中獲取數據的唯一方法是提供繞過Cloudflare的反機器人檢查所需的HTTP請求信息。最終,這相當於通過某些標題以及HTTP請求,但挑戰來自實際獲取此類標題。
目前,為您生成這些標題的範圍超出了pychan的範圍,因此,如果您想自動化Cloudflare保護措施的規範,則可能需要考慮使用以下一個項目之一(此列表已按字母列表且不詳盡):
獲取這些值的手動方法是使用Web瀏覽器執行4chan搜索,並利用瀏覽器的開發人員工具來跟踪搜索過程中提出的網絡請求。包含CloudFlare值的請求將通過一些查詢參數提出https://find.4chan.org/api 。找到此請求後,請複制您請求中發送的User-Agent和Cookie值,然後將它們傳遞給pychan的search()方法。請注意,Cloudflare cookie(S)對它們有效期,因此此手動解決方法只能返回結果,直到Cloudflare無效您的cookie。之後,您需要獲取新值。
注意:搜索結果中永遠不會返回封閉/粘連/存檔的線程。
# This "threads" variable will contain a Python Generator (not a list) in order to facilitate laziness
threads = fourchan . search (
board = "b" ,
text = "ylyl" ,
user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36" ,
cloudflare_cookies = {
"cf_clearance" : "bm2RICpcDeR4cXoC2nfI_cnZcbAkN4UYpN6c1zzeb8g-1440859602-0-160"
}
)
for thread in threads :
# The thread object is the same class as the one returned by get_threads()
for post in fourchan . get_posts ( thread ):
# Do stuff with the post - refer to the model documentation in pychan's README for details
print ( post . text ) from pychan . models import Thread
# Instantiate a Thread instance with which to query for posts
thread = Thread ( "int" , 168484869 )
# Note: the thread contained within the returned posts will have all applicable metadata (such as
# title and sticky status), regardless of whether you provided such data above - pychan will
# "auto-discover" all metadata and include it in the post models' copy of the thread
posts = fourchan . get_posts ( thread )以下表總結了此庫使用的各種模型上可用的所有數據類型。
另請注意, pychan中的所有模型類都實現以下方法:
__repr____str____hash____eq____iter__ __-這將實現,以便可以將模型傳遞給Python的tuple()函數__copy____deepcopy__下表對應於pychan.models.Thread類。
| 場地 | 類型 | 示例值 |
|---|---|---|
thread.board | str | "b" , "int" |
thread.number | int | |
thread.title | Optional[str] | None , "YLYL thread" |
thread.is_stickied | bool | True , False |
thread.is_closed | bool | True , False |
thread.is_archived | bool | True , False |
thread.url | str | "https://boards.4chan.org/a/thread/251097344" |
下表對應於pychan.models.Post類。
| 場地 | 類型 | 示例值 |
|---|---|---|
post.thread | Thread | pychan.models.Thread |
post.number | int | |
post.timestamp | datetime.datetime | datetime.datetime |
post.poster | Poster | pychan.models.Poster |
post.text | str | ">be men>be boredn>write pychann>somehow it works" |
post.is_original_post | bool | True , False |
post.file | Optional[File] | None , pychan.models.File |
post.replies | list[Post] | [] , [pychan.models.Post, pychan.models.Post] |
post.url | str | "https://boards.4chan.org/a/thread/251097344#p251097419" |
上面顯示的replies字段純粹是pychan提供的便利功能,用於訪問線程中的所有帖子,該線程使用>>操作員“回复”當前帖子。但是,不必使用replies字段訪問線程中的所有可用帖子。當您調用get_posts()方法時,您仍然會收到所有帖子(以發布的順序)為單個平面列表。
下表對應於pychan.models.Poster類。
| 場地 | 類型 | 示例值 |
|---|---|---|
poster.name | str | "Anonymous" |
poster.is_moderator | bool | True , False |
poster.id | Optional[str] | None , "BYagKQXI" |
poster.flag | Optional[str] | None , "United States" , "Canada" |
下表對應於pychan.models.File類。
| 場地 | 類型 | 示例值 |
|---|---|---|
file.url | str | "https://i.4cdn.org/pol/1658892700380132.jpg" |
file.name | str | "wojak.jpg" , "i feel alone.jpg" |
file.size | str | "601 KB" |
file.dimensions | tuple[int, int] | (1920, 1080) , (800, 600) |
file.is_spoiler | bool | True , False |
有關以開發人員為導向的信息,請參見貢獻。