pychan是与4chan互动的Python客户。 4chan没有官方的API,并且试图通过第三方实施一个API,因此倾向于苦苦挣扎,因此,该库提供了与(刮擦)4chan交互的抽象。 pychan是面向对象的,其实现是懒惰的,在合理的地方(使用Python发电机),以优化性能并最大程度地减少多余的阻塞I/O操作。
如果您安装了python> = 3.10和<4.0,则可以使用类似的东西从PYCI安装pychan
pip install pychan所有4chan互动都可以通过睡觉线程来内部进行。如果您以多线程的方式执行pychan ,则将无法获得此节流的好处。在这种情况下, pychan对过度HTTP请求的后果不承担责任。
from pychan import FourChan , LogLevel , PychanLogger
# With all defaults (logging disabled, all exceptions raised)
fourchan = FourChan ()
# Tell pychan to gracefully ignore HTTP exceptions, if any, within its internal logic
fourchan = FourChan ( raise_http_exceptions = False )
# Tell pychan to gracefully ignore parsing exceptions, if any, within its internal logic
fourchan = FourChan ( raise_parsing_exceptions = False )
# Configure logging explicitly
logger = PychanLogger ( LogLevel . INFO )
fourchan = FourChan ( logger = logger )
# Use all of the above settings at once
logger = PychanLogger ( LogLevel . INFO )
fourchan = FourChan ( logger = logger , raise_http_exceptions = True , raise_parsing_exceptions = True )此README中的其余示例假定您已经创建了FourChan类的实例,如上所示。
此功能在呼叫时动态从4chan获取板。
注意:此列表中未返回与
pychan不兼容的板。
boards = fourchan . get_boards ()
# Sample return value:
# ['a', 'b', 'c', 'd', 'e', 'g', 'gif', 'h', 'hr', 'k', 'm', 'o', 'p', 'r', 's', 't', 'u', 'v', 'vg', 'vm', 'vmg', 'vr', 'vrpg', 'vst', 'w', 'wg', 'i', 'ic', 'r9k', 's4s', 'vip', 'qa', 'cm', 'hm', 'lgbt', 'y', '3', 'aco', 'adv', 'an', 'bant', 'biz', 'cgl', 'ck', 'co', 'diy', 'fa', 'fit', 'gd', 'hc', 'his', 'int', 'jp', 'lit', 'mlp', 'mu', 'n', 'news', 'out', 'po', 'pol', 'pw', 'qst', 'sci', 'soc', 'sp', 'tg', 'toy', 'trv', 'tv', 'vp', 'vt', 'wsg', 'wsr', 'x', 'xs'] # Iterate over all threads in /b/
for thread in fourchan . get_threads ( "b" ):
# Do stuff with the thread
print ( thread . title )
# You can also iterate over all the posts in the thread
for post in fourchan . get_posts ( thread ):
# Do stuff with the post - refer to the model documentation in pychan's README for details
print ( post . text )注意:某些板没有存档(例如
/b/)。此类董事会将根据您如何配置FourChan实例返回空列表或提出异常。
此功能返回的线程将始终具有一个title字段,其中包含“摘录”列标题下4chan接口中显示的文本。该文本可以是线程的真实标题,也可以是原始帖子文本的预览。将该方法返回的任何线程传递给get_posts()方法将在附加到返回的帖子的线程上自动纠正title字段(如有必要)。有关更多详细信息,请参见获取帖子以获取特定线程。
从技术上讲,
pychan可以通过为每个线程发出额外的HTTP请求来解决上述title行为,以获取其真实标题,但本着使最小数量的HTTP请求成为可能的精神,pychan直接使用摘录。
for thread in fourchan . get_archived_threads ( "pol" ):
# Do stuff with the thread
print ( thread . title )
# You can also iterate over all the posts in the thread
for post in fourchan . get_posts ( thread ):
# Do stuff with the post - refer to the model documentation in pychan's README for details
print ( post . text )对4chan进行搜索要比访问其余的4chan数据要繁琐。这是因为4chan在其REST API面前有一个Cloudflare防火墙,因此从搜索中获取数据的唯一方法是提供绕过Cloudflare的反机器人检查所需的HTTP请求信息。最终,这相当于通过某些标题以及HTTP请求,但挑战来自实际获取此类标题。
目前,为您生成这些标题的范围超出了pychan的范围,因此,如果您想自动化Cloudflare保护措施的规范,则可能需要考虑使用以下一个项目之一(此列表已按字母列表且不详尽):
获取这些值的手动方法是使用Web浏览器执行4chan搜索,并利用浏览器的开发人员工具来跟踪搜索过程中提出的网络请求。包含CloudFlare值的请求将通过一些查询参数提出https://find.4chan.org/api 。找到此请求后,请复制您请求中发送的User-Agent和Cookie值,然后将它们传递给pychan的search()方法。请注意,Cloudflare cookie(S)对它们有效期,因此此手动解决方法只能返回结果,直到Cloudflare无效您的cookie。之后,您需要获取新值。
注意:搜索结果中永远不会返回封闭/粘连/存档的线程。
# This "threads" variable will contain a Python Generator (not a list) in order to facilitate laziness
threads = fourchan . search (
board = "b" ,
text = "ylyl" ,
user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36" ,
cloudflare_cookies = {
"cf_clearance" : "bm2RICpcDeR4cXoC2nfI_cnZcbAkN4UYpN6c1zzeb8g-1440859602-0-160"
}
)
for thread in threads :
# The thread object is the same class as the one returned by get_threads()
for post in fourchan . get_posts ( thread ):
# Do stuff with the post - refer to the model documentation in pychan's README for details
print ( post . text ) from pychan . models import Thread
# Instantiate a Thread instance with which to query for posts
thread = Thread ( "int" , 168484869 )
# Note: the thread contained within the returned posts will have all applicable metadata (such as
# title and sticky status), regardless of whether you provided such data above - pychan will
# "auto-discover" all metadata and include it in the post models' copy of the thread
posts = fourchan . get_posts ( thread )以下表总结了此库使用的各种模型上可用的所有数据类型。
另请注意, pychan中的所有模型类都实现以下方法:
__repr____str____hash____eq____iter__ __-这将实现,以便可以将模型传递给Python的tuple()函数__copy____deepcopy__下表对应于pychan.models.Thread类。
| 场地 | 类型 | 示例值 |
|---|---|---|
thread.board | str | "b" , "int" |
thread.number | int | 882774935 168484869 |
thread.title | Optional[str] | None , "YLYL thread" |
thread.is_stickied | bool | True , False |
thread.is_closed | bool | True , False |
thread.is_archived | bool | True , False |
thread.url | str | "https://boards.4chan.org/a/thread/251097344" |
下表对应于pychan.models.Post类。
| 场地 | 类型 | 示例值 |
|---|---|---|
post.thread | Thread | pychan.models.Thread |
post.number | int | 882774935 882774974 |
post.timestamp | datetime.datetime | datetime.datetime |
post.poster | Poster | pychan.models.Poster |
post.text | str | ">be men>be boredn>write pychann>somehow it works" |
post.is_original_post | bool | True , False |
post.file | Optional[File] | None , pychan.models.File |
post.replies | list[Post] | [] , [pychan.models.Post, pychan.models.Post] |
post.url | str | "https://boards.4chan.org/a/thread/251097344#p251097419" |
上面显示的replies字段纯粹是pychan提供的便利功能,用于访问线程中的所有帖子,该线程使用>>操作员“回复”当前帖子。但是,不必使用replies字段访问线程中的所有可用帖子。当您调用get_posts()方法时,您仍然会收到所有帖子(以发布的顺序)为单个平面列表。
下表对应于pychan.models.Poster类。
| 场地 | 类型 | 示例值 |
|---|---|---|
poster.name | str | "Anonymous" |
poster.is_moderator | bool | True , False |
poster.id | Optional[str] | None , "BYagKQXI" |
poster.flag | Optional[str] | None , "United States" , "Canada" |
下表对应于pychan.models.File类。
| 场地 | 类型 | 示例值 |
|---|---|---|
file.url | str | "https://i.4cdn.org/pol/1658892700380132.jpg" |
file.name | str | "wojak.jpg" , "i feel alone.jpg" |
file.size | str | "601 KB" |
file.dimensions | tuple[int, int] | (1920, 1080) , (800, 600) |
file.is_spoiler | bool | True , False |
有关以开发人员为导向的信息,请参见贡献。