在这个项目中,我试图复制Sergey Brin和Larry Page开发的基于链接的排名系统在1998年的论文“ Pagerank引文排名:将秩序列入网络”。该算法仍然是Google的Web搜索工具的基础。兰维尔(Langville)和迈耶(Meyer)在其2004年的《 Pagerank Inside》(Pagerank)的纸张中提供了有关Pagerank的构建和组成部分的其他指导。
我们首先要构建一个马尔可夫矩阵,每个条目都是“从状态I到州J的行为”(Langville and Meyer 2004)。该超链接矩阵转化为随机,不可还原和原始基质。该马尔可夫矩阵将融合到主要的特征向量。该向量是Pagerank向量,它指示图中每个网页的重要性。为此,BRIN和PAGE使用Power方法,该方法仅存储每次迭代的先前迭代,并为随机,不可减至和原始矩阵快速收敛。
访问pagerank.py文件以查看实现。特别是,power_method和make_personalization_vector函数是实现的核心。 power_method函数实施。该算法已迭代,直到收敛到足够低的误差为止,通常为1E-6。由此产生的特征向量Yeilds Pagerank载体。请注意,由于该函数将功率方法应用于Thex相对稀疏的超链接矩阵,将其转换为随机,不可减至和原始矩阵,因此该算法的运行时比我们明确定义了后者矩阵。同时,Make_Personalization_Vector方法接受查询,然后创建一个提及查询的文章的向量。这允许我们的power_method函数完成最针对性的页面排名,这些排名最常链接到我们已经知道的页面相关的页面。
run pagerank.py --data=./lawfareblog.csv.gz --search_query=weapons
INFO:gensim.models.keyedvectors:precomputing L2-norms of word weight vectors
INFO:root:rank=0 pagerank=4.5715e-03 url=www.lawfareblog.com/why-did-you-wait-moral-emptiness-and-drone-strikes
INFO:root:rank=1 pagerank=3.1107e-03 url=www.lawfareblog.com/dc-district-court-dismisses-journalists-drone-lawsuit
INFO:root:rank=2 pagerank=2.0231e-03 url=www.lawfareblog.com/revived-cia-drone-strike-program-comments-new-policy
INFO:root:rank=3 pagerank=1.9667e-03 url=www.lawfareblog.com/us-court-appeals-dc-circuit-dismisses-suit-over-us-drone-strike
INFO:root:rank=4 pagerank=1.1788e-03 url=www.lawfareblog.com/iran-shoots-down-us-drone-domestic-and-international-legal-implications
INFO:root:rank=5 pagerank=1.1620e-03 url=www.lawfareblog.com/slaughterbots-and-other-anticipated-autonomous-weapons-problems
INFO:root:rank=6 pagerank=1.1276e-03 url=www.lawfareblog.com/german-courts-weigh-legal-responsibility-us-drone-strikes
INFO:root:rank=7 pagerank=8.3738e-04 url=www.lawfareblog.com/shift-jsoc-drone-strikes-does-not-mean-cia-has-been-sidelined
INFO:root:rank=8 pagerank=7.8704e-04 url=www.lawfareblog.com/atomwaffen-division-member-pleads-guilty-firearms-charge
INFO:root:rank=9 pagerank=7.8570e-04 url=www.lawfareblog.com/waiving-imminent-threat-test-cia-drone-strikes-pakistan这些结果的“无人机”和“目标”的底部包含了更多示例,这些示例有助于说明新系统的效果,尽管该文件中的所有结果都应用了新的排名系统。
第1部分:测试基本功率方法实施
实现算法后,我们可以在6个网页图上测试基本算法,类似于Langville和Meyer 2004中使用的算法。我们获得以下结果:请注意,已打开详细标签,以说明在调试语句中,算法收敛到恒定的Epsilon项1E-6。这些页面的排名是,第四个URL排名最高,第一个URL是最低的URL。这些结果通过Mike Izbicki的实施来验证。
[In]: run pagerank.py --data=small.csv.gz --verbose
DEBUG:root:computing indices
DEBUG:root:computing values
DEBUG:root:root=i=0 accuracy=2.7229e-01
DEBUG:root:root=i=1 accuracy=1.3318e-01
DEBUG:root:root=i=2 accuracy=8.1429e-02
DEBUG:root:root=i=3 accuracy=3.7537e-02
DEBUG:root:root=i=4 accuracy=2.4824e-02
DEBUG:root:root=i=5 accuracy=1.2412e-02
DEBUG:root:root=i=6 accuracy=8.0711e-03
DEBUG:root:root=i=7 accuracy=4.3977e-03
DEBUG:root:root=i=8 accuracy=2.7713e-03
DEBUG:root:root=i=9 accuracy=1.5882e-03
DEBUG:root:root=i=10 accuracy=9.7799e-04
DEBUG:root:root=i=11 accuracy=5.7459e-04
DEBUG:root:root=i=12 accuracy=3.4924e-04
DEBUG:root:root=i=13 accuracy=2.0761e-04
DEBUG:root:root=i=14 accuracy=1.2532e-04
DEBUG:root:root=i=15 accuracy=7.4923e-05
DEBUG:root:root=i=16 accuracy=4.5074e-05
DEBUG:root:root=i=17 accuracy=2.6996e-05
DEBUG:root:root=i=18 accuracy=1.6217e-05
DEBUG:root:root=i=19 accuracy=9.7244e-06
DEBUG:root:root=i=20 accuracy=5.8908e-06
DEBUG:root:root=i=21 accuracy=3.5000e-06
DEBUG:root:root=i=22 accuracy=2.0959e-06
DEBUG:root:root=i=23 accuracy=1.2712e-06
INFO:root:rank=0 pagerank=6.8019e-01 url=4
INFO:root:rank=1 pagerank=5.3070e-01 url=6
INFO:root:rank=2 pagerank=4.1114e-01 url=5
INFO:root:rank=3 pagerank=2.0103e-01 url=2
INFO:root:rank=4 pagerank=1.5937e-01 url=3
INFO:root:rank=5 pagerank=1.4440e-01 url=1第2部分:搜索查询接下来,我们可以使用许多命令行参数来完善我们的搜索。该算法现在将返回与我们的查询最相关的这些页面。 --search_query参数接受一个字符串,并将其与每个链接进行比较,并过滤出不包括查询的链接。对于此部分,我使用Mike Izbicki教授准备的数据集,该数据集从Defense Blog www.lawfareblog.com上绘制了超链接。从这一点开始,我不会使用详细命令使结果更加简洁。以下结果与Izbicki教授实施中获得的结果相似。
如果我们进行搜索查询“电晕”以查找有关大流行的文章,则根据Pagerank的最相关链接:
[In]: run pagerank.py --data=./lawfareblog.csv.gz --search_query=corona
DEBUG:root:computing indices
DEBUG:root:computing values
INFO:gensim.models.keyedvectors:precomputing L2-norms of word weight vectors
INFO:root:rank=0 pagerank=1.0038e-03 url=www.lawfareblog.com/lawfare-podcast-united-nations-and-coronavirus-crisis
INFO:root:rank=1 pagerank=8.9224e-04 url=www.lawfareblog.com/house-oversight-committee-holds-day-two-hearing-government-coronavirus-response
INFO:root:rank=2 pagerank=7.0390e-04 url=www.lawfareblog.com/britains-coronavirus-response
INFO:root:rank=3 pagerank=6.9153e-04 url=www.lawfareblog.com/prosecuting-purposeful-coronavirus-exposure-terrorism
INFO:root:rank=4 pagerank=6.7041e-04 url=www.lawfareblog.com/israeli-emergency-regulations-location-tracking-coronavirus-carriers
INFO:root:rank=5 pagerank=6.6256e-04 url=www.lawfareblog.com/why-congress-conducting-business-usual-face-coronavirus
INFO:root:rank=6 pagerank=6.5046e-04 url=www.lawfareblog.com/congressional-homeland-security-committees-seek-ways-support-state-federal-responses-coronavirus
INFO:root:rank=7 pagerank=6.3620e-04 url=www.lawfareblog.com/paper-hearing-experts-debate-digital-contact-tracing-and-coronavirus-privacy-concerns
INFO:root:rank=8 pagerank=6.1248e-04 url=www.lawfareblog.com/house-subcommittee-voices-concerns-over-us-management-coronavirus
INFO:root:rank=9 pagerank=6.0187e-04 url=www.lawfareblog.com/livestream-house-oversight-committee-holds-hearing-government-coronavirus-response接下来,我们将调整搜索查询为“特朗普”:
[In]: run pagerank.py --data=./lawfareblog.csv.gz --search_query=trump
DEBUG:root:computing indices
DEBUG:root:computing values
INFO:gensim.models.keyedvectors:precomputing L2-norms of word weight vectors
INFO:root:rank=0 pagerank=5.7826e-03 url=www.lawfareblog.com/trump-asks-supreme-court-stay-congressional-subpeona-tax-returns
INFO:root:rank=1 pagerank=5.2338e-03 url=www.lawfareblog.com/document-trump-revokes-obama-executive-order-counterterrorism-strike-casualty-reporting
INFO:root:rank=2 pagerank=5.1297e-03 url=www.lawfareblog.com/trump-administrations-worrying-new-policy-israeli-settlements
INFO:root:rank=3 pagerank=4.6599e-03 url=www.lawfareblog.com/dc-circuit-overrules-district-courts-due-process-ruling-qasim-v-trump
INFO:root:rank=4 pagerank=4.5934e-03 url=www.lawfareblog.com/donald-trump-and-politically-weaponized-executive-branch
INFO:root:rank=5 pagerank=4.3071e-03 url=www.lawfareblog.com/how-trumps-approach-middle-east-ignores-past-future-and-human-condition
INFO:root:rank=6 pagerank=4.0935e-03 url=www.lawfareblog.com/why-trump-cant-buy-greenland
INFO:root:rank=7 pagerank=3.7591e-03 url=www.lawfareblog.com/oral-argument-summary-qassim-v-trump
INFO:root:rank=8 pagerank=3.4509e-03 url=www.lawfareblog.com/dc-circuit-court-denies-trump-rehearing-mazars-case
INFO:root:rank=9 pagerank=3.4484e-03 url=www.lawfareblog.com/second-circuit-rules-mazars-must-hand-over-trump-tax-returns-new-york-prosecutors最后,我们将搜索查询更改为“伊朗”:
[In]: run pagerank.py --data=./lawfareblog.csv.gz --search_query=iran
DEBUG:root:computing indices
DEBUG:root:computing values
INFO:gensim.models.keyedvectors:precomputing L2-norms of word weight vectors
INFO:root:rank=0 pagerank=5.1297e-03 url=www.lawfareblog.com/trump-administrations-worrying-new-policy-israeli-settlements
INFO:root:rank=1 pagerank=5.0168e-03 url=www.lawfareblog.com/update-military-commissions-continued-health-issues-recusal-motion-and-new-cell-al-iraqi
INFO:root:rank=2 pagerank=4.5746e-03 url=www.lawfareblog.com/praise-presidents-iran-tweets
INFO:root:rank=3 pagerank=4.4174e-03 url=www.lawfareblog.com/how-us-iran-tensions-could-disrupt-iraqs-fragile-peace
INFO:root:rank=4 pagerank=4.3659e-03 url=www.lawfareblog.com/haftar-attacking-tripoli-us-needs-re-engage-libya
INFO:root:rank=5 pagerank=3.4237e-03 url=www.lawfareblog.com/france-makes-play-try-foreign-fighters-iraq
INFO:root:rank=6 pagerank=2.6928e-03 url=www.lawfareblog.com/cyber-command-operational-update-clarifying-june-2019-iran-operation
INFO:root:rank=7 pagerank=2.2567e-03 url=www.lawfareblog.com/document-sens-kaine-and-young-introduce-bill-revoke-iraq-force-authorizations
INFO:root:rank=8 pagerank=1.9391e-03 url=www.lawfareblog.com/aborted-iran-strike-fine-line-between-necessity-and-revenge
INFO:root:rank=9 pagerank=1.8263e-03 url=www.lawfareblog.com/its-not-only-iraq-and-syria第3部分:对网页结构的担忧大多数网站都有很多结构,因为大多数页面都连接到主页和其他一些广泛的页面,例如www.lawfareblog.com/topics。因为Pagerank确实链接排名,所以许多页面链接到的站点经常但并非总是具有更高的评分。如果我们检查www.lawfareblog.com上最大的Pageranks,我们可以看到其中几个宽页出现:
run pagerank.py --data=./lawfareblog.csv.gz
DEBUG:root:computing indices
DEBUG:root:computing values
INFO:root:rank=0 pagerank=2.8741e-01 url=www.lawfareblog.com/about-lawfare-brief-history-term-and-site
INFO:root:rank=1 pagerank=2.8741e-01 url=www.lawfareblog.com/lawfare-job-board
INFO:root:rank=2 pagerank=2.8741e-01 url=www.lawfareblog.com/litigation-documents-resources-related-travel-ban
INFO:root:rank=3 pagerank=2.8741e-01 url=www.lawfareblog.com/subscribe-lawfare
INFO:root:rank=4 pagerank=2.8741e-01 url=www.lawfareblog.com/our-comments-policy
INFO:root:rank=5 pagerank=2.8741e-01 url=www.lawfareblog.com/upcoming-events
INFO:root:rank=6 pagerank=2.8741e-01 url=www.lawfareblog.com/support-lawfare
INFO:root:rank=7 pagerank=2.8741e-01 url=www.lawfareblog.com/snowden-revelations
INFO:root:rank=8 pagerank=2.8741e-01 url=www.lawfareblog.com/topics
INFO:root:rank=9 pagerank=2.8741e-01 url=www.lawfareblog.com/documents-related-mueller-investigation
但是,如果我们想了解博客最受欢迎的内容是什么,这些页面通常没有用。为了收集有关文章的数据,这些数据通常比链接的页面较少,而不是更宽的页面,我们可以使用--filter_ratio参数。它删除了“所有页面都超过指定的口粮”(Izbicki教授)。现在,我们可以估计最重要的文章:
run pagerank.py --data=./lawfareblog.csv.gz --filter_ratio=0.2
DEBUG:root:computing indices
DEBUG:root:computing values
INFO:root:rank=0 pagerank=3.4696e-01 url=www.lawfareblog.com/trump-asks-supreme-court-stay-congressional-subpeona-tax-returns
INFO:root:rank=1 pagerank=2.9521e-01 url=www.lawfareblog.com/livestream-nov-21-impeachment-hearings-0
INFO:root:rank=2 pagerank=2.9040e-01 url=www.lawfareblog.com/opening-statement-david-holmes
INFO:root:rank=3 pagerank=1.5179e-01 url=www.lawfareblog.com/lawfare-podcast-ben-nimmo-whack-mole-game-disinformation
INFO:root:rank=4 pagerank=1.5099e-01 url=www.lawfareblog.com/todays-headlines-and-commentary-1964
INFO:root:rank=5 pagerank=1.5099e-01 url=www.lawfareblog.com/todays-headlines-and-commentary-1963
INFO:root:rank=6 pagerank=1.5071e-01 url=www.lawfareblog.com/lawfare-podcast-week-was-impeachment
INFO:root:rank=7 pagerank=1.4957e-01 url=www.lawfareblog.com/todays-headlines-and-commentary-1962
INFO:root:rank=8 pagerank=1.4367e-01 url=www.lawfareblog.com/cyberlaw-podcast-mistrusting-google
INFO:root:rank=9 pagerank=1.4240e-01 url=www.lawfareblog.com/lawfare-podcast-bonus-edition-gordon-sondland-vs-committee-no-bull
第4部分:eigengaps p barbar矩阵的特征由
[In]: run pagerank.py --data=./lawfareblog.csv.gz --verbose
[In]: run pagerank.py --data=./lawfareblog.csv.gz --verbose --alpha=0.99999
[In]: run pagerank.py --data=./lawfareblog.csv.gz --verbose --filter_ratio=0.2
[In]: run pagerank.py --data=./lawfareblog.csv.gz --verbose --filter_ratio=0.2 --alpha=0.99999
虽然前三个命令进行了15、11和22次迭代,但最后一个命令进行了684次迭代。这是因为过滤的P矩阵具有较小的特征,因此在较大的alpha边界,收敛需要更长的时间。 Alpha值的这种变化也导致不同的Pagerank排名。首先,第三个命令的结果,该命令过滤P图,但不使用大型alpha:
[In]: run pagerank.py --data=./lawfareblog.csv.gz --verbose --filter_ratio=0.2
INFO:root:rank=0 pagerank=3.4696e-01 url=www.lawfareblog.com/trump-asks-supreme-court-stay-congressional-subpeona-tax-returns
INFO:root:rank=1 pagerank=2.9521e-01 url=www.lawfareblog.com/livestream-nov-21-impeachment-hearings-0
INFO:root:rank=2 pagerank=2.9040e-01 url=www.lawfareblog.com/opening-statement-david-holmes
INFO:root:rank=3 pagerank=1.5179e-01 url=www.lawfareblog.com/lawfare-podcast-ben-nimmo-whack-mole-game-disinformation
INFO:root:rank=4 pagerank=1.5099e-01 url=www.lawfareblog.com/todays-headlines-and-commentary-1964
INFO:root:rank=5 pagerank=1.5099e-01 url=www.lawfareblog.com/todays-headlines-and-commentary-1963
INFO:root:rank=6 pagerank=1.5071e-01 url=www.lawfareblog.com/lawfare-podcast-week-was-impeachment
INFO:root:rank=7 pagerank=1.4957e-01 url=www.lawfareblog.com/todays-headlines-and-commentary-1962
INFO:root:rank=8 pagerank=1.4367e-01 url=www.lawfareblog.com/cyberlaw-podcast-mistrusting-google
INFO:root:rank=9 pagerank=1.4240e-01 url=www.lawfareblog.com/lawfare-podcast-bonus-edition-gordon-sondland-vs-committee-no-bull
然后,第四命令的结果,该命令过滤p图并使用一个大的alpha:
[In]: run pagerank.py --data=./lawfareblog.csv.gz --verbose --filter_ratio=0.2 --alpha=0.99999
INFO:root:rank=0 pagerank=7.0149e-01 url=www.lawfareblog.com/covid-19-speech-and-surveillance-response
INFO:root:rank=1 pagerank=7.0149e-01 url=www.lawfareblog.com/lawfare-live-covid-19-speech-and-surveillance
INFO:root:rank=2 pagerank=1.0552e-01 url=www.lawfareblog.com/cost-using-zero-days
INFO:root:rank=3 pagerank=3.1755e-02 url=www.lawfareblog.com/lawfare-podcast-former-congressman-brian-baird-and-daniel-schuman-how-congress-can-continue-function
INFO:root:rank=4 pagerank=2.2040e-02 url=www.lawfareblog.com/events
INFO:root:rank=5 pagerank=1.6027e-02 url=www.lawfareblog.com/water-wars-increased-us-focus-indo-pacific
INFO:root:rank=6 pagerank=1.6026e-02 url=www.lawfareblog.com/water-wars-drill-maybe-drill
INFO:root:rank=7 pagerank=1.6023e-02 url=www.lawfareblog.com/water-wars-disjointed-operations-south-china-sea
INFO:root:rank=8 pagerank=1.6020e-02 url=www.lawfareblog.com/water-wars-song-oil-and-fire
INFO:root:rank=9 pagerank=1.6020e-02 url=www.lawfareblog.com/water-wars-sinking-feeling-philippine-china-relations 第1部分:基本功率方法实现个性化向量是通过查询过滤的替代方法。个性化向量确定哪些网页最常见于有关查询本身的页面。这与以前的--search_query参数不同,该参数确定了整个Pagerank,然后是与查询相关的最高匹配的过滤器。为了证明这一点,我们可以在使用个性化向量与--search_query参数时比较相同查询的排名。
[In]: run pagerank.py --data=./lawfareblog.csv.gz --filter_ratio=0.2 --personalization_vector_query=corona
INFO:gensim.models.keyedvectors:precomputing L2-norms of word weight vectors
INFO:root:rank=0 pagerank=6.3213e-01 url=www.lawfareblog.com/covid-19-speech-and-surveillance-response
INFO:root:rank=1 pagerank=6.3211e-01 url=www.lawfareblog.com/lawfare-live-covid-19-speech-and-surveillance
INFO:root:rank=2 pagerank=1.4962e-01 url=www.lawfareblog.com/chinatalk-how-party-takes-its-propaganda-global
INFO:root:rank=3 pagerank=1.1626e-01 url=www.lawfareblog.com/brexit-not-immune-coronavirus
INFO:root:rank=4 pagerank=1.1626e-01 url=www.lawfareblog.com/rational-security-my-corona-edition
INFO:root:rank=5 pagerank=8.8833e-02 url=www.lawfareblog.com/trump-cant-reopen-country-over-state-objections
INFO:root:rank=6 pagerank=8.5443e-02 url=www.lawfareblog.com/prosecuting-purposeful-coronavirus-exposure-terrorism
INFO:root:rank=7 pagerank=8.5443e-02 url=www.lawfareblog.com/britains-coronavirus-response
INFO:root:rank=8 pagerank=7.1883e-02 url=www.lawfareblog.com/lawfare-podcast-united-nations-and-coronavirus-crisis
INFO:root:rank=9 pagerank=6.8968e-02 url=www.lawfareblog.com/house-oversight-committee-holds-day-two-hearing-government-coronavirus-response [In]: run pagerank.py --data=./lawfareblog.csv.gz --filter_ratio=0.2 --search_query=corona
INFO:gensim.models.keyedvectors:precomputing L2-norms of word weight vectors
INFO:root:rank=0 pagerank=8.1320e-03 url=www.lawfareblog.com/house-oversight-committee-holds-day-two-hearing-government-coronavirus-response
INFO:root:rank=1 pagerank=7.7908e-03 url=www.lawfareblog.com/lawfare-podcast-united-nations-and-coronavirus-crisis
INFO:root:rank=2 pagerank=5.2262e-03 url=www.lawfareblog.com/livestream-house-oversight-committee-holds-hearing-government-coronavirus-response
INFO:root:rank=3 pagerank=3.9584e-03 url=www.lawfareblog.com/britains-coronavirus-response
INFO:root:rank=4 pagerank=3.8114e-03 url=www.lawfareblog.com/prosecuting-purposeful-coronavirus-exposure-terrorism
INFO:root:rank=5 pagerank=3.3973e-03 url=www.lawfareblog.com/paper-hearing-experts-debate-digital-contact-tracing-and-coronavirus-privacy-concerns
INFO:root:rank=6 pagerank=3.3633e-03 url=www.lawfareblog.com/cyberlaw-podcast-how-israel-fighting-coronavirus
INFO:root:rank=7 pagerank=3.3557e-03 url=www.lawfareblog.com/israeli-emergency-regulations-location-tracking-coronavirus-carriers
INFO:root:rank=8 pagerank=3.2160e-03 url=www.lawfareblog.com/congress-needs-coronavirus-failsafe-its-too-late
INFO:root:rank=9 pagerank=3.1036e-03 url=www.lawfareblog.com/why-congress-conducting-business-usual-face-coronavirus第2部分:查找相关的文章,但不要提及我们的查询,个性化向量特别有用,因为它跟踪哪些文章与查询最相关。这意味着,如果我们想了解间接效果我们的查询可能会产生,我们仍然可以使用相同的个性化向量搜索查询。同时,我们可以使用--search_query参数仅介绍这些文章不包括查询本身。因为减号不会在Gensim模型中,所以这些结果通过而无需搜索相关单词。我得到以下“电晕”的结果:
[In]: run pagerank.py --data=./lawfareblog.csv.gz --filter_ratio=0.2 --personalization_vector_query=corona --search_query=-corona
INFO:gensim.models.keyedvectors:precomputing L2-norms of word weight vectors
INFO:root:rank=0 pagerank=6.3213e-01 url=www.lawfareblog.com/covid-19-speech-and-surveillance-response
INFO:root:rank=1 pagerank=6.3211e-01 url=www.lawfareblog.com/lawfare-live-covid-19-speech-and-surveillance
INFO:root:rank=2 pagerank=1.4962e-01 url=www.lawfareblog.com/chinatalk-how-party-takes-its-propaganda-global
INFO:root:rank=3 pagerank=8.8833e-02 url=www.lawfareblog.com/trump-cant-reopen-country-over-state-objections
INFO:root:rank=4 pagerank=6.8562e-02 url=www.lawfareblog.com/lawfare-podcast-mom-and-dad-talk-clinical-trials-pandemic
INFO:root:rank=5 pagerank=6.5838e-02 url=www.lawfareblog.com/fault-lines-foreign-policy-quarantined
INFO:root:rank=6 pagerank=6.1389e-02 url=www.lawfareblog.com/limits-world-health-organization
INFO:root:rank=7 pagerank=5.5939e-02 url=www.lawfareblog.com/chinatalk-dispatches-shanghai-beijing-and-hong-kong
INFO:root:rank=8 pagerank=5.4060e-02 url=www.lawfareblog.com/trump-asks-supreme-court-stay-congressional-subpeona-tax-returns
INFO:root:rank=9 pagerank=4.9363e-02 url=www.lawfareblog.com/us-moves-dismiss-case-against-company-linked-ira-troll-farm尽管这些文章主要与冠状病毒有关,但也有一些更明显的主题。例如,有一个重要的导弹防御听证会排名第9,这可能是在大流行期间发生的,并受到它的影响,但显然与冠状病毒直接相关。
第3部分:查找相关的文章,但不要提及我们的查询(续),我们可以查看以上文章相关的另一个示例,但没有明确提及查询。对于“伊朗”,我们得到以下结果:
[In]: run pagerank.py --data=./lawfareblog.csv.gz --filter_ratio=0.2 --personalization_vector_query=iran --search_query=-iran
INFO:gensim.models.keyedvectors:precomputing L2-norms of word weight vectors
INFO:root:rank=0 pagerank=3.9977e-01 url=www.lawfareblog.com/omphalos
INFO:root:rank=1 pagerank=2.4971e-01 url=www.lawfareblog.com/haftar-attacking-tripoli-us-needs-re-engage-libya
INFO:root:rank=2 pagerank=2.4398e-01 url=www.lawfareblog.com/cancellation-algerias-elections-opportunity-democratization
INFO:root:rank=3 pagerank=2.3989e-01 url=www.lawfareblog.com/yemen-houthi-strategy-has-promise-and-risk
INFO:root:rank=4 pagerank=2.3954e-01 url=www.lawfareblog.com/how-trumps-approach-middle-east-ignores-past-future-and-human-condition
INFO:root:rank=5 pagerank=2.1294e-01 url=www.lawfareblog.com/trump-asks-supreme-court-stay-congressional-subpeona-tax-returns
INFO:root:rank=6 pagerank=1.5762e-01 url=www.lawfareblog.com/livestream-nov-21-impeachment-hearings-0
INFO:root:rank=7 pagerank=1.5762e-01 url=www.lawfareblog.com/opening-statement-david-holmes
INFO:root:rank=8 pagerank=1.1823e-01 url=www.lawfareblog.com/trump-administrations-worrying-new-policy-israeli-settlements
INFO:root:rank=9 pagerank=8.0923e-02 url=www.lawfareblog.com/senate-examines-threats-homeland
我认为这是一个有趣的扩展,特别是要了解有关伊朗国际影响力的更多信息。在过去的几十年中,伊朗在中东和北非的部分地区一直在慢慢发展。在排名最高的文章中,这种影响非常明显,因为有有关阿尔及利亚,利比亚和以色列的文章。关于伊朗与美国关系的一些文章的出现也说明了2020年初引起的轰炸和紧张局势。这些结果表明http://www.lawfareblog.com/认为美国关系和中东地区政治是与伊朗有关的一些重要主题。
这里还有一些突出显示更新的排名系统的示例:这些结果用于“无人机”:
run pagerank.py --data=./lawfareblog.csv.gz --search_query=drones
INFO:gensim.models.keyedvectors:precomputing L2-norms of word weight vectors
INFO:root:rank=0 pagerank=4.5715e-03 url=www.lawfareblog.com/why-did-you-wait-moral-emptiness-and-drone-strikes
INFO:root:rank=1 pagerank=3.1107e-03 url=www.lawfareblog.com/dc-district-court-dismisses-journalists-drone-lawsuit
INFO:root:rank=2 pagerank=2.0231e-03 url=www.lawfareblog.com/revived-cia-drone-strike-program-comments-new-policy
INFO:root:rank=3 pagerank=1.9667e-03 url=www.lawfareblog.com/us-court-appeals-dc-circuit-dismisses-suit-over-us-drone-strike
INFO:root:rank=4 pagerank=1.7738e-03 url=www.lawfareblog.com/whats-point-charging-foreign-state-linked-hackers
INFO:root:rank=5 pagerank=1.2885e-03 url=www.lawfareblog.com/video-justice-department-announces-indictment-two-chinese-government-hackers
INFO:root:rank=6 pagerank=1.1973e-03 url=www.lawfareblog.com/daisy-chain-associated-forces-potential-use-force-niger-against-al-mourabitoun
INFO:root:rank=7 pagerank=1.1788e-03 url=www.lawfareblog.com/iran-shoots-down-us-drone-domestic-and-international-legal-implications
INFO:root:rank=8 pagerank=1.1620e-03 url=www.lawfareblog.com/slaughterbots-and-other-anticipated-autonomous-weapons-problems
INFO:root:rank=9 pagerank=1.1506e-03 url=www.lawfareblog.com/dhss-joint-task-forces-next-chapter这些结果是针对“目标”:
run pagerank.py --data=./lawfareblog.csv.gz --search_query=targets
INFO:gensim.models.keyedvectors:precomputing L2-norms of word weight vectors
INFO:root:rank=0 pagerank=6.0299e-03 url=www.lawfareblog.com/update-military-commissions-big-september-911-case
INFO:root:rank=1 pagerank=5.0168e-03 url=www.lawfareblog.com/update-military-commissions-continued-health-issues-recusal-motion-and-new-cell-al-iraqi
INFO:root:rank=2 pagerank=4.6573e-03 url=www.lawfareblog.com/targeting-al-baghdadi-and-selective-notification-congress-assessing-issues
INFO:root:rank=3 pagerank=3.2598e-03 url=www.lawfareblog.com/federal-judge-dismisses-military-commissions-defendants-8th-amendment-claim
INFO:root:rank=4 pagerank=3.2329e-03 url=www.lawfareblog.com/military-commissions-judge-rules-against-government-privilege-claim-nashiri-case
INFO:root:rank=5 pagerank=3.2204e-03 url=www.lawfareblog.com/week-military-commissions-98-session-kangaroo-lapel-pin-edition
INFO:root:rank=6 pagerank=3.1839e-03 url=www.lawfareblog.com/military-commission-judge-bars-government-using-defendants-statements-fbi-clean-teams-911-case
INFO:root:rank=7 pagerank=3.1759e-03 url=www.lawfareblog.com/last-week-military-commissions-defense-moves-classification-review-process
INFO:root:rank=8 pagerank=3.1533e-03 url=www.lawfareblog.com/court-military-commissions-review-upholds-life-sentence-al-bahlul
INFO:root:rank=9 pagerank=2.9309e-03 url=www.lawfareblog.com/summary-dc-circuit-vacates-military-judges-rulings-al-nashiri