一个简单的库,可让您爬网,清理和重复解说网页以创建大量的单语数据集。使用此库,您应该能够创建比OpenAI用于GPT-2使用的数据集更大的数据集。
该库使用Python 3。
git clone https://github.com/chiphuyen/lazynlp.git
cd lazynlp
pip3 install -r requirements.txt
pip3 install .如果要卸载库,请使用:
pip3 uninstall lazynlp
您可以使用几个主要的URL转储。
这是按几个月来指向Reddit的所有提交的链接。您可以下载原始转储和进程以获取链接。请记住,这些转储中的每一个都是巨大的(100MB -1GB)。
@jcpeterson很友好,可以提供至少3个业力的redulphic链接列表,您可以在此处下载。
在2015 - 06年至2018 - 10年之间,大约有2300万个URL,其中约40-60%是不良的URL(URL不再存在或不友好)。这意味着,从中下载并清洁了所有好URL后,您应该拥有大约10m的网页或50GB的纯文本。
您可以在此处将所有URL的列表下载到我们的Gutenberg书籍。有50k的书籍,可转换为约14GB的纯文本。
您还可以运行lazynlp.get_us_gutenberg_links()获取相同的列表。例如,如果您想获取所有gutenberg url并将其存储在文件us_gutenberg.urls中,请运行以下命令。这可能需要半天。
lazynlp.get_us_gutenberg_links('us_gutenberg.urls')
您可以在此处将所有URL列表下载到澳大利亚Gutenberg图书中。有4K书,可转换为约1GB的纯文本。
您还可以运行lazynlp.get_aus_gutenberg_links()获取相同的列表。例如,如果您想获取所有gutenberg url并将其存储在文件中aus_gutenberg.urls :
lazynlp.get_aus_gutenberg_links('aus_gutenberg.urls')
您可以在此处下载Wikipedia转储。
您不想多次下载相同的URL。有两个功能可以帮助您重复重复所有URL:
lazynlp.dedup_lines(files, outfold)
此函数在文件列表中获取(在每个文件中,每行都是一个URL),并针对所有以前的文件重置每个文件。将所有重复的文件保存在折叠中。
lazynlp.dedup_lines_from_new_file(original_files, new_file, outfile)
此功能使您可以针对所有先前重复的文件(Original_files)重复解复该功能
如果要单独下载每个网页,请致电:
lazynlp.download_page(link, context=None, timeout=None)
如果要从包含URL列表的文件中下载,请致电:
lazynlp.download_pages(link_file, folder, timeout=30, default_skip=True, extensions=[], domains=[])
"""
link_file:
file contains links to webpages to crawl. Each line contains one URL.
folder:
folder that you want to contain your downloaded pages.
timeout:
seconds to wait for a page to respond before abandoning it.
default_skip:
set to True if you want to automatically skip all URLs that contain domains and extensions that are known to be scraper-unfriendly or NSFW.
You can see the list of excluded domains at lazynlp/exclude_domains.txt.
You can see the list of excluded extensions at lazynlp/exclude_extensions.txt
You can also add your own domains and extensions to skip with domains and extensions and arguments.
In the folder:
Each URL is downloaded into a file, indexed by the order in which it is downloaded. The first line of each file is the URL. The rest is the textual content of the page.
index.urls contains all the URLs that have been successfully downloaded.
bad.urls contains the URLs that are bad.
connection.urls contains the URLs that haven't been downloaded because of connection issues.
non_ascii.urls contains the URLs that haven't been downloaded because of bad encoding issues.
empty.urls contains the URLs that have empty textual content.
"""
如果您有很多URL,则可以将列表分为多个文件,并单独调用此功能。我能够并行运行40个脚本。我想我可以将代码一致。我只是发现这很容易。
您可以摆脱所有HTML标签,将UTF-8解码为字符串,音译外来字符,折叠空白空间,替换不打印的字符,UNESCAPE HTML等。使用Lazynlp/Cleaner.py中可用的方法。
您也可以调用以下功能进行大部分处理。
lazynlp.clean_page(page)
在此库中,函数lazynlp.download_pages()同时爬行和清洁部分,因此您拥有的网页是纯文本,如下:
http://www.thecannabist.co/2017/03/02/jeff-sessions-russia-resign-democrats/74687/
Attorney general nominee Sen. Jeff Sessions, R-Ala., testifies on Capitol Hill in Washington on Jan. 10, 2017, in the first day of his confirmation hearing before the Senate Judiciary Committee. Top Democrats now say that because he misled the committee about his visits to Russia, he should resign. (Andrew Harnik, The Associated Press)
House Oversight and Government Reform Committee Chairman Jason Chaffetz, R-Utah, tweeted early Thursday that "AG Sessions should clarify his testimony and recuse himself."
Later, Sen. Rob Portman, R-Ohio, said in a statement, "Jeff Sessions is a former colleague and a friend, but I think it would be best for him and for the country to recuse himself from the DOJ Russia probe."
House Majority Leader Kevin McCarthy, R-Calif., also initially said during an appearance on MSNBC's "Morning Joe" that Sessions should bow out.
Asked whether Sessions should recuse himself in this situation, McCarthy replied "I think the trust of the American people -- you recuse yourself in these situations, yes."
McCarthy was pressed a second time about whether he was calling for Sessions to recuse himself and he confirmed that he believed the situation required a recusal.
"I think it would be easier from that standpoint, yes," McCarthy said.
But McCarthy later said his comment had been misinterpreted, telling Fox News' "Fox and Friends," "I'm not calling on him to recuse himself. I was asked on 'Morning Joe,' if he needs to recuse himself as going forward. As you just heard, Attorney General Sessions said he would recuse himself going forward -- appropriate, and that's all my answer was."
The comments from prominent Republicans follow revelations that Sessions met with the Russian ambassador during election season. Under oath in front of the Senate Judiciary Committee for his confirmation hearing in January, Sessions had said that he had not met with any Russian officials.
Senate Minority Leader Charles Schumer, D-N.Y., joined growing Democratic calls for Sessions to either resign or at least recuse himself from any investigations into Russia's meddling in U.S. elections.
"Attorney General Sessions cannot possibly lead an investigation into Russian interference in our elections or come anywhere near it. With these revelations, he may indeed become the subject of it," Schumer told reporters. "Better for the country if he resigns, but let's get an investigation going."
Because the Department of Justice should be above reproach, for the good of the country, the Attorney General should resign.
为了避免任何文本过多的文本,您只想包含不会与其他页面重叠的页面。
要估计目标文件与某些源文件的重叠的数量,请使用此功能:
lazynlp.estimate_overlap(source_files, target_files, gran='word', n=8, capacity=10000, error_rate=1e-5, header=0, interval=100000)
gran是代币的颗粒:“ char”或“ word”级别。
n是n-gram。
capacity和error_rate适用于使用的bloomfilter。
header :要跳过的每个文件的行数。这是因为以我们的格式,第一行是URL
要估计目标文件与现有bloomfilter的重叠的数量,请使用此功能:
lazynlp.estimate_overlap_bf(bf, target_file, gran='word', n=8, header=0)
如果给出文件列表,例如清洁网页,以滤除所有包含超过其他文件threshold重叠的文件,请使用此功能:
lazynlp.filter_files(files, threshold=0.5, gran='word', n=8, capacity=100000000, error_rate=1e-7, header=0, interval=1000000)
所有被视为重复的文件的名称存储在dupped_files.list中
数据集使用的所有文件的名称存储在clean_files.list中
1GB的文本约为1B字符。英文单词平均具有4.5个字符,或5.5个字符,包括空格。因此,1GB的文字约为1.81亿个单词。
当我并行运行30个脚本时,花费了3个小时才能下载并清洁1GB的纯文本。因此,要获得50GB的纯文本需要5天。
OpenAI数据集具有40GB,我估计其中包含约7-8亿个单词。如果您从良好的Reddit URL和Gutenberg书籍中下载所有网页,则应该拥有一个比OpenAI的WebText更大的数据集。
Openai在他们的GPT-2论文中,因为担心重叠而不包括Wikipedia文章。您可以选择使用lazynlp.estimate_overlap_bf(bf, target_file, gran='word', n=8 。