一個簡單的庫,可讓您爬網,清理和重複解說網頁以創建大量的單語數據集。使用此庫,您應該能夠創建比OpenAI用於GPT-2使用的數據集更大的數據集。
該庫使用Python 3。
git clone https://github.com/chiphuyen/lazynlp.git
cd lazynlp
pip3 install -r requirements.txt
pip3 install .如果要卸載庫,請使用:
pip3 uninstall lazynlp
您可以使用幾個主要的URL轉儲。
這是按幾個月來指向Reddit的所有提交的鏈接。您可以下載原始轉儲和進程以獲取鏈接。請記住,這些轉儲中的每一個都是巨大的(100MB -1GB)。
@jcpeterson很友好,可以提供至少3個業力的redulphic鏈接列表,您可以在此處下載。
在2015 - 06年至2018 - 10年之間,大約有2300萬個URL,其中約40-60%是不良的URL(URL不再存在或不友好)。這意味著,從中下載並清潔了所有好URL後,您應該擁有大約10m的網頁或50GB的純文本。
您可以在此處將所有URL的列表下載到我們的Gutenberg書籍。有50k的書籍,可轉換為約14GB的純文本。
您還可以運行lazynlp.get_us_gutenberg_links()獲取相同的列表。例如,如果您想獲取所有gutenberg url並將其存儲在文件us_gutenberg.urls中,請運行以下命令。這可能需要半天。
lazynlp.get_us_gutenberg_links('us_gutenberg.urls')
您可以在此處將所有URL列表下載到澳大利亞Gutenberg圖書中。有4K書,可轉換為約1GB的純文本。
您還可以運行lazynlp.get_aus_gutenberg_links()獲取相同的列表。例如,如果您想獲取所有gutenberg url並將其存儲在文件中aus_gutenberg.urls :
lazynlp.get_aus_gutenberg_links('aus_gutenberg.urls')
您可以在此處下載Wikipedia轉儲。
您不想多次下載相同的URL。有兩個功能可以幫助您重複重複所有URL:
lazynlp.dedup_lines(files, outfold)
此函數在文件列表中獲取(在每個文件中,每行都是一個URL),並針對所有以前的文件重置每個文件。將所有重複的文件保存在折疊中。
lazynlp.dedup_lines_from_new_file(original_files, new_file, outfile)
此功能使您可以針對所有先前重複的文件(Original_files)重複解复該功能
如果要單獨下載每個網頁,請致電:
lazynlp.download_page(link, context=None, timeout=None)
如果要從包含URL列表的文件中下載,請致電:
lazynlp.download_pages(link_file, folder, timeout=30, default_skip=True, extensions=[], domains=[])
"""
link_file:
file contains links to webpages to crawl. Each line contains one URL.
folder:
folder that you want to contain your downloaded pages.
timeout:
seconds to wait for a page to respond before abandoning it.
default_skip:
set to True if you want to automatically skip all URLs that contain domains and extensions that are known to be scraper-unfriendly or NSFW.
You can see the list of excluded domains at lazynlp/exclude_domains.txt.
You can see the list of excluded extensions at lazynlp/exclude_extensions.txt
You can also add your own domains and extensions to skip with domains and extensions and arguments.
In the folder:
Each URL is downloaded into a file, indexed by the order in which it is downloaded. The first line of each file is the URL. The rest is the textual content of the page.
index.urls contains all the URLs that have been successfully downloaded.
bad.urls contains the URLs that are bad.
connection.urls contains the URLs that haven't been downloaded because of connection issues.
non_ascii.urls contains the URLs that haven't been downloaded because of bad encoding issues.
empty.urls contains the URLs that have empty textual content.
"""
如果您有很多URL,則可以將列表分為多個文件,並單獨調用此功能。我能夠並行運行40個腳本。我想我可以將代碼一致。我只是發現這很容易。
您可以擺脫所有HTML標籤,將UTF-8解碼為字符串,音譯外來字符,折疊空白空間,替換不打印的字符,UNESCAPE HTML等。使用Lazynlp/Cleaner.py中可用的方法。
您也可以調用以下功能進行大部分處理。
lazynlp.clean_page(page)
在此庫中,函數lazynlp.download_pages()同時爬行和清潔部分,因此您擁有的網頁是純文本,如下:
http://www.thecannabist.co/2017/03/02/jeff-sessions-russia-resign-democrats/74687/
Attorney general nominee Sen. Jeff Sessions, R-Ala., testifies on Capitol Hill in Washington on Jan. 10, 2017, in the first day of his confirmation hearing before the Senate Judiciary Committee. Top Democrats now say that because he misled the committee about his visits to Russia, he should resign. (Andrew Harnik, The Associated Press)
House Oversight and Government Reform Committee Chairman Jason Chaffetz, R-Utah, tweeted early Thursday that "AG Sessions should clarify his testimony and recuse himself."
Later, Sen. Rob Portman, R-Ohio, said in a statement, "Jeff Sessions is a former colleague and a friend, but I think it would be best for him and for the country to recuse himself from the DOJ Russia probe."
House Majority Leader Kevin McCarthy, R-Calif., also initially said during an appearance on MSNBC's "Morning Joe" that Sessions should bow out.
Asked whether Sessions should recuse himself in this situation, McCarthy replied "I think the trust of the American people -- you recuse yourself in these situations, yes."
McCarthy was pressed a second time about whether he was calling for Sessions to recuse himself and he confirmed that he believed the situation required a recusal.
"I think it would be easier from that standpoint, yes," McCarthy said.
But McCarthy later said his comment had been misinterpreted, telling Fox News' "Fox and Friends," "I'm not calling on him to recuse himself. I was asked on 'Morning Joe,' if he needs to recuse himself as going forward. As you just heard, Attorney General Sessions said he would recuse himself going forward -- appropriate, and that's all my answer was."
The comments from prominent Republicans follow revelations that Sessions met with the Russian ambassador during election season. Under oath in front of the Senate Judiciary Committee for his confirmation hearing in January, Sessions had said that he had not met with any Russian officials.
Senate Minority Leader Charles Schumer, D-N.Y., joined growing Democratic calls for Sessions to either resign or at least recuse himself from any investigations into Russia's meddling in U.S. elections.
"Attorney General Sessions cannot possibly lead an investigation into Russian interference in our elections or come anywhere near it. With these revelations, he may indeed become the subject of it," Schumer told reporters. "Better for the country if he resigns, but let's get an investigation going."
Because the Department of Justice should be above reproach, for the good of the country, the Attorney General should resign.
為了避免任何文本過多的文本,您只想包含不會與其他頁面重疊的頁面。
要估計目標文件與某些源文件的重疊的數量,請使用此功能:
lazynlp.estimate_overlap(source_files, target_files, gran='word', n=8, capacity=10000, error_rate=1e-5, header=0, interval=100000)
gran是代幣的顆粒:“ char”或“ word”級別。
n是n-gram。
capacity和error_rate適用於使用的bloomfilter。
header :要跳過的每個文件的行數。這是因為以我們的格式,第一行是URL
要估計目標文件與現有bloomfilter的重疊的數量,請使用此功能:
lazynlp.estimate_overlap_bf(bf, target_file, gran='word', n=8, header=0)
如果給出文件列表,例如清潔網頁,以濾除所有包含超過其他文件threshold重疊的文件,請使用此功能:
lazynlp.filter_files(files, threshold=0.5, gran='word', n=8, capacity=100000000, error_rate=1e-7, header=0, interval=1000000)
所有被視為重複的文件的名稱存儲在dupped_files.list中
數據集使用的所有文件的名稱存儲在clean_files.list中
1GB的文本約為1B字符。英文單詞平均具有4.5個字符,或5.5個字符,包括空格。因此,1GB的文字約為1.81億個單詞。
當我並行運行30個腳本時,花費了3個小時才能下載並清潔1GB的純文本。因此,要獲得50GB的純文本需要5天。
OpenAI數據集具有40GB,我估計其中包含約7-8億個單詞。如果您從良好的Reddit URL和Gutenberg書籍中下載所有網頁,則應該擁有一個比OpenAI的WebText更大的數據集。
Openai在他們的GPT-2論文中,因為擔心重疊而不包括Wikipedia文章。您可以選擇使用lazynlp.estimate_overlap_bf(bf, target_file, gran='word', n=8 。