تنزيل lazynlp - تنزيل رمز المصدر lazynlp

lazynlp

شفرة المصدر الأخرى

1.0.0

تنزيل

Lazynlp

مكتبة مباشرة تتيح لك الزحف والتنظيف وصفحات الويب المكررة لإنشاء مجموعات بيانات أحادية اللغة ضخمة. باستخدام هذه المكتبة ، يجب أن تكون قادرًا على إنشاء مجموعات بيانات أكبر من تلك التي تستخدمها Openai لـ GPT-2.

يثبت

تستخدم هذه المكتبة Python 3.

استنساخ هذه المكتبة والقرص المضغوط في مجلد LazynLP:

 git clone https://github.com/chiphuyen/lazynlp.git
cd lazynlp

تثبيت التبعيات

pip3 install -r requirements.txt

تثبيت pip3 install .

إذا كنت ترغب في إلغاء تثبيت المكتبة ، فاستخدم:

pip3 uninstall lazynlp

كيفية إنشاء مجموعة بيانات ضخمة باستخدام lazynlp:

الخطوة 1. الحصول على عناوين URL من صفحات الويب التي تريد الزحف

هناك العديد من مقالب عناوين URL المتاحة التي يمكنك استخدامها.

urls reddit

هذا هو الرابط لجميع التقديمات إلى Reddit بأشهر. يمكنك تنزيل تفريغ الخام والعملية للحصول على الروابط. ضع في اعتبارك أن كل من هذه المقالب ضخمة (100 ميجابايت - 1 جيجابايت).

jcpeterson لطيف بما يكفي لتقديم قائمة بالروابط المكررة مع 3 كارما على الأقل يمكنك تنزيلها هنا.

هناك حوالي 23 متر عناوين URL من الفترة ما بين 2015-06 إلى 2018-10 ، منها حوالي 40-60 ٪ من عناوين URL سيئة (عناوين URL لم تعد موجودة أو ليست صديقة للمكشطة). هذا يعني أنه بعد تنزيل وتنظيف جميع عناوين URL الجيدة من هذا ، يجب أن يكون لديك ما يقرب من 10 أمتار ويب أو 50 جيجابايت من النص النقي.

غوتنبرغ

يمكنك تنزيل قائمة جميع عناوين URL لنا كتب Gutenberg هنا. هناك 50 ألف كتاب ، يتحول إلى حوالي 14 جيجابايت من النص النقي.

يمكنك أيضًا تشغيل lazynlp.get_us_gutenberg_links() للحصول على نفس القائمة. على سبيل المثال ، إذا كنت ترغب في الحصول على جميع عناوين URL Gutenberg وتخزينها في ملف us_gutenberg.urls ، قم بتشغيل الأمر التالي. هذا قد يستغرق نصف يوم.

lazynlp.get_us_gutenberg_links('us_gutenberg.urls')

يمكنك تنزيل قائمة جميع عناوين URL إلى كتب Gutenberg الأسترالية هنا. هناك كتب 4K ، والتي تتحول إلى حوالي 1 جيجابايت من النص النقي.

يمكنك أيضًا تشغيل lazynlp.get_aus_gutenberg_links() للحصول على نفس القائمة. على سبيل المثال ، إذا كنت ترغب في الحصول على جميع عناوين URL Gutenberg وتخزينها في الملف aus_gutenberg.urls :

lazynlp.get_aus_gutenberg_links('aus_gutenberg.urls')

ويكيبيديا

يمكنك تنزيل Wikipedia Dumps هنا.

الخطوة 2. عناوين URL

لا تريد تنزيل عنوان URL نفسه عدة مرات. هناك وظيفتان تساعدانك على تكريس جميع عناوين URL:

lazynlp.dedup_lines(files, outfold)

تأخذ هذه الوظيفة قائمة بالملفات (في كل ملف ، كل سطر هو عناوين URL) وتصوير كل ملف مقابل جميع الملفات السابقة. حفظ جميع الملفات المكررة في Outfold.

lazynlp.dedup_lines_from_new_file(original_files, new_file, outfile)

تتيح لك هذه الوظيفة تكريس ملف جديد مقابل جميع الملفات المكرسة مسبقًا (Original_Files)

الخطوة 3. قم بتنزيل عناوين URL

إذا كنت ترغب في تنزيل كل صفحة ويب بشكل منفصل ، اتصل:

lazynlp.download_page(link, context=None, timeout=None)

إذا كنت ترغب في التنزيل من ملف يحتوي على قائمة عناوين URL ، اتصل:

lazynlp.download_pages(link_file, folder, timeout=30, default_skip=True, extensions=[], domains=[])

 """

link_file:

	file contains links to webpages to crawl. Each line contains one URL.

folder:

	folder that you want to contain your downloaded pages.

timeout:

	seconds to wait for a page to respond before abandoning it.

default_skip:

	set to True if you want to automatically skip all URLs that contain domains and extensions that are known to be scraper-unfriendly or NSFW.

	You can see the list of excluded domains at lazynlp/exclude_domains.txt.

	You can see the list of excluded extensions at lazynlp/exclude_extensions.txt

You can also add your own domains and extensions to skip with domains and extensions and arguments.

In the folder:

	Each URL is downloaded into a file, indexed by the order in which it is downloaded. The first line of each file is the URL. The rest is the textual content of the page.
 	
 	index.urls contains all the URLs that have been successfully downloaded.
	
	bad.urls contains the URLs that are bad.
	
	connection.urls contains the URLs that haven't been downloaded because of connection issues.
	
	non_ascii.urls contains the URLs that haven't been downloaded because of bad encoding issues.
	
	empty.urls contains the URLs that have empty textual content.

"""

إذا كان لديك الكثير من عناوين URL ، فيمكنك تقسيم القائمة إلى ملفات متعددة والاتصال بهذه الوظيفة بشكل منفصل. تمكنت من تشغيل 40 نصًا بالتوازي. أعتقد أنه يمكن أن يكون قد تمكنت من ذلك في الرمز. لقد وجدت هذا أسهل.

الخطوة 4. تنظيف صفحات الويب

يمكنك التخلص من جميع علامات HTML ، وفك تشفير UTF-8 في سلسلة ، وترجمة الأحرف الأجنبية ، والمساحة البيضاء ، واستبدال الأحرف غير القابلة للطباعة ، HTML غير القابلة للتشكيل ، إلخ.

يمكنك أيضًا استدعاء الوظيفة التالية للقيام بمعظم المعالجة.

lazynlp.clean_page(page)

ملحوظة:

في هذه المكتبة ، تقوم الدالة lazynlp.download_pages() على جزء من الزحف والتنظيف ، وبالتالي فإن صفحات الويب التي لديك نص نقي ، مثل هذا:

 http://www.thecannabist.co/2017/03/02/jeff-sessions-russia-resign-democrats/74687/
Attorney general nominee Sen. Jeff Sessions, R-Ala., testifies on Capitol Hill in Washington on Jan. 10, 2017, in the first day of his confirmation hearing before the Senate Judiciary Committee. Top Democrats now say that because he misled the committee about his visits to Russia, he should resign. (Andrew Harnik, The Associated Press)

House Oversight and Government Reform Committee Chairman Jason Chaffetz, R-Utah, tweeted early Thursday that "AG Sessions should clarify his testimony and recuse himself."

Later, Sen. Rob Portman, R-Ohio, said in a statement, "Jeff Sessions is a former colleague and a friend, but I think it would be best for him and for the country to recuse himself from the DOJ Russia probe."

House Majority Leader Kevin McCarthy, R-Calif., also initially said during an appearance on MSNBC's "Morning Joe" that Sessions should bow out.

Asked whether Sessions should recuse himself in this situation, McCarthy replied "I think the trust of the American people -- you recuse yourself in these situations, yes."

McCarthy was pressed a second time about whether he was calling for Sessions to recuse himself and he confirmed that he believed the situation required a recusal.

"I think it would be easier from that standpoint, yes," McCarthy said.

But McCarthy later said his comment had been misinterpreted, telling Fox News' "Fox and Friends," "I'm not calling on him to recuse himself. I was asked on 'Morning Joe,' if he needs to recuse himself as going forward. As you just heard, Attorney General Sessions said he would recuse himself going forward -- appropriate, and that's all my answer was."

The comments from prominent Republicans follow revelations that Sessions met with the Russian ambassador during election season. Under oath in front of the Senate Judiciary Committee for his confirmation hearing in January, Sessions had said that he had not met with any Russian officials.

Senate Minority Leader Charles Schumer, D-N.Y., joined growing Democratic calls for Sessions to either resign or at least recuse himself from any investigations into Russia's meddling in U.S. elections.

"Attorney General Sessions cannot possibly lead an investigation into Russian interference in our elections or come anywhere near it. With these revelations, he may indeed become the subject of it," Schumer told reporters. "Better for the country if he resigns, but let's get an investigation going."

Because the Department of Justice should be above reproach, for the good of the country, the Attorney General should resign.

الخطوة 5. إزالة صفحات الويب المكررة

لتجنب أي قطعة من النصوص تم تمثيلها بشكل مفرط ، فأنت تريد فقط تضمين صفحات لا تتداخل بشكل ملحوظ مع صفحات أخرى.

لتقدير مقدار تداخل الملفات المستهدفة مع ملفات مصدر معينة ، استخدم هذه الوظيفة:

lazynlp.estimate_overlap(source_files, target_files, gran='word', n=8, capacity=10000, error_rate=1e-5, header=0, interval=100000)

gran هي حافة الرموز: "char" أو "Word" مستوى.

n هو n-gram.

capacity و error_rate هي لـ BloomFilter المستخدمة.

header : عدد خطوط كل ملف لتخطي. هذا لأنه في تنسيقنا ، الخط الأول هو عنوان URL

لتقدير مقدار تداخل الملف المستهدف مع بلومفيلتر الحالي ، استخدم هذه الوظيفة:

lazynlp.estimate_overlap_bf(bf, target_file, gran='word', n=8, header=0)

إذا تم منح قائمة بالملفات ، مثل صفحات الويب التي تم تنظيفها ، لتصفية جميع الملفات التي تحتوي على أكثر من تداخل threshold مع الملفات الأخرى ، استخدم هذه الوظيفة:

lazynlp.filter_files(files, threshold=0.5, gran='word', n=8, capacity=100000000, error_rate=1e-7, header=0, interval=1000000)

يتم تخزين أسماء جميع الملفات التي يتم اعتبارها مكررة في dupped_files.list

يتم تخزين أسماء جميع الملفات المستخدمة في مجموعة البيانات في clean_files.list

بعض الملاحظات:

1 جيجابايت من النص حوالي 1 ب. تحتوي الكلمة الإنجليزية في المتوسط على 4.5 حرفًا ، أو 5.5 بما في ذلك المسافة البيضاء. لذلك 1 جيجا بايت من النص حوالي 181 متر.
عندما قمت بتشغيل 30 نصًا على التوازي ، استغرق الأمر 3 ساعات لتنزيل وتنظيف 1 جيجابايت من النص النقي. لذلك سيستغرق الأمر 5 أيام للحصول على 50 جيجابايت من النص النقي.
تحتوي مجموعة بيانات Openai على 40 جيجابايت ، والتي أقدرها تحتوي على حوالي 7-8 مليار كلمة. إذا قمت بتنزيل جميع صفحات الويب من عناوين URL Reddit الجيدة وكتب Gutenberg ، فيجب أن يكون لديك مجموعة بيانات أكبر من WebText من Openai.
لم يشمل Openai ، في ورقته لـ GPT-2 ، مقالات ويكيبيديا خوفًا من التداخل. يمكنك اختيار تضمين مقالات Wikipedia التي لديها أقل من كمية معينة من التداخل مع مجموعة البيانات الحالية باستخدام lazynlp.estimate_overlap_bf(bf, target_file, gran='word', n=8 .

يوسع

معلومات إضافية

الإصدار 1.0.0
النوع شفرة المصدر الأخرى
وقت التحديث 2025-04-19
الحجم 16.4KB
من Github

تطبيقات ذات صلة

Google Dorks

2025-03-10
shepherd

2025-06-04
mongo express

2025-06-04
hidusbf

2025-02-14
Free Algorithms Books

2025-05-29
markdownpedia

2025-04-22

نوصي لك

chat.petals.dev

شفرة المصدر الأخرى

1.0.0
GPT Prompt Templates

شفرة المصدر الأخرى

1.0.0
GPTyped

شفرة المصدر الأخرى

GPTyped 1.0.5
Google Dorks

شفرة المصدر الأخرى

1.0
shepherd

شفرة المصدر الأخرى

v6.1.6-react-shepherd: Prepare Release (#3063)
mongo express

شفرة المصدر الأخرى

v1.1.0-rc-3
Google Dorks

شفرة المصدر الأخرى

1.0
shepherd

شفرة المصدر الأخرى

v6.1.6-react-shepherd: Prepare Release (#3063)
mongo express

شفرة المصدر الأخرى

v1.1.0-rc-3

أخبار ذات صلة الكل