
Edgar-Crawler是唯一从埃德加(Edgar)下载原始和非结构化的SEC文件,并将其转换为结构化的JSON文件以进行Bootstrap Financial NLP实验,并将其转换为结构化的JSON文件。
Edgar-Crawler具有2个核心功能:
除了下载原始文件外, Edgar-Crawler是唯一将复杂且非结构化的SEC文件转换为结构化JSON输出的开源工具包,以便于您的研究和开发更容易集成。以下是每种支持的备案类型的此类输出的示例:
原始报告:2022年的苹果10-K
{
"cik" : " 320193 " ,
"company" : " Apple Inc. " ,
"filing_type" : " 10-K " ,
"filing_date" : " 2022-10-28 " ,
"period_of_report" : " 2022-09-24 " ,
"sic" : " 3571 " ,
"state_of_inc" : " CA " ,
"state_location" : " CA " ,
"fiscal_year_end" : " 0924 " ,
"filing_html_index" : " https://www.sec.gov/Archives/edgar/data/320193/0000320193-22-000108-index.html " ,
"htm_filing_link" : " https://www.sec.gov/Archives/edgar/data/320193/000032019322000108/aapl-20220924.htm " ,
"complete_text_filing_link" : " https://www.sec.gov/Archives/edgar/data/320193/0000320193-22-000108.txt " ,
"filename" : " 320193_10K_2022_0000320193-22-000108.htm " ,
"item_1" : " Item 1. Business n Company Background n The Company designs, manufactures ... " ,
"item_1A" : " Item 1A. Risk Factors n The Company’s business, reputation, results of ... " ,
"item_1B" : " Item 1B. Unresolved Staff Comments n None. " ,
"item_1C" : " " ,
"item_2" : " Item 2. Properties n The Company’s headquarters are located in Cupertino, California. ... " ,
"item_3" : " Item 3. Legal Proceedings n Epic Games n Epic Games, Inc. (“Epic”) filed a lawsuit ... " ,
"item_4" : " Item 4. Mine Safety Disclosures n Not applicable. ... " ,
"item_5" : " Item 5. Market for Registrant’s Common Equity, Related Stockholder ... " ,
"item_6" : " Item 6. [Reserved] n Apple Inc. | 2022 Form 10-K | 19 " ,
"item_7" : " Item 7. Management’s Discussion and Analysis of Financial Condition ... " ,
"item_8" : " Item 8. Financial Statements and Supplementary Data n All financial ... " ,
"item_9" : " Item 9. Changes in and Disagreements with Accountants on Accounting and Financial Disclosure n None. " ,
"item_9A" : " Item 9A. Controls and Procedures n Evaluation of Disclosure Controls and ... " ,
"item_9B" : " Item 9B. Other Information n Rule 10b5-1 Trading Plans n During the three months ... " ,
"item_9C" : " Item 9C. Disclosure Regarding Foreign Jurisdictions that Prevent Inspections n Not applicable. ... " ,
"item_10" : " Item 10. Directors, Executive Officers and Corporate Governance n The information required ... " ,
"item_11" : " Item 11. Executive Compensation n The information required by this Item will be included ... " ,
"item_12" : " Item 12. Security Ownership of Certain Beneficial Owners and Management and ... " ,
"item_13" : " Item 13. Certain Relationships and Related Transactions, and Director Independence ... " ,
"item_14" : " Item 14. Principal Accountant Fees and Services n The information required ... " ,
"item_15" : " Item 15. Exhibit and Financial Statement Schedules n (a)Documents filed as part ... " ,
"item_16" : " Item 16. Form 10-K Summary n None. n Apple Inc. | 2022 Form 10-K | 57 "
}原始报告:Apple 10-Q摘自第1季度2024
{
"cik" : " 320193 " ,
"company" : " Apple Inc. " ,
"filing_type" : " 10-Q " ,
"filing_date" : " 2024-05-03 " ,
"period_of_report" : " 2024-03-30 " ,
"sic" : " 3571 " ,
"state_of_inc" : " CA " ,
"state_location" : " CA " ,
"fiscal_year_end" : " 0928 " ,
"filing_html_index" : " https://www.sec.gov/Archives/edgar/data/320193/0000320193-24-000069-index.html " ,
"htm_filing_link" : " https://www.sec.gov/Archives/edgar/data/320193/000032019324000069/aapl-20240330.htm " ,
"complete_text_filing_link" : " https://www.sec.gov/Archives/edgar/data/320193/0000320193-24-000069.txt " ,
"filename" : " 320193_10Q_2024_0000320193-24-000069.htm " ,
"part_1" : " PART I - FINANCIAL INFORMATION n Item 1. Financial Statements n Apple Inc. n CONDENSED CONSOLIDATED STATEMENTS ... " ,
"part_1_item_1" : " Item 1. Financial Statements n Apple Inc. n CONDENSED CONSOLIDATED STATEMENTS ... " ,
"part_1_item_2" : " Item 2. Management’s Discussion and Analysis of Financial Condition and ... " ,
"part_1_item_3" : " Item 3. Quantitative and Qualitative Disclosures About Market Risk n There have ... " ,
"part_1_item_4" : " Item 4. Controls and Procedures n Evaluation of Disclosure Controls and ... " ,
"part_2" : " PART II - OTHER INFORMATION n Item 1. Legal Proceedings n Digital Markets Act Investigations n On ... " ,
"part_2_item_1" : " Item 1. Legal Proceedings n Digital Markets Act Investigations n On March 25, 2024, ... " ,
"part_2_item_1A" : " Item 1A. Risk Factors n The Company’s business, reputation, ... " ,
"part_2_item_2" : " Item 2. Unregistered Sales of Equity Securities and Use of ... " ,
"part_2_item_3" : " Item 3. Defaults Upon Senior Securities n None. " ,
"part_2_item_4" : " Item 4. Mine Safety Disclosures n Not applicable. " ,
"part_2_item_5" : " Item 5. Other Information n Insider Trading Arrangements n None. " ,
"part_2_item_6" : " Item 6. Exhibits n Incorporated by Reference n Exhibit n Number n Exhibit Description ... "
}注意: part_1和part_2包含该部分的完整检测到的文本。我们提供了这一点,因为在某些旧的10-Q文件中,不可能在项目级别提取信息。
原始报告:2022-08-19的Apple 8-K
{
"cik" : " 320193 " ,
"company" : " Apple Inc. " ,
"filing_type" : " 8-K " ,
"filing_date" : " 2022-08-19 " ,
"period_of_report" : " 2022-08-17 " ,
"sic" : " 3571 " ,
"state_of_inc" : " CA " ,
"state_location" : " CA " ,
"fiscal_year_end" : " 0924 " ,
"filing_html_index" : " https://www.sec.gov/Archives/edgar/data/320193/0001193125-22-225365-index.html " ,
"htm_filing_link" : " https://www.sec.gov/Archives/edgar/data/320193/000119312522225365/d366128d8k.htm " ,
"complete_text_filing_link" : " https://www.sec.gov/Archives/edgar/data/320193/0001193125-22-225365.txt " ,
"filename" : " 320193_8K_2022_0001193125-22-225365.htm " ,
"item_1.01" : " " ,
"item_1.02" : " " ,
"item_1.03" : " " ,
"item_1.04" : " " ,
"item_1.05" : " " ,
"item_2.01" : " " ,
"item_2.02" : " " ,
"item_2.03" : " " ,
"item_2.04" : " " ,
"item_2.05" : " " ,
"item_2.06" : " " ,
"item_3.01" : " " ,
"item_3.02" : " " ,
"item_3.03" : " " ,
"item_4.01" : " " ,
"item_4.02" : " " ,
"item_5.01" : " " ,
"item_5.02" : " Item 5.02 Departure of Directors or Certain Officers; Election of Directors; Appointment ... " ,
"item_5.03" : " Item 5.03 Amendments to Articles of Incorporation or Bylaws; Change in Fiscal Year. n On August 17, 2022, Apple’s Board approved and adopted amended and restated bylaws ... " ,
"item_5.04" : " " ,
"item_5.05" : " " ,
"item_5.06" : " " ,
"item_5.07" : " " ,
"item_5.08" : " " ,
"item_6.01" : " " ,
"item_6.02" : " " ,
"item_6.03" : " " ,
"item_6.04" : " " ,
"item_6.05" : " " ,
"item_7.01" : " " ,
"item_8.01" : " " ,
"item_9.01" : " Item 9.01 Financial Statements and Exhibits. n (d) Exhibits. n Exhibit n Number n Exhibit ... " ,
}EDGAR-CRAWLER : # Method 1: SSH
git clone https://github.com/nlpaueb/edgar-crawler.git
# Method 2: HTTPS
git clone [email protected]:nlpaueb/edgar-crawler.gitconda create -n edgar-crawler-venv python=3.8 # After installing Anaconda, create a venv with python 3.8+
conda activate edgar-crawler-venv # Activate the environmentpip install -r requirements.txt # Install requirements for edgar-crawler 在运行任何脚本之前,您应该编辑config.json文件,该文件配置了我们的2个模块的行为(一个用于下载您选择的文件,另一个用于获取其结构化输出)。
download_filings.py的论点,下载财务报告的模块:start_year XXXX :开始的年度范围(默认为2023)。end_year YYYY :结束的年度范围(默认值为2023)。quarters :您要从(列表)下载文件的四分之一。[1, 2, 3, 4] 。filing_types :要下载的备案类型列表。['10-K', '8-K', '10-Q'] 。cik_tickers :包含CIK或诉讼的文件的列表或路径。例如[789019, "1018724", "AAPL", "TWTR"]user_agent :将声明为SEC EDGAR的用户代理(名称/电子邮件)。raw_filings_folder :将存储下载文件的文件夹的名称。'RAW_FILINGS' 。indices_folder :将存储Edgar TSV文件的文件夹的名称。这些用于定位年度报告。默认值是'INDICES' 。filings_metadata_file :CSV文件名从报告中保存元数据。skip_present_indices :是否跳过已经下载了Edgar索引或下载它们。True 。extract_items.py的参数,从已经下载的报告中清洁和提取文本数据的模块:raw_filings_folder :存储下载文档的文件夹的名称。'RAW_FILINGS' 。extracted_filings_folder :将存储提取文档的文件夹的名称。'EXTRACTED_FILINGS' 。filings_metadata_file :加载报告元数据的CSV文件名(提供与download_filings.py中相同的CSV文件)。filing_types :要提取的备案类型列表。include_signature :是否在最后一项之后包括签名部分。items_to_extract :带有特定项目部分的列表。['7','8']提取“管理层的讨论和分析”和“财务报表”部分项目的10-K报告。remove_tables :是否要删除包含数值(财务)数据的表。这项工作主要是为了促进NLP研究,而数值表通常没有用。skip_extracted_filings :是否跳过已提取文件或提取它们。True 。要从Edgar下载原始财务报告,请运行python download_filings.py 。
要从已经下载的文档中清理和提取特定的项目部分,请运行python extract_items.py 。
part都包含在单独的条目中。 Edgar-Crawler纸正在途中。在此之前,请引用我们在Econlp@emnlp 2021(多米尼加共和国Punta Cana)上发表的相关埃德加·科普斯论文。
@inproceedings { loukas-etal-2021-edgar-corpus-and-edgar-crawler ,
title = " {EDGAR}-{CORPUS}: {B}illions of {T}okens {M}ake {T}he {W}orld {G}o {R}ound " ,
author = " Loukas, Lefteris and
Fergadiotis, Manos and
Androutsopoulos, Ion and
Malakasiotis, Prodromos " ,
booktitle = " Proceedings of the Third Workshop on Economics and Natural Language Processing (ECONLP) " ,
month = nov,
year = " 2021 " ,
address = " Punta Cana, Dominican Republic " ,
publisher = " Association for Computational Linguistics " ,
url = " https://aclanthology.org/2021.econlp-1.2 " ,
pages = " 13--18 " ,
}在此处阅读Edgar-Corpus论文:https://aclanthology.org/2021.econlp-1.2/
以下是使用Edgar-Crawler创建的一些其他资源:
Edgar-Corpus :最大的金融NLP语料库,年度报告中有6亿个令牌(Huggingface URL?)| (Zenodo URL)。
EDGAR-W2V :财务Word2Vec嵌入,预先训练在Edgar-Corpus(Zenodo url)上
您有任何功能请求吗?直接使用此Google表单告诉我们:(https://forms.gle/bpv8nxmqx8sq2v5z8)!
公关和贡献被接受。我们使用功能分支工作流程。
请在Github上创建问题,而不是直接向我们发送电子邮件,以便所有可能的用户都可以从故障排除中受益。
该软件是根据GNU通用公共许可证v3.0许可的,该许可是由开源计划(OSI)批准的许可证。