
EDGAR-CRAWLER is the only open-source toolkit that downloads raw and unstructured financial SEC filings from EDGAR and converts them into structured JSON files in order to bootstrap financial NLP experiments.
EDGAR-CRAWLER has 2 core functionalities:
Other than downloading the raw filings, EDGAR-CRAWLER is the only open-source toolkit that converts the complex and unstructured SEC filings to structured JSON outputs for easier integration to your research and development. Below are examples of such outputs for each supported filing type:
Original report: Apple 10-K from 2022
{
"cik": "320193",
"company": "Apple Inc.",
"filing_type": "10-K",
"filing_date": "2022-10-28",
"period_of_report": "2022-09-24",
"sic": "3571",
"state_of_inc": "CA",
"state_location": "CA",
"fiscal_year_end": "0924",
"filing_html_index": "https://www.sec.gov/Archives/edgar/data/320193/0000320193-22-000108-index.html",
"htm_filing_link": "https://www.sec.gov/Archives/edgar/data/320193/000032019322000108/aapl-20220924.htm",
"complete_text_filing_link": "https://www.sec.gov/Archives/edgar/data/320193/0000320193-22-000108.txt",
"filename": "320193_10K_2022_0000320193-22-000108.htm",
"item_1": "Item 1. BusinessnCompany BackgroundnThe Company designs, manufactures ...",
"item_1A": "Item 1A. Risk FactorsnThe Company’s business, reputation, results of ...",
"item_1B": "Item 1B. Unresolved Staff CommentsnNone.",
"item_1C": "",
"item_2": "Item 2. PropertiesnThe Company’s headquarters are located in Cupertino, California. ...",
"item_3": "Item 3. Legal ProceedingsnEpic GamesnEpic Games, Inc. (“Epic”) filed a lawsuit ...",
"item_4": "Item 4. Mine Safety DisclosuresnNot applicable. ...",
"item_5": "Item 5. Market for Registrant’s Common Equity, Related Stockholder ...",
"item_6": "Item 6. [Reserved]nApple Inc. | 2022 Form 10-K | 19",
"item_7": "Item 7. Management’s Discussion and Analysis of Financial Condition ...",
"item_8": "Item 8. Financial Statements and Supplementary DatanAll financial ...",
"item_9": "Item 9. Changes in and Disagreements with Accountants on Accounting and Financial DisclosurenNone.",
"item_9A": "Item 9A. Controls and ProceduresnEvaluation of Disclosure Controls and ...",
"item_9B": "Item 9B. Other InformationnRule 10b5-1 Trading PlansnDuring the three months ...",
"item_9C": "Item 9C. Disclosure Regarding Foreign Jurisdictions that Prevent InspectionsnNot applicable. ...",
"item_10": "Item 10. Directors, Executive Officers and Corporate GovernancenThe information required ...",
"item_11": "Item 11. Executive CompensationnThe information required by this Item will be included ...",
"item_12": "Item 12. Security Ownership of Certain Beneficial Owners and Management and ...",
"item_13": "Item 13. Certain Relationships and Related Transactions, and Director Independence ...",
"item_14": "Item 14. Principal Accountant Fees and ServicesnThe information required ...",
"item_15": "Item 15. Exhibit and Financial Statement Schedulesn(a)Documents filed as part ...",
"item_16": "Item 16. Form 10-K SummarynNone.nApple Inc. | 2022 Form 10-K | 57"
}Original report: Apple 10-Q from Q1 2024
{
"cik": "320193",
"company": "Apple Inc.",
"filing_type": "10-Q",
"filing_date": "2024-05-03",
"period_of_report": "2024-03-30",
"sic": "3571",
"state_of_inc": "CA",
"state_location": "CA",
"fiscal_year_end": "0928",
"filing_html_index": "https://www.sec.gov/Archives/edgar/data/320193/0000320193-24-000069-index.html",
"htm_filing_link": "https://www.sec.gov/Archives/edgar/data/320193/000032019324000069/aapl-20240330.htm",
"complete_text_filing_link": "https://www.sec.gov/Archives/edgar/data/320193/0000320193-24-000069.txt",
"filename": "320193_10Q_2024_0000320193-24-000069.htm",
"part_1": "PART I - FINANCIAL INFORMATIONnItem 1. Financial StatementsnApple Inc.nCONDENSED CONSOLIDATED STATEMENTS ...",
"part_1_item_1": "Item 1. Financial StatementsnApple Inc.nCONDENSED CONSOLIDATED STATEMENTS ...",
"part_1_item_2": "Item 2. Management’s Discussion and Analysis of Financial Condition and ...",
"part_1_item_3": "Item 3. Quantitative and Qualitative Disclosures About Market RisknThere have ...",
"part_1_item_4": "Item 4. Controls and ProceduresnEvaluation of Disclosure Controls and ...",
"part_2": "PART II - OTHER INFORMATIONnItem 1. Legal ProceedingsnDigital Markets Act InvestigationsnOn ...",
"part_2_item_1": "Item 1. Legal ProceedingsnDigital Markets Act InvestigationsnOn March 25, 2024, ...",
"part_2_item_1A": "Item 1A. Risk FactorsnThe Company’s business, reputation, ...",
"part_2_item_2": "Item 2. Unregistered Sales of Equity Securities and Use of ...",
"part_2_item_3": "Item 3. Defaults Upon Senior SecuritiesnNone.",
"part_2_item_4": "Item 4. Mine Safety DisclosuresnNot applicable.",
"part_2_item_5": "Item 5. Other InformationnInsider Trading ArrangementsnNone.",
"part_2_item_6": "Item 6. ExhibitsnIncorporated by ReferencenExhibitnNumbernExhibit Description ..."
}Note: part_1 and part_2 contain the full detected text for that Part. We provide that, since in some old 10-Q files, it is not possible to extract the information in item level.
Original report: Apple 8-K from 2022-08-19
{
"cik": "320193",
"company": "Apple Inc.",
"filing_type": "8-K",
"filing_date": "2022-08-19",
"period_of_report": "2022-08-17",
"sic": "3571",
"state_of_inc": "CA",
"state_location": "CA",
"fiscal_year_end": "0924",
"filing_html_index": "https://www.sec.gov/Archives/edgar/data/320193/0001193125-22-225365-index.html",
"htm_filing_link": "https://www.sec.gov/Archives/edgar/data/320193/000119312522225365/d366128d8k.htm",
"complete_text_filing_link": "https://www.sec.gov/Archives/edgar/data/320193/0001193125-22-225365.txt",
"filename": "320193_8K_2022_0001193125-22-225365.htm",
"item_1.01": "",
"item_1.02": "",
"item_1.03": "",
"item_1.04": "",
"item_1.05": "",
"item_2.01": "",
"item_2.02": "",
"item_2.03": "",
"item_2.04": "",
"item_2.05": "",
"item_2.06": "",
"item_3.01": "",
"item_3.02": "",
"item_3.03": "",
"item_4.01": "",
"item_4.02": "",
"item_5.01": "",
"item_5.02": "Item 5.02 Departure of Directors or Certain Officers; Election of Directors; Appointment ...",
"item_5.03": "Item 5.03 Amendments to Articles of Incorporation or Bylaws; Change in Fiscal Year.nOn August 17, 2022, Apple’s Board approved and adopted amended and restated bylaws ...",
"item_5.04": "",
"item_5.05": "",
"item_5.06": "",
"item_5.07": "",
"item_5.08": "",
"item_6.01": "",
"item_6.02": "",
"item_6.03": "",
"item_6.04": "",
"item_6.05": "",
"item_7.01": "",
"item_8.01": "",
"item_9.01": "Item 9.01 Financial Statements and Exhibits.n(d) Exhibits.nExhibitnNumbernExhibit ...",
}EDGAR-CRAWLER locally via SSH or HTTPS:# Method 1: SSH
git clone https://github.com/nlpaueb/edgar-crawler.git
# Method 2: HTTPS
git clone [email protected]:nlpaueb/edgar-crawler.gitconda create -n edgar-crawler-venv python=3.8 # After installing Anaconda, create a venv with python 3.8+
conda activate edgar-crawler-venv # Activate the environmentpip install -r requirements.txt # Install requirements for edgar-crawlerBefore running any script, you should edit the config.json file, which configures the behavior of our 2 modules (one for downloading the filings of your choice, the other one for getting the structured output of them).
download_filings.py, the module to download financial reports:
start_year XXXX: the year range to start from (default is 2023).end_year YYYY: the year range to end to (default is 2023).quarters: the quarters that you want to download filings from (List).[1, 2, 3, 4].filing_types: list of filing types to download.['10-K', '8-K', '10-Q'].cik_tickers: list or path of file containing CIKs or Tickers. e.g. [789019, "1018724", "AAPL", "TWTR"] user_agent: the User-agent (name/email) that will be declared to SEC EDGAR.raw_filings_folder: the name of the folder where downloaded filings will be stored.'RAW_FILINGS'.indices_folder: the name of the folder where EDGAR TSV files will be stored. These are used to locate the annual reports. Default value is 'INDICES'.filings_metadata_file: CSV filename to save metadata from the reports.skip_present_indices: Whether to skip already downloaded EDGAR indices or download them nonetheless.True.extract_items.py, the module to clean and extract textual data from already-downloaded reports:
raw_filings_folder: the name of the folder where the downloaded documents are stored.'RAW_FILINGS'.extracted_filings_folder: the name of the folder where extracted documents will be stored.'EXTRACTED_FILINGS'.filings_metadata_file: CSV filename to load reports metadata (Provide the same csv file as in download_filings.py).filing_types: list of filing types to extract.include_signature: Whether to include the signature section after the last item or not.items_to_extract: a list with the certain item sections to extract. ['7','8'] to extract 'Management’s Discussion and Analysis' and 'Financial Statements' section items for 10-K reports.remove_tables: Whether to remove tables containing mostly numerical (financial) data. This work is mostly to facilitate NLP research where, often, numerical tables are not useful.skip_extracted_filings: Whether to skip already extracted filings or extract them nonetheless.True.To download the raw financial reports from EDGAR, run python download_filings.py.
To clean and extract specific item sections from already-downloaded documents, run python extract_items.py.
part in the output file as a separate entry.An EDGAR-CRAWLER paper is on its way. Until then, please cite our relevant EDGAR-CORPUS paper published at ECONLP@EMNLP 2021 (Punta Cana, Dominican Republic).
@inproceedings{loukas-etal-2021-edgar-corpus-and-edgar-crawler,
title = "{EDGAR}-{CORPUS}: {B}illions of {T}okens {M}ake {T}he {W}orld {G}o {R}ound",
author = "Loukas, Lefteris and
Fergadiotis, Manos and
Androutsopoulos, Ion and
Malakasiotis, Prodromos",
booktitle = "Proceedings of the Third Workshop on Economics and Natural Language Processing (ECONLP)",
month = nov,
year = "2021",
address = "Punta Cana, Dominican Republic",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.econlp-1.2",
pages = "13--18",
}Read the EDGAR-CORPUS paper here: https://aclanthology.org/2021.econlp-1.2/
Here are some additional resources created by using EDGAR-CRAWLER:
EDGAR-CORPUS: The largest financial NLP corpus, 6+ billion tokens from annual reports (HuggingFace URL ?) | (Zenodo URL).
EDGAR-W2V: Financial Word2Vec embeddings, pre-trained on EDGAR-CORPUS (Zenodo URL)
Do you have any feature request? Tell us directly using this Google Form: (https://forms.gle/bpV8nxMqX8Sq2v5z8)!
PRs and contributions are accepted. We use the Feature Branch Workflow.
Please create an issue on GitHub instead of emailing us directly so all possible users can benefit from the troubleshooting.
This software is licensed under the GNU General Public License v3.0, a license approved by the Open-Source Initiative (OSI).