Financial4All

| Pandas | Numpy | Plotly |

Financial4All (F4A) is an open-source passion project for visualizing SEC financial data translated from XBRL. It also allows users to download Excel from the SEC with ease to perform financial analysis. This initiative is to combat financial data privatization and enhance financial transparency after the SEC’s decision to no longer make financial statements Excel documents easily accessible. I plan on updating this project in the future to integrate AI to help with parsing Excel financial statements.

You can check out this project on GitHub.

The Situation:

Thanks to the Financial Data Transparency Act of 2022 (FDTA), the SEC has started requiring companies to file their financial data in more “machine-readable formats”. The downside is that these formats are less human-readable and far less accessible to the average person, who relies on the Excel file format. These new requirements wouldn’t be detrimental as long as companies still filed and or the SEC still provided downloadable XLSX files with all of the properly formatted details to be analyze within excel, but as of May 2025, the SEC removed the button to download the excel files to financial statements and hid them away deep within the “Archives” directory, expecting any user interested in them to know the companie’s CIK identifier and the filing # of the exact file they need.

This, in turn, has greatly restricted the accessibility to financial information, having the exact opposite effect of what the FDTA aimed to achieve.

My Solution:

To help make financial data more transparent again, I created Financial4All, which allows a user to input a stock ticker, grab its associated XBRL data from the SEC’s API, and translate that data into visualizations and workable Excel spreadsheets that make sense to the average user. This solution is quite nuanced due to the lack of standardization and cleanliness of data that is sent from the SEC’s API endpoint. To combat the severe lack of complete data and disorganization, I had to construct complex algorithms to parse XBRL data using the raw calculation files from the Financial Accounting Standards Board. The result allows for the enhanced standardization of financial statements produced by companies in XBRL, allowing everyday users to analyze their investment prospects.

Despite my efforts to create standardized financial statements, the accuracy of some fields may vary greatly due to the lack of regulation surrounding accuracy and standards when filing in XBRL. Additionally, XBRL’s “tagging” rules vary greatly depending on how a company is filing. If the company files its income statement with gross margin, its XBRL tags will be different from those that file without gross margin. In total, there are about 8 different ways a company can file its income statements, and they can vary from year to year if the company chooses. The SEC’s API doesn’t provide information on what type of income statement format was filed for any given year, so the formula required to fill in missing/uncalculated data is largely guess-and-check, making for an extremely inefficient process.

The Data:

The SEC supplies a public API to retrieve financial data in the form of XBRL. XBRL is the “Business Standard” version of XML, standing for “eXtensible Business Reporting Language”. The challenge stemming from using XBRL is the sheer amount of nuanced possibilities when reporting, and the lack of care taken when normalizing that data. XBRL follows many structures, but the most prominent one is the US-GAAP Taxonomy. The taxonomy defines which smaller parts combine into the bigger parts, and in the case of US-GAAP, funneling all the way up until you get the final calculation of Earnings Per Share or Net Earnings. XBRL.org is nice enough to allow the public to see a tree format of the taxonomies

Alternatively, they allow you to download the calculation files in a .zip file for your convenience.

Parsing the .Zip file, I assembled all of the calculations and stored them in an income_statement_formulas.json file, which turned out to be 38,823 lines long. I then had the plan of going through the XBRL data we received from the API and doing a depth-first search, starting with Revenue. If the value of Revenue was N/A for a given year, it would then try to calculate it from its possible calculations. If the value of a variable required in the calculation was N/A, it would then try to calculate that variable, and continue to dive deeper into the calculations until: Either the N/A variable didn’t have any calculations, at which point we would assume it’s 0, or all variables had values (no N/A). We would then perform the calculations, collapse, and fill the previous layer of calculations until we got the calculated value at the top. With this recursive method, it’s very slow and, for some reason, doesn’t work as well as one would suspect due to one serious issue: Companies don’t file with a uniform taxonomy, or the SEC doesn’t send data in the proper order when sending company concept data in bulk. This results, still, in some missing values, which is why I allowed the option within Financial4All to switch between different formulas.

The SEC API

My chosen method for retrieving company data was through the SEC’s “companyfacts” endpoint, which provides all data from every XBRL disclosure for all financial reporting. The data comes in the form of JSON, which is easy to work with within Pandas with a little tweaking, as the JSON is nested with 4 dictionaries, a list, and then another dictionary. So what are the dictionaries?

Dictionary 1: Entity Details - Contains Company CIK and name

Dictionary 2: Facts - Contains a dictionary of taxonomy structures (XBRL structures) such as dei and us-gaap

Dictionary 3: XBRL tags for financial entries - Such as “AccountsPayableCurrent”, “Revenues”, “NetIncomeLoss”

Dictionary 4: XBRL tag attributes - Label = description of tag, and Units = What unit it’s reported in (Shares, USD, shares/usd)

Dictionary 5: Units - describes what unit the XBRL entry is in

List - A list of all entries for the XBRL tag

Dictionary - Attributes for each entry

If this seems complicated, it’s because it is. Perhaps this approach wasn’t the best one, but it’s the most accessible via the API, and I wanted all of the data reported at once. This is the data structure of each of the datapoints within each XBRL entry:

We can see that we are provided the End Date, Value, Account Number, Fiscal Year, Fiscal Period, Form Type, Filing Date, and optional Frame attribute. On other XBRL values, we also have a Start Date attribute for each entry.

Filtering The Data

For the uninitiated, the most important thing to look at here is the Form type. In particular, companies file their financial statements on a quarterly basis and an annual basis as per requirements from regulations within the US, thanks to the Securities Exchange Act of 1934. The form times are as follows:

10-Q: Quarterly Report, Q1 = January - March, Q2 = April - June, Q3 = July - September, Q4 = October - December

10-K: Annual Report, combines all of the reports for the year (Q1-Q3) and creates a “12-month ended” report which compares that year with the past 2-3 years. This report contains the most data, providing the true benchmark for the company’s performance.

10-K/A: This is an amendment to the 10-K. This is the result of fixing an error with the original 10-K, or providing “Updated” information. This filing usually doesn't contain

8-K: Very broad report form, used to report significant “material” events within the company that are not reported within the normal financial statement cycle and could sway an investor’s decision. - Not necessarily a financial statement.

For the scope of this project, I decided that I only wanted 10-K data, as the extra granularity from 10-Qs wasn’t necessary for the average investor. The functions I wrote allow the user to input a stock ticker, and receive the company’s CIK identifier (because the SEC requires it when using the API), then the user can use that CIK to put it into a function to grab all of the 10-K data from the API.

def get_cik(ticker: str):
    """
    Gets the CIK for the specified ticker
    """
    row = CIK_dict[CIK_dict['ticker'].str.upper() == ticker.upper()]
    if not row.empty:
        return row['cik_str'].values[0]
    else:
        raise ValueError(f"Ticker {ticker} not found in CIK_dict")




def get_filing_by_metrics(CIK: str):
    url = f'https://data.sec.gov/api/xbrl/companyfacts/CIK{CIK}.json'
    headers = {
        'User-Agent': 'your_email@email.com' # SEC requires contact info in the header of the request
    }
    response = requests.get(url, headers=headers)
    data = response.json()
    return data['facts']['us-gaap'] # Indexing into the US-GAAP taxonomy to remove the 2 outermost dictionaries


get_filing_by_metrics(get_cik('AAPL'))
def extract_metrics(filing_metrics: dict):
    """
    Extracts the specified metrics from the filing
    
    Returns a dictionary of metrics with the following structure:
    {
        'metric_name': [
            {
                'entry_attributes': entry_attributes
            }
        ]
    }
    """
    metric_master = {}
    for metric, attributes in filing_metrics.items():
        for unit, entries in attributes['units'].items():
            for entry in entries:
                if entry['form'] == '10-K':
                    if metric + '_' + unit not in metric_master:
                        metric_master[metric + '_' + unit] = []
                    metric_master[metric + '_' + unit].append(entry)
    
    return metric_master

The result gives us the XBRL metric tags with a list of each of the entries from all 10-K forms. I then aggregate the datapoints by each End Date and remove any End Date that has 70% or more N/A values. This removes the datapoints from the “Frame” data or incomplete/partial year data that is sprinkled in. Overall, I’m not sure why this API endpoint returns partial data while still labeling it as “10-K” data.

I then format the data to show the end periods as the index and the XBRL variables as the columns, which can be transposed to show something a little more similar to what analysts are used to.

def format_metrics_efficient(extracted_metrics):
    """
    Converts extracted metrics into a clean, graphable DataFrame.
    This version filters for annual data points ('fp' == 'FY'), processes
    them based on 'end' dates, handles duplicates by prioritizing the most
    recent filing, removes rows with too many N/A values, and formats
    the 'end' date column to show only the date.
    """
    all_entries = []
    # Iterate through each metric and its corresponding entries.
    for metric_name, entries in extracted_metrics.items():
        for entry in entries:
            # Filter for entries that represent a full fiscal year ('FY').
            if entry.get('fp') == 'FY':
                record = {
                    'metric': metric_name,
                    'end': entry['end'],
                    'val': entry['val'],
                    'filed': entry['filed'],
                    'fy': entry['fy']
                }
                if 'start' in entry:
                    record['start'] = entry['start']
                all_entries.append(record)

    if not all_entries:
        return pd.DataFrame()

    # Create a DataFrame from the filtered list of annual entries.
    df = pd.DataFrame(all_entries)

    # Convert date strings to datetime objects for correct sorting.
    df['filed'] = pd.to_datetime(df['filed'])
    df['end'] = pd.to_datetime(df['end'])
    if 'start' in df.columns:
        df['start'] = pd.to_datetime(df['start'])

    # Sort by filing date to ensure the latest filing is first.
    df.sort_values(by='filed', ascending=False, inplace=True)

    # Drop duplicates for the same metric and end date, keeping the most recent filing.
    df.drop_duplicates(subset=['metric', 'end'], keep='first', inplace=True)

    # Pivot the DataFrame to get end dates as the index and metrics as columns.
    df_pivot = df.pivot_table(index='end', columns='metric', values='val')

    # Remove rows where more than 70% of the values are N/A.
    df_pivot.dropna(thresh=df_pivot.shape[1] * 0.3, inplace=True)

    # Standardize common metric names.
    metric_mapping = {
        'SellingAndMarketingExpense': 'SG&A',
        'SellingGeneralAndAdministrativeExpense': 'SG&A',
        'ResearchAndDevelopmentExpense': 'R&D',
        'ResearchAndDevelopment': 'R&D'
    }
    df_pivot.rename(columns=metric_mapping, inplace=True)

    # Consolidate columns if renaming created duplicates.
    if df_pivot.columns.has_duplicates:
        df_pivot = df_pivot.groupby(level=0, axis=1).first()

    # Reset index to make 'end' a regular column and clean up.
    df_pivot.reset_index(inplace=True)
    df_pivot.rename_axis(None, axis=1, inplace=True)

    # Format the 'end' column to remove the time component, showing only the date.
    df_pivot['end'] = pd.to_datetime(df_pivot['end']).dt.date

    # Sort the final DataFrame by the end date.
    df_pivot.sort_values(by='end', ascending=True, inplace=True)

    return df_pivot
nv_efficient = format_metrics_efficient(extract_metrics(get_filing_by_metrics(get_cik('NVDA'))))
nv_efficient.T

Data Standardization

I want the user to be able to type in the Ticker, and receive an accurate income statement of all reported XBRL data. To do this effectively, I needed to define a standardized Income Statement structure, which consists of the following:

Revenue
+

Cost of revenue

=

Gross Revenue

-

Research and Development Expense

-

Admin & General Expense
-

Other Operating Expense

=

Operating Income

+

Other Non-Operating Income (Expense)

=

Income before taxes

-

Taxes

=

Net Income

This means I need to standardize the XBRL entries and fill in the missing data for all of the variables within this list. This was the crux of the project. To do this, I needed to calculate all of the values for a given end date and income statement variable that was N/A due to either a company not bothering to fill out the full hierarchy or their reporting software handling the calculations in a strange way.

I first had to download the 2025 US-GAAP XBRL calculation files from https://xbrl.us/xbrl-taxonomy/2025-us-gaap/

Then I parsed through the calculation files and created a master list of all calculations, which had a weight of either 1 or -1 assigned. This allows for the value to be multiplied by its weight, then summed to perform subtraction automatically.

The way I resolved this was through a very complex process of recursive functions that scanned for the values within the data. If the value was N/A, it would try to compute it using all of its possible formulae; if there was an N/A for a value within the calculation, it would then try to calculate that value. This process would continue until it either reached non-N/A values or an N/A that did not have any calculations. At that point, we would assume that the values are truly N/A, set them to zero, perform the calculation, return the result of that calculation as the previously N/A variable, perform that calculation, and continue to collapse all previous calculations with N/A variables until we get a value for the target metric for the standardized income statement.

As a result, we’re able to create a table like this:

It is to be noted that this works better on certain company filings than others, and the values displayed shouldn’t be taken as gospel, but after glancing at a few company filings and checking them against their official 10-Ks, this information is generally accurate.

Excel Access

As mentioned before, the SEC made it very hard to access the original Excel financial statements, putting them within the “Archive” section of the SEC website. Luckily, the SEC provides a data.sec.gov/submissions/ API endpoint where you can grab the history of all filings from a company. Using this information, you can find a filing’s “accessionNumber,” which is linked to the folder where you can find the proper Excel files for the financial statements from the filing. You can find the Excel file for financial statements at “https://www.sec.gov/Archives/edgar/data/{CIK}/{accession_number}/Financial_Report.xlsx”.

So I included a button within Financial4All that allows users to automatically download all submitted 10-Ks in their original Excel format from the SEC with ease.

The button will fetch the submissions JSON from the API endpoint, filtering down to all 10-K forms, grabbing their accession numbers, then fetching and aggregating them into a Zip file that gets downloaded within the browser.

The Dashboard

My goal for the Financial4All dashboard was to make financial data as accessible to everyday people as possible, so a clean and intuitive dashboard was important.

The Result

Financial4All allows everyone to access financial data easily, removing the added barrier to entry due to the SEC removing the ability to download Excel files for financial statements. Not only does this platform parse the majority of raw XBRL data, but it maintains access to the Excel files for the Financial Statements via bulk download. I hope this open-source project inspires others to contribute and create a fully featured community-made financial platform that is free to everyone.

Previous
Previous

Project Two: Tech Industry Survey Data Cleaning and Analysis (Pandas + Numpy)