- Metadata, metadatalake, Modern Metadata Stack (MMS)
- Overview
- References
- Introduction
- Frameworks
- Tools
Created by gh-md-toc
This project intends to collect, analyze and synthetize referential material about metadata, in order to facilitate the implementing of metadatalakes. That is, this project is a first contribution to a Modern Metadatalake Stack (MMS), much like the initiatives around the rise of the Modern Data Stack (MDS).
Even though the members of the GitHub organization may be employed by some companies, they speak on their personal behalf and do not represent these companies.
- Data Engineering Helpers - Knowledge Sharing - Data products
- Data Engineering Helpers - Knowledge Sharing - Data contracts
- Data Engineering Helpers - Knowledge Sharing - Data quality
- Data Engineering Helpers - Knowledge Sharing - Architecture principles
- Data Engineering Helpers - Knowledge Sharing - Data life cycle
- Data Engineering Helpers - Knowledge Sharing - Data management
- Data Engineering Helpers - Knowledge Sharing - Data lakehouse
- Data Engineering Helpers - Knowledge Sharing - Data pipeline deployment
- Data Engineering Helpers - Knowledge Sharing - Semantic layer
- The Rise of the Metadata Lake, Prukalpa, Jun. 2021: https://towardsdatascience.com/the-rise-of-the-metadata-lake-1e95127594de
- The anatomy of an active metadata platform, Prukalpa, Aug. 2021: https://towardsdatascience.com/the-anatomy-of-an-active-metadata-platform-13473091ad0d
- Arxiv - The Data Lakehouse: Data Warehousing and More - 2023 -
- Authors: Dipankar Mazumdar, Jason Hughes, JB Onofré (all working at Dremio at the time)
- Date: October 2023
- What is Apache XTable (formerly OneTable) — Interoperability for Apache Hudi, Iceberg & Delta Lake
- Author: Dipankar Mazumdar (Dipankar Mazumdar on LinkedIn, Dipankar Mazumdar on Medium)
- Date: Dec. 2023
- The race to own open data, The fight for metadata and access control in the Lakehouse, May 2024, by Roy Hasson: https://royondata.substack.com/p/the-race-to-own-open-data
- DataHub: A generalized metadata search & discovery tool, Mars Lan, Aug. 2019: https://engineering.linkedin.com/blog/2019/data-hub
- Date: June 2025
- Author: Dipankar Mazumdar (Dipankar Mazumdar on LinkedIn)
- Link to the LinkedIn post: https://www.linkedin.com/posts/dipankar-mazumdar_lakehouse-dataengineering-softwareengineering-activity-7336406995798740992-eji9/
- Title: From Data Catalog 📚 to Data Marketplace 🛒
- Author: Jochen Christ (Jochen Christ on LinkedIn)
- Date: Jan. 2025
- Link to the LinkedIn post: https://www.linkedin.com/posts/jochenchrist_datamarketplace-datamarketplace-dataproducts-activity-7281953125140246528-BExu/
- Link to the Data Mesh Manager blog post: https://datamesh-manager.com/learn/data-catalog-vs-data-marketplace
- Title: The Art of Discoverability and Reverse Engineering User Happiness
- Authors: Animesh Kumar and Travis Thompson
- Date: Dec. 2024
- Link to the article: https://moderndata101.substack.com/p/the-art-of-discoverability-and-reverse
- Title: Big Metadata: When Metadata is Big Data
- Publisher: Google
- Authors:
- Pavan Edara (Pavan Edara on LinkedIn)
- Mosha Pasumansky (Mosha Pasumansky on LinkedIn)
- Link to the PDF article: https://vldb.org/pvldb/vol14/p3083-edara.pdf
In the past 10 years, as the modern data stack has matured and become mainstream, we’ve taken great leaps forward in data infrastructure. However, the modern data stack still has one key missing component: context. That’s where metadata comes in. In this increasingly diverse data world, metadata holds the key to the elusive promised land — a single source of truth. There will always be countless tools and tech in a team’s data infrastructure. By effectively collecting metadata, a team can finally unify context about all their tools, processes, and data.
But what actually is metadata, you ask? Simply put, metadata is “data about data”.
Today, metadata is everywhere. Every component of the modern data stack and every user interaction on it generates metadata. Apart from traditional forms like technical metadata (e.g. schemas) and business metadata (e.g. taxonomy, glossary), our data systems now create entirely new forms of metadata.
Cloud compute ecosystems and orchestration engines generate logs every second, called performance metadata. Users who interact with data assets and one another generate social metadata. Logs from BI tools, notebooks, and other applications, as well as from communication tools like Slack, generate usage metadata. Orchestration engines and raw code (e.g. SQL) used to create data assets generate provenance metadata.
- Homepage: https://hudi.apache.org/docs/metadata/
- Hudi GitHub repository: https://github.com/apache/hudi
- Hudi tracks metadata about a table to remove bottlenecks in achieving great read/write performance, specifically on cloud storage.
- Avoid list operations to obtain set of files in a table
- Expose columns statistics for better query planning and faster queries
- Moto: "A Metadata Platform for the Modern Data Stack"
- Home page: https://datahubproject.io/
- GitHub: https://github.com/linkedin/datahub
- Companies behind: LinkedIn and Acryl data (see below)
- Open source: yes
- Overview: DataHub is an open-source metadata platform for the modern data stack.
- References:
- Read about the architectures of different metadata systems and why DataHub excels.
- Also read the LinkedIn Engineering blog post,
- Check out the Strata presentation
- And watch the Crunch Conference Talk.
- You should also visit DataHub Architecture to get a better understanding of how DataHub is implemented
- And DataHub Onboarding Guide to understand how to extend DataHub for your own use cases.
- Moto: Bring clarity to your data
- Home page: https://www.acryldata.io/
- Open source: no
- Overview: Acryl Cloud is a comprehensive metadata platform that joins a best-in-class catalog with data observability. Built by the team behind DataHub (see above).
- Moto: "Data Mastery for the Whole Company" "A modern data catalog powered by social data intelligence and AI - from the creators of DataHub"
- Home page: https://metaphor.io/
- Open source: no
- Articles on the principles:
- The Grand Rewrite of DataHub, by Mars Lan et al, Sep. 2023 - https://metaphor.io/blog/the-grand-rewrite-of-datahub
- The Modern Metadata Platform (MMP): What, Why, and How? by Mars Lan et al, Jan. 2022 - https://metaphor.io/blog/the-modern-metadata-platform-what-why-and-how
- DataHub: A generalized metadata search & discovery tool, by Mars Lan et al, Aug. 2019 - https://engineering.linkedin.com/blog/2019/data-hub
- Moto: "A Single place to Discover, Collaborate, and Get your data right"
- Home page: https://open-metadata.org/
- GitHub: https://github.com/open-metadata/OpenMetadata
- Organization behind: former employees of Uber and Hortonworks
- Open source: yes
- GitHub: https://github.com/MarquezProject/marquez
- Organization behind: WeWork / Datakin
- Open source: yes
Marquez is an open source metadata service for the collection, aggregation, and visualization of a data ecosystem's metadata, going together with OpenLineage (see below). It maintains the provenance of how datasets are consumed and produced, provides global visibility into job runtime and frequency of dataset access, centralization of dataset lifecycle management, and much more. Marquez was released and open sourced by WeWork.
- Home page: http://openlineage.io/
- GitHub: https://github.com/OpenLineage/OpenLineage
- Organization behind: WeWork / Datakin
- Open source: yes
OpenLineage is an open standard for metadata and lineage collection designed to instrument jobs as they are running. OpenLineage is the ground standard for Marquez (see above). It defines a generic model of run, job, and dataset entities identified using consistent naming strategies. The core lineage model is extensible by defining specific facets to enrich those entities.
- GitHub: https://github.com/opendatadiscovery/opendatadiscovery-specification
- Open standard: yes
Open Data Discovery Specification (ODD Spec): A Universal Standard for Metadata Collection
- GitHub: https://github.com/amundsen-io/amundsen
- Company behind: Lyft
- Open source: yes
Amundsen is a data discovery and metadata engine for improving the productivity of data analysts, data scientists and engineers when interacting with data. It does that today by indexing data resources (tables, dashboards, streams, etc.) and powering a page-rank style search based on usage patterns (e.g. highly queried tables show up earlier than less queried tables). Think of it as Google search for data. The project is named after Norwegian explorer Roald Amundsen, the first person to discover the South Pole.
- Moto: "Open metadata and governance for enterprises - automatically capturing, managing and exchanging metadata between tools and platforms, no matter the vendor"
- Organization behind it: Linux foundation
- Home page: https://egeria-project.org/
- GitHub: https://github.com/odpi/egeria
- Open source: yes
- Moto: "Turning Big Data into Knowledge with Metadata at Uber"
- Uber: https://eng.uber.com/databook/
- Company behind: Uber
- Open source: no
Like many technologies at Uber, they Databook is well described in articles, but has not been open sourced so far.
- Moto: "Data discovery at Facebook"
- Facebook: https://engineering.fb.com/2020/10/09/data-infrastructure/nemo/
- Company behind: Facebook
- Open source: no
- Moto: "Democratizing Data at Airbnb"
- Medium: https://medium.com/airbnb-engineering/democratizing-data-at-airbnb-852d76c51770
- Company behind: Airbnb
- Open source: no
- Moto: "Making Big Data Discoverable and Meaningful at Netflix"
- Medium: https://netflixtechblog.com/metacat-making-big-data-discoverable-and-meaningful-at-netflix-56fb36a53520
- Company behind: Netflix
- Open source: ?
- Home page:
- Documentation: https://docs.delta.io/latest/delta-uniform.html
- Open source: yes
- Articles:
- Delta Lake Universal Format (UniForm) for Iceberg compatibility, now Generally Available (GA):
- Link to the article: https://www.databricks.com/blog/delta-lake-universal-format-uniform-iceberg-compatibility-now-ga
- Authors: Jonathan Brito, Fred Liu and Susan Pierce
- Date: June 2024
- Home page: https://projectnessie.org/
- GitHub: https://github.com/projectnessie/nessie
- Open source: yes
- Documentation: https://projectnessie.org/nessie-latest/
- Article from Apr. 2024, by Ciro Greco: https://towardsdatascience.com/write-audit-publish-for-data-lakes-in-pure-python-no-jvm-25fbd971b17d
- Overview: Transactional catalog for data lakes
- Git-inspired data version control
- Cross-table transactions and visibility
- Open data lake approach, supporting Hive, Spark, Dremio, AWS Athena, etc.
- Works with Apache Iceberg tables
- Run as a Docker image or on Kubernetes
- Home page: https://iceberg.apache.org/concepts/catalog/
- Iceberg REST catalog OpenAPI specification: https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml
- Open source: yes
- Iceberg was initially contributed by Netflix
- Overview:
- You may think of Iceberg as a format for managing data in a single table, but the Iceberg library needs a way to keep track of those tables by name. Tasks like creating, dropping, and renaming tables are the responsibility of a catalog. Catalogs manage a collection of tables that are usually grouped into namespaces. The most important responsibility of a catalog is tracking a table’s current metadata, which is provided by the catalog when you load a table.
- Iceberg catalogs are flexible and can be implemented using almost any backend system. They can be plugged into any Iceberg runtime, and allow any processing engine that supports Iceberg to load the tracked Iceberg tables. Iceberg also comes with a number of catalog implementations that are ready to use out of the box. This includes:
- REST - a server-side catalog that’s exposed through a REST API
- Hive Metastore - tracks namespaces and tables using a Hive metastore
- JDBC - tracks namespaces and tables in a simple JDBC database
- Nessie - a transactional catalog that tracks namespaces and tables in a database with git-like version control
- Apache Iceberg - What Is It, Diving Deep With A Guest Post, May 2024, by Julien Hurault: https://seattledataguy.substack.com/p/apache-iceberg-what-is-it
- GitHub - Tabular.io - Iceberg REST Docker image
- Simple project to expose a catalog over REST using a Java catalog backend
- For instance, uses AWS Glue as a backend and exposes an Iceberg REST catalog
- Home page (part of the Hive documentation on the Apache wiki): https://cwiki.apache.org/confluence/display/hive/design#Design-Metastore
- GitHub (part of the Hive repository on GitHub): https://github.com/apache/hive/tree/master/metastore
- Open source: yes Apache wiki - AdminManual Metastore 3.0 Administration
- Articles:
- Home page: https://xtable.apache.org/
- GitHub: https://github.com/apache/incubator-xtable
- Open source: yes
- Articles:
- Arxiv - The Data Lakehouse: Data Warehousing and More - 2023 -
- Authors: Dipankar Mazumdar, Jason Hughes, JB Onofré (all working at Dremio at the time)
- Date: October 2023
- What is Apache XTable (formerly OneTable) — Interoperability for Apache Hudi, Iceberg & Delta Lake
- Author: Dipankar Mazumdar (Dipankar Mazumdar on LinkedIn, Dipankar Mazumdar on Medium)
- Date: Dec. 2023
- Arxiv - The Data Lakehouse: Data Warehousing and More - 2023 -
- Home page: https://www.openhousedb.org/
- GitHub: https://github.com/linkedin/openhouse
- Open source: yes
- Articles:
- Open sourcing OpenHouse
- Author: Sumedh Sakdeo
- Date: March 2024
- Link to the article: https://www.linkedin.com/blog/engineering/open-source/open-sourcing-openhouse
- Taking Charge of Tables: Introducing OpenHouse for Big Data Management
- Author: Sumedh Sakdeo
- Date: July 2023
- Link to the article: https://www.linkedin.com/blog/engineering/data-management/taking-charge-of-tables--introducing-openhouse-for-big-data-mana
- Open sourcing OpenHouse
- Company behind: LinkedIn
- Overview: OpenHouse is an open source control plane designed for efficient management of tables within open data lakehouse deployments. The control plane comprises a declarative catalog and a suite of data services. Users can seamlessly define Tables, their schemas, and associated metadata declaratively within the catalog. OpenHouse reconciles the observed state of Tables with the desired state by orchestrating various data services.
- Home page: https://data-dot-all.github.io/dataall/
- GitHub: https://github.com/data-dot-all/dataall
- Open source: yes
- Company behind: AWS
- Home page: https://sqlglot.com/sqlglot.html
- GitHub page: https://github.com/tobymao/sqlglot
- Companion of SQLMesh
- SQLGlot is a no-dependency SQL parser, transpiler, optimizer, and engine.
- It can be used to format SQL or translate between
21 different dialects
like DuckDB, Presto / Trino,
Spark / Databricks, Snowflake,
and BigQuery. It aims to read a wide variety of SQL inputs and output syntactically
and semantically correct SQL in the targeted dialects.
- It is a very comprehensive generic SQL parser with a robust test suite. It is also quite performant, while being written purely in Python.
- You can easily customize the parser, analyze queries, traverse expression trees, and programmatically build SQL.
- Syntax errors are highlighted and dialect incompatibilities can warn or raise depending on configurations. However, SQLGlot does not aim to be a SQL validator, so it may fail to detect certain syntax errors.
- It can be used to format SQL or translate between
21 different dialects
like DuckDB, Presto / Trino,
Spark / Databricks, Snowflake,
and BigQuery. It aims to read a wide variety of SQL inputs and output syntactically
and semantically correct SQL in the targeted dialects.
- GitHub/home page: https://github.com/tdoehmen/gitschemas
- That project features scripts to crawl SQL-files from GitHub, parse them and extract structured database schema information from them. The goal is to learn about the semantics of database tables in the wild (table names, column names, foreign key relations etc.)
