Skip to content

data-engineering-helpers/metadata

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

32 Commits
 
 
 
 
 
 

Repository files navigation

Metadata, metadatalake, Modern Metadata Stack (MMS)

Table of Content (ToC)

Created by gh-md-toc

Overview

This project intends to collect, analyze and synthetize referential material about metadata, in order to facilitate the implementing of metadatalakes. That is, this project is a first contribution to a Modern Metadatalake Stack (MMS), much like the initiatives around the rise of the Modern Data Stack (MDS).

Even though the members of the GitHub organization may be employed by some companies, they speak on their personal behalf and do not represent these companies.

Other repositories of Data Engineering helpers

References

Articles

Metadata is king

From Data Catalog to Data Marketplace

The Art of Discoverability

Google paper - Big Metadata: When Metadata is Big Data

Introduction

In the past 10 years, as the modern data stack has matured and become mainstream, we’ve taken great leaps forward in data infrastructure. However, the modern data stack still has one key missing component: context. That’s where metadata comes in. In this increasingly diverse data world, metadata holds the key to the elusive promised land — a single source of truth. There will always be countless tools and tech in a team’s data infrastructure. By effectively collecting metadata, a team can finally unify context about all their tools, processes, and data.

But what actually is metadata, you ask? Simply put, metadata is “data about data”.

Today, metadata is everywhere. Every component of the modern data stack and every user interaction on it generates metadata. Apart from traditional forms like technical metadata (e.g. schemas) and business metadata (e.g. taxonomy, glossary), our data systems now create entirely new forms of metadata.

Cloud compute ecosystems and orchestration engines generate logs every second, called performance metadata. Users who interact with data assets and one another generate social metadata. Logs from BI tools, notebooks, and other applications, as well as from communication tools like Slack, generate usage metadata. Orchestration engines and raw code (e.g. SQL) used to create data assets generate provenance metadata.

Metadata lake

Frameworks

Hudi metadata table

  • Homepage: https://hudi.apache.org/docs/metadata/
  • Hudi GitHub repository: https://github.com/apache/hudi
  • Hudi tracks metadata about a table to remove bottlenecks in achieving great read/write performance, specifically on cloud storage.
    • Avoid list operations to obtain set of files in a table
    • Expose columns statistics for better query planning and faster queries

DataHub

Acryl data

  • Moto: Bring clarity to your data
  • Home page: https://www.acryldata.io/
  • Open source: no
  • Overview: Acryl Cloud is a comprehensive metadata platform that joins a best-in-class catalog with data observability. Built by the team behind DataHub (see above).

Metaphor

Open Metadata

Marquez

Marquez is an open source metadata service for the collection, aggregation, and visualization of a data ecosystem's metadata, going together with OpenLineage (see below). It maintains the provenance of how datasets are consumed and produced, provides global visibility into job runtime and frequency of dataset access, centralization of dataset lifecycle management, and much more. Marquez was released and open sourced by WeWork.

OpenLineage

OpenLineage is an open standard for metadata and lineage collection designed to instrument jobs as they are running. OpenLineage is the ground standard for Marquez (see above). It defines a generic model of run, job, and dataset entities identified using consistent naming strategies. The core lineage model is extensible by defining specific facets to enrich those entities.

Open Data Discovery (ODD) Spec

Open Data Discovery Specification (ODD Spec): A Universal Standard for Metadata Collection

Amundsen

Amundsen is a data discovery and metadata engine for improving the productivity of data analysts, data scientists and engineers when interacting with data. It does that today by indexing data resources (tables, dashboards, streams, etc.) and powering a page-rank style search based on usage patterns (e.g. highly queried tables show up earlier than less queried tables). Think of it as Google search for data. The project is named after Norwegian explorer Roald Amundsen, the first person to discover the South Pole.

Egeria

  • Moto: "Open metadata and governance for enterprises - automatically capturing, managing and exchanging metadata between tools and platforms, no matter the vendor"
  • Organization behind it: Linux foundation
  • Home page: https://egeria-project.org/
  • GitHub: https://github.com/odpi/egeria
  • Open source: yes

Databook

Like many technologies at Uber, they Databook is well described in articles, but has not been open sourced so far.

Nemo

Dataportal

Metacat

Delta UniForm

Nessie

Iceberg catalogs

  • Home page: https://iceberg.apache.org/concepts/catalog/
  • Iceberg REST catalog OpenAPI specification: https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml
  • Open source: yes
  • Iceberg was initially contributed by Netflix
  • Overview:
    • You may think of Iceberg as a format for managing data in a single table, but the Iceberg library needs a way to keep track of those tables by name. Tasks like creating, dropping, and renaming tables are the responsibility of a catalog. Catalogs manage a collection of tables that are usually grouped into namespaces. The most important responsibility of a catalog is tracking a table’s current metadata, which is provided by the catalog when you load a table.
    • Iceberg catalogs are flexible and can be implemented using almost any backend system. They can be plugged into any Iceberg runtime, and allow any processing engine that supports Iceberg to load the tracked Iceberg tables. Iceberg also comes with a number of catalog implementations that are ready to use out of the box. This includes:
    • REST - a server-side catalog that’s exposed through a REST API
    • Hive Metastore - tracks namespaces and tables using a Hive metastore
    • JDBC - tracks namespaces and tables in a simple JDBC database
    • Nessie - a transactional catalog that tracks namespaces and tables in a database with git-like version control
  • Apache Iceberg - What Is It, Diving Deep With A Guest Post, May 2024, by Julien Hurault: https://seattledataguy.substack.com/p/apache-iceberg-what-is-it
  • GitHub - Tabular.io - Iceberg REST Docker image
    • Simple project to expose a catalog over REST using a Java catalog backend
    • For instance, uses AWS Glue as a backend and exposes an Iceberg REST catalog

Hive Metastore

Apache XTable

OpenHouse

Data.all

Tools

SQL parsers

SQLGlot

GitSchemas

  • GitHub/home page: https://github.com/tdoehmen/gitschemas
  • That project features scripts to crawl SQL-files from GitHub, parse them and extract structured database schema information from them. The goal is to learn about the semantics of database tables in the wild (table names, column names, foreign key relations etc.)

About

Knowledge sharing - Metadata, metadata-lake

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published