Skip to content
This repository was archived by the owner on Aug 10, 2025. It is now read-only.

The-AI-Alliance/dpk-alliance

DPK Alliance

This project implements a simplification of and extensions to the Data Prep Kit project to better support AI Alliance needs, especially for the Open Trusted Data Initiative.

Data Prep Kit is a community project to democratize and accelerate unstructured data preparation for LLM app developers. With the explosive growth of LLM-enabled use cases, developers are faced with the enormous challenge of preparing use case-specific unstructured data to fine-tune, instruct-tune the LLMs or to build RAG applications for LLMs. As the variety of use cases grow, so does the need to support:

  • New ways of transforming the data to enhance the performance of the resulting LLMs for each specific use case.
  • A large variety in the scale of data to be processed, from laptop-scale to datacenter-scale
  • Support for different data modalities including language, code, vision, multimodal etc

To get more information about implementation architecture, please read this blog post.

The main components of the framework are:

  • transforms - base classes for transforms creation
  • runtime - implementation of the runtime responsible for starting and executing transforms
  • data access - extendable implementation of configurable data access, including local files, S3 and Hugging Face data sets

We also provide several examples of the framework usage:

The project's code was extensively tested leveraging transform testing framework

About

A simplification and extension of the Data Prep Kit project

Topics

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •