This project implements a simplification of and extensions to the Data Prep Kit project to better support AI Alliance needs, especially for the Open Trusted Data Initiative.
Data Prep Kit is a community project to democratize and accelerate unstructured data preparation for LLM app developers. With the explosive growth of LLM-enabled use cases, developers are faced with the enormous challenge of preparing use case-specific unstructured data to fine-tune, instruct-tune the LLMs or to build RAG applications for LLMs. As the variety of use cases grow, so does the need to support:
- New ways of transforming the data to enhance the performance of the resulting LLMs for each specific use case.
 - A large variety in the scale of data to be processed, from laptop-scale to datacenter-scale
 - Support for different data modalities including language, code, vision, multimodal etc
 
To get more information about implementation architecture, please read this blog post.
The main components of the framework are:
- transforms - base classes for transforms creation
 - runtime - implementation of the runtime responsible for starting and executing transforms
 - data access - extendable implementation of configurable data access, including local files, S3 and Hugging Face data sets
 
We also provide several examples of the framework usage:
The project's code was extensively tested leveraging transform testing framework