support pre-tokenized parquet datasets #9351

AbdulmalikDS · 2025-10-25T10:45:00Z

What does this PR do?

Support pre-tokenized datasets in Parquet format and to skip tokenization step if it's already has been done.

Motivation

I needed to use Parquet files with pre-tokenized data, but LLaMA-Factory didn't support the format. This adds Parquet support and skips tokenization when data is already tokenized.

Changes

New files:

src/llamafactory/data/tokenized_parquet.py - Parquet loader
src/llamafactory/data/collator_tokenized.py - Data collator for pre-tokenized samples

Usage Example

dataset_format: tokenized_ids
data_files:
  - /path/to/chunk_000.parquet
  - /path/to/chunk_001.parquet

Parquet files should have input_ids (required) and attention_mask (optional) columns.

Before submitting

Did you read the contributor guideline?
Did you run make style && make quality?
Did you write any new necessary tests?

gemini-code-assist · 2025-10-25T10:45:13Z

Summary of Changes

Hello @AbdulmalikDS, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances LLaMA-Factory's data loading capabilities by introducing native support for pre-tokenized datasets stored in Parquet files. This allows users to directly feed already processed token IDs into the training pipeline, streamlining workflows and potentially reducing data preparation overhead by avoiding redundant tokenization.

Highlights

Parquet Dataset Support: Introduces the ability to load and process datasets stored in Parquet format, specifically designed for pre-tokenized data.
Skip Tokenization: Implements a mechanism to bypass the tokenization step during training when the input data is already pre-tokenized, improving efficiency.
Dedicated Data Collator: Adds a new TokenizedIdsCollator to handle batching and padding for pre-tokenized input_ids and attention_mask from Parquet files.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a valuable feature to support pre-tokenized datasets in Parquet format, which can significantly speed up data loading by skipping the tokenization step. The implementation adds a new data loader for Parquet files and a corresponding data collator. The changes are well-structured.

However, I've identified a few areas for improvement:

There is a critical issue with class inheritance in the new TokenizedIdsCollator that will lead to a runtime error.
The Parquet file loader is currently limited to local files, and should be extended to support cloud storage paths.
The current implementation does not support evaluation datasets for the pre-tokenized format, which is a significant limitation.
The new functionality lacks unit tests, which are important for ensuring correctness and preventing future regressions.

Detailed comments are provided below.

src/llamafactory/data/collator_tokenized.py

src/llamafactory/data/loader.py

src/llamafactory/data/tokenized_parquet.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

AbdulmalikDS · 2025-10-26T13:10:43Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a valuable feature by adding support for pre-tokenized datasets in Parquet format, which can significantly speed up data loading for large-scale training. The implementation is well-structured, with new modules for loading parquet files and a dedicated data collator. My review includes a critical fix for a missing import that would prevent the code from running, along with several suggestions to improve performance, code clarity, and maintainability. Overall, this is a great addition, and with these changes, it will be even more robust and efficient.

src/llamafactory/data/collator_tokenized.py

src/llamafactory/data/tokenized_parquet.py

src/llamafactory/hparams/data_args.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

AbdulmalikDS · 2025-10-29T10:35:29Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces support for pre-tokenized datasets in Parquet format, which is a valuable addition for users who want to skip the tokenization step. The implementation includes a new Parquet loader, a data collator for pre-tokenized samples, and updates to the data loading logic and arguments. The code is well-structured, but I have identified a few areas for improvement, mainly concerning efficiency and robustness. My suggestions aim to make the code more concise and performant.

src/llamafactory/data/collator_tokenized.py

src/llamafactory/data/tokenized_parquet.py

src/llamafactory/hparams/data_args.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

support pre-tokenized parquet datasets

0a26975

gemini-code-assist bot reviewed Oct 25, 2025

View reviewed changes

src/llamafactory/data/collator_tokenized.py Outdated Show resolved Hide resolved

src/llamafactory/data/loader.py Show resolved Hide resolved

src/llamafactory/data/tokenized_parquet.py Show resolved Hide resolved

src/llamafactory/data/tokenized_parquet.py Outdated Show resolved Hide resolved

Update src/llamafactory/data/collator_tokenized.py

fc02353

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

gemini-code-assist bot reviewed Oct 26, 2025

View reviewed changes

AbdulmalikDS and others added 4 commits October 26, 2025 16:17

Update src/llamafactory/data/collator_tokenized.py

bcc7016

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Update src/llamafactory/data/tokenized_parquet.py

1126470

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Update src/llamafactory/data/tokenized_parquet.py

93b667b

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Update src/llamafactory/hparams/data_args.py

5bdcea1

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

gemini-code-assist bot reviewed Oct 29, 2025

View reviewed changes

src/llamafactory/data/collator_tokenized.py Outdated Show resolved Hide resolved

src/llamafactory/data/tokenized_parquet.py Show resolved Hide resolved

src/llamafactory/data/tokenized_parquet.py Show resolved Hide resolved

src/llamafactory/hparams/data_args.py Outdated Show resolved Hide resolved

AbdulmalikDS and others added 3 commits October 29, 2025 13:45

Update src/llamafactory/data/collator_tokenized.py

eabd855

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Update src/llamafactory/data/tokenized_parquet.py

f03fa2a

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Update src/llamafactory/hparams/data_args.py

dc76323

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Uh oh!

support pre-tokenized parquet datasets #9351

Are you sure you want to change the base?

support pre-tokenized parquet datasets #9351

Conversation

AbdulmalikDS commented Oct 25, 2025

What does this PR do?

Motivation

Changes

Usage Example

Before submitting

Uh oh!

gemini-code-assist bot commented Oct 25, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

AbdulmalikDS commented Oct 26, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

AbdulmalikDS commented Oct 29, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant