Skip to content

Conversation

@AbdulmalikDS
Copy link

What does this PR do?

Support pre-tokenized datasets in Parquet format and to skip tokenization step if it's already has been done.

Motivation

I needed to use Parquet files with pre-tokenized data, but LLaMA-Factory didn't support the format. This adds Parquet support and skips tokenization when data is already tokenized.

Changes

New files:

  • src/llamafactory/data/tokenized_parquet.py - Parquet loader
  • src/llamafactory/data/collator_tokenized.py - Data collator for pre-tokenized samples

Usage Example

dataset_format: tokenized_ids
data_files:
  - /path/to/chunk_000.parquet
  - /path/to/chunk_001.parquet

Parquet files should have input_ids (required) and attention_mask (optional) columns.

Before submitting

  • Did you read the contributor guideline?
  • Did you run make style && make quality?
  • Did you write any new necessary tests?

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @AbdulmalikDS, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances LLaMA-Factory's data loading capabilities by introducing native support for pre-tokenized datasets stored in Parquet files. This allows users to directly feed already processed token IDs into the training pipeline, streamlining workflows and potentially reducing data preparation overhead by avoiding redundant tokenization.

Highlights

  • Parquet Dataset Support: Introduces the ability to load and process datasets stored in Parquet format, specifically designed for pre-tokenized data.
  • Skip Tokenization: Implements a mechanism to bypass the tokenization step during training when the input data is already pre-tokenized, improving efficiency.
  • Dedicated Data Collator: Adds a new TokenizedIdsCollator to handle batching and padding for pre-tokenized input_ids and attention_mask from Parquet files.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a valuable feature to support pre-tokenized datasets in Parquet format, which can significantly speed up data loading by skipping the tokenization step. The implementation adds a new data loader for Parquet files and a corresponding data collator. The changes are well-structured.

However, I've identified a few areas for improvement:

  • There is a critical issue with class inheritance in the new TokenizedIdsCollator that will lead to a runtime error.
  • The Parquet file loader is currently limited to local files, and should be extended to support cloud storage paths.
  • The current implementation does not support evaluation datasets for the pre-tokenized format, which is a significant limitation.
  • The new functionality lacks unit tests, which are important for ensuring correctness and preventing future regressions.

Detailed comments are provided below.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@AbdulmalikDS
Copy link
Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a valuable feature by adding support for pre-tokenized datasets in Parquet format, which can significantly speed up data loading for large-scale training. The implementation is well-structured, with new modules for loading parquet files and a dedicated data collator. My review includes a critical fix for a missing import that would prevent the code from running, along with several suggestions to improve performance, code clarity, and maintainability. Overall, this is a great addition, and with these changes, it will be even more robust and efficient.

AbdulmalikDS and others added 4 commits October 26, 2025 16:17
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@AbdulmalikDS
Copy link
Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for pre-tokenized datasets in Parquet format, which is a valuable addition for users who want to skip the tokenization step. The implementation includes a new Parquet loader, a data collator for pre-tokenized samples, and updates to the data loading logic and arguments. The code is well-structured, but I have identified a few areas for improvement, mainly concerning efficiency and robustness. My suggestions aim to make the code more concise and performant.

AbdulmalikDS and others added 3 commits October 29, 2025 13:45
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant