-
Couldn't load subscription status.
- Fork 7.4k
support pre-tokenized parquet datasets #9351
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
support pre-tokenized parquet datasets #9351
Conversation
Summary of ChangesHello @AbdulmalikDS, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request enhances LLaMA-Factory's data loading capabilities by introducing native support for pre-tokenized datasets stored in Parquet files. This allows users to directly feed already processed token IDs into the training pipeline, streamlining workflows and potentially reducing data preparation overhead by avoiding redundant tokenization. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a valuable feature to support pre-tokenized datasets in Parquet format, which can significantly speed up data loading by skipping the tokenization step. The implementation adds a new data loader for Parquet files and a corresponding data collator. The changes are well-structured.
However, I've identified a few areas for improvement:
- There is a critical issue with class inheritance in the new
TokenizedIdsCollatorthat will lead to a runtime error. - The Parquet file loader is currently limited to local files, and should be extended to support cloud storage paths.
- The current implementation does not support evaluation datasets for the pre-tokenized format, which is a significant limitation.
- The new functionality lacks unit tests, which are important for ensuring correctness and preventing future regressions.
Detailed comments are provided below.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a valuable feature by adding support for pre-tokenized datasets in Parquet format, which can significantly speed up data loading for large-scale training. The implementation is well-structured, with new modules for loading parquet files and a dedicated data collator. My review includes a critical fix for a missing import that would prevent the code from running, along with several suggestions to improve performance, code clarity, and maintainability. Overall, this is a great addition, and with these changes, it will be even more robust and efficient.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces support for pre-tokenized datasets in Parquet format, which is a valuable addition for users who want to skip the tokenization step. The implementation includes a new Parquet loader, a data collator for pre-tokenized samples, and updates to the data loading logic and arguments. The code is well-structured, but I have identified a few areas for improvement, mainly concerning efficiency and robustness. My suggestions aim to make the code more concise and performant.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
What does this PR do?
Support pre-tokenized datasets in Parquet format and to skip tokenization step if it's already has been done.
Motivation
I needed to use Parquet files with pre-tokenized data, but LLaMA-Factory didn't support the format. This adds Parquet support and skips tokenization when data is already tokenized.
Changes
New files:
src/llamafactory/data/tokenized_parquet.py- Parquet loadersrc/llamafactory/data/collator_tokenized.py- Data collator for pre-tokenized samplesUsage Example
Parquet files should have
input_ids(required) andattention_mask(optional) columns.Before submitting
make style && make quality?