Skip to content

TypeError when converting to GGUF with Sentencepiece tokenizer in Colab #117

@CuriosityWave

Description

@CuriosityWave

I have received this error fairly consistently over the past 2 months when using your "ChatML + chat templates + Mistral 7b full example" notebook and finetuning a number of different models from Gemma to Llama-2 & 3 and this was with a Mistral Model - this one to be exact => teknium/OpenHermes-2.5-Mistral-7B

Everything goes fine, except for being able to save the finetuned model, as a GGUF file always failing and giving this TypeError in regard to a duplicate sentencepiece file name. I have been and still am able to save the finetuned model in a non-gguf form (I call it its full form) to a hugging face repo, usually as 4bit. I then download it and convert it to a gguf file myself locally with the llama python script, so it was not a huge deal for me personally. But since I have seen the exact same error for a while (and I finally believe it is not due to something dumb I am doing - I think...), I thought you might want to know about it in case it was a bug or something you would want to or could address on your end.

The error is detailed below. And I have attached the full error below. It seems like when you perform your sentencepiece tokenizer fix you import => sentencepiece_model_pb2 and call this method => ModelProto()

Everything looks fine until a Google Library calls in dependences with sentence piece likely one of them and on Super call both this sentencepeice_model_pb2 and the one you imported get pushed (or pulled ?) to fill the global and is maybe where the duplicate error is coming from.

But then again, I have been learning and working on AI and Python for about a whole 3 months as of today. So, I could be embarrassingly wrong.

I did not really change much from the default notebook, with one difference being I used my own Jinja2 Chat Template and applied it to the tokenizer (although it really is just a subset of your main sloth template and seems pretty standard for a ChatML template).

It happens on this function call attempting to save the Model as a Q4_K_M Quantized GGUF file:
model.push_to_hub_gguf("b22000r/xyntrai-mistral-2.5-7b/", tokenizer, quantization_method = "q4_k_m", token = "mah'_hf_token")

Here is a link to the Colab Notebook:
https://colab.research.google.com/drive/1naZeXgRT7cEiOWR-C2RD2Uc9jSwCak_y?authuser=1#scrollTo=FqfebeAdT073

I will see if I can upload the notebook and attach it to this issue submission as well.

I am using a Windows 11 Machine with 32GB of RAM, No GPU locally (only baseline 128 MB Onboard) but I use a rented Google T4 in Colab.

This is the Error that is produced on screen in Colab:

Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 5.61 out of 12.67 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...
100%|██████████| 32/32 [02:54<00:00,  5.44s/it]
Unsloth: Saving tokenizer... Done.
Unsloth: Saving b22000/xyntrai-mistral-2.5-7b/pytorch_model.bin...
Done.
==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp might take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF 16bits might take 3 minutes.
\        /    [2] Converting GGUF 16bits to ['q4_k_m'] might take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: Installing llama.cpp. This might take 3 minutes...
Unsloth: [1] Converting model at b22000/xyntrai-mistral-2.5-7b into f16 GGUF format.
The output location will be /content/b22000/xyntrai-mistral-2.5-7b/unsloth.F16.gguf
This might take 3 minutes...
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
[/tmp/ipython-input-4103010664.py](https://localhost:8080/#) in <cell line: 0>()
      7 is_sentencepiece = isinstance(tokenizer, AutoTokenizer) and tokenizer.vocab_file is not None
      8 if True and not is_sentencepiece:
----> 9     model.push_to_hub_gguf("b22000/xyntrai-mistral-2.5-7b", tokenizer, quantization_method = "q4_k_m", token = "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")
     10 elif True and is_sentencepiece:
     11     print("Skipping GGUF push due to sentencepiece error.")

4 frames
[/usr/local/lib/python3.12/dist-packages/unsloth/save.py](https://localhost:8080/#) in unsloth_push_to_hub_gguf(self, repo_id, tokenizer, quantization_method, first_conversion, use_temp_dir, commit_message, private, token, max_shard_size, create_pr, safe_serialization, revision, commit_description, tags, temporary_location, maximum_memory_usage)
   2075 
   2076     # Save to GGUF
-> 2077     all_file_locations, want_full_precision = save_to_gguf(
   2078         model_type, model_dtype, is_sentencepiece_model,
   2079         new_save_directory, quantization_method, first_conversion, makefile,

[/usr/local/lib/python3.12/dist-packages/unsloth/save.py](https://localhost:8080/#) in save_to_gguf(model_type, model_dtype, is_sentencepiece, model_directory, quantization_method, first_conversion, _run_installer)
   1185         vocab_type = "spm,hfft,bpe"
   1186         # Fix Sentencepiece model as well!
-> 1187         fix_sentencepiece_gguf(model_directory)
   1188     else:
   1189         vocab_type = "bpe"

[/usr/local/lib/python3.12/dist-packages/unsloth/tokenizer_utils.py](https://localhost:8080/#) in fix_sentencepiece_gguf(saved_location)
    408     """
    409     from copy import deepcopy
--> 410     from transformers.utils import sentencepiece_model_pb2
    411     import json
    412     from enum import IntEnum

[/usr/local/lib/python3.12/dist-packages/transformers/utils/sentencepiece_model_pb2.py](https://localhost:8080/#) in <module>
     26 
     27 
---> 28 DESCRIPTOR = _descriptor.FileDescriptor(
     29     name="sentencepiece_model.proto",
     30     package="sentencepiece",

[/usr/local/lib/python3.12/dist-packages/google/protobuf/descriptor.py](https://localhost:8080/#) in __new__(cls, name, package, options, serialized_options, serialized_pb, dependencies, public_dependencies, syntax, edition, pool, create_key)
   1226       # pylint: disable=g-explicit-bool-comparison
   1227       if serialized_pb:
-> 1228         return _message.default_pool.AddSerializedFile(serialized_pb)
   1229       else:
   1230         return super(FileDescriptor, cls).__new__(cls)

[TypeError: Couldn't build proto file into descriptor pool: duplicate file name sentencepiece_model.proto](url)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions