Skip to content

Sythentic Data Generation Notebook broken #69

@Satej

Description

@Satej

On recent colab T4 when I run the Synthetic Data Generation Notebook the following line throws below error:

Code:

from unsloth.dataprep import SyntheticDataKit

generator = SyntheticDataKit.from_pretrained(
    # Choose any model from https://huggingface.co/unsloth
    model_name = "unsloth/Llama-3.2-3B-Instruct",
    max_seq_length = 2048, # Longer sequence lengths will be slower!
)

Error:

tokenizer_config.json: 
 54.7k/? [00:00<00:00, 6.33MB/s]
tokenizer.json: 100%
 17.2M/17.2M [00:01<00:00, 12.6MB/s]
special_tokens_map.json: 100%
 454/454 [00:00<00:00, 63.0kB/s]
chat_template.jinja: 
 3.83k/? [00:00<00:00, 419kB/s]

Unsloth: Using dtype = torch.float16 for vLLM.
Unsloth: vLLM loading unsloth/Llama-3.2-3B-Instruct with actual GPU utilization = 89.39%
Unsloth: Your GPU has CUDA compute capability 7.5 with VRAM = 14.74 GB.
Unsloth: Using conservativeness = 1.0. Chunked prefill tokens = 2048. Num Sequences = 192.
Unsloth: vLLM's KV Cache can use up to 7.19 GB. Also swap space = 4 GB.
vLLM STDOUT: INFO 07-12 07:47:36 [__init__.py:239] Automatically detected platform cuda.
vLLM STDOUT: INFO 07-12 07:47:41 [api_server.py:1043] vLLM API server version 0.8.5.post1
vLLM STDOUT: INFO 07-12 07:47:41 [api_server.py:1044] args: Namespace(subparser='serve', model_tag='unsloth/Llama-3.2-3B-Instruct', config='', host=None, port=8000, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='unsloth/Llama-3.2-3B-Instruct', task='auto', tokenizer=None, hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, load_format='auto', download_dir=None, model_loader_extra_config={}, use_tqdm_on_load=True, config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', max_model_len=2048, guided_decoding_backend='auto', reasoning_parser=None, logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, data_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, disable_custom_all_reduce=False, block_size=None, gpu_memory_utilization=0.8938626454842437, swap_space=4.0, kv_cache_dtype='auto', num_gpu_blocks_override=None, enable_prefix_caching=True, prefix_caching_hash_algo='builtin', cpu_offload_gb=0, calculate_kv_scales=False, disable_sliding_window=False, use_v2_block_manager=True, seed=0, max_logprobs=0, disable_log_stats=True, quantization=None, rope_scaling=None, rope_theta=None, hf_token=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config={}, limit_mm_per_prompt={}, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=None, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=None, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', speculative_config=None, ignore_patterns=[], served_model_name=None, qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, max_num_batched_tokens=2048, max_num_seqs=192, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, num_lookahead_slots=0, scheduler_delay_factor=0.0, preemption_mode=None, num_scheduler_steps=1, multi_step_stream_outputs=True, scheduling_policy='fcfs', enable_chunked_prefill=None, disable_chunked_mm_input=False, scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config={"level":3,"splitting_ops":[]}, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, additional_config=None, enable_reasoning=False, disable_cascade_attn=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False, dispatch_function=<function ServeSubcommand.cmd at 0x785110c3bd80>)
vLLM STDOUT: INFO 07-12 07:47:55 [config.py:717] This model supports multiple tasks: {'embed', 'generate', 'score', 'classify', 'reward'}. Defaulting to 'generate'.
vLLM STDOUT: WARNING 07-12 07:47:55 [arg_utils.py:1658] Compute Capability < 8.0 is not supported by the V1 Engine. Falling back to V0.
vLLM STDOUT: INFO 07-12 07:47:55 [api_server.py:246] Started engine process with PID 2511
vLLM STDOUT: INFO 07-12 07:48:03 [__init__.py:239] Automatically detected platform cuda.
vLLM STDOUT: INFO 07-12 07:48:06 [llm_engine.py:240] Initializing a V0 LLM engine (v0.8.5.post1) with config: model='unsloth/Llama-3.2-3B-Instruct', speculative_config=None, tokenizer='unsloth/Llama-3.2-3B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=unsloth/Llama-3.2-3B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":3,"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":192}, use_cached_outputs=True,
vLLM STDOUT: INFO 07-12 07:48:08 [cuda.py:240] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
vLLM STDOUT: INFO 07-12 07:48:08 [cuda.py:289] Using XFormers backend.
vLLM STDOUT: ERROR 07-12 07:48:08 [engine.py:448] Bfloat16 is only supported on GPUs with compute capability of at least 8.0. Your Tesla T4 GPU has compute capability 7.5. You can use float16 instead by explicitly setting the `dtype` flag in CLI, for example: --dtype=half.
vLLM STDOUT: ERROR 07-12 07:48:08 [engine.py:448] Traceback (most recent call last):
vLLM STDOUT: ERROR 07-12 07:48:08 [engine.py:448]   File "/usr/local/lib/python3.11/dist-packages/vllm/engine/multiprocessing/engine.py", line 436, in run_mp_engine
vLLM STDOUT: ERROR 07-12 07:48:08 [engine.py:448]     engine = MQLLMEngine.from_vllm_config(
vLLM STDOUT: ERROR 07-12 07:48:08 [engine.py:448]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vLLM STDOUT: ERROR 07-12 07:48:08 [engine.py:448]   File "/usr/local/lib/python3.11/dist-packages/vllm/engine/multiprocessing/engine.py", line 128, in from_vllm_config
vLLM STDOUT: ERROR 07-12 07:48:08 [engine.py:448]     return cls(
vLLM STDOUT: ERROR 07-12 07:48:08 [engine.py:448]            ^^^^
vLLM STDOUT: ERROR 07-12 07:48:08 [engine.py:448]   File "/usr/local/lib/python3.11/dist-packages/vllm/engine/multiprocessing/engine.py", line 82, in __init__
vLLM STDOUT: ERROR 07-12 07:48:08 [engine.py:448]     self.engine = LLMEngine(*args, **kwargs)
vLLM STDOUT: ERROR 07-12 07:48:08 [engine.py:448]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
vLLM STDOUT: ERROR 07-12 07:48:08 [engine.py:448]   File "/usr/local/lib/python3.11/dist-packages/vllm/engine/llm_engine.py", line 275, in __init__
vLLM STDOUT: ERROR 07-12 07:48:08 [engine.py:448]     self.model_executor = executor_class(vllm_config=vllm_config)
vLLM STDOUT: ERROR 07-12 07:48:08 [engine.py:448]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vLLM STDOUT: ERROR 07-12 07:48:08 [engine.py:448]   File "/usr/local/lib/python3.11/dist-packages/vllm/executor/executor_base.py", line 52, in __init__
vLLM STDOUT: ERROR 07-12 07:48:08 [engine.py:448]     self._init_executor()
vLLM STDOUT: ERROR 07-12 07:48:08 [engine.py:448]   File "/usr/local/lib/python3.11/dist-packages/vllm/executor/uniproc_executor.py", line 46, in _init_executor
vLLM STDOUT: ERROR 07-12 07:48:08 [engine.py:448]     self.collective_rpc("init_device")
vLLM STDOUT: ERROR 07-12 07:48:08 [engine.py:448]   File "/usr/local/lib/python3.11/dist-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
vLLM STDOUT: ERROR 07-12 07:48:08 [engine.py:448]     answer = run_method(self.driver_worker, method, args, kwargs)
vLLM STDOUT: ERROR 07-12 07:48:08 [engine.py:448]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vLLM STDOUT: ERROR 07-12 07:48:08 [engine.py:448]   File "/usr/local/lib/python3.11/dist-packages/vllm/utils.py", line 2456, in run_method
vLLM STDOUT: ERROR 07-12 07:48:08 [engine.py:448]     return func(*args, **kwargs)
vLLM STDOUT: ERROR 07-12 07:48:08 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^
vLLM STDOUT: ERROR 07-12 07:48:08 [engine.py:448]   File "/usr/local/lib/python3.11/dist-packages/vllm/worker/worker_base.py", line 604, in init_device
vLLM STDOUT: ERROR 07-12 07:48:08 [engine.py:448]     self.worker.init_device()  # type: ignore
vLLM STDOUT: ERROR 07-12 07:48:08 [engine.py:448]     ^^^^^^^^^^^^^^^^^^^^^^^^^
vLLM STDOUT: ERROR 07-12 07:48:08 [engine.py:448]   File "/usr/local/lib/python3.11/dist-packages/vllm/worker/worker.py", line 177, in init_device
vLLM STDOUT: ERROR 07-12 07:48:08 [engine.py:448]     _check_if_gpu_supports_dtype(self.model_config.dtype)
vLLM STDOUT: ERROR 07-12 07:48:08 [engine.py:448]   File "/usr/local/lib/python3.11/dist-packages/vllm/worker/worker.py", line 546, in _check_if_gpu_supports_dtype
vLLM STDOUT: ERROR 07-12 07:48:08 [engine.py:448]     raise ValueError(
vLLM STDOUT: ERROR 07-12 07:48:08 [engine.py:448] ValueError: Bfloat16 is only supported on GPUs with compute capability of at least 8.0. Your Tesla T4 GPU has compute capability 7.5. You can use float16 instead by explicitly setting the `dtype` flag in CLI, for example: --dtype=half.
Stdout stream ended before readiness message detected.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions