⚡️ Speed up method `SpeechT5FeatureExtractor.zero_mean_unit_var_norm` by 104% #378

codeflash-ai · 2025-11-22T09:26:40Z

📄 104% (1.04x) speedup for `SpeechT5FeatureExtractor.zero_mean_unit_var_norm` in `src/transformers/models/speecht5/feature_extraction_speecht5.py`

⏱️ Runtime : 37.3 milliseconds → 18.3 milliseconds (best of 78 runs)

📝 Explanation and details

The optimized code achieves a 103% speedup by implementing two key optimizations in the zero_mean_unit_var_norm method:

1. Vectorized Batch Processing for Same-Length Arrays (No Attention Mask)
When no attention mask is provided and all input arrays have the same length, the optimization uses np.stack() to combine arrays into a single 2D array, then performs vectorized mean and variance calculations using axis=1. This eliminates the expensive per-array Python loop and leverages NumPy's highly optimized BLAS operations.

2. Optimized Memory Allocation and Computation Order (With Attention Mask)
For attention mask cases, the optimization:

Pre-computes attention_mask.sum(-1) once instead of recalculating for each iteration
Uses np.empty_like() for faster memory allocation without initialization
Separates normalization factor calculation (1.0 / np.sqrt(var + 1e-7)) to avoid redundant square root operations
Uses more efficient slicing operations

Performance Impact Analysis:

Large batch scenarios see dramatic improvements: test_large_batch_no_attention_mask shows 482% speedup (8.95ms → 1.54ms)
Same-length array batches benefit most from vectorization: test_large_batch_size achieves 1887% speedup (12.6ms → 633μs)
Individual small arrays show modest overhead (20-30% slower) due to the additional length checking logic, but this is negligible in real workloads

Why This Matters:
Audio feature extraction typically processes batches of similar-length audio segments. The vectorized approach transforms O(n) individual operations into O(1) batch operations for the common case, making it highly effective for speech processing pipelines where consistent audio chunk sizes are standard. The optimizations are particularly beneficial when processing multiple audio files or segments in batch inference scenarios.

The slight overhead for single small vectors is acceptable given the substantial gains for batch processing, which is the primary use case in production ML workflows.

✅ Correctness verification report:

Test	Status
⚙️ Existing Unit Tests	🔘 None Found
🌀 Generated Regression Tests	✅ 38 Passed
⏪ Replay Tests	🔘 None Found
🔎 Concolic Coverage Tests	🔘 None Found
📊 Tests Coverage	100.0%

🌀 Generated Regression Tests and Runtime

import numpy as np

# imports
from transformers.models.speecht5.feature_extraction_speecht5 import SpeechT5FeatureExtractor


# ----------------------- UNIT TESTS -----------------------

# --------- BASIC TEST CASES ---------


def test_single_vector_no_attention_mask():
    # Test normalization of a single vector without attention mask
    arr = np.array([1.0, 2.0, 3.0, 4.0])
    codeflash_output = SpeechT5FeatureExtractor.zero_mean_unit_var_norm([arr], None)
    result = codeflash_output  # 50.3μs -> 70.1μs (28.2% slower)
    normed = result[0]


def test_multiple_vectors_no_attention_mask():
    # Test normalization of multiple vectors without attention mask
    arr1 = np.array([0.0, 2.0, 4.0])
    arr2 = np.array([-1.0, 1.0, 3.0])
    codeflash_output = SpeechT5FeatureExtractor.zero_mean_unit_var_norm([arr1, arr2], None)
    result = codeflash_output  # 63.7μs -> 63.8μs (0.222% slower)
    for normed in result:
        pass


def test_single_vector_with_attention_mask_full():
    # Attention mask covers the whole vector (should be same as no mask)
    arr = np.array([0.0, 1.0, 2.0, 3.0])
    attn = np.array([1, 1, 1, 1])
    codeflash_output = SpeechT5FeatureExtractor.zero_mean_unit_var_norm([arr], [attn])
    result = codeflash_output  # 51.9μs -> 56.5μs (8.08% slower)
    normed = result[0]


def test_single_vector_with_attention_mask_partial():
    # Attention mask covers only part of the vector
    arr = np.array([1.0, 2.0, 3.0, 100.0])
    attn = np.array([1, 1, 0, 0])
    codeflash_output = SpeechT5FeatureExtractor.zero_mean_unit_var_norm([arr], [attn])
    result = codeflash_output  # 53.8μs -> 55.4μs (2.83% slower)
    normed = result[0]


def test_multiple_vectors_with_attention_mask():
    # Test batch with different attention masks
    arr1 = np.array([5.0, 6.0, 7.0, 8.0])
    arr2 = np.array([10.0, 20.0, 30.0, 40.0])
    attn1 = np.array([1, 1, 1, 0])
    attn2 = np.array([1, 1, 0, 0])
    codeflash_output = SpeechT5FeatureExtractor.zero_mean_unit_var_norm([arr1, arr2], [attn1, attn2])
    result = codeflash_output  # 72.1μs -> 74.7μs (3.54% slower)
    # arr1: first 3 normalized, last is padding
    normed1 = result[0]
    # arr2: first 2 normalized, last two are padding
    normed2 = result[1]


def test_padding_value_custom():
    # Test that custom padding_value is respected
    arr = np.array([1.0, 2.0, 3.0, 4.0])
    attn = np.array([1, 1, 0, 0])
    codeflash_output = SpeechT5FeatureExtractor.zero_mean_unit_var_norm([arr], [attn], padding_value=-999.0)
    result = codeflash_output  # 51.8μs -> 50.7μs (2.00% faster)
    normed = result[0]


# --------- EDGE TEST CASES ---------


def test_empty_input_list():
    # Should return empty list if input is empty
    codeflash_output = SpeechT5FeatureExtractor.zero_mean_unit_var_norm([], None)
    result = codeflash_output  # 784ns -> 1.92μs (59.1% slower)


def test_zero_length_vector():
    # Should handle zero-length vector
    arr = np.array([])
    codeflash_output = SpeechT5FeatureExtractor.zero_mean_unit_var_norm([arr], None)
    result = codeflash_output  # 72.2μs -> 92.9μs (22.3% slower)


def test_attention_mask_all_zeros():
    # All attention mask values are zero; normalization over zero-length
    arr = np.array([1.0, 2.0, 3.0])
    attn = np.array([0, 0, 0])
    # This will try to normalize over zero elements, which will result in nan in mean/var
    codeflash_output = SpeechT5FeatureExtractor.zero_mean_unit_var_norm([arr], [attn])
    result = codeflash_output  # 74.1μs -> 76.7μs (3.39% slower)
    normed = result[0]


def test_vector_with_constant_values():
    # All values are the same; variance is zero, so denominator is sqrt(1e-7)
    arr = np.array([5.0, 5.0, 5.0])
    codeflash_output = SpeechT5FeatureExtractor.zero_mean_unit_var_norm([arr], None)
    result = codeflash_output  # 61.9μs -> 81.8μs (24.2% slower)
    normed = result[0]


def test_vector_with_nan_and_inf():
    # Contains nan and inf
    arr = np.array([1.0, np.nan, np.inf, -np.inf])
    # Should not crash, but result will be nan/inf
    codeflash_output = SpeechT5FeatureExtractor.zero_mean_unit_var_norm([arr], None)
    result = codeflash_output  # 46.8μs -> 64.6μs (27.5% slower)
    normed = result[0]


def test_attention_mask_with_nonbinary_values():
    # Non-binary attention mask: should sum as length, but not standard usage
    arr = np.array([1.0, 2.0, 3.0, 4.0])
    attn = np.array([2, 0, 0, 0])  # sum is 2, so first two are normalized
    codeflash_output = SpeechT5FeatureExtractor.zero_mean_unit_var_norm([arr], [attn])
    result = codeflash_output  # 58.3μs -> 60.7μs (3.92% slower)
    normed = result[0]


def test_vector_with_negative_values():
    # Negative values should normalize as usual
    arr = np.array([-10.0, 0.0, 10.0])
    codeflash_output = SpeechT5FeatureExtractor.zero_mean_unit_var_norm([arr], None)
    result = codeflash_output  # 44.4μs -> 64.5μs (31.2% slower)
    normed = result[0]


# --------- LARGE SCALE TEST CASES ---------


def test_large_vector_no_attention_mask():
    # Test with a large vector (under 100MB)
    arr = np.random.randn(500_000).astype(np.float32)  # ~2MB
    codeflash_output = SpeechT5FeatureExtractor.zero_mean_unit_var_norm([arr], None)
    result = codeflash_output  # 784μs -> 963μs (18.6% slower)
    normed = result[0]


def test_large_batch_no_attention_mask():
    # Test with a large batch of medium vectors
    batch = [np.random.randn(1000).astype(np.float32) for _ in range(500)]
    codeflash_output = SpeechT5FeatureExtractor.zero_mean_unit_var_norm(batch, None)
    result = codeflash_output  # 8.95ms -> 1.54ms (482% faster)
    for normed in result:
        pass


def test_large_vector_with_attention_mask():
    # Large vector with partial attention mask
    arr = np.random.randn(100_000).astype(np.float32)
    attn = np.array([1] * 50_000 + [0] * 50_000)
    codeflash_output = SpeechT5FeatureExtractor.zero_mean_unit_var_norm([arr], [attn])
    result = codeflash_output  # 227μs -> 200μs (13.2% faster)
    normed = result[0]


def test_large_batch_with_attention_mask():
    # Large batch, each with partial attention mask
    batch_size = 200
    length = 2000
    batch = [np.random.randn(length).astype(np.float32) for _ in range(batch_size)]
    attn = [np.array([1] * 1000 + [0] * 1000) for _ in range(batch_size)]
    codeflash_output = SpeechT5FeatureExtractor.zero_mean_unit_var_norm(batch, attn)
    result = codeflash_output  # 4.18ms -> 4.10ms (2.07% faster)
    for normed in result:
        pass


def test_large_batch_with_varied_attention_mask():
    # Large batch, each with different attention mask lengths
    batch_size = 100
    length = 1000
    batch = [np.random.randn(length).astype(np.float32) for _ in range(batch_size)]
    attn = []
    for i in range(batch_size):
        mask_len = np.random.randint(1, length + 1)
        mask = np.array([1] * mask_len + [0] * (length - mask_len))
        attn.append(mask)
    codeflash_output = SpeechT5FeatureExtractor.zero_mean_unit_var_norm(batch, attn)
    result = codeflash_output  # 1.93ms -> 1.93ms (0.028% slower)
    for i, normed in enumerate(result):
        mask_len = attn[i].sum()
        if mask_len > 1:
            pass


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

import numpy as np

# imports
from transformers.models.speecht5.feature_extraction_speecht5 import SpeechT5FeatureExtractor


# -------------------------------
# Unit tests for zero_mean_unit_var_norm
# -------------------------------

# 1. Basic Test Cases


def test_basic_single_vector_no_attention_mask():
    # Single vector, no attention mask
    arr = np.array([1.0, 2.0, 3.0, 4.0])
    codeflash_output = SpeechT5FeatureExtractor.zero_mean_unit_var_norm([arr], None)
    result = codeflash_output  # 49.8μs -> 71.8μs (30.6% slower)
    out = result[0]


def test_basic_batch_vectors_no_attention_mask():
    # Batch of vectors, no attention mask
    arr1 = np.array([1.0, 2.0, 3.0])
    arr2 = np.array([10.0, 20.0, 30.0])
    codeflash_output = SpeechT5FeatureExtractor.zero_mean_unit_var_norm([arr1, arr2], None)
    result = codeflash_output  # 59.2μs -> 64.0μs (7.53% slower)
    out1, out2 = result


def test_basic_single_vector_with_attention_mask():
    # Single vector, with attention mask (masking last element)
    arr = np.array([5.0, 5.0, 5.0, 0.0])
    attn = np.array([1, 1, 1, 0])
    codeflash_output = SpeechT5FeatureExtractor.zero_mean_unit_var_norm([arr], [attn], padding_value=-99.0)
    result = codeflash_output  # 54.1μs -> 57.8μs (6.44% slower)
    out = result[0]


def test_basic_batch_vectors_with_attention_mask():
    # Batch of vectors with attention masks
    arr1 = np.array([1.0, 2.0, 3.0, 0.0])
    arr2 = np.array([10.0, 20.0, 30.0, 40.0])
    attn1 = np.array([1, 1, 1, 0])
    attn2 = np.array([1, 1, 1, 1])
    codeflash_output = SpeechT5FeatureExtractor.zero_mean_unit_var_norm(
        [arr1, arr2], [attn1, attn2], padding_value=-1.0
    )
    result = codeflash_output  # 71.7μs -> 78.6μs (8.81% slower)
    out1, out2 = result


def test_basic_padding_value_default():
    # Default padding_value is 0.0
    arr = np.array([1.0, 2.0, 3.0, 0.0])
    attn = np.array([1, 1, 1, 0])
    codeflash_output = SpeechT5FeatureExtractor.zero_mean_unit_var_norm([arr], [attn])
    result = codeflash_output  # 49.2μs -> 50.7μs (2.94% slower)
    out = result[0]


# 2. Edge Test Cases


def test_edge_empty_input_values():
    # Empty input_values list
    codeflash_output = SpeechT5FeatureExtractor.zero_mean_unit_var_norm([], None)
    result = codeflash_output  # 805ns -> 2.13μs (62.2% slower)


def test_edge_single_element_vector():
    # Single element vector (variance should be zero, so denominator is sqrt(1e-7))
    arr = np.array([42.0])
    codeflash_output = SpeechT5FeatureExtractor.zero_mean_unit_var_norm([arr], None)
    result = codeflash_output  # 62.8μs -> 79.8μs (21.3% slower)
    out = result[0]


def test_edge_attention_mask_all_zeros():
    # Attention mask all zeros (should pad everything)
    arr = np.array([1.0, 2.0, 3.0])
    attn = np.array([0, 0, 0])
    codeflash_output = SpeechT5FeatureExtractor.zero_mean_unit_var_norm([arr], [attn], padding_value=123.0)
    result = codeflash_output  # 81.6μs -> 81.0μs (0.670% faster)
    out = result[0]


def test_edge_attention_mask_partial():
    # Attention mask with some zeros in the middle
    arr = np.array([10.0, 20.0, 30.0, 40.0])
    attn = np.array([1, 0, 1, 1])
    # sum(attn) == 3, so only first 3 elements are normalized, last is padding
    codeflash_output = SpeechT5FeatureExtractor.zero_mean_unit_var_norm([arr], [attn], padding_value=-77.0)
    result = codeflash_output  # 57.1μs -> 58.4μs (2.21% slower)
    out = result[0]


def test_edge_vector_with_nans():
    # Vector containing NaN values
    arr = np.array([1.0, np.nan, 3.0])
    codeflash_output = SpeechT5FeatureExtractor.zero_mean_unit_var_norm([arr], None)
    result = codeflash_output  # 62.0μs -> 82.1μs (24.5% slower)
    out = result[0]


def test_edge_vector_with_infs():
    # Vector containing Inf values
    arr = np.array([1.0, np.inf, 3.0])
    codeflash_output = SpeechT5FeatureExtractor.zero_mean_unit_var_norm([arr], None)
    result = codeflash_output  # 60.4μs -> 77.9μs (22.5% slower)
    out = result[0]


def test_edge_vector_all_same_value():
    # All values the same, variance is zero
    arr = np.array([7.0, 7.0, 7.0, 7.0])
    codeflash_output = SpeechT5FeatureExtractor.zero_mean_unit_var_norm([arr], None)
    result = codeflash_output  # 44.4μs -> 63.2μs (29.7% slower)
    out = result[0]


def test_edge_vector_length_zero():
    # Zero-length vector
    arr = np.array([])
    codeflash_output = SpeechT5FeatureExtractor.zero_mean_unit_var_norm([arr], None)
    result = codeflash_output  # 63.8μs -> 82.3μs (22.5% slower)
    out = result[0]


def test_edge_attention_mask_none_and_non_none():
    # attention_mask is None, should ignore it
    arr = np.array([1.0, 2.0, 3.0])
    codeflash_output = SpeechT5FeatureExtractor.zero_mean_unit_var_norm([arr], None)
    result = codeflash_output  # 46.5μs -> 62.3μs (25.3% slower)
    out = result[0]


# 3. Large Scale Test Cases


def test_large_batch_size():
    # Large batch of vectors (up to 1000)
    batch_size = 1000
    arrs = [np.random.randn(10) for _ in range(batch_size)]
    codeflash_output = SpeechT5FeatureExtractor.zero_mean_unit_var_norm(arrs, None)
    result = codeflash_output  # 12.6ms -> 633μs (1887% faster)
    for out in result:
        pass


def test_large_vector_length():
    # Large vector length (up to 100000 floats, ~800KB, well below 100MB)
    arr = np.random.randn(100000)
    codeflash_output = SpeechT5FeatureExtractor.zero_mean_unit_var_norm([arr], None)
    result = codeflash_output  # 273μs -> 310μs (12.0% slower)
    out = result[0]


def test_large_batch_with_attention_mask():
    # Large batch with attention masks
    batch_size = 500
    arrs = [np.random.randn(50) for _ in range(batch_size)]
    attns = [np.array([1] * 40 + [0] * 10) for _ in range(batch_size)]
    codeflash_output = SpeechT5FeatureExtractor.zero_mean_unit_var_norm(arrs, attns, padding_value=-42.0)
    result = codeflash_output  # 6.76ms -> 6.73ms (0.515% faster)
    for out in result:
        pass


def test_large_vector_with_partial_attention_mask():
    # Large vector with partial attention mask
    arr = np.random.randn(999)
    attn = np.array([1] * 900 + [0] * 99)
    codeflash_output = SpeechT5FeatureExtractor.zero_mean_unit_var_norm([arr], [attn], padding_value=999.0)
    result = codeflash_output  # 63.1μs -> 63.4μs (0.511% slower)
    out = result[0]


def test_large_vector_all_padding():
    # Large vector, attention mask all zeros
    arr = np.random.randn(1000)
    attn = np.zeros(1000, dtype=int)
    codeflash_output = SpeechT5FeatureExtractor.zero_mean_unit_var_norm([arr], [attn], padding_value=123456.0)
    result = codeflash_output  # 80.1μs -> 77.7μs (3.10% faster)
    out = result[0]


def test_large_vector_all_ones():
    # Large vector, all values the same
    arr = np.ones(1000)
    codeflash_output = SpeechT5FeatureExtractor.zero_mean_unit_var_norm([arr], None)
    result = codeflash_output  # 45.1μs -> 71.3μs (36.7% slower)
    out = result[0]


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-SpeechT5FeatureExtractor.zero_mean_unit_var_norm-mia34bri and push.

The optimized code achieves a **103% speedup** by implementing two key optimizations in the `zero_mean_unit_var_norm` method: **1. Vectorized Batch Processing for Same-Length Arrays (No Attention Mask)** When no attention mask is provided and all input arrays have the same length, the optimization uses `np.stack()` to combine arrays into a single 2D array, then performs vectorized mean and variance calculations using `axis=1`. This eliminates the expensive per-array Python loop and leverages NumPy's highly optimized BLAS operations. **2. Optimized Memory Allocation and Computation Order (With Attention Mask)** For attention mask cases, the optimization: - Pre-computes `attention_mask.sum(-1)` once instead of recalculating for each iteration - Uses `np.empty_like()` for faster memory allocation without initialization - Separates normalization factor calculation (`1.0 / np.sqrt(var + 1e-7)`) to avoid redundant square root operations - Uses more efficient slicing operations **Performance Impact Analysis:** - **Large batch scenarios** see dramatic improvements: `test_large_batch_no_attention_mask` shows **482% speedup** (8.95ms → 1.54ms) - **Same-length array batches** benefit most from vectorization: `test_large_batch_size` achieves **1887% speedup** (12.6ms → 633μs) - **Individual small arrays** show modest overhead (20-30% slower) due to the additional length checking logic, but this is negligible in real workloads **Why This Matters:** Audio feature extraction typically processes batches of similar-length audio segments. The vectorized approach transforms O(n) individual operations into O(1) batch operations for the common case, making it highly effective for speech processing pipelines where consistent audio chunk sizes are standard. The optimizations are particularly beneficial when processing multiple audio files or segments in batch inference scenarios. The slight overhead for single small vectors is acceptable given the substantial gains for batch processing, which is the primary use case in production ML workflows.

codeflash-ai bot requested a review from mashraf-222 November 22, 2025 09:26

codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Nov 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

⚡️ Speed up method `SpeechT5FeatureExtractor.zero_mean_unit_var_norm` by 104% #378

⚡️ Speed up method `SpeechT5FeatureExtractor.zero_mean_unit_var_norm` by 104% #378

Uh oh!

codeflash-ai bot commented Nov 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

⚡️ Speed up method SpeechT5FeatureExtractor.zero_mean_unit_var_norm by 104% #378

Are you sure you want to change the base?

⚡️ Speed up method SpeechT5FeatureExtractor.zero_mean_unit_var_norm by 104% #378

Uh oh!

Conversation

codeflash-ai bot commented Nov 22, 2025

📄 104% (1.04x) speedup for SpeechT5FeatureExtractor.zero_mean_unit_var_norm in src/transformers/models/speecht5/feature_extraction_speecht5.py

📝 Explanation and details

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

⚡️ Speed up method `SpeechT5FeatureExtractor.zero_mean_unit_var_norm` by 104% #378

⚡️ Speed up method `SpeechT5FeatureExtractor.zero_mean_unit_var_norm` by 104% #378

📄 104% (1.04x) speedup for `SpeechT5FeatureExtractor.zero_mean_unit_var_norm` in `src/transformers/models/speecht5/feature_extraction_speecht5.py`