Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Nov 22, 2025

📄 114% (1.14x) speedup for EnglishNormalizer.collapse_whitespace in src/transformers/models/clvp/number_normalizer.py

⏱️ Runtime : 1.90 milliseconds 886 microseconds (best of 206 runs)

📝 Explanation and details

The optimization achieves a 114% speedup by eliminating redundant regex compilation in the collapse_whitespace method.

Key Changes:

  1. Precompiled regex pattern: The whitespace regex r"\s+" is now compiled once during __init__ and stored as self._whitespace_re, instead of being recompiled on every method call.
  2. Removed redundant import: The import regex as re line was removed since only the standard re module is used.

Why This Works:

  • Regex compilation overhead: In the original code, re.compile(r"\s+") was called every time collapse_whitespace was invoked, which is expensive (65,318ns per hit vs 18,744ns per hit in the optimized version).
  • Memory efficiency: Precompiling eliminates repeated pattern parsing and compilation, reducing both CPU cycles and memory allocations.

Performance Impact:
The optimization shows consistent 3-6x speedups across all test cases, with particularly strong gains for:

  • Simple cases (empty strings: 1634% faster)
  • Small inputs with minimal whitespace (400-600% faster)
  • Large-scale operations still benefit significantly (20-40% faster)

Context Benefits:
Since EnglishNormalizer.__call__ invokes collapse_whitespace as part of a text processing pipeline, this optimization will compound performance gains for any text normalization workload, especially when processing multiple documents or operating in batch scenarios where the normalizer instance is reused.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 114 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime
# imports
import pytest

from transformers.models.clvp.number_normalizer import EnglishNormalizer


# unit tests


@pytest.fixture
def normalizer():
    """Fixture to provide an instance of EnglishNormalizer."""
    return EnglishNormalizer()


# 1. BASIC TEST CASES


def test_single_space_remains_unchanged(normalizer):
    """A string with no extra whitespace should remain unchanged."""
    codeflash_output = normalizer.collapse_whitespace("hello world")  # 22.0μs -> 4.88μs (352% faster)


def test_multiple_spaces_between_words(normalizer):
    """Multiple consecutive spaces between words should be collapsed to one."""
    codeflash_output = normalizer.collapse_whitespace("hello    world")  # 22.4μs -> 4.30μs (421% faster)


def test_tabs_and_spaces(normalizer):
    """Tabs and spaces mixed should be collapsed to a single space."""
    codeflash_output = normalizer.collapse_whitespace("hello\t   world")  # 22.1μs -> 4.40μs (402% faster)


def test_newlines_and_spaces(normalizer):
    """Newlines and spaces should be collapsed to a single space."""
    codeflash_output = normalizer.collapse_whitespace("hello\n   world")  # 21.2μs -> 4.23μs (403% faster)


def test_leading_and_trailing_spaces(normalizer):
    """Leading and trailing spaces should be collapsed to a single space (not stripped)."""
    # Note: collapse_whitespace does NOT strip leading/trailing whitespace, only collapses
    codeflash_output = normalizer.collapse_whitespace("   hello world   ")  # 22.3μs -> 4.97μs (349% faster)


def test_mixed_whitespace_types(normalizer):
    """Mixed tabs, newlines, and spaces should be collapsed to a single space."""
    codeflash_output = normalizer.collapse_whitespace("a\t\n  b \r\n\tc")  # 22.4μs -> 4.55μs (393% faster)


def test_no_whitespace(normalizer):
    """A string with no whitespace should remain unchanged."""
    codeflash_output = normalizer.collapse_whitespace("abc")  # 19.8μs -> 2.73μs (626% faster)


def test_only_spaces(normalizer):
    """A string of only spaces should be collapsed to a single space."""
    codeflash_output = normalizer.collapse_whitespace("     ")  # 21.3μs -> 4.02μs (429% faster)


def test_only_tabs(normalizer):
    """A string of only tabs should be collapsed to a single space."""
    codeflash_output = normalizer.collapse_whitespace("\t\t\t")  # 21.3μs -> 3.58μs (493% faster)


def test_only_newlines(normalizer):
    """A string of only newlines should be collapsed to a single space."""
    codeflash_output = normalizer.collapse_whitespace("\n\n\n")  # 21.2μs -> 3.68μs (477% faster)


def test_spaces_tabs_newlines_only(normalizer):
    """A string with a mix of whitespace characters only should be collapsed to a single space."""
    codeflash_output = normalizer.collapse_whitespace(" \t\n\r  \n\t")  # 21.2μs -> 3.71μs (472% faster)


# 2. EDGE TEST CASES


def test_empty_string(normalizer):
    """An empty string should remain unchanged."""
    codeflash_output = normalizer.collapse_whitespace("")  # 18.2μs -> 1.05μs (1634% faster)


def test_string_of_length_one_space(normalizer):
    """A single space should remain a single space."""
    codeflash_output = normalizer.collapse_whitespace(" ")  # 22.1μs -> 4.52μs (390% faster)


def test_string_of_length_one_nonspace(normalizer):
    """A single non-whitespace character should remain unchanged."""
    codeflash_output = normalizer.collapse_whitespace("a")  # 21.2μs -> 2.92μs (625% faster)


def test_unicode_whitespace(normalizer):
    """Unicode whitespace (e.g., non-breaking space, em space) should be collapsed."""
    # \u00A0 = non-breaking space, \u2003 = em space
    s = "foo\u00a0\u2003bar"
    codeflash_output = normalizer.collapse_whitespace(s)
    result = codeflash_output  # 23.6μs -> 5.46μs (332% faster)


def test_whitespace_between_punctuation(normalizer):
    """Whitespace between punctuation should be collapsed to a single space."""
    codeflash_output = normalizer.collapse_whitespace("hello ,   world !")  # 22.5μs -> 5.20μs (332% faster)


def test_multiple_whitespace_at_edges(normalizer):
    """Multiple whitespace at both ends should be collapsed but not stripped."""
    codeflash_output = normalizer.collapse_whitespace("   a b   ")  # 22.3μs -> 4.71μs (373% faster)


def test_whitespace_inside_word(normalizer):
    """Whitespace inside a word should be collapsed (though this is a strange input)."""
    codeflash_output = normalizer.collapse_whitespace("hel   lo")  # 22.4μs -> 4.25μs (426% faster)


def test_long_run_of_whitespace(normalizer):
    """A long run of whitespace should be collapsed to a single space."""
    codeflash_output = normalizer.collapse_whitespace("a" + " " * 100 + "b")  # 21.6μs -> 4.30μs (402% faster)


def test_string_with_carriage_returns(normalizer):
    """Carriage returns should be treated as whitespace and collapsed."""
    codeflash_output = normalizer.collapse_whitespace("foo\rbar\r\nbaz")  # 22.3μs -> 4.89μs (356% faster)


def test_string_with_formfeed_and_vertical_tab(normalizer):
    """Formfeed (\f) and vertical tab (\v) are whitespace and should be collapsed."""
    codeflash_output = normalizer.collapse_whitespace("a\f\v b")  # 21.6μs -> 4.38μs (394% faster)


def test_string_with_mixed_unicode_and_ascii_whitespace(normalizer):
    """Mix of unicode and ASCII whitespace should be collapsed."""
    s = "a\u2002\t b\u2005\nc"
    codeflash_output = normalizer.collapse_whitespace(s)  # 22.2μs -> 4.79μs (364% faster)


def test_whitespace_between_numbers(normalizer):
    """Numbers separated by whitespace should be collapsed to a single space."""
    codeflash_output = normalizer.collapse_whitespace("123    456")  # 22.1μs -> 4.44μs (399% faster)


def test_whitespace_between_emojis(normalizer):
    """Whitespace between emojis should be collapsed."""
    codeflash_output = normalizer.collapse_whitespace("😀   😃")  # 23.0μs -> 5.03μs (357% faster)


# 3. LARGE SCALE TEST CASES


def test_large_string_all_spaces(normalizer):
    """A very large string of only spaces should be collapsed to a single space."""
    s = " " * 1000
    codeflash_output = normalizer.collapse_whitespace(s)  # 24.0μs -> 6.21μs (286% faster)


def test_large_string_with_words_and_whitespace(normalizer):
    """A large string with many words and random whitespace should be collapsed correctly."""
    # Create a string of 500 words separated by random whitespace
    import random

    words = [f"word{i}" for i in range(500)]
    whitespace_types = [" ", "\t", "\n", "\r", "\v", "\f", "\u00a0", "\u2003"]
    s = ""
    for w in words:
        s += w + random.choice(whitespace_types) * random.randint(1, 3)
    # Remove trailing whitespace for assertion
    expected = " ".join(words)
    codeflash_output = normalizer.collapse_whitespace(s)
    result = codeflash_output  # 134μs -> 113μs (19.3% faster)
    # Collapse_whitespace will leave a trailing space if s ends with whitespace, so we check accordingly
    if s[-1].isspace():
        pass
    else:
        pass


def test_large_string_with_no_whitespace(normalizer):
    """A large string with no whitespace should remain unchanged."""
    s = "a" * 1000
    codeflash_output = normalizer.collapse_whitespace(s)  # 24.3μs -> 5.74μs (323% faster)


def test_large_string_with_leading_and_trailing_whitespace(normalizer):
    """A large string with leading and trailing whitespace should collapse but not strip."""
    s = " " * 100 + "abc" + " " * 100
    codeflash_output = normalizer.collapse_whitespace(s)  # 23.4μs -> 5.79μs (305% faster)


def test_large_string_with_alternating_whitespace(normalizer):
    """A large string with alternating word and whitespace patterns."""
    s = ""
    for i in range(500):
        s += "x" + (" " * (i % 5 + 1))
    # Should collapse all runs of whitespace to a single space
    expected = " ".join(["x"] * 500) + " "
    codeflash_output = normalizer.collapse_whitespace(s)  # 108μs -> 89.8μs (20.7% faster)


# Additional edge case: ensure function does not throw on very large input
def test_very_large_input_no_crash(normalizer):
    """Ensure function does not crash or hang on very large input."""
    s = ("a " * 1000).replace(" ", "    ")
    codeflash_output = normalizer.collapse_whitespace(s)
    result = codeflash_output  # 185μs -> 167μs (10.8% faster)
    # Should be 1000 'a's separated by single spaces, with no trailing space
    expected = " ".join(["a"] * 1000)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
# imports
import pytest  # used for our unit tests

from transformers.models.clvp.number_normalizer import EnglishNormalizer


# unit tests


@pytest.fixture
def normalizer():
    # Fixture to provide a fresh EnglishNormalizer instance for each test
    return EnglishNormalizer()


# 1. Basic Test Cases


def test_single_space_between_words(normalizer):
    # Basic: Should not change single spaces
    codeflash_output = normalizer.collapse_whitespace("hello world")  # 23.6μs -> 5.00μs (371% faster)


def test_multiple_spaces_between_words(normalizer):
    # Basic: Should collapse multiple spaces to one
    codeflash_output = normalizer.collapse_whitespace("hello    world")  # 21.9μs -> 4.44μs (392% faster)


def test_tabs_between_words(normalizer):
    # Basic: Tabs should be collapsed to a single space
    codeflash_output = normalizer.collapse_whitespace("hello\tworld")  # 21.8μs -> 4.34μs (402% faster)


def test_newlines_between_words(normalizer):
    # Basic: Newlines should be collapsed to a single space
    codeflash_output = normalizer.collapse_whitespace("hello\nworld")  # 21.7μs -> 4.15μs (423% faster)


def test_mixed_whitespace_between_words(normalizer):
    # Basic: Mixed whitespace should be collapsed to a single space
    codeflash_output = normalizer.collapse_whitespace("hello \t\n  world")  # 22.3μs -> 4.21μs (428% faster)


def test_leading_and_trailing_spaces(normalizer):
    # Basic: Leading and trailing spaces are not stripped, only collapsed
    codeflash_output = normalizer.collapse_whitespace("   hello world   ")  # 22.2μs -> 4.84μs (359% faster)


def test_string_with_no_whitespace(normalizer):
    # Basic: No whitespace should remain unchanged
    codeflash_output = normalizer.collapse_whitespace("helloworld")  # 20.7μs -> 2.90μs (613% faster)


# 2. Edge Test Cases


def test_empty_string(normalizer):
    # Edge: Empty string should remain unchanged
    codeflash_output = normalizer.collapse_whitespace("")  # 18.6μs -> 1.09μs (1599% faster)


def test_string_only_spaces(normalizer):
    # Edge: String with only spaces should collapse to a single space
    codeflash_output = normalizer.collapse_whitespace("     ")  # 22.4μs -> 4.52μs (395% faster)


def test_string_only_tabs(normalizer):
    # Edge: String with only tabs should collapse to a single space
    codeflash_output = normalizer.collapse_whitespace("\t\t\t")  # 21.3μs -> 3.66μs (483% faster)


def test_string_only_newlines(normalizer):
    # Edge: String with only newlines should collapse to a single space
    codeflash_output = normalizer.collapse_whitespace("\n\n\n")  # 20.8μs -> 3.57μs (482% faster)


def test_string_only_mixed_whitespace(normalizer):
    # Edge: String with only mixed whitespace should collapse to a single space
    codeflash_output = normalizer.collapse_whitespace(" \t \n\t\n  ")  # 20.7μs -> 3.56μs (481% faster)


def test_whitespace_between_punctuation(normalizer):
    # Edge: Whitespace between punctuation should be collapsed
    codeflash_output = normalizer.collapse_whitespace("hello ,   world !")  # 22.2μs -> 5.10μs (336% faster)


def test_unicode_whitespace(normalizer):
    # Edge: Unicode whitespace characters should be collapsed
    # U+2003 EM SPACE, U+2009 THIN SPACE, U+202F NARROW NO-BREAK SPACE
    s = "hello\u2003world\u2009test\u202fagain"
    # All unicode whitespace should be collapsed to a single space
    codeflash_output = normalizer.collapse_whitespace(s)  # 22.8μs -> 5.22μs (337% faster)


def test_whitespace_at_string_boundaries(normalizer):
    # Edge: Leading and trailing whitespace should be collapsed, not stripped
    codeflash_output = normalizer.collapse_whitespace("   hello   ")  # 22.1μs -> 4.43μs (399% faster)


def test_multiple_whitespace_groups(normalizer):
    # Edge: Multiple groups of whitespace should each collapse to a single space
    codeflash_output = normalizer.collapse_whitespace("a  b   c    d")  # 21.7μs -> 4.78μs (355% faster)


def test_whitespace_between_numbers(normalizer):
    # Edge: Numbers separated by whitespace should be collapsed
    codeflash_output = normalizer.collapse_whitespace("1   2\t3\n4")  # 22.8μs -> 4.54μs (402% faster)


def test_whitespace_between_emojis(normalizer):
    # Edge: Emojis separated by whitespace should be collapsed
    codeflash_output = normalizer.collapse_whitespace("😀  😃   😄")  # 22.0μs -> 5.54μs (297% faster)


def test_whitespace_between_non_ascii(normalizer):
    # Edge: Non-ASCII characters with whitespace should be collapsed
    codeflash_output = normalizer.collapse_whitespace("你好   世界")  # 22.2μs -> 5.02μs (342% faster)


def test_whitespace_in_long_repeated_pattern(normalizer):
    # Edge: Long repeated whitespace patterns should be collapsed
    codeflash_output = normalizer.collapse_whitespace("a" + " " * 50 + "b")  # 22.0μs -> 4.36μs (405% faster)


def test_whitespace_between_empty_strings(normalizer):
    # Edge: Empty string between whitespace should collapse to a single space
    codeflash_output = normalizer.collapse_whitespace(" \t \n ")  # 21.2μs -> 3.57μs (493% faster)


# 3. Large Scale Test Cases


def test_large_string_with_lots_of_whitespace(normalizer):
    # Large: Collapse whitespace in a large string
    s = ("word" + " " * 10) * 200  # 200 words separated by 10 spaces
    codeflash_output = normalizer.collapse_whitespace(s)
    result = codeflash_output  # 67.4μs -> 48.6μs (38.5% faster)


def test_large_string_with_mixed_whitespace(normalizer):
    # Large: Collapse whitespace in a large string with mixed whitespace
    s = ("word" + "\t\n  ") * 300  # 300 words, each followed by mixed whitespace
    codeflash_output = normalizer.collapse_whitespace(s)
    result = codeflash_output  # 81.7μs -> 62.2μs (31.3% faster)


def test_large_string_with_unicode_whitespace(normalizer):
    # Large: Collapse whitespace in a large string with unicode whitespace
    s = ("word" + "\u2003\u2009\u202f") * 400  # 400 words, each followed by unicode whitespace
    codeflash_output = normalizer.collapse_whitespace(s)
    result = codeflash_output  # 99.6μs -> 81.5μs (22.2% faster)


def test_large_string_with_leading_and_trailing_whitespace(normalizer):
    # Large: Collapse whitespace in a large string with leading/trailing whitespace
    s = " " * 100 + ("word " * 500) + " " * 100
    codeflash_output = normalizer.collapse_whitespace(s)
    result = codeflash_output  # 108μs -> 92.0μs (18.4% faster)


def test_large_string_all_whitespace(normalizer):
    # Large: Collapse whitespace in a large string of only whitespace
    s = " \t\n" * 1000
    codeflash_output = normalizer.collapse_whitespace(s)
    result = codeflash_output  # 28.8μs -> 10.8μs (166% faster)


def test_large_string_no_whitespace(normalizer):
    # Large: Large string with no whitespace should remain unchanged
    s = "word" * 1000
    codeflash_output = normalizer.collapse_whitespace(s)
    result = codeflash_output  # 30.6μs -> 12.9μs (138% faster)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-EnglishNormalizer.collapse_whitespace-mia8jzgf and push.

Codeflash Static Badge

The optimization achieves a **114% speedup** by eliminating redundant regex compilation in the `collapse_whitespace` method. 

**Key Changes:**
1. **Precompiled regex pattern**: The whitespace regex `r"\s+"` is now compiled once during `__init__` and stored as `self._whitespace_re`, instead of being recompiled on every method call.
2. **Removed redundant import**: The `import regex as re` line was removed since only the standard `re` module is used.

**Why This Works:**
- **Regex compilation overhead**: In the original code, `re.compile(r"\s+")` was called every time `collapse_whitespace` was invoked, which is expensive (65,318ns per hit vs 18,744ns per hit in the optimized version).
- **Memory efficiency**: Precompiling eliminates repeated pattern parsing and compilation, reducing both CPU cycles and memory allocations.

**Performance Impact:**
The optimization shows consistent **3-6x speedups** across all test cases, with particularly strong gains for:
- Simple cases (empty strings: **1634% faster**)
- Small inputs with minimal whitespace (**400-600% faster**)
- Large-scale operations still benefit significantly (**20-40% faster**)

**Context Benefits:**
Since `EnglishNormalizer.__call__` invokes `collapse_whitespace` as part of a text processing pipeline, this optimization will compound performance gains for any text normalization workload, especially when processing multiple documents or operating in batch scenarios where the normalizer instance is reused.
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 November 22, 2025 11:58
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Nov 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant