Skip to content

Word chunk overlap and duplicate timestamps at 30s boundaries #41

@saveli

Description

@saveli

Issue Description

CrisperWhisper generates overlapping and duplicate word chunks at 30-second boundaries due to
the chunking mechanism not properly merging overlapping segments. This results in duplicate
content and corrupted timestamps.

Example of Duplicate Chunks

When sorted by start time, chunks show clear duplication:

  193 = array(['we,', (68.0, 68.1)], dtype=object)
  194 = array(['we,', (68.0, 68.14)], dtype=object)  # Duplicate
  195 = array(['we', (68.24, 68.26)], dtype=object)
  196 = array(['we', (68.24, 68.28)], dtype=object)  # Duplicate
  197 = array(['love', (68.36, 68.48)], dtype=object)
  198 = array(['love', (68.36, 68.48)], dtype=object)  # Duplicate

Invalid Timestamps

Some chunks have invalid timestamps where start > end:
( 69640, 61500, 'So what and relax.') # start=69.64s > end=61.5s

Overlap Detection

Debug logging shows overlaps occurring at chunk boundaries:

  OVERLAP DETECTED at RAW MODEL OUTPUT: 1 overlapping chunks
    Chunk 182: 'what' ends at 69.820
    Chunk 183: 'and' starts at 60.780
    Overlap duration: 9.040s

Configuration

  • Model: nyrahealth/CrisperWhisper
  • Chunk length: 30s (model's training chunk size)
  • Languages: Both English and German affected
  • Return timestamps: word-level

Root Cause

The issue appears to stem from the HuggingFace pipeline's chunking mechanism creating
overlapping 30s segments to avoid cutting mid-word, but the timestamp merging logic fails to
properly deduplicate chunks that appear in multiple segments.

Impact

  • Duplicate word chunks in output
  • Invalid timestamps (some chunks have start > end)
  • Corrupted transcript text when chunks are concatenated
  • Issues with downstream processing that relies on clean timestamps

Attempted Workaround

We've implemented post-processing deduplication that removes chunks with identical text and
timestamps within a 0.12s threshold. However:

  • This only handles most duplicates/overlaps, not all cases
  • We cannot fix invalid timestamps where start > end, as there's no way to determine the
    correct timing
  • This workaround shouldn't be necessary if the chunking worked correctly

Expected Behavior

Word chunks should have non-overlapping timestamps and no duplicates, even when processing
long audio files that require chunking. All timestamps should be valid (start ≤ end).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions