-
Notifications
You must be signed in to change notification settings - Fork 45
Description
Issue Description
CrisperWhisper generates overlapping and duplicate word chunks at 30-second boundaries due to
the chunking mechanism not properly merging overlapping segments. This results in duplicate
content and corrupted timestamps.
Example of Duplicate Chunks
When sorted by start time, chunks show clear duplication:
193 = array(['we,', (68.0, 68.1)], dtype=object)
194 = array(['we,', (68.0, 68.14)], dtype=object) # Duplicate
195 = array(['we', (68.24, 68.26)], dtype=object)
196 = array(['we', (68.24, 68.28)], dtype=object) # Duplicate
197 = array(['love', (68.36, 68.48)], dtype=object)
198 = array(['love', (68.36, 68.48)], dtype=object) # Duplicate
Invalid Timestamps
Some chunks have invalid timestamps where start > end:
( 69640, 61500, 'So what and relax.') # start=69.64s > end=61.5s
Overlap Detection
Debug logging shows overlaps occurring at chunk boundaries:
OVERLAP DETECTED at RAW MODEL OUTPUT: 1 overlapping chunks
Chunk 182: 'what' ends at 69.820
Chunk 183: 'and' starts at 60.780
Overlap duration: 9.040s
Configuration
- Model: nyrahealth/CrisperWhisper
- Chunk length: 30s (model's training chunk size)
- Languages: Both English and German affected
- Return timestamps: word-level
Root Cause
The issue appears to stem from the HuggingFace pipeline's chunking mechanism creating
overlapping 30s segments to avoid cutting mid-word, but the timestamp merging logic fails to
properly deduplicate chunks that appear in multiple segments.
Impact
- Duplicate word chunks in output
- Invalid timestamps (some chunks have start > end)
- Corrupted transcript text when chunks are concatenated
- Issues with downstream processing that relies on clean timestamps
Attempted Workaround
We've implemented post-processing deduplication that removes chunks with identical text and
timestamps within a 0.12s threshold. However:
- This only handles most duplicates/overlaps, not all cases
- We cannot fix invalid timestamps where start > end, as there's no way to determine the
correct timing - This workaround shouldn't be necessary if the chunking worked correctly
Expected Behavior
Word chunks should have non-overlapping timestamps and no duplicates, even when processing
long audio files that require chunking. All timestamps should be valid (start ≤ end).