Word chunk overlap and duplicate timestamps at 30s boundaries

### Issue Description

  CrisperWhisper generates overlapping and duplicate word chunks at 30-second boundaries due to
  the chunking mechanism not properly merging overlapping segments. This results in duplicate
  content and corrupted timestamps.

  ### Example of Duplicate  Chunks

  When sorted by start time, chunks show clear duplication:
```
  193 = array(['we,', (68.0, 68.1)], dtype=object)
  194 = array(['we,', (68.0, 68.14)], dtype=object)  # Duplicate
  195 = array(['we', (68.24, 68.26)], dtype=object)
  196 = array(['we', (68.24, 68.28)], dtype=object)  # Duplicate
  197 = array(['love', (68.36, 68.48)], dtype=object)
  198 = array(['love', (68.36, 68.48)], dtype=object)  # Duplicate
```

  ### Invalid Timestamps

  Some chunks have invalid timestamps where start > end:
`  ( 69640,  61500, 'So what and relax.')  # start=69.64s > end=61.5s`

  ### Overlap Detection

  Debug logging shows overlaps occurring at chunk boundaries:
```
  OVERLAP DETECTED at RAW MODEL OUTPUT: 1 overlapping chunks
    Chunk 182: 'what' ends at 69.820
    Chunk 183: 'and' starts at 60.780
    Overlap duration: 9.040s
```

  ### Configuration

  - Model: nyrahealth/CrisperWhisper
  - Chunk length: 30s (model's training chunk size)
  - Languages: Both English and German affected
  - Return timestamps: word-level

  ### Root Cause

  The issue appears to stem from the HuggingFace pipeline's chunking mechanism creating
  overlapping 30s segments to avoid cutting mid-word, but the timestamp merging logic fails to
  properly deduplicate chunks that appear in multiple segments.

  ### Impact

  - Duplicate word chunks in output
  - Invalid timestamps (some chunks have start > end)
  - Corrupted transcript text when chunks are concatenated
  - Issues with downstream processing that relies on clean timestamps

  ### Attempted Workaround

  We've implemented post-processing deduplication that removes chunks with identical text and
  timestamps within a 0.12s threshold. However:
  - This only handles most duplicates/overlaps, not all cases
  - We cannot fix invalid timestamps where start > end, as there's no way to determine the
  correct timing
  - This workaround shouldn't be necessary if the chunking worked correctly

  ### Expected Behavior

  Word chunks should have non-overlapping timestamps and no duplicates, even when processing
  long audio files that require chunking. All timestamps should be valid (start ≤ end).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Word chunk overlap and duplicate timestamps at 30s boundaries #41

Issue Description

Example of Duplicate Chunks

Invalid Timestamps

Overlap Detection

Configuration

Root Cause

Impact

Attempted Workaround

Expected Behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Word chunk overlap and duplicate timestamps at 30s boundaries #41

Description

Issue Description

Example of Duplicate Chunks

Invalid Timestamps

Overlap Detection

Configuration

Root Cause

Impact

Attempted Workaround

Expected Behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions