Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Nov 22, 2025

📄 6% (0.06x) speedup for LightGlueImageProcessor.post_process_keypoint_matching in src/transformers/models/lightglue/image_processing_lightglue.py

⏱️ Runtime : 2.95 milliseconds 2.78 milliseconds (best of 48 runs)

📝 Explanation and details

The optimized code achieves a 6% speedup through several key tensor operation optimizations:

Key Optimizations:

  1. Eliminated unnecessary .clone(): The original code called .clone() on outputs.keypoints, but this is redundant since the subsequent multiplication and .to(torch.int32) operations already create new tensors. This saves memory allocation and copying overhead.

  2. Precomputed batch slices: Instead of extracting slices inside the loop (e.g., outputs.mask, outputs.matches[:, 0]), the optimized version precomputes these outside the loop as mask0_all, mask1_all, matches_all, etc. This eliminates repeated attribute lookups and tensor slicing operations that were happening on every iteration.

  3. Replaced torch.tensor with torch.as_tensor: For converting target_sizes from list to tensor, torch.as_tensor avoids unnecessary memory copies when the input is already tensor-like, providing a minor but consistent performance gain.

  4. Added empty tensor handling: The optimization safely handles cases where matched_indices.numel() == 0 to avoid potential indexing errors with empty tensors, using .new_empty() to create appropriately shaped empty tensors.

Performance Impact:

The line profiler shows the biggest gains come from:

  • Removing the expensive .clone() operation (line with keypoints = outputs.keypoints.clone())
  • Precomputing slices reduces per-iteration overhead in the main loop
  • The batch processing approach scales better with larger batch sizes, as evidenced by the 9.5% speedup on the large batch test case

Test Case Analysis:

The optimizations show consistent improvements across all test scenarios:

  • Basic cases: 3-8% faster due to eliminated .clone() and reduced attribute access
  • Edge cases with empty matches: 6-8% faster, particularly benefiting from the safe empty tensor handling
  • Large batch cases: Up to 9.5% faster, where the precomputed slicing approach shows its greatest benefit

This optimization is particularly valuable for computer vision pipelines processing multiple image pairs in batches, where the post-processing step is called frequently after model inference.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 48 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 96.2%
🌀 Generated Regression Tests and Runtime
import pytest
import torch

from transformers.models.lightglue.image_processing_lightglue import LightGlueImageProcessor


# Minimal stub for LightGlueKeypointMatchingOutput to allow testing
class LightGlueKeypointMatchingOutput:
    def __init__(self, keypoints, mask, matches, matching_scores):
        self.keypoints = keypoints
        self.mask = mask
        self.matches = matches
        self.matching_scores = matching_scores


# ------------------- Basic Test Cases -------------------


def test_basic_single_pair_match():
    # 1 batch, 2 images, 3 keypoints per image, 2 coords per keypoint
    keypoints = torch.tensor(
        [
            [
                [[0.1, 0.2], [0.3, 0.4], [0.5, 0.6]],  # img0
                [[0.7, 0.8], [0.9, 1.0], [0.2, 0.3]],
            ]
        ]
    )  # img1
    mask = torch.tensor([[[1, 1, 1], [1, 1, 1]]])  # all valid
    matches = torch.tensor([[[0, 1, 2]]])  # each keypoint in img0 matches to same index in img1
    matching_scores = torch.tensor([[[0.9, 0.8, 0.7]]])
    outputs = LightGlueKeypointMatchingOutput(keypoints, mask, matches, matching_scores)
    target_sizes = [(10, 20), (10, 20)]  # height, width for both images

    processor = LightGlueImageProcessor()
    codeflash_output = processor.post_process_keypoint_matching(outputs, [target_sizes], threshold=0.0)
    result = codeflash_output  # 165μs -> 159μs (3.80% faster)
    # Check coordinate scaling
    expected0 = torch.tensor([[2, 1], [8, 3], [12, 5]], dtype=torch.int32)
    expected1 = torch.tensor([[16, 7], [20, 9], [6, 2]], dtype=torch.int32)


def test_basic_threshold_filtering():
    # Only scores above threshold should be kept
    keypoints = torch.tensor([[[[0.1, 0.2], [0.3, 0.4]], [[0.5, 0.6], [0.7, 0.8]]]])
    mask = torch.tensor([[[1, 1], [1, 1]]])
    matches = torch.tensor([[[0, 1]]])
    matching_scores = torch.tensor([[[0.4, 0.9]]])
    outputs = LightGlueKeypointMatchingOutput(keypoints, mask, matches, matching_scores)
    target_sizes = [(10, 10), (10, 10)]

    processor = LightGlueImageProcessor()
    codeflash_output = processor.post_process_keypoint_matching(outputs, [target_sizes], threshold=0.5)
    result = codeflash_output  # 151μs -> 147μs (3.05% faster)


def test_basic_no_matches_above_threshold():
    # No matches above threshold
    keypoints = torch.tensor([[[[0.1, 0.2]], [[0.3, 0.4]]]])
    mask = torch.tensor([[[1], [1]]])
    matches = torch.tensor([[[-1]]])  # -1 means no match
    matching_scores = torch.tensor([[[0.1]]])
    outputs = LightGlueKeypointMatchingOutput(keypoints, mask, matches, matching_scores)
    target_sizes = [(10, 10), (10, 10)]

    processor = LightGlueImageProcessor()
    codeflash_output = processor.post_process_keypoint_matching(outputs, [target_sizes], threshold=0.5)
    result = codeflash_output  # 152μs -> 141μs (8.05% faster)


def test_basic_multiple_batch():
    # Two pairs in batch
    keypoints = torch.tensor(
        [[[[0.1, 0.2], [0.3, 0.4]], [[0.5, 0.6], [0.7, 0.8]]], [[[0.9, 0.8], [0.7, 0.6]], [[0.5, 0.4], [0.3, 0.2]]]]
    )
    mask = torch.tensor([[[1, 1], [1, 1]], [[1, 0], [1, 1]]])
    matches = torch.tensor([[[0, 1]], [[1, -1]]])
    matching_scores = torch.tensor([[[0.8, 0.9]], [[0.7, 0.2]]])
    outputs = LightGlueKeypointMatchingOutput(keypoints, mask, matches, matching_scores)
    target_sizes = [[(10, 10), (10, 10)], [(20, 30), (20, 30)]]

    processor = LightGlueImageProcessor()
    codeflash_output = processor.post_process_keypoint_matching(outputs, target_sizes, threshold=0.0)
    result = codeflash_output  # 200μs -> 190μs (5.21% faster)


# ------------------- Edge Test Cases -------------------


def test_edge_empty_keypoints():
    # Empty keypoints and masks
    keypoints = torch.empty((1, 2, 0, 2))
    mask = torch.empty((1, 2, 0), dtype=torch.int64)
    matches = torch.empty((1, 1, 0), dtype=torch.int64)
    matching_scores = torch.empty((1, 1, 0))
    outputs = LightGlueKeypointMatchingOutput(keypoints, mask, matches, matching_scores)
    target_sizes = [(10, 10), (10, 10)]

    processor = LightGlueImageProcessor()
    codeflash_output = processor.post_process_keypoint_matching(outputs, [target_sizes], threshold=0.0)
    result = codeflash_output  # 148μs -> 136μs (8.91% faster)


def test_edge_all_mask_zero():
    # All mask values are zero (no valid keypoints)
    keypoints = torch.tensor([[[[0.1, 0.2], [0.3, 0.4]], [[0.5, 0.6], [0.7, 0.8]]]])
    mask = torch.tensor([[[0, 0], [0, 0]]])
    matches = torch.tensor([[[0, 1]]])
    matching_scores = torch.tensor([[[0.9, 0.8]]])
    outputs = LightGlueKeypointMatchingOutput(keypoints, mask, matches, matching_scores)
    target_sizes = [(10, 10), (10, 10)]

    processor = LightGlueImageProcessor()
    codeflash_output = processor.post_process_keypoint_matching(outputs, [target_sizes], threshold=0.0)
    result = codeflash_output  # 142μs -> 134μs (6.02% faster)


def test_edge_invalid_target_sizes_length():
    # target_sizes length does not match batch size
    keypoints = torch.tensor([[[[0.1, 0.2]], [[0.3, 0.4]]]])
    mask = torch.tensor([[[1], [1]]])
    matches = torch.tensor([[[0]]])
    matching_scores = torch.tensor([[[0.9]]])
    outputs = LightGlueKeypointMatchingOutput(keypoints, mask, matches, matching_scores)
    target_sizes = []  # Should be length 1

    processor = LightGlueImageProcessor()
    with pytest.raises(ValueError):
        processor.post_process_keypoint_matching(
            outputs, target_sizes, threshold=0.0
        )  # 2.42μs -> 2.36μs (2.54% faster)


def test_edge_invalid_target_sizes_shape():
    # target_sizes elements are not tuples of length 2
    keypoints = torch.tensor([[[[0.1, 0.2]], [[0.3, 0.4]]]])
    mask = torch.tensor([[[1], [1]]])
    matches = torch.tensor([[[0]]])
    matching_scores = torch.tensor([[[0.9]]])
    outputs = LightGlueKeypointMatchingOutput(keypoints, mask, matches, matching_scores)
    target_sizes = [(10,)]  # Not length 2

    processor = LightGlueImageProcessor()
    with pytest.raises(ValueError):
        processor.post_process_keypoint_matching(
            outputs, target_sizes, threshold=0.0
        )  # 3.76μs -> 3.88μs (3.19% slower)


def test_edge_target_sizes_tensor_shape():
    # target_sizes is a tensor with wrong shape
    keypoints = torch.tensor([[[[0.1, 0.2]], [[0.3, 0.4]]]])
    mask = torch.tensor([[[1], [1]]])
    matches = torch.tensor([[[0]]])
    matching_scores = torch.tensor([[[0.9]]])
    outputs = LightGlueKeypointMatchingOutput(keypoints, mask, matches, matching_scores)
    target_sizes = torch.zeros((1, 1, 2))  # Should be (batch, 2, 2)

    processor = LightGlueImageProcessor()
    with pytest.raises(ValueError):
        processor.post_process_keypoint_matching(
            outputs, target_sizes, threshold=0.0
        )  # 21.2μs -> 21.9μs (2.88% slower)


def test_edge_negative_match_indices():
    # matches contain -1 (no match), but scores above threshold
    keypoints = torch.tensor([[[[0.1, 0.2], [0.3, 0.4]], [[0.5, 0.6], [0.7, 0.8]]]])
    mask = torch.tensor([[[1, 1], [1, 1]]])
    matches = torch.tensor([[[1, -1]]])
    matching_scores = torch.tensor([[[0.9, 0.8]]])
    outputs = LightGlueKeypointMatchingOutput(keypoints, mask, matches, matching_scores)
    target_sizes = [(10, 10), (10, 10)]

    processor = LightGlueImageProcessor()
    codeflash_output = processor.post_process_keypoint_matching(outputs, [target_sizes], threshold=0.0)
    result = codeflash_output  # 162μs -> 154μs (5.06% faster)


def test_edge_large_threshold_all_filtered():
    # All scores below threshold
    keypoints = torch.tensor([[[[0.1, 0.2], [0.3, 0.4]], [[0.5, 0.6], [0.7, 0.8]]]])
    mask = torch.tensor([[[1, 1], [1, 1]]])
    matches = torch.tensor([[[0, 1]]])
    matching_scores = torch.tensor([[[0.1, 0.2]]])
    outputs = LightGlueKeypointMatchingOutput(keypoints, mask, matches, matching_scores)
    target_sizes = [(10, 10), (10, 10)]

    processor = LightGlueImageProcessor()
    codeflash_output = processor.post_process_keypoint_matching(outputs, [target_sizes], threshold=0.5)
    result = codeflash_output  # 151μs -> 141μs (7.13% faster)


# ------------------- Large Scale Test Cases -------------------


def test_large_batch_and_keypoints():
    # Large batch and keypoints, but <1000 elements
    batch_size = 10
    num_keypoints = 50
    keypoints = torch.rand((batch_size, 2, num_keypoints, 2))
    mask = torch.ones((batch_size, 2, num_keypoints), dtype=torch.int64)
    matches = torch.arange(num_keypoints).expand(batch_size, 1, num_keypoints)
    matching_scores = torch.ones((batch_size, 1, num_keypoints))
    outputs = LightGlueKeypointMatchingOutput(keypoints, mask, matches, matching_scores)
    target_sizes = [[(100, 200), (100, 200)] for _ in range(batch_size)]

    processor = LightGlueImageProcessor()
    codeflash_output = processor.post_process_keypoint_matching(outputs, target_sizes, threshold=0.0)
    result = codeflash_output  # 581μs -> 530μs (9.50% faster)
    # Each batch result should have num_keypoints matches
    for r in result:
        pass


def test_large_target_sizes_tensor_input():
    # Use tensor for target_sizes
    batch_size = 3
    num_keypoints = 10
    keypoints = torch.rand((batch_size, 2, num_keypoints, 2))
    mask = torch.ones((batch_size, 2, num_keypoints), dtype=torch.int64)
    matches = torch.arange(num_keypoints).expand(batch_size, 1, num_keypoints)
    matching_scores = torch.ones((batch_size, 1, num_keypoints))
    outputs = LightGlueKeypointMatchingOutput(keypoints, mask, matches, matching_scores)
    target_sizes = torch.tensor([[[10, 20], [10, 20]]] * batch_size)

    processor = LightGlueImageProcessor()
    codeflash_output = processor.post_process_keypoint_matching(outputs, target_sizes, threshold=0.0)
    result = codeflash_output  # 244μs -> 235μs (4.23% faster)
    for r in result:
        pass


def test_large_threshold_filtering():
    # Large batch, many keypoints, threshold filters out most
    batch_size = 7
    num_keypoints = 80
    keypoints = torch.rand((batch_size, 2, num_keypoints, 2))
    mask = torch.ones((batch_size, 2, num_keypoints), dtype=torch.int64)
    matches = torch.arange(num_keypoints).expand(batch_size, 1, num_keypoints)
    matching_scores = torch.linspace(0, 1, steps=num_keypoints).expand(batch_size, 1, num_keypoints)
    outputs = LightGlueKeypointMatchingOutput(keypoints, mask, matches, matching_scores)
    target_sizes = [[(100, 100), (100, 100)] for _ in range(batch_size)]

    processor = LightGlueImageProcessor()
    codeflash_output = processor.post_process_keypoint_matching(outputs, target_sizes, threshold=0.95)
    result = codeflash_output  # 443μs -> 404μs (9.45% faster)
    # Only last few keypoints should remain
    for r in result:
        pass


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
import pytest  # used for our unit tests
import torch

from transformers.models.lightglue.image_processing_lightglue import LightGlueImageProcessor


# Minimal stub for LightGlueKeypointMatchingOutput for testing
class LightGlueKeypointMatchingOutput:
    def __init__(self, keypoints, mask, matches, matching_scores):
        self.keypoints = keypoints
        self.mask = mask
        self.matches = matches
        self.matching_scores = matching_scores


# ----------- Basic Test Cases -----------


def test_basic_multiple_pairs():
    # Test with two image pairs, each with two keypoints
    keypoints = torch.tensor(
        [[[[0.1, 0.2], [0.5, 0.7]], [[0.3, 0.4], [0.6, 0.8]]], [[[0.9, 0.8], [0.2, 0.1]], [[0.7, 0.6], [0.5, 0.4]]]]
    )  # shape [2,2,2,2]
    mask = torch.tensor([[[1, 1], [1, 1]], [[1, 0], [1, 1]]])  # shape [2,2,2]
    matches = torch.tensor([[[1, 0], [-1, -1]], [[0, -1], [1, 0]]])  # shape [2,2,2]
    matching_scores = torch.tensor([[[0.8, 0.7], [0.0, 0.0]], [[0.5, 0.2], [0.9, 0.1]]])  # shape [2,2,2]
    target_sizes = [(100, 200), (150, 250)]
    outputs = LightGlueKeypointMatchingOutput(keypoints, mask, matches, matching_scores)
    processor = LightGlueImageProcessor()
    codeflash_output = processor.post_process_keypoint_matching(outputs, target_sizes)
    results = codeflash_output  # 212μs -> 205μs (3.17% faster)


def test_target_sizes_tensor():
    # Test with target_sizes as a tensor
    keypoints = torch.tensor([[[[0.5, 0.5]], [[0.6, 0.4]]]])
    mask = torch.tensor([[[1], [1]]])
    matches = torch.tensor([[[0], [-1]]])
    matching_scores = torch.tensor([[[0.9], [0.0]]])
    target_sizes = torch.tensor([[[100, 200], [100, 200]]])
    outputs = LightGlueKeypointMatchingOutput(keypoints, mask, matches, matching_scores)
    processor = LightGlueImageProcessor()
    codeflash_output = processor.post_process_keypoint_matching(outputs, target_sizes)
    results = codeflash_output  # 163μs -> 161μs (1.62% faster)


# ----------- Edge Test Cases -----------


def test_shape_mismatch_raises():
    # Test that shape mismatch between mask and target_sizes raises ValueError
    keypoints = torch.tensor([[[[0.5, 0.5]], [[0.6, 0.4]]]])
    mask = torch.tensor([[[1], [1]]])
    matches = torch.tensor([[[0], [-1]]])
    matching_scores = torch.tensor([[[0.9], [0.0]]])
    target_sizes = [(100, 200), (100, 200)]  # Should be length 1
    outputs = LightGlueKeypointMatchingOutput(keypoints, mask, matches, matching_scores)
    processor = LightGlueImageProcessor()
    with pytest.raises(ValueError):
        processor.post_process_keypoint_matching(outputs, target_sizes)  # 2.33μs -> 2.47μs (5.72% slower)


def test_target_sizes_wrong_inner_length():
    # Test that target_sizes with wrong inner tuple length raises ValueError
    keypoints = torch.tensor([[[[0.5, 0.5]], [[0.6, 0.4]]]])
    mask = torch.tensor([[[1], [1]]])
    matches = torch.tensor([[[0], [-1]]])
    matching_scores = torch.tensor([[[0.9], [0.0]]])
    target_sizes = [(100,)]  # Should be (h, w)
    outputs = LightGlueKeypointMatchingOutput(keypoints, mask, matches, matching_scores)
    processor = LightGlueImageProcessor()
    with pytest.raises(ValueError):
        processor.post_process_keypoint_matching(outputs, target_sizes)  # 3.54μs -> 3.62μs (2.18% slower)

To edit these changes git checkout codeflash/optimize-LightGlueImageProcessor.post_process_keypoint_matching-mia6ei7c and push.

Codeflash Static Badge

The optimized code achieves a **6% speedup** through several key tensor operation optimizations:

**Key Optimizations:**

1. **Eliminated unnecessary `.clone()`**: The original code called `.clone()` on `outputs.keypoints`, but this is redundant since the subsequent multiplication and `.to(torch.int32)` operations already create new tensors. This saves memory allocation and copying overhead.

2. **Precomputed batch slices**: Instead of extracting slices inside the loop (e.g., `outputs.mask`, `outputs.matches[:, 0]`), the optimized version precomputes these outside the loop as `mask0_all`, `mask1_all`, `matches_all`, etc. This eliminates repeated attribute lookups and tensor slicing operations that were happening on every iteration.

3. **Replaced `torch.tensor` with `torch.as_tensor`**: For converting `target_sizes` from list to tensor, `torch.as_tensor` avoids unnecessary memory copies when the input is already tensor-like, providing a minor but consistent performance gain.

4. **Added empty tensor handling**: The optimization safely handles cases where `matched_indices.numel() == 0` to avoid potential indexing errors with empty tensors, using `.new_empty()` to create appropriately shaped empty tensors.

**Performance Impact:**

The line profiler shows the biggest gains come from:
- Removing the expensive `.clone()` operation (line with `keypoints = outputs.keypoints.clone()`)
- Precomputing slices reduces per-iteration overhead in the main loop
- The batch processing approach scales better with larger batch sizes, as evidenced by the 9.5% speedup on the large batch test case

**Test Case Analysis:**

The optimizations show consistent improvements across all test scenarios:
- **Basic cases**: 3-8% faster due to eliminated `.clone()` and reduced attribute access
- **Edge cases with empty matches**: 6-8% faster, particularly benefiting from the safe empty tensor handling
- **Large batch cases**: Up to 9.5% faster, where the precomputed slicing approach shows its greatest benefit

This optimization is particularly valuable for computer vision pipelines processing multiple image pairs in batches, where the post-processing step is called frequently after model inference.
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 November 22, 2025 10:58
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Nov 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant