Skip to content

Conversation

@jasonyuezhang
Copy link
Owner

PR Details

  • Reworking my approach to updating this function for triage-signals. I am removing the cache based check (added this a few days ago) and going to add an if else in this function for the new behaviour.
  • I am putting stuff that is going to be common for both approaches in functions so that it can be re-used.

Copied from getsentry#103345
Original PR: getsentry#103345

@propel-test-bot
Copy link

Refactor Seer automation kickoff & split permission / rate-limit checks

Refactors kick_off_seer_automation in src/sentry/tasks/post_process.py by extracting the long in-line guard logic into two reusable helpers: seer_automation_permission_and_type_check and seer_automation_rate_limit_check. The cache-based debounce previously used to avoid duplicate executions (seer_automation_queued:<group_id>) is removed; duplication avoidance now relies on an existing short-lived Redis lock plus the new helpers. Associated unit tests in tests/sentry/tasks/test_post_process.py are cleaned up: old cache/atomic-add scenarios are deleted, and focused tests for the new helpers are added. Mixin wiring is updated to include the new test mixin.

Key Changes

• Added seer_automation_permission_and_type_check and seer_automation_rate_limit_check to src/sentry/tasks/post_process.py
• Simplified kick_off_seer_automation to rely on the new helpers and issue-summary lock; removed cache debounce and redundant budget/flag checks now inside helper
• Deleted cache / atomic-add based tests; added SeerAutomationHelperFunctionsTestMixin with exhaustive unit coverage for both helpers
• Updated import lists and mixin registration to include new tests

Affected Areas

src/sentry/tasks/post_process.py
tests/sentry/tasks/test_post_process.py

This summary was automatically generated by @propel-code-bot

Comment on lines +1677 to 1700
def kick_off_seer_automation(job: PostProcessJob) -> None:
from sentry.seer.autofix.issue_summary import get_issue_summary_lock_key
from sentry.tasks.autofix import start_seer_automation

event = job["event"]
group = event.group

# Only run on issues with no existing scan - TODO: Update condition for triage signals V0
if group.seer_fixability_score is not None:
return

if seer_automation_permission_and_type_check(group) is False:
return

# Don't run if there's already a task in progress for this issue
lock_key, lock_name = get_issue_summary_lock_key(group.id)
lock = locks.get(lock_key, duration=1, name=lock_name)
if lock.locked():
return

# cache.add uses Redis SETNX which atomically sets the key only if it doesn't exist
# Returns False if another process already set the key, ensuring only one process proceeds
if not cache.add(cache_key, True, timeout=600): # 10 minute
if seer_automation_rate_limit_check(group) is False:
return

start_seer_automation.delay(group.id)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[CriticalError]

Critical: The removal of the cache-based deduplication mechanism (lines checking cache_key, cache.get(cache_key), group.seer_autofix_last_triggered, and cache.add(cache_key, True, timeout=600)) introduces a potential race condition.

The old code used cache.add() which is atomic (Redis SETNX) to ensure only one process proceeds. Without this, if multiple events for the same group are processed concurrently, start_seer_automation.delay(group.id) could be called multiple times for the same group.

The lock check on line 1693-1695 only prevents concurrent execution of this function, but doesn't prevent the same group from being queued multiple times if this function completes and is called again before the async task starts.

Consider: Was this deduplication moved elsewhere, or is the seer_fixability_score check on line 1685 intended to be sufficient? If so, document why multiple task queuing is now acceptable.

Context for Agents
[**CriticalError**]

Critical: The removal of the cache-based deduplication mechanism (lines checking `cache_key`, `cache.get(cache_key)`, `group.seer_autofix_last_triggered`, and `cache.add(cache_key, True, timeout=600)`) introduces a potential race condition.

The old code used `cache.add()` which is atomic (Redis SETNX) to ensure only one process proceeds. Without this, if multiple events for the same group are processed concurrently, `start_seer_automation.delay(group.id)` could be called multiple times for the same group.

The lock check on line 1693-1695 only prevents concurrent execution *of this function*, but doesn't prevent the same group from being queued multiple times if this function completes and is called again before the async task starts.

Consider: Was this deduplication moved elsewhere, or is the `seer_fixability_score` check on line 1685 intended to be sufficient? If so, document why multiple task queuing is now acceptable.

File: src/sentry/tasks/post_process.py
Line: 1700

Comment on lines 1628 to 1639
# check currently supported issue categories for Seer
if group.issue_category not in [
GroupCategory.ERROR,
GroupCategory.PERFORMANCE,
GroupCategory.MOBILE,
GroupCategory.FRONTEND,
GroupCategory.DB_QUERY,
GroupCategory.HTTP_CLIENT,
] or group.issue_category in [
GroupCategory.REPLAY,
GroupCategory.FEEDBACK,
]:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[BestPractice]

The seer_automation_permission_and_type_check function checks the issue category with both an inclusion check and an exclusion check that overlap. Lines 1629-1639 check if the category is in a list that includes REPLAY and FEEDBACK, then immediately checks if it's in [REPLAY, FEEDBACK] and returns False. This logic is confusing.

The condition will always return False for REPLAY and FEEDBACK due to the second check, making their inclusion in the first list unnecessary. Consider simplifying:

if group.issue_category not in [
    GroupCategory.ERROR,
    GroupCategory.PERFORMANCE,
    GroupCategory.MOBILE,
    GroupCategory.FRONTEND,
    GroupCategory.DB_QUERY,
    GroupCategory.HTTP_CLIENT,
]:
    return False
Context for Agents
[**BestPractice**]

The `seer_automation_permission_and_type_check` function checks the issue category with both an inclusion check and an exclusion check that overlap. Lines 1629-1639 check if the category is in a list that includes `REPLAY` and `FEEDBACK`, then immediately checks if it's in `[REPLAY, FEEDBACK]` and returns False. This logic is confusing.

The condition will always return False for REPLAY and FEEDBACK due to the second check, making their inclusion in the first list unnecessary. Consider simplifying:

```python
if group.issue_category not in [
    GroupCategory.ERROR,
    GroupCategory.PERFORMANCE,
    GroupCategory.MOBILE,
    GroupCategory.FRONTEND,
    GroupCategory.DB_QUERY,
    GroupCategory.HTTP_CLIENT,
]:
    return False
```

File: src/sentry/tasks/post_process.py
Line: 1639

if group.seer_fixability_score is not None:
return

if seer_automation_permission_and_type_check(group) is False:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[BestPractice]

The comparison is False is used here and on line 1697. While technically correct, it's more Pythonic to use not seer_automation_permission_and_type_check(group) and not seer_automation_rate_limit_check(group) since these functions return boolean values, not None/False ambiguity.

Suggestion:

if not seer_automation_permission_and_type_check(group):
    return
Context for Agents
[**BestPractice**]

The comparison `is False` is used here and on line 1697. While technically correct, it's more Pythonic to use `not seer_automation_permission_and_type_check(group)` and `not seer_automation_rate_limit_check(group)` since these functions return boolean values, not None/False ambiguity.

Suggestion:
```python
if not seer_automation_permission_and_type_check(group):
    return
```

File: src/sentry/tasks/post_process.py
Line: 1688

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants