Skip to content

Conversation

@thlorenz
Copy link
Contributor

@thlorenz thlorenz commented Nov 6, 2025

Summary by CodeRabbit

  • New Features

    • Subscription metrics and new account-fetch counters (success/failed/found/not-found)
    • Automatic pub/sub reconnection with health checks and subscription resync
    • Test helpers and an integration helper to fetch monitored-account metrics remotely
  • Bug Fixes

    • More robust eviction/subscription flow with rollback on upstream failures
    • Improved undelegation detection, counting, and clearer logs
  • Improvements

    • Config option to enable/disable subscription metrics and refined LRU subscription management
    • Exposed inspection helpers for subscription cache and metrics; expanded tests for large-scale subscriptions

Enhances subscription management and reliability by adding metrics, and robust reconnection logic with automatic resubscription.

Details

Subscription Management Improvements

  • Atomic Subscribe/Unsubscribe: Improved subscribe logic to handle evictions and rollbacks atomically, preventing race conditions between LRU cache and pubsub client
  • Enhanced Error Handling: Better error propagation and logging for subscription failures

Metrics and Monitoring

  • Subscription Metrics: Added subscription_count method to ChainPubsubClient trait for tracking active subscriptions across all clients
  • Monitored Accounts Gauge: Updated metrics to accurately report monitored accounts count
  • Configurable Metrics: Added metrics enable/disable flag to RemoteAccountProviderConfig

Reconnection and Reliability

  • Reconnection Logic: Replaced periodic connection recycling with event-driven reconnection using the new ReconnectableClient trait
  • Fibonacci Backoff: Implemented exponential backoff with fibonacci delays for reconnection attempts (max 10 minutes)
  • Automatic Resubscription: Clients automatically resubscribe to existing subscriptions after successful reconnection
  • Improved SubMuxClient: Simplified constructor and removed deprecated recycling logic

Testing and Integration

  • Integration Test: Added 07_subscription_limits.rs to test large-scale subscription scenarios (400 accounts)
  • Reconnection Tests: Comprehensive tests for reconnection logic with failed resubscription recovery
  • Updated Test Configurations: All tests now use metrics-enabled configurations

Code Quality

  • Trait Refactoring: Introduced ReconnectableClient trait for better abstraction
  • Logging Improvements: Enhanced debug/trace logging throughout subscription and reconnection flows
  • Dependency Updates: Added ureq and url for metrics fetching in integration tests

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 77c33e1 and 2cea109.

📒 Files selected for processing (1)
  • magicblock-chainlink/src/chainlink/blacklisted_accounts.rs (2 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: bmuddha
Repo: magicblock-labs/magicblock-validator PR: 578
File: magicblock-aperture/src/requests/websocket/account_subscribe.rs:18-27
Timestamp: 2025-10-21T14:00:54.642Z
Learning: In magicblock-aperture account_subscribe handler (src/requests/websocket/account_subscribe.rs), the RpcAccountInfoConfig fields data_slice, commitment, and min_context_slot are currently ignored—only encoding is applied. This is tracked as technical debt in issue #579: https://github.com/magicblock-labs/magicblock-validator/issues/579
Learnt from: thlorenz
Repo: magicblock-labs/magicblock-validator PR: 621
File: magicblock-chainlink/src/remote_account_provider/chain_pubsub_actor.rs:457-495
Timestamp: 2025-11-07T14:20:31.457Z
Learning: In magicblock-chainlink/src/remote_account_provider/chain_pubsub_client.rs, the unsubscribe closure returned by PubSubConnection::account_subscribe(...) resolves to () (unit), not a Result. Downstream code should not attempt to inspect an unsubscribe result and can optionally wrap it in a timeout to guard against hangs.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: run_make_ci_test
  • GitHub Check: run_make_ci_lint
  • GitHub Check: Build Project
🔇 Additional comments (1)
magicblock-chainlink/src/chainlink/blacklisted_accounts.rs (1)

69-69: Verify the intent and scope of blacklisting the SPL Token program.

The implementation correctly adds the SPL Token program to the blacklist. However, please confirm:

  1. This blacklists only the Token program account itself (TokenkegQfeZyiNwAJbNbGKPFXCWuBvf9Ss623VQ5DA), not individual token accounts created by the program.
  2. The Token-2022/Token Extensions program (TokenzQdBNbLqP5VEhdkAS6EPFLC1PHnBqCXEpPxuEb) remains unblacklisted—is this intentional?
  3. What subscription management issue prompted this addition?

Comment on lines +51 to +52
const NATIVE_TOKEN_PROGRAM_ID: Pubkey =
solana_sdk::pubkey!("TokenkegQfeZyiNwAJbNbGKPFXCWuBvf9Ss623VQ5DA");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Consider adding a comment explaining the blacklisting rationale.

Similar to NATIVE_SOL_ID (lines 10-11), consider adding a comment explaining why the SPL Token program account is being blacklisted. This helps future maintainers understand the intent.

Example:

+    // The SPL Token program account should not be monitored directly
+    // (individual token accounts are fine, but the program account itself is blacklisted)
     const NATIVE_TOKEN_PROGRAM_ID: Pubkey =
         solana_sdk::pubkey!("TokenkegQfeZyiNwAJbNbGKPFXCWuBvf9Ss623VQ5DA");
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
const NATIVE_TOKEN_PROGRAM_ID: Pubkey =
solana_sdk::pubkey!("TokenkegQfeZyiNwAJbNbGKPFXCWuBvf9Ss623VQ5DA");
// The SPL Token program account should not be monitored directly
// (individual token accounts are fine, but the program account itself is blacklisted)
const NATIVE_TOKEN_PROGRAM_ID: Pubkey =
solana_sdk::pubkey!("TokenkegQfeZyiNwAJbNbGKPFXCWuBvf9Ss623VQ5DA");
🤖 Prompt for AI Agents
In magicblock-chainlink/src/chainlink/blacklisted_accounts.rs around lines 51 to
52, add a concise comment above NATIVE_TOKEN_PROGRAM_ID explaining why the SPL
Token program account is blacklisted (for example: to prevent
token-program-driven account creation/interaction or to avoid treating the token
program as a user-owned account in checks), mirroring the style and level of
detail used for NATIVE_SOL_ID on lines 10–11; keep the comment short,
action-oriented, and focused on intent for future maintainers.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2cea109 and 917f177.

📒 Files selected for processing (2)
  • magicblock-api/src/tickers.rs (1 hunks)
  • magicblock-metrics/src/service.rs (2 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: bmuddha
Repo: magicblock-labs/magicblock-validator PR: 578
File: magicblock-aperture/src/requests/websocket/account_subscribe.rs:18-27
Timestamp: 2025-10-21T14:00:54.642Z
Learning: In magicblock-aperture account_subscribe handler (src/requests/websocket/account_subscribe.rs), the RpcAccountInfoConfig fields data_slice, commitment, and min_context_slot are currently ignored—only encoding is applied. This is tracked as technical debt in issue #579: https://github.com/magicblock-labs/magicblock-validator/issues/579
Learnt from: thlorenz
Repo: magicblock-labs/magicblock-validator PR: 621
File: magicblock-chainlink/src/remote_account_provider/chain_pubsub_actor.rs:457-495
Timestamp: 2025-11-07T14:20:31.457Z
Learning: In magicblock-chainlink/src/remote_account_provider/chain_pubsub_client.rs, the unsubscribe closure returned by PubSubConnection::account_subscribe(...) resolves to () (unit), not a Result. Downstream code should not attempt to inspect an unsubscribe result and can optionally wrap it in a timeout to guard against hangs.
📚 Learning: 2025-11-07T14:20:31.457Z
Learnt from: thlorenz
Repo: magicblock-labs/magicblock-validator PR: 621
File: magicblock-chainlink/src/remote_account_provider/chain_pubsub_actor.rs:457-495
Timestamp: 2025-11-07T14:20:31.457Z
Learning: In magicblock-chainlink/src/remote_account_provider/chain_pubsub_client.rs, the unsubscribe closure returned by PubSubConnection::account_subscribe(...) resolves to () (unit), not a Result. Downstream code should not attempt to inspect an unsubscribe result and can optionally wrap it in a timeout to guard against hangs.

Applied to files:

  • magicblock-metrics/src/service.rs
🧬 Code graph analysis (1)
magicblock-api/src/tickers.rs (2)
magicblock-api/src/magic_validator.rs (1)
  • ledger (735-737)
magicblock-metrics/src/metrics/mod.rs (1)
  • set_ledger_size (297-299)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Build Project
  • GitHub Check: run_make_ci_test
  • GitHub Check: run_make_ci_lint
🔇 Additional comments (1)
magicblock-api/src/tickers.rs (1)

111-130: LGTM! Clean implementation with proper cancellation handling.

The simplified implementation correctly:

  • Handles ledger storage size retrieval errors with appropriate logging
  • Uses tokio::select! for graceful cancellation
  • Clones the Arc<Ledger> before moving into the async block

Comment on lines 104 to 110
#[allow(unused_variables)]
pub fn init_system_metrics_ticker(
tick_duration: Duration,
ledger: &Arc<Ledger>,
accountsdb: &Arc<AccountsDb>,
token: CancellationToken,
) -> tokio::task::JoinHandle<()> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Verify the unused accountsdb parameter is intentional.

The accountsdb parameter is no longer used in the simplified implementation, yet it remains in the function signature. While the #[allow(unused_variables)] attribute acknowledges this, keeping unused parameters can mislead callers about the function's behavior.

Please confirm:

  1. Is this parameter kept for API compatibility during a transition period?
  2. Should it be removed in a future breaking change?
  3. Should the function documentation explicitly note that accountsdb is currently unused but kept for compatibility?

To verify the impact of removing this parameter, run:


🏁 Script executed:

#!/bin/bash
# Search for all call sites of init_system_metrics_ticker
rg -n -A5 -B2 'init_system_metrics_ticker\s*\(' --type rust

Length of output: 1143



---

🏁 Script executed:

```shell
#!/bin/bash
# Get the complete function implementation to see if accountsdb is used
sed -n '104,130p' magicblock-api/src/tickers.rs

Length of output: 838


🏁 Script executed:

#!/bin/bash
# Search comprehensively for all references to init_system_metrics_ticker (function calls only, not definition)
rg -n 'init_system_metrics_ticker' --type rust | grep -v '^\s*pub fn init_system_metrics_ticker'

Length of output: 353


🏁 Script executed:

#!/bin/bash
# Check the call site in magic_validator.rs to see the full context
sed -n '220,235p' magicblock-api/src/magic_validator.rs

Length of output: 466


Remove the unused accountsdb parameter from function signature and call site.

The accountsdb parameter is declared but never used in the function implementation. The #[allow(unused_variables)] attribute is suppressing a compiler warning that indicates incomplete refactoring.

Required fixes:

  1. Remove accountsdb: &Arc<AccountsDb>, from function signature at magicblock-api/src/tickers.rs:108
  2. Remove &accountsdb, from the function call at magicblock-api/src/magic_validator.rs:227
🤖 Prompt for AI Agents
In magicblock-api/src/tickers.rs around lines 104–110, the function
init_system_metrics_ticker declares an unused parameter accountsdb and silences
the warning with #[allow(unused_variables)]; remove the accountsdb:
&Arc<AccountsDb>, parameter from the function signature and delete the
#[allow(unused_variables)] attribute, then update all call sites (specifically
remove the &accountsdb, argument at magicblock-api/src/magic_validator.rs:227)
so the function is invoked with the remaining parameters; also remove any
now-unused AccountsDb imports or bindings to keep the code clean.

Comment on lines 132 to 204
/*
#[allow(unused_variables)]
pub fn init_system_metrics_ticker_old(
tick_duration: Duration,
ledger: &Arc<Ledger>,
accountsdb: &Arc<AccountsDb>,
token: CancellationToken,
) -> tokio::task::JoinHandle<()> {
fn try_set_ledger_counts(ledger: &Ledger) {
macro_rules! try_set_ledger_count {
($name:ident) => {
paste::paste! {
match ledger.[< count_ $name >]() {
Ok(count) => {
metrics::[< set_ledger_ $name _count >](count);
}
Err(err) => warn!(
"Failed to get ledger {} count: {:?}",
stringify!($name),
err
),
}
}
};
}
try_set_ledger_count!(block_times);
try_set_ledger_count!(blockhashes);
try_set_ledger_count!(slot_signatures);
try_set_ledger_count!(address_signatures);
try_set_ledger_count!(transaction_status);
try_set_ledger_count!(transaction_successful_status);
try_set_ledger_count!(transaction_failed_status);
try_set_ledger_count!(transactions);
try_set_ledger_count!(transaction_memos);
try_set_ledger_count!(perf_samples);
try_set_ledger_count!(account_mod_data);
}
fn try_set_ledger_storage_size(ledger: &Ledger) {
match ledger.storage_size() {
Ok(byte_size) => metrics::set_ledger_size(byte_size),
Err(err) => warn!("Failed to get ledger storage size: {:?}", err),
}
}
fn set_accounts_storage_size(bank: &Bank) {
let byte_size = bank.accounts_db_storage_size();
metrics::set_accounts_size(byte_size);
}
fn set_accounts_count(bank: &Bank) {
metrics::set_accounts_count(bank.accounts_db.get_accounts_count());
}
let ledger = ledger.clone();
let bank = bank.clone();
tokio::task::spawn(async move {
loop {
tokio::select! {
_ = tokio::time::sleep(tick_duration) => {
try_set_ledger_storage_size(&ledger);
set_accounts_storage_size(&bank);
try_set_ledger_counts(&ledger);
set_accounts_count(&bank);
},
_ = token.cancelled() => {
break;
}
}
}
})
tokio::task::spawn(async move {})
}
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Consider removing the commented-out old implementation.

The 72-line commented block preserves the previous multi-metric implementation. While this may serve as temporary reference during the transition, large commented sections can clutter the codebase and create maintenance overhead.

Recommendations:

  1. If this is a temporary reference during a transition period, add a TODO comment with a timeline for removal (e.g., "TODO: Remove after v2.0 release").
  2. If the old implementation may be needed for rollback, consider preserving it in a separate branch or commit rather than in comments.
  3. If permanently removed, delete the commented code—Git history will preserve it if needed for future reference.
🤖 Prompt for AI Agents
In magicblock-api/src/tickers.rs around lines 132 to 204, there is a 72-line
commented-out old implementation (init_system_metrics_ticker_old) cluttering the
file; remove the large commented block or replace it with a short TODO note. If
you need a temporary reference during the transition, replace the commented code
with a single-line TODO including a removal timeline (e.g., "TODO: remove after
vX.Y") or move the old implementation to a separate commit/branch, otherwise
delete the commented block entirely so Git history retains the prior code.

Comment on lines +131 to +137
// We must consume the body fully to keep the connection alive. We
// iterate over all chunks and simply drop them. This prevents garbage
// data of previous requests from being stuck in connection buffer.
let mut body = req.into_body();
while (body.frame().await).is_some() {}

result
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Add timeout and error handling to body draining loop.

The body draining logic has the following concerns:

  1. DoS vulnerability: The loop has no timeout, allowing a malicious or slow client to tie up the server indefinitely by sending a large body slowly. Consider adding a timeout (e.g., using tokio::time::timeout) or a maximum frame count limit.

  2. Silent error handling: Errors from body.frame().await are silently ignored. At minimum, log errors for debugging purposes.

  3. Performance: The body is drained for every request, including GET requests that typically have no body. While the overhead is minimal for empty bodies, consider checking req.headers().get("content-length") first to skip draining when there's no body.

Apply this diff to add timeout and error logging:

-    // We must consume the body fully to keep the connection alive. We
-    // iterate over all chunks and simply drop them. This prevents garbage
-    // data of previous requests from being stuck in connection buffer.
-    let mut body = req.into_body();
-    while (body.frame().await).is_some() {}
+    // We must consume the body fully to keep the connection alive. We
+    // iterate over all chunks and simply drop them. This prevents garbage
+    // data of previous requests from being stuck in connection buffer.
+    let mut body = req.into_body();
+    let drain_timeout = tokio::time::Duration::from_secs(30);
+    match tokio::time::timeout(drain_timeout, async {
+        while let Some(frame_result) = body.frame().await {
+            if let Err(e) = frame_result {
+                warn!("Error draining request body: {:?}", e);
+            }
+        }
+    })
+    .await
+    {
+        Err(_) => warn!("Timeout draining request body"),
+        Ok(_) => {}
+    }

Additionally, verify whether this body draining is necessary with the current version of Hyper. Modern versions typically handle request body consumption automatically for keep-alive connections:


🌐 Web query:

Does Hyper automatically consume request bodies for HTTP/1.1 keep-alive connections?

💡 Result:

Short answer: No. Hyper does not implicitly drain/consume an unread request body for you — if you want the TCP connection to be reused for HTTP/1.1 keep‑alive you must read (or explicitly discard) the request body before returning a response, or configure your server/connection behavior accordingly. See hyper's docs on request bodies and collecting them, and the HTTP/1.1 messaging guidance that servers should read or discard a body if they intend to keep a connection open. [1][2][3]

Sources:

  • hyper guide: request/response bodies and collecting the request body. [1]
  • hyper server/conn::Http options (http1_keep_alive behavior). [2]
  • HTTP messaging spec: servers SHOULD read/discard request bodies if they will keep the transport open (HTTP/1.1 guidance). [3]


Add timeout and error handling to body draining loop to prevent DoS and improve observability.

The body draining logic correctly implements HTTP/1.1 keep-alive support, as Hyper requires explicit body consumption before reusing the connection. However, the implementation needs hardening:

  1. DoS vulnerability: The loop has no timeout, allowing a malicious or slow client to tie up the server indefinitely by sending a large body slowly. Add a timeout (e.g., tokio::time::timeout) or a maximum frame count limit.

  2. Silent error handling: Errors from body.frame().await are silently ignored, making debugging difficult. At minimum, log errors.

  3. Performance optimization (optional): Skip body draining for requests without a body by checking req.headers().get("content-length") first.

Apply this diff to add timeout and error logging:

-    // We must consume the body fully to keep the connection alive. We
-    // iterate over all chunks and simply drop them. This prevents garbage
-    // data of previous requests from being stuck in connection buffer.
-    let mut body = req.into_body();
-    while (body.frame().await).is_some() {}
+    // We must consume the body fully to keep the connection alive. We
+    // iterate over all chunks and simply drop them. This prevents garbage
+    // data of previous requests from being stuck in connection buffer.
+    let mut body = req.into_body();
+    let drain_timeout = tokio::time::Duration::from_secs(30);
+    match tokio::time::timeout(drain_timeout, async {
+        while let Some(frame_result) = body.frame().await {
+            if let Err(e) = frame_result {
+                warn!("Error draining request body: {:?}", e);
+            }
+        }
+    })
+    .await
+    {
+        Err(_) => warn!("Timeout draining request body"),
+        Ok(_) => {}
+    }

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In magicblock-metrics/src/service.rs around lines 131 to 137, the body-draining
loop needs hardening: wrap the await in a tokio::time::timeout (e.g., a few
seconds) to avoid slow-client DoS, track a maximum frame count (e.g., a
reasonable upper bound) to prevent infinite loops, and check whether the request
actually has a body first (inspect Content-Length or absence of
transfer-encoding) to skip draining when unnecessary. On timeout or any Err from
body.frame().await, log the error with context and break the loop; on Ok(None)
stop normally. Ensure the timeout and frame count are configurable constants and
that logging includes request id/context for observability.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
test-integration/test-schedule-intent/tests/test_schedule_intents.rs (1)

128-141: Add verification after second undelegation at line 140.

The test verifies undelegation after the first schedule_intent (line 128), but the second schedule_intent at line 132 with Some(vec![102]) also triggers undelegation since any Some(...) value in counter_diffs sets is_undelegate = true. For consistency and completeness, add verify_undelegation_in_ephem_via_owner(&[payer.pubkey()], &ctx); after the assertion block ends at line 140.

magicblock-chainlink/src/remote_account_provider/mod.rs (2)

911-970: Retry scope for ForUser errors could be narrower.

The code retries on all RpcError::ForUser errors (line 930), but the comment (lines 915-919) suggests only the "AccountNotFound: Minimum context slot has not been reached" case requires retries. Other ForUser errors (authentication failures, rate limits, etc.) won't resolve with retries and waste retry budget.

This is acceptable for a background task with limited retries, but consider narrowing the retry condition:

 RpcError::ForUser(ref rpc_user_err) => {
-    // When an account is not present for the desired
-    // min-context slot then we normally get the below
-    // handled `RpcResponseError`, but may also get the
-    // following error from the RPC.
-    // See test::ixtest_existing_account_for_future_slot
-    // ```
-    // RpcError(
-    //   ForUser(
-    //       "AccountNotFound: \
-    // pubkey=DaeruQ4SukTQaJA5muyv51MQZok7oaCAF8fAW19mbJv5: \
-    //        RPC response error -32016: \
-    //        Minimum context slot has not been reached; ",
-    //   ),
-    // )
-    // ```
-    retry!("Fetching accounts failed: {rpc_user_err:?}");
+    // Retry only for context slot issues
+    if rpc_user_err.contains("Minimum context slot") 
+        || rpc_user_err.contains("AccountNotFound") {
+        retry!("Fetching accounts failed: {rpc_user_err:?}");
+    } else {
+        let err_msg = format!(
+            "RpcError fetching accounts {}: ForUser({rpc_user_err})", 
+            pubkeys_str(&pubkeys)
+        );
+        notify_error(&err_msg);
+        return;
+    }
 }

728-758: Subscription leak on eviction failure—reinsertion of evicted account required for rollback consistency.

When pubsub_client.unsubscribe(evicted) fails at line 738, the code correctly removes pubkey from the LRU cache to prevent subscription of the new account. However, evicted was already removed from the LRU by add(*pubkey) on line 734 and is not reinserted during rollback.

This creates a mismatch: evicted remains subscribed in the pubsub client but is no longer tracked in the LRU cache, causing a subscription leak. The diagnostic code at lines 223–254 will detect and warn about this inconsistency.

Restore the pre-add state by reinserting evicted:

 if let Err(err) = self.pubsub_client.unsubscribe(evicted).await {
     warn!(
         "Failed to unsubscribe from pubsub for evicted account {evicted}: {err:?}");
     // Rollback the LRU add since eviction failed
     self.lrucache_subscribed_accounts.remove(pubkey);
+    // Restore evicted to maintain consistency between LRU and pubsub
+    self.lrucache_subscribed_accounts.add(evicted);
     return Err(err);
 }

Since pubkey was just removed, capacity is available and reinserting evicted will not trigger another eviction.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 917f177 and 0b1b245.

📒 Files selected for processing (3)
  • magicblock-chainlink/src/chainlink/mod.rs (5 hunks)
  • magicblock-chainlink/src/remote_account_provider/mod.rs (28 hunks)
  • test-integration/test-schedule-intent/tests/test_schedule_intents.rs (5 hunks)
🧰 Additional context used
🧠 Learnings (7)
📓 Common learnings
Learnt from: bmuddha
Repo: magicblock-labs/magicblock-validator PR: 578
File: magicblock-aperture/src/requests/websocket/account_subscribe.rs:18-27
Timestamp: 2025-10-21T14:00:54.642Z
Learning: In magicblock-aperture account_subscribe handler (src/requests/websocket/account_subscribe.rs), the RpcAccountInfoConfig fields data_slice, commitment, and min_context_slot are currently ignored—only encoding is applied. This is tracked as technical debt in issue #579: https://github.com/magicblock-labs/magicblock-validator/issues/579
📚 Learning: 2025-11-07T14:20:31.457Z
Learnt from: thlorenz
Repo: magicblock-labs/magicblock-validator PR: 621
File: magicblock-chainlink/src/remote_account_provider/chain_pubsub_actor.rs:457-495
Timestamp: 2025-11-07T14:20:31.457Z
Learning: In magicblock-chainlink/src/remote_account_provider/chain_pubsub_client.rs, the unsubscribe closure returned by PubSubConnection::account_subscribe(...) resolves to () (unit), not a Result. Downstream code should not attempt to inspect an unsubscribe result and can optionally wrap it in a timeout to guard against hangs.

Applied to files:

  • magicblock-chainlink/src/chainlink/mod.rs
  • magicblock-chainlink/src/remote_account_provider/mod.rs
📚 Learning: 2025-10-21T14:00:54.642Z
Learnt from: bmuddha
Repo: magicblock-labs/magicblock-validator PR: 578
File: magicblock-aperture/src/requests/websocket/account_subscribe.rs:18-27
Timestamp: 2025-10-21T14:00:54.642Z
Learning: In magicblock-aperture account_subscribe handler (src/requests/websocket/account_subscribe.rs), the RpcAccountInfoConfig fields data_slice, commitment, and min_context_slot are currently ignored—only encoding is applied. This is tracked as technical debt in issue #579: https://github.com/magicblock-labs/magicblock-validator/issues/579

Applied to files:

  • magicblock-chainlink/src/chainlink/mod.rs
  • magicblock-chainlink/src/remote_account_provider/mod.rs
📚 Learning: 2025-11-07T13:20:13.793Z
Learnt from: bmuddha
Repo: magicblock-labs/magicblock-validator PR: 589
File: magicblock-processor/src/scheduler/coordinator.rs:227-238
Timestamp: 2025-11-07T13:20:13.793Z
Learning: In magicblock-processor's ExecutionCoordinator (scheduler/coordinator.rs), the `account_contention` HashMap intentionally does not call `shrink_to_fit()`. Maintaining slack capacity is beneficial for performance by avoiding frequent reallocations during high transaction throughput. As long as empty entries are removed from the map (which `clear_account_contention` does), the capacity overhead is acceptable.

Applied to files:

  • magicblock-chainlink/src/chainlink/mod.rs
  • magicblock-chainlink/src/remote_account_provider/mod.rs
📚 Learning: 2025-10-14T09:56:14.047Z
Learnt from: taco-paco
Repo: magicblock-labs/magicblock-validator PR: 564
File: test-integration/programs/flexi-counter/src/processor/call_handler.rs:122-125
Timestamp: 2025-10-14T09:56:14.047Z
Learning: The file test-integration/programs/flexi-counter/src/processor/call_handler.rs contains a test smart contract used for integration testing, not production code.

Applied to files:

  • test-integration/test-schedule-intent/tests/test_schedule_intents.rs
📚 Learning: 2025-11-07T13:09:52.253Z
Learnt from: bmuddha
Repo: magicblock-labs/magicblock-validator PR: 589
File: test-kit/src/lib.rs:275-0
Timestamp: 2025-11-07T13:09:52.253Z
Learning: In test-kit, the transaction scheduler in ExecutionTestEnv is not expected to shut down during tests. Therefore, using `.unwrap()` in test helper methods like `schedule_transaction` is acceptable and will not cause issues in the test environment.

Applied to files:

  • test-integration/test-schedule-intent/tests/test_schedule_intents.rs
📚 Learning: 2025-10-26T16:53:29.820Z
Learnt from: thlorenz
Repo: magicblock-labs/magicblock-validator PR: 587
File: magicblock-chainlink/src/remote_account_provider/mod.rs:134-0
Timestamp: 2025-10-26T16:53:29.820Z
Learning: In magicblock-chainlink/src/remote_account_provider/mod.rs, the `Endpoint::separate_pubsub_url_and_api_key()` method uses `split_once("?api-key=")` because the api-key parameter is always the only query parameter right after `?`. No additional query parameter parsing is needed for this use case.

Applied to files:

  • magicblock-chainlink/src/remote_account_provider/mod.rs
🧬 Code graph analysis (3)
magicblock-chainlink/src/chainlink/mod.rs (2)
magicblock-chainlink/src/chainlink/fetch_cloner.rs (1)
  • new (134-158)
magicblock-chainlink/src/chainlink/blacklisted_accounts.rs (1)
  • blacklisted_accounts (6-30)
test-integration/test-schedule-intent/tests/test_schedule_intents.rs (1)
test-integration/programs/flexi-counter/src/state.rs (1)
  • pda (32-35)
magicblock-chainlink/src/remote_account_provider/mod.rs (6)
magicblock-metrics/src/metrics/mod.rs (5)
  • inc_account_fetches_failed (399-401)
  • inc_account_fetches_found (403-405)
  • inc_account_fetches_not_found (407-409)
  • inc_account_fetches_success (395-397)
  • set_monitored_accounts_count (357-359)
magicblock-chainlink/src/remote_account_provider/chain_pubsub_actor.rs (1)
  • new (126-158)
magicblock-chainlink/src/remote_account_provider/chain_pubsub_client.rs (3)
  • new (46-54)
  • new (311-322)
  • try_new_from_url (165-180)
magicblock-chainlink/src/submux/mod.rs (1)
  • new (147-158)
magicblock-chainlink/src/remote_account_provider/lru_cache.rs (4)
  • new (25-34)
  • pubkeys (38-41)
  • pubkeys (129-135)
  • len (117-123)
magicblock-chainlink/src/remote_account_provider/config.rs (2)
  • default (65-72)
  • try_new_with_metrics (27-42)
🔇 Additional comments (16)
test-integration/test-schedule-intent/tests/test_schedule_intents.rs (3)

66-69: LGTM! Verification helper appropriately placed.

The undelegation verification is correctly positioned after the schedule_intent operation. Minor observation: line 247 includes a debug log after verification, but this call site doesn't—consider adding one for consistency with other tests.


193-196: LGTM! Multi-account verification correctly implemented.

The verification helper is properly invoked with multiple pubkeys, correctly mapping from the payers vector. This ensures all undelegated accounts are verified in the ephemeral layer.


243-247: LGTM! Good example of verification with logging.

This call site demonstrates the complete pattern with a debug log after verification. This pattern could be applied to other call sites (lines 69, 128) for consistency.

magicblock-chainlink/src/chainlink/mod.rs (4)

1-4: LGTM!

The atomic imports are appropriate for the new counter-based account removal tracking in reset_accounts_bank.


139-191: LGTM!

The refactored account removal logic with atomic counters provides clear telemetry. The corrected non_empty calculation using saturating_sub (from previous fixes) properly computes non-empty accounts without the fetch_sub pitfall.


317-338: LGTM!

The enhanced logging correctly gates on Trace level and uses the trace!() macro consistently. The mark_empty_if_not_found details provide valuable debugging context for account fetching.


361-377: LGTM!

Upgrading undelegation logging from trace!() to debug!() is appropriate for these important operational events, making them visible without requiring trace-level verbosity.

magicblock-chainlink/src/remote_account_provider/mod.rs (9)

1-74: LGTM!

The new imports support the enhanced functionality: Entry API for atomic fetch operations, metrics integration, and background task management. The FetchResult type alias improves code clarity, and the Helius-specific error code constant documents a provider-specific variation.


75-109: LGTM!

The field rename to lrucache_subscribed_accounts is more descriptive, and the _active_subscriptions_task_handle properly maintains the background task lifetime through Drop semantics.


272-330: LGTM!

The constructor properly initializes the LRU cache and conditionally starts the metrics updater based on configuration. The NonZeroUsize construction is safe due to the config validation invariant documented in the comment.


332-377: LGTM!

The per-endpoint abort channels enable independent lifecycle management for each pubsub client, supporting the event-driven reconnection strategy mentioned in the PR objectives.


610-706: LGTM!

The Entry API usage (lines 634-643) correctly appends to existing waiter lists instead of replacing them, fixing the concurrent fetch issue from previous reviews. The nested Result handling for FetchResult is appropriate for the channel-based architecture.


708-725: LGTM!

The subscription setup cleanly delegates to the subscribe method for each pubkey, which properly handles deduplication via is_watching checks.


797-825: LGTM!

The unsubscribe logic correctly maintains consistency between the LRU cache and pubsub client by only removing from the LRU after successful pubsub unsubscription (lines 809-813). The early return for non-tracked pubkeys (lines 801-807) prevents spurious errors.


849-886: LGTM!

The notify_error helper and retry! macro cleanly encapsulate error handling and retry logic, properly integrating metrics and notifying all waiting requests on failure.


976-1055: LGTM!

The metrics integration properly tracks found/not-found counts and updates gauges after successful fetches. The handling of mark_empty_if_not_found accounts (lines 991-1004) correctly distinguishes between truly missing accounts and those explicitly marked as empty.

Comment on lines +200 to +270
/// Creates a background task that periodically updates the active subscriptions gauge
fn start_active_subscriptions_updater<PubsubClient: ChainPubsubClient>(
subscribed_accounts: Arc<AccountsLruCache>,
pubsub_client: Arc<PubsubClient>,
) -> task::JoinHandle<()> {
task::spawn(async move {
let mut interval = time::interval(Duration::from_millis(
ACTIVE_SUBSCRIPTIONS_UPDATE_INTERVAL_MS,
));
let never_evicted = subscribed_accounts.never_evicted_accounts();

loop {
interval.tick().await;
let lru_count = subscribed_accounts.len();
let (pubsub_total, pubsub_without_never_evict) = pubsub_client
.subscription_count(Some(&never_evicted))
.await;

let all_pubsub_subs = if log::log_enabled!(log::Level::Debug) {
pubsub_client.subscriptions()
} else {
vec![]
};
if lru_count != pubsub_without_never_evict {
warn!(
"User account subscription counts LRU cache={} pubsub client={} don't match",
lru_count, pubsub_without_never_evict
);
if log::log_enabled!(log::Level::Debug) {
// Log all pubsub subscriptions for debugging
trace!(
"All pubsub subscriptions: {:?}",
all_pubsub_subs
);

// Find extra keys in pubsub that are not in LRU cache
let lru_pubkeys = subscribed_accounts.pubkeys();
let pubsub_subs_without_never_evict: HashSet<_> =
all_pubsub_subs
.iter()
.filter(|pk| !never_evicted.contains(pk))
.copied()
.collect();
let lru_pubkeys_set: HashSet<_> =
lru_pubkeys.into_iter().collect();

let extra_in_pubsub: Vec<_> =
pubsub_subs_without_never_evict
.difference(&lru_pubkeys_set)
.cloned()
.collect();
let extra_in_lru: Vec<_> = lru_pubkeys_set
.difference(&pubsub_subs_without_never_evict)
.cloned()
.collect();

if !extra_in_pubsub.is_empty() {
debug!("Extra pubkeys in pubsub client not in LRU cache: {:?}", extra_in_pubsub);
}
if !extra_in_lru.is_empty() {
debug!("Extra pubkeys in LRU cache not in pubsub client: {:?}", extra_in_lru);
}
}
}

debug!("Updating active subscriptions: count={}", pubsub_total);
trace!("All subscriptions: {}", pubkeys_str(&all_pubsub_subs));
set_monitored_accounts_count(pubsub_total);
}
})
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

LGTM with optional improvement suggestion.

The background updater provides valuable subscription-count reconciliation. The diagnostic logging (lines 228-262) is properly gated behind debug-level checks to avoid overhead in production.

Optional: Consider graceful shutdown for the metrics task.

The background task runs indefinitely and will be aborted when the provider is dropped. For a metrics task this is acceptable, but you could add a CancellationToken parameter to allow clean shutdown if needed in the future.

 fn start_active_subscriptions_updater<PubsubClient: ChainPubsubClient>(
     subscribed_accounts: Arc<AccountsLruCache>,
     pubsub_client: Arc<PubsubClient>,
+    shutdown: tokio_util::sync::CancellationToken,
 ) -> task::JoinHandle<()> {
     task::spawn(async move {
         let mut interval = time::interval(Duration::from_millis(
             ACTIVE_SUBSCRIPTIONS_UPDATE_INTERVAL_MS,
         ));
         let never_evicted = subscribed_accounts.never_evicted_accounts();

         loop {
-            interval.tick().await;
+            tokio::select! {
+                _ = interval.tick() => {},
+                _ = shutdown.cancelled() => {
+                    debug!("Active subscriptions updater shutting down");
+                    break;
+                }
+            }

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
magicblock-chainlink/src/remote_account_provider/mod.rs around lines 200 to 270:
the background updater currently runs forever and is only aborted when the
provider is dropped; add an explicit CancellationToken parameter to
start_active_subscriptions_updater and use it inside the spawned task to stop
the loop cleanly (e.g., use tokio::select! to await either interval.tick().await
or token.cancelled().await and break/return on cancellation), propagate the
token from the caller when spawning the task, and ensure the task exits promptly
so the JoinHandle completes gracefully.

Comment on lines +460 to +492
fn verify_undelegation_in_ephem_via_owner(
pubkeys: &[Pubkey],
ctx: &IntegrationTestContext,
) {
const RETRY_LIMIT: usize = 20;
let mut retries = 0;

loop {
ctx.wait_for_next_slot_ephem().unwrap();
let mut not_verified = vec![];
for pk in pubkeys.iter() {
let counter_pda = FlexiCounter::pda(pk).0;
let owner = ctx.fetch_ephem_account_owner(counter_pda).unwrap();
if owner == delegation_program_id() {
not_verified.push(*pk);
}
}
if not_verified.is_empty() {
break;
}
retries += 1;
if retries >= RETRY_LIMIT {
panic!(
"Failed to verify undelegation for pubkeys: {}",
not_verified
.iter()
.map(|k| k.to_string())
.collect::<Vec<_>>()
.join(", ")
);
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Consider adding progress logging to aid debugging.

The verification logic is correct: it polls until all accounts' owners are no longer delegation_program_id() or the retry limit is exceeded. The retry limit of 20 and the polling approach are reasonable for test scenarios.

Consider adding debug logging to track verification progress, which would help diagnose issues if the retry limit is reached:

 fn verify_undelegation_in_ephem_via_owner(
     pubkeys: &[Pubkey],
     ctx: &IntegrationTestContext,
 ) {
     const RETRY_LIMIT: usize = 20;
     let mut retries = 0;
 
     loop {
         ctx.wait_for_next_slot_ephem().unwrap();
         let mut not_verified = vec![];
         for pk in pubkeys.iter() {
             let counter_pda = FlexiCounter::pda(pk).0;
             let owner = ctx.fetch_ephem_account_owner(counter_pda).unwrap();
             if owner == delegation_program_id() {
                 not_verified.push(*pk);
             }
         }
         if not_verified.is_empty() {
+            debug!("✅ Verified undelegation for {} accounts", pubkeys.len());
             break;
         }
         retries += 1;
+        debug!(
+            "⏳ Undelegation verification retry {}/{}: {} accounts remaining",
+            retries,
+            RETRY_LIMIT,
+            not_verified.len()
+        );
         if retries >= RETRY_LIMIT {
             panic!(
                 "Failed to verify undelegation for pubkeys: {}",
                 not_verified
                     .iter()
                     .map(|k| k.to_string())
                     .collect::<Vec<_>>()
                     .join(", ")
             );
         }
     }
 }

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In test-integration/test-schedule-intent/tests/test_schedule_intents.rs around
lines 460 to 492, add progress logging inside the polling loop to emit the
current retry count and the list of pubkeys still not verified each iteration
and a success log when verification completes; specifically, log (e.g., via
tracing::info! or println!) the retries and not_verified pubkeys before
incrementing retries so failures are easier to diagnose, and optionally log a
final confirmation once not_verified.is_empty() before breaking.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0b1b245 and 7880a7d.

📒 Files selected for processing (3)
  • magicblock-chainlink/src/chainlink/fetch_cloner.rs (5 hunks)
  • magicblock-chainlink/src/chainlink/mod.rs (5 hunks)
  • magicblock-metrics/src/metrics/mod.rs (9 hunks)
🧰 Additional context used
🧠 Learnings (6)
📓 Common learnings
Learnt from: bmuddha
Repo: magicblock-labs/magicblock-validator PR: 578
File: magicblock-aperture/src/requests/websocket/account_subscribe.rs:18-27
Timestamp: 2025-10-21T14:00:54.642Z
Learning: In magicblock-aperture account_subscribe handler (src/requests/websocket/account_subscribe.rs), the RpcAccountInfoConfig fields data_slice, commitment, and min_context_slot are currently ignored—only encoding is applied. This is tracked as technical debt in issue #579: https://github.com/magicblock-labs/magicblock-validator/issues/579
📚 Learning: 2025-10-14T09:56:14.047Z
Learnt from: taco-paco
Repo: magicblock-labs/magicblock-validator PR: 564
File: test-integration/programs/flexi-counter/src/processor/call_handler.rs:122-125
Timestamp: 2025-10-14T09:56:14.047Z
Learning: The file test-integration/programs/flexi-counter/src/processor/call_handler.rs contains a test smart contract used for integration testing, not production code.

Applied to files:

  • magicblock-chainlink/src/chainlink/mod.rs
📚 Learning: 2025-11-07T13:20:13.793Z
Learnt from: bmuddha
Repo: magicblock-labs/magicblock-validator PR: 589
File: magicblock-processor/src/scheduler/coordinator.rs:227-238
Timestamp: 2025-11-07T13:20:13.793Z
Learning: In magicblock-processor's ExecutionCoordinator (scheduler/coordinator.rs), the `account_contention` HashMap intentionally does not call `shrink_to_fit()`. Maintaining slack capacity is beneficial for performance by avoiding frequent reallocations during high transaction throughput. As long as empty entries are removed from the map (which `clear_account_contention` does), the capacity overhead is acceptable.

Applied to files:

  • magicblock-chainlink/src/chainlink/mod.rs
📚 Learning: 2025-10-21T14:00:54.642Z
Learnt from: bmuddha
Repo: magicblock-labs/magicblock-validator PR: 578
File: magicblock-aperture/src/requests/websocket/account_subscribe.rs:18-27
Timestamp: 2025-10-21T14:00:54.642Z
Learning: In magicblock-aperture account_subscribe handler (src/requests/websocket/account_subscribe.rs), the RpcAccountInfoConfig fields data_slice, commitment, and min_context_slot are currently ignored—only encoding is applied. This is tracked as technical debt in issue #579: https://github.com/magicblock-labs/magicblock-validator/issues/579

Applied to files:

  • magicblock-chainlink/src/chainlink/mod.rs
  • magicblock-chainlink/src/chainlink/fetch_cloner.rs
  • magicblock-metrics/src/metrics/mod.rs
📚 Learning: 2025-11-07T14:20:31.457Z
Learnt from: thlorenz
Repo: magicblock-labs/magicblock-validator PR: 621
File: magicblock-chainlink/src/remote_account_provider/chain_pubsub_actor.rs:457-495
Timestamp: 2025-11-07T14:20:31.457Z
Learning: In magicblock-chainlink/src/remote_account_provider/chain_pubsub_client.rs, the unsubscribe closure returned by PubSubConnection::account_subscribe(...) resolves to () (unit), not a Result. Downstream code should not attempt to inspect an unsubscribe result and can optionally wrap it in a timeout to guard against hangs.

Applied to files:

  • magicblock-chainlink/src/chainlink/mod.rs
  • magicblock-chainlink/src/chainlink/fetch_cloner.rs
📚 Learning: 2025-10-26T16:53:29.820Z
Learnt from: thlorenz
Repo: magicblock-labs/magicblock-validator PR: 587
File: magicblock-chainlink/src/remote_account_provider/mod.rs:134-0
Timestamp: 2025-10-26T16:53:29.820Z
Learning: In magicblock-chainlink/src/remote_account_provider/mod.rs, the `Endpoint::separate_pubsub_url_and_api_key()` method uses `split_once("?api-key=")` because the api-key parameter is always the only query parameter right after `?`. No additional query parameter parsing is needed for this use case.

Applied to files:

  • magicblock-chainlink/src/chainlink/fetch_cloner.rs
🧬 Code graph analysis (2)
magicblock-chainlink/src/chainlink/mod.rs (2)
magicblock-chainlink/src/chainlink/blacklisted_accounts.rs (1)
  • blacklisted_accounts (6-30)
magicblock-metrics/src/metrics/mod.rs (1)
  • inc_undelegation_requested (427-429)
magicblock-chainlink/src/chainlink/fetch_cloner.rs (2)
magicblock-metrics/src/metrics/mod.rs (1)
  • inc_undelegation_completed (431-433)
magicblock-chainlink/src/remote_account_provider/config.rs (1)
  • try_new_with_metrics (27-42)

Comment on lines 224 to 235
if let Some(in_bank) =
self.accounts_bank.get_account(&pubkey)
{
if !in_bank.delegated()
&& in_bank.owner().eq(&dlp::id())
&& !account.owner().eq(&dlp::id())
{
debug!(
"Undelegation completed for account: {pubkey}"
);
magicblock_metrics::metrics::inc_undelegation_completed();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fix undelegation completion detection logic.

During cloning we explicitly set delegated accounts to delegated = true and overwrite their owner with the delegation record’s owner (see the set_owner(...).set_delegated(is_delegated_to_us) path in this file). That means a still-delegated account lives in the bank with delegated() == true and owner() != dlp::id(). The new guard checks the exact opposite (!in_bank.delegated() and in_bank.owner() == dlp::id()), so it never trips for the happy path. As a result inc_undelegation_completed() will never run, making the new metric useless. Flip the predicate to look for an entry we previously considered delegated and an update whose owner is no longer dlp::id(), e.g.:

-        if !in_bank.delegated()
-            && in_bank.owner().eq(&dlp::id())
-            && !account.owner().eq(&dlp::id())
-        {
+        if in_bank.delegated()
+            && !account.delegated()
+            && !account.owner().eq(&dlp::id())
+        {

This fires exactly once on the first post-undelegation update and keeps the metric aligned with the new observability story.

🤖 Prompt for AI Agents
In magicblock-chainlink/src/chainlink/fetch_cloner.rs around lines 224 to 235,
the undelegation completion check is inverted and never matches the actual
cloned state; change the guard to detect an entry that was previously marked
delegated in the bank (in_bank.delegated() == true and in_bank.owner() ==
dlp::id()) and which is now updated so its owner is no longer dlp::id() (i.e.
!account.owner().eq(&dlp::id())), then run the debug log and
magicblock_metrics::metrics::inc_undelegation_completed() in that branch so the
metric fires once on the first post-undelegation update.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7880a7d and 9f8a0b2.

📒 Files selected for processing (1)
  • magicblock-ledger/src/ledger_truncator.rs (1 hunks)
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: bmuddha
Repo: magicblock-labs/magicblock-validator PR: 578
File: magicblock-aperture/src/requests/websocket/account_subscribe.rs:18-27
Timestamp: 2025-10-21T14:00:54.642Z
Learning: In magicblock-aperture account_subscribe handler (src/requests/websocket/account_subscribe.rs), the RpcAccountInfoConfig fields data_slice, commitment, and min_context_slot are currently ignored—only encoding is applied. This is tracked as technical debt in issue #579: https://github.com/magicblock-labs/magicblock-validator/issues/579
Learnt from: thlorenz
Repo: magicblock-labs/magicblock-validator PR: 621
File: magicblock-chainlink/src/remote_account_provider/chain_pubsub_actor.rs:457-495
Timestamp: 2025-11-07T14:20:31.457Z
Learning: In magicblock-chainlink/src/remote_account_provider/chain_pubsub_client.rs, the unsubscribe closure returned by PubSubConnection::account_subscribe(...) resolves to () (unit), not a Result. Downstream code should not attempt to inspect an unsubscribe result and can optionally wrap it in a timeout to guard against hangs.
📚 Learning: 2025-10-28T13:15:42.706Z
Learnt from: bmuddha
Repo: magicblock-labs/magicblock-validator PR: 596
File: magicblock-processor/src/scheduler.rs:1-1
Timestamp: 2025-10-28T13:15:42.706Z
Learning: In magicblock-processor, transaction indexes were always set to 0 even before the changes in PR #596. The proper transaction indexing within slots will be addressed during the planned ledger rewrite.

Applied to files:

  • magicblock-ledger/src/ledger_truncator.rs
📚 Learning: 2025-11-07T13:20:13.793Z
Learnt from: bmuddha
Repo: magicblock-labs/magicblock-validator PR: 589
File: magicblock-processor/src/scheduler/coordinator.rs:227-238
Timestamp: 2025-11-07T13:20:13.793Z
Learning: In magicblock-processor's ExecutionCoordinator (scheduler/coordinator.rs), the `account_contention` HashMap intentionally does not call `shrink_to_fit()`. Maintaining slack capacity is beneficial for performance by avoiding frequent reallocations during high transaction throughput. As long as empty entries are removed from the map (which `clear_account_contention` does), the capacity overhead is acceptable.

Applied to files:

  • magicblock-ledger/src/ledger_truncator.rs
🔇 Additional comments (1)
magicblock-ledger/src/ledger_truncator.rs (1)

206-207: Good use of spawn_blocking to offload heavy deletion work.

Moving RocksDB deletions to a blocking task prevents blocking the Tokio async runtime, which improves responsiveness during large truncation operations.

Comment on lines 233 to 237
if let Err(err) = delete_handle.await {
error!("Ledger delete task cancelled: {err}");
}

Self::compact_slot_range(ledger, from_slot, to_slot).await;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Consider skipping compaction if the deletion task fails.

If the blocking deletion task is cancelled or panics (line 233), the function logs the error but still proceeds to compact (line 237). This means compaction runs even when slot deletions may have failed or only partially completed, which could be wasteful or lead to unexpected ledger state.

Consider returning early if delete_handle.await fails, or at minimum document why compaction should proceed despite deletion failures.

Apply this diff to skip compaction when deletion fails:

-        if let Err(err) = delete_handle.await {
-            error!("Ledger delete task cancelled: {err}");
+        match delete_handle.await {
+            Ok(_) => {
+                Self::compact_slot_range(ledger, from_slot, to_slot).await;
+            }
+            Err(err) => {
+                error!("Ledger delete task cancelled, skipping compaction: {err}");
+                return;
+            }
         }
-
-        Self::compact_slot_range(ledger, from_slot, to_slot).await;
🤖 Prompt for AI Agents
In magicblock-ledger/src/ledger_truncator.rs around lines 233 to 237, the code
logs an error when delete_handle.await fails but still proceeds to call
Self::compact_slot_range(...); modify the flow to skip compaction when the
deletion task failed by returning early (or otherwise short-circuiting) after
logging the error so compaction only runs on successful deletion; ensure the
function returns or exits appropriately in the error branch and keep the
existing log message (or augment it) to preserve diagnostics.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 021ef0d and a8900e1.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (1)
  • Cargo.toml (3 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: bmuddha
Repo: magicblock-labs/magicblock-validator PR: 578
File: magicblock-aperture/src/requests/websocket/account_subscribe.rs:18-27
Timestamp: 2025-10-21T14:00:54.642Z
Learning: In magicblock-aperture account_subscribe handler (src/requests/websocket/account_subscribe.rs), the RpcAccountInfoConfig fields data_slice, commitment, and min_context_slot are currently ignored—only encoding is applied. This is tracked as technical debt in issue #579: https://github.com/magicblock-labs/magicblock-validator/issues/579
Learnt from: thlorenz
Repo: magicblock-labs/magicblock-validator PR: 621
File: magicblock-chainlink/src/remote_account_provider/chain_pubsub_actor.rs:457-495
Timestamp: 2025-11-07T14:20:31.457Z
Learning: In magicblock-chainlink/src/remote_account_provider/chain_pubsub_client.rs, the unsubscribe closure returned by PubSubConnection::account_subscribe(...) resolves to () (unit), not a Result. Downstream code should not attempt to inspect an unsubscribe result and can optionally wrap it in a timeout to guard against hangs.
📚 Learning: 2025-10-26T16:54:39.084Z
Learnt from: thlorenz
Repo: magicblock-labs/magicblock-validator PR: 587
File: test-manual/Cargo.toml:0-0
Timestamp: 2025-10-26T16:54:39.084Z
Learning: In the magicblock-validator repository, use git branch references (not commit hashes or tags) for the helius-laserstream dependency to allow automatic updates when the branch is pushed to.

Applied to files:

  • Cargo.toml
🔇 Additional comments (2)
Cargo.toml (2)

94-94: Confirm the log feature removal is intentional.

Removing features = ["release_max_level_info"] means debug and trace logs will now compile into release builds. This increases binary size and runtime logging overhead in production. Verify this aligns with the PR's metrics/logging objectives and the trade-off is acceptable.


94-232: Verification complete—no issues found.

The ureq = "2.9.6" dependency is correctly added in test-integration/Cargo.toml, with test-integration/test-tools/Cargo.toml referencing it as a workspace dependency. The url = "2.5.0" dependency is present in the root workspace Cargo.toml (line 218 in the provided snippet) and also declared in integration test modules. Both dependencies are properly positioned for integration test metrics fetching as described in the PR objectives.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 37d6839 and c6f75fd.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (3)
  • Cargo.toml (3 hunks)
  • magicblock-processor/src/executor/processing.rs (4 hunks)
  • test-integration/Cargo.toml (2 hunks)
🧰 Additional context used
🧠 Learnings (5)
📓 Common learnings
Learnt from: bmuddha
Repo: magicblock-labs/magicblock-validator PR: 578
File: magicblock-aperture/src/requests/websocket/account_subscribe.rs:18-27
Timestamp: 2025-10-21T14:00:54.642Z
Learning: In magicblock-aperture account_subscribe handler (src/requests/websocket/account_subscribe.rs), the RpcAccountInfoConfig fields data_slice, commitment, and min_context_slot are currently ignored—only encoding is applied. This is tracked as technical debt in issue #579: https://github.com/magicblock-labs/magicblock-validator/issues/579
Learnt from: thlorenz
Repo: magicblock-labs/magicblock-validator PR: 621
File: magicblock-chainlink/src/remote_account_provider/chain_pubsub_actor.rs:457-495
Timestamp: 2025-11-07T14:20:31.457Z
Learning: In magicblock-chainlink/src/remote_account_provider/chain_pubsub_client.rs, the unsubscribe closure returned by PubSubConnection::account_subscribe(...) resolves to () (unit), not a Result. Downstream code should not attempt to inspect an unsubscribe result and can optionally wrap it in a timeout to guard against hangs.
📚 Learning: 2025-11-07T13:20:13.793Z
Learnt from: bmuddha
Repo: magicblock-labs/magicblock-validator PR: 589
File: magicblock-processor/src/scheduler/coordinator.rs:227-238
Timestamp: 2025-11-07T13:20:13.793Z
Learning: In magicblock-processor's ExecutionCoordinator (scheduler/coordinator.rs), the `account_contention` HashMap intentionally does not call `shrink_to_fit()`. Maintaining slack capacity is beneficial for performance by avoiding frequent reallocations during high transaction throughput. As long as empty entries are removed from the map (which `clear_account_contention` does), the capacity overhead is acceptable.

Applied to files:

  • magicblock-processor/src/executor/processing.rs
📚 Learning: 2025-11-13T09:38:43.804Z
Learnt from: bmuddha
Repo: magicblock-labs/magicblock-validator PR: 589
File: magicblock-processor/src/scheduler/locks.rs:64-102
Timestamp: 2025-11-13T09:38:43.804Z
Learning: In magicblock-processor's TransactionScheduler (scheduler/mod.rs line 59), the executor count is clamped to MAX_SVM_EXECUTORS (63) at initialization time, and executor IDs are assigned sequentially from 0 to count-1. This architectural guarantee ensures that executor IDs used in the bitmask-based AccountLock (scheduler/locks.rs) will always be within valid bounds for bit shifting operations, making runtime bounds checks unnecessary.

Applied to files:

  • magicblock-processor/src/executor/processing.rs
📚 Learning: 2025-10-21T14:00:54.642Z
Learnt from: bmuddha
Repo: magicblock-labs/magicblock-validator PR: 578
File: magicblock-aperture/src/requests/websocket/account_subscribe.rs:18-27
Timestamp: 2025-10-21T14:00:54.642Z
Learning: In magicblock-aperture account_subscribe handler (src/requests/websocket/account_subscribe.rs), the RpcAccountInfoConfig fields data_slice, commitment, and min_context_slot are currently ignored—only encoding is applied. This is tracked as technical debt in issue #579: https://github.com/magicblock-labs/magicblock-validator/issues/579

Applied to files:

  • magicblock-processor/src/executor/processing.rs
📚 Learning: 2025-10-26T16:54:39.084Z
Learnt from: thlorenz
Repo: magicblock-labs/magicblock-validator PR: 587
File: test-manual/Cargo.toml:0-0
Timestamp: 2025-10-26T16:54:39.084Z
Learning: In the magicblock-validator repository, use git branch references (not commit hashes or tags) for the helius-laserstream dependency to allow automatic updates when the branch is pushed to.

Applied to files:

  • Cargo.toml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: run_make_ci_lint
  • GitHub Check: run_make_ci_test
  • GitHub Check: Build Project
🔇 Additional comments (6)
test-integration/Cargo.toml (2)

100-101: Approve: ureq and url dependencies for test metrics collection.

The addition of ureq 2.9.6 and url 2.5.0 as workspace dependencies is appropriate for the new metrics collection capability in test integration (as referenced in the PR summary for fetching monitored account counts). ureq 2.9.6 upgrades rustls to 0.22 and ring to 0.17, without introducing breaking changes. While ureq 2.9.6 is not the latest version (3.x or 2.12.x exist), it is suitable for test-only usage and maintains backward compatibility.


77-77: I need to gather more context about the pull request to assess stability. Let me check the PR status and details.

Use a stable released version or main branch of solana-account, not a feature branch.

The revision 8f7050a is on the feature branch bmuddha/fix/pre-cow-check rather than a stable release or main branch. The commit was made today, allowing no time for testing or validation. Feature branch dependencies introduce significant risk and should be reserved for development only—not production or CI/CD test suites.

Replace this with either:

  • A tagged release from the solana-account repository, or
  • The main branch revision if the fix is already merged

Lines affected: test-integration/Cargo.toml 77, 109

Cargo.toml (3)

94-94: Confirm intent: log feature removal enables runtime logging in release builds.

Removing the release_max_level_info feature from the log dependency (line 94) changes compile-time behavior. Libraries should avoid using the max level features because they're global and can't be changed once they're set. This change allows all log levels (debug, trace) to be available at runtime in release builds, rather than being stripped at compile time.

Confirm:

  • Is this intentional to support enhanced observability for the new reconnection and metrics features described in the PR?
  • Are there any performance or binary-size implications in production that should be documented?

218-218: Confirm url dependency is shared across workspaces.

The url = "2.5.0" dependency is already declared in the main workspace dependencies (line 218), so test-integration will correctly reuse it. This is consistent with the dependency management pattern and avoids duplication.


154-154: Solana-account revision 8f7050a is consistently referenced across all manifests and exists in the repository.

Verification confirms that revision 8f7050a is uniformly specified in:

  • Main Cargo.toml workspace dependencies (line 154)
  • Main Cargo.toml patch.crates-io (line 230)
  • test-integration/Cargo.toml (lines 77 and 109)

The revision exists in the solana-account repository on branch bmuddha/fix/pre-cow-check and is associated with PR #20. All references are consistent across manifests.

magicblock-processor/src/executor/processing.rs (1)

10-21: New imports align with downstream usage

Importing ReadableAccount for lamports() and TransactionError for the new gasless guard is consistent with later code; no issues here.

Comment on lines +315 to 333
// The first loaded account is always a feepayer, check
// whether we are running in privileged execution mode
let privileged = accounts
.first()
.map(|feepayer| feepayer.1.privileged())
.unwrap_or_default();

for (pubkey, account) in accounts {
// only persist account's update if it was actually modified, ignore
// the rest, even if an account was writeable in the transaction
if !account.is_dirty() {
// the rest, even if an account was writeable in the transaction.
//
// We also don't persist accounts that are empty, with an exception
// for special cases, when those are inserted forcefully as placeholders
// (for example by the chainlink), those cases can be distinguished from
// others by the fact that such a transaction is always running in a
// privileged mode.
if !account.is_dirty() || (account.lamports() == 0 && !privileged) {
continue;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Account persistence filter depends on is_dirty and zero‑lamport semantics — please confirm invariants

The new gating logic:

  • Uses the first loaded account to infer a per‑transaction privileged mode, and
  • Skips persistence when !account.is_dirty() or when account.lamports() == 0 && !privileged.

This is a good direction to avoid unnecessary writes and to special‑case chainlink/placeholder flows, but there are a couple of subtle invariants to verify:

  1. Fee‑only / rollback path: In the ProcessedTransaction::FeesOnly branch, the single (feepayer, fee_payer_account) pair is now subject to !account.is_dirty(). You need fee_payer_account.is_dirty() to be reliably true whenever fees should be charged, otherwise the fee payer state will not be persisted (and the validator may under‑charge). Please confirm the implementation of RollbackAccounts::FeePayerOnly ensures this.

  2. Zero‑lamport account deletions: For non‑privileged transactions that intentionally close accounts by setting lamports to zero, the condition account.lamports() == 0 && !privileged means we will not call insert_account at all for those keys. That’s fine if AccountsDb treats “no insert” as the correct deletion semantics (e.g., it will already have removed the account or infer deletion from elsewhere), but if insert_account expects to see a zero‑lamport write to purge an existing account, this could leave stale state in the DB and omit “account closed” notifications on accounts_tx.

  3. Notifications vs. storage: Because the same if gates both insert_account and the accounts_tx notification, observers on accounts_tx will no longer see updates for non‑privileged zero‑lamport accounts. If external consumers rely on seeing a final update when an account is closed, you may need to differentiate “store or purge in AccountsDb” from “notify observers”.

If the above invariants all hold, the optimization looks good. Otherwise, you may want to adjust the conditions (e.g., always ensuring fee‑payer writes occur when fees are due, and explicitly modeling deletions in AccountsDb rather than skipping inserts).

🤖 Prompt for AI Agents
In magicblock-processor/src/executor/processing.rs around lines 315 to 333, the
new persistence gate (skipping when !account.is_dirty() or account.lamports()==0
&& !privileged) can suppress necessary fee-payer writes and account-closure
semantics; ensure invariants by (1) guaranteeing fee-payer writes are always
persisted in the FeesOnly / RollbackAccounts::FeePayerOnly path — either by
ensuring fee_payer_account.is_dirty() is set when fees are deducted or by
unconditionally persisting the fee-payer when fees were charged, (2) handling
zero-lamport deletions explicitly rather than skipping insert_account when a
non-privileged tx sets lamports to zero — either call the AccountsDb
deletion/purge API or still call insert_account with the zero-lamports
representation if AccountsDb expects that to trigger removal, and (3) separating
storage vs notification semantics so observers still receive an accounts_tx
update for account closures even if physical storage is handled differently
(i.e., always emit the notification for closed accounts or add an explicit
notify-on-close step). Ensure these changes are covered by unit tests for
FeesOnly and zero-lamport close cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants