Skip to content

Conversation

@nikhilsinhaparseable
Copy link
Contributor

@nikhilsinhaparseable nikhilsinhaparseable commented Nov 11, 2025

instead of using file_size from manifest -- which is size of json
we should use ingestion_size -- which is compressed size

Summary by CodeRabbit

  • Bug Fixes
    • Metrics now use compressed data size instead of uncompressed file size for table operations and billing, improving accuracy of reported bytes scanned.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 11, 2025

Walkthrough

The StandardTableProvider::partitioned_files implementation now drops file_size and instead computes per-file compressed bytes by summing column.compressed_size, accumulating into total_compressed_size. Billing metrics (e.g., increment_bytes_scanned_in_query_by_date) are updated to use total_compressed_size.

Changes

Cohort / File(s) Summary
Compressed size aggregation
src/query/stream_schema_provider.rs
Replaced file_size usage with per-column compressed_size summation; removed file_size from destructuring; accumulated total_compressed_size and switched billing increment to use total_compressed_size instead of total_file_size.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • Verify column.compressed_size is present and populated for all file types processed here.
  • Confirm total_compressed_size is the correct metric for billing/bytes-scanned semantics.
  • Check for other codepaths or docs still referencing file_size or total_file_size.
  • Run/inspect existing tests around metrics and partitioned file accounting.

Poem

A rabbit counts what storage keeps,
Summing columns where compression sleeps,
Dropped the old size, embraced the compressed,
Hopping through bytes now fairly addressed. 🐇📦

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Description check ⚠️ Warning The description is minimal and lacks the required template structure including rationale, key changes detail, testing confirmation, and documentation checklist items. Expand description to follow the template: add detailed rationale for using ingestion_size over file_size, explain the impact, and confirm the testing and documentation checklist items have been completed.
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title 'fix: bytes scanned in query' clearly and concisely describes the main change: correcting how bytes scanned are calculated in queries.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

coderabbitai[bot]
coderabbitai bot previously approved these changes Nov 11, 2025
instead of using file_size from manifest -- which is size of json
we should use ingestion_size -- which is compressed size
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/query/stream_schema_provider.rs (1)

337-365: Critical compilation error: file.file_size is inaccessible after destructuring.

At lines 338-343, the File struct is destructured and the file variable is moved. The .. pattern drops the file_size field. Then at line 365, the code attempts to access file.file_size, which will fail to compile because file has been consumed.

To fix this, capture file_size in the destructuring pattern:

 let File {
     mut file_path,
     num_rows,
     columns,
+    file_size,
     ..
 } = file;

Alternatively, if PartitionedFile should also use compressed bytes (consistent with the billing metrics change), calculate compressed_bytes earlier and use it:

 let File {
     mut file_path,
     num_rows,
     columns,
     ..
 } = file;

 // Track billing metrics for files scanned in query
 file_count += 1;
 // Calculate actual compressed bytes that will be read from storage
 let compressed_bytes: u64 = columns.iter().map(|col| col.compressed_size).sum();
 total_compressed_size += compressed_bytes;

 // ... (Windows path handling code) ...
 
-let pf = PartitionedFile::new(file_path, file.file_size);
+let pf = PartitionedFile::new(file_path, compressed_bytes);

Which approach to use depends on whether DataFusion's PartitionedFile sizing should reflect compressed or uncompressed bytes. Since the PR aims to use compressed size for accuracy, the second approach may be more consistent.

🧹 Nitpick comments (1)
src/query/stream_schema_provider.rs (1)

347-349: Consider defensive handling if compressed_size could be zero or missing.

If there's any possibility that column.compressed_size could be 0 (e.g., empty columns, metadata not populated), the billing metrics might underreport bytes scanned. Consider adding validation or a fallback.

For example:

let compressed_bytes: u64 = columns.iter().map(|col| col.compressed_size).sum();
if compressed_bytes == 0 {
    tracing::warn!(
        "Compressed size is 0 for file {} in stream {}. This may indicate missing metadata.",
        file_path, self.stream
    );
}
total_compressed_size += compressed_bytes;

However, only add this if zero compressed_size is a realistic scenario in your system. If compressed_size is always populated correctly, this check is unnecessary.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 31f0062 and 1c7f996.

📒 Files selected for processing (1)
  • src/query/stream_schema_provider.rs (3 hunks)
🧰 Additional context used
🧠 Learnings (4)
📚 Learning: 2025-08-25T01:31:41.786Z
Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1415
File: src/metadata.rs:63-68
Timestamp: 2025-08-25T01:31:41.786Z
Learning: The TOTAL_EVENTS_INGESTED_DATE, TOTAL_EVENTS_INGESTED_SIZE_DATE, and TOTAL_EVENTS_STORAGE_SIZE_DATE metrics in src/metadata.rs and src/storage/object_storage.rs are designed to track total events across all streams, not per-stream. They use labels [origin, parsed_date] to aggregate by format and date, while per-stream metrics use [stream_name, origin, parsed_date] labels.

Applied to files:

  • src/query/stream_schema_provider.rs
📚 Learning: 2025-08-25T01:32:25.980Z
Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1415
File: src/metrics/mod.rs:163-173
Timestamp: 2025-08-25T01:32:25.980Z
Learning: The TOTAL_EVENTS_INGESTED_DATE, TOTAL_EVENTS_INGESTED_SIZE_DATE, and TOTAL_EVENTS_STORAGE_SIZE_DATE metrics in src/metrics/mod.rs are intentionally designed to track global totals across all streams for a given date, using labels ["format", "date"] rather than per-stream labels. This is the correct design for global aggregation purposes.

Applied to files:

  • src/query/stream_schema_provider.rs
📚 Learning: 2025-08-18T19:10:11.941Z
Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1405
File: src/handlers/http/ingest.rs:163-164
Timestamp: 2025-08-18T19:10:11.941Z
Learning: Field statistics calculation in src/storage/field_stats.rs uses None for the time_partition parameter when calling flatten_and_push_logs(), as field stats generation does not require time partition functionality.

Applied to files:

  • src/query/stream_schema_provider.rs
📚 Learning: 2025-09-18T09:52:07.554Z
Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1415
File: src/storage/object_storage.rs:173-177
Timestamp: 2025-09-18T09:52:07.554Z
Learning: In Parseable's upload system (src/storage/object_storage.rs), the update_storage_metrics function can safely use path.metadata().map_err() to fail on local file metadata read failures because parquet validation (validate_uploaded_parquet_file) ensures file integrity before this step, and the system guarantees local staging files remain accessible throughout the upload flow.

Applied to files:

  • src/query/stream_schema_provider.rs
🧬 Code graph analysis (1)
src/query/stream_schema_provider.rs (1)
src/metrics/mod.rs (1)
  • increment_bytes_scanned_in_query_by_date (563-567)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)
  • GitHub Check: Build Default x86_64-unknown-linux-gnu
  • GitHub Check: Build Default x86_64-apple-darwin
  • GitHub Check: Build Default aarch64-apple-darwin
  • GitHub Check: Build Default aarch64-unknown-linux-gnu
  • GitHub Check: Build Default x86_64-pc-windows-msvc
  • GitHub Check: Build Kafka aarch64-apple-darwin
  • GitHub Check: Build Kafka x86_64-unknown-linux-gnu
  • GitHub Check: coverage
  • GitHub Check: Quest Smoke and Load Tests for Distributed deployments
  • GitHub Check: Quest Smoke and Load Tests for Standalone deployments
🔇 Additional comments (2)
src/query/stream_schema_provider.rs (2)

407-411: Good: Billing metrics now use compressed size for accuracy.

The change to track total_compressed_size instead of total_file_size for the increment_bytes_scanned_in_query_by_date metric aligns with the PR objective. The comment clearly explains the rationale—compressed size represents actual bytes read from object storage, which is what cloud providers charge for.

However, this assumes the critical compilation error at line 365 is resolved and that the compressed bytes calculation is verified.


347-349: Implementation verified - no issues found.

The Column struct correctly has a compressed_size field (defined in src/catalog/column.rs:133), which is populated from parquet metadata and properly summed across all columns for billing calculations. The sum of per-column compressed sizes represents the file's total compressed ingestion size, which aligns with the PR's intent to use ingestion_size instead of file_size. No edge cases identified—empty column lists correctly yield 0.

@nikhilsinhaparseable nikhilsinhaparseable merged commit 7adda55 into parseablehq:main Nov 18, 2025
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant