fix: bytes scanned in query #1464

nikhilsinhaparseable · 2025-11-11T23:21:56Z

instead of using file_size from manifest -- which is size of json
we should use ingestion_size -- which is compressed size

Summary by CodeRabbit

Bug Fixes
- Metrics now use compressed data size instead of uncompressed file size for table operations and billing, improving accuracy of reported bytes scanned.

coderabbitai · 2025-11-11T23:22:04Z

Walkthrough

The StandardTableProvider::partitioned_files implementation now drops file_size and instead computes per-file compressed bytes by summing column.compressed_size, accumulating into total_compressed_size. Billing metrics (e.g., increment_bytes_scanned_in_query_by_date) are updated to use total_compressed_size.

Changes

Cohort / File(s)	Summary
Compressed size aggregation `src/query/stream_schema_provider.rs`	Replaced `file_size` usage with per-column `compressed_size` summation; removed `file_size` from destructuring; accumulated `total_compressed_size` and switched billing increment to use `total_compressed_size` instead of `total_file_size`.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Verify column.compressed_size is present and populated for all file types processed here.
Confirm total_compressed_size is the correct metric for billing/bytes-scanned semantics.
Check for other codepaths or docs still referencing file_size or total_file_size.
Run/inspect existing tests around metrics and partitioned file accounting.

Poem

A rabbit counts what storage keeps,
Summing columns where compression sleeps,
Dropped the old size, embraced the compressed,
Hopping through bytes now fairly addressed. 🐇📦

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The description is minimal and lacks the required template structure including rationale, key changes detail, testing confirmation, and documentation checklist items.	Expand description to follow the template: add detailed rationale for using ingestion_size over file_size, explain the impact, and confirm the testing and documentation checklist items have been completed.
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'fix: bytes scanned in query' clearly and concisely describes the main change: correcting how bytes scanned are calculated in queries.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

instead of using file_size from manifest -- which is size of json we should use ingestion_size -- which is compressed size

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/query/stream_schema_provider.rs (1)
337-365: Critical compilation error: file.file_size is inaccessible after destructuring.

At lines 338-343, the File struct is destructured and the file variable is moved. The .. pattern drops the file_size field. Then at line 365, the code attempts to access file.file_size, which will fail to compile because file has been consumed.

To fix this, capture file_size in the destructuring pattern:
 let File {
     mut file_path,
     num_rows,
     columns,
+    file_size,
     ..
 } = file;
Alternatively, if PartitionedFile should also use compressed bytes (consistent with the billing metrics change), calculate compressed_bytes earlier and use it:
 let File {
     mut file_path,
     num_rows,
     columns,
     ..
 } = file;

 // Track billing metrics for files scanned in query
 file_count += 1;
 // Calculate actual compressed bytes that will be read from storage
 let compressed_bytes: u64 = columns.iter().map(|col| col.compressed_size).sum();
 total_compressed_size += compressed_bytes;

 // ... (Windows path handling code) ...
 
-let pf = PartitionedFile::new(file_path, file.file_size);
+let pf = PartitionedFile::new(file_path, compressed_bytes);
Which approach to use depends on whether DataFusion's PartitionedFile sizing should reflect compressed or uncompressed bytes. Since the PR aims to use compressed size for accuracy, the second approach may be more consistent.

🧹 Nitpick comments (1)

src/query/stream_schema_provider.rs (1)
347-349: Consider defensive handling if compressed_size could be zero or missing.

If there's any possibility that column.compressed_size could be 0 (e.g., empty columns, metadata not populated), the billing metrics might underreport bytes scanned. Consider adding validation or a fallback.

For example:
let compressed_bytes: u64 = columns.iter().map(|col| col.compressed_size).sum();
if compressed_bytes == 0 {
    tracing::warn!(
        "Compressed size is 0 for file {} in stream {}. This may indicate missing metadata.",
        file_path, self.stream
    );
}
total_compressed_size += compressed_bytes;
However, only add this if zero compressed_size is a realistic scenario in your system. If compressed_size is always populated correctly, this check is unnecessary.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 31f0062 and 1c7f996.

📒 Files selected for processing (1)

src/query/stream_schema_provider.rs (3 hunks)

🧰 Additional context used

🧠 Learnings (4)

📚 Learning: 2025-08-25T01:31:41.786Z

Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1415
File: src/metadata.rs:63-68
Timestamp: 2025-08-25T01:31:41.786Z
Learning: The TOTAL_EVENTS_INGESTED_DATE, TOTAL_EVENTS_INGESTED_SIZE_DATE, and TOTAL_EVENTS_STORAGE_SIZE_DATE metrics in src/metadata.rs and src/storage/object_storage.rs are designed to track total events across all streams, not per-stream. They use labels [origin, parsed_date] to aggregate by format and date, while per-stream metrics use [stream_name, origin, parsed_date] labels.

Applied to files:

src/query/stream_schema_provider.rs

📚 Learning: 2025-08-25T01:32:25.980Z

Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1415
File: src/metrics/mod.rs:163-173
Timestamp: 2025-08-25T01:32:25.980Z
Learning: The TOTAL_EVENTS_INGESTED_DATE, TOTAL_EVENTS_INGESTED_SIZE_DATE, and TOTAL_EVENTS_STORAGE_SIZE_DATE metrics in src/metrics/mod.rs are intentionally designed to track global totals across all streams for a given date, using labels ["format", "date"] rather than per-stream labels. This is the correct design for global aggregation purposes.

Applied to files:

src/query/stream_schema_provider.rs

📚 Learning: 2025-08-18T19:10:11.941Z

Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1405
File: src/handlers/http/ingest.rs:163-164
Timestamp: 2025-08-18T19:10:11.941Z
Learning: Field statistics calculation in src/storage/field_stats.rs uses None for the time_partition parameter when calling flatten_and_push_logs(), as field stats generation does not require time partition functionality.

Applied to files:

src/query/stream_schema_provider.rs

📚 Learning: 2025-09-18T09:52:07.554Z

Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1415
File: src/storage/object_storage.rs:173-177
Timestamp: 2025-09-18T09:52:07.554Z
Learning: In Parseable's upload system (src/storage/object_storage.rs), the update_storage_metrics function can safely use path.metadata().map_err() to fail on local file metadata read failures because parquet validation (validate_uploaded_parquet_file) ensures file integrity before this step, and the system guarantees local staging files remain accessible throughout the upload flow.

Applied to files:

src/query/stream_schema_provider.rs

🧬 Code graph analysis (1)

src/query/stream_schema_provider.rs (1)

src/metrics/mod.rs (1)

increment_bytes_scanned_in_query_by_date (563-567)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)

GitHub Check: Build Default x86_64-unknown-linux-gnu
GitHub Check: Build Default x86_64-apple-darwin
GitHub Check: Build Default aarch64-apple-darwin
GitHub Check: Build Default aarch64-unknown-linux-gnu
GitHub Check: Build Default x86_64-pc-windows-msvc
GitHub Check: Build Kafka aarch64-apple-darwin
GitHub Check: Build Kafka x86_64-unknown-linux-gnu
GitHub Check: coverage
GitHub Check: Quest Smoke and Load Tests for Distributed deployments
GitHub Check: Quest Smoke and Load Tests for Standalone deployments

🔇 Additional comments (2)

src/query/stream_schema_provider.rs (2)

407-411: Good: Billing metrics now use compressed size for accuracy.

The change to track total_compressed_size instead of total_file_size for the increment_bytes_scanned_in_query_by_date metric aligns with the PR objective. The comment clearly explains the rationale—compressed size represents actual bytes read from object storage, which is what cloud providers charge for.

However, this assumes the critical compilation error at line 365 is resolved and that the compressed bytes calculation is verified.

347-349: Implementation verified - no issues found.

The Column struct correctly has a compressed_size field (defined in src/catalog/column.rs:133), which is populated from parquet metadata and properly summed across all columns for billing calculations. The sum of per-column compressed sizes represents the file's total compressed ingestion size, which aligns with the PR's intent to use ingestion_size instead of file_size. No edge cases identified—empty column lists correctly yield 0.

coderabbitai bot previously approved these changes Nov 11, 2025

View reviewed changes

nikhilsinhaparseable added 2 commits November 17, 2025 10:49

fix: bytes scanned in query

22b967d

instead of using file_size from manifest -- which is size of json we should use ingestion_size -- which is compressed size

update the compressed size for query bytes scanned

1c7f996

nikhilsinhaparseable dismissed coderabbitai[bot]’s stale review via 1c7f996 November 17, 2025 22:11

nikhilsinhaparseable force-pushed the fix-query-bytes-scanned branch from 31f0062 to 1c7f996 Compare November 17, 2025 22:11

coderabbitai bot reviewed Nov 17, 2025

View reviewed changes

coderabbitai bot approved these changes Nov 17, 2025

View reviewed changes

nikhilsinhaparseable merged commit 7adda55 into parseablehq:main Nov 18, 2025
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

fix: bytes scanned in query #1464

fix: bytes scanned in query #1464

nikhilsinhaparseable commented Nov 11, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Nov 11, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

fix: bytes scanned in query #1464

fix: bytes scanned in query #1464

Conversation

nikhilsinhaparseable commented Nov 11, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

nikhilsinhaparseable commented Nov 11, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 11, 2025 •

edited

Loading