-
-
Notifications
You must be signed in to change notification settings - Fork 154
fix: bytes scanned in query #1464
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: bytes scanned in query #1464
Conversation
WalkthroughThe Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
Poem
Pre-merge checks and finishing touches❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Comment |
instead of using file_size from manifest -- which is size of json we should use ingestion_size -- which is compressed size
31f0062 to
1c7f996
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/query/stream_schema_provider.rs (1)
337-365: Critical compilation error:file.file_sizeis inaccessible after destructuring.At lines 338-343, the
Filestruct is destructured and thefilevariable is moved. The..pattern drops thefile_sizefield. Then at line 365, the code attempts to accessfile.file_size, which will fail to compile becausefilehas been consumed.To fix this, capture
file_sizein the destructuring pattern:let File { mut file_path, num_rows, columns, + file_size, .. } = file;Alternatively, if
PartitionedFileshould also use compressed bytes (consistent with the billing metrics change), calculatecompressed_bytesearlier and use it:let File { mut file_path, num_rows, columns, .. } = file; // Track billing metrics for files scanned in query file_count += 1; // Calculate actual compressed bytes that will be read from storage let compressed_bytes: u64 = columns.iter().map(|col| col.compressed_size).sum(); total_compressed_size += compressed_bytes; // ... (Windows path handling code) ... -let pf = PartitionedFile::new(file_path, file.file_size); +let pf = PartitionedFile::new(file_path, compressed_bytes);Which approach to use depends on whether DataFusion's
PartitionedFilesizing should reflect compressed or uncompressed bytes. Since the PR aims to use compressed size for accuracy, the second approach may be more consistent.
🧹 Nitpick comments (1)
src/query/stream_schema_provider.rs (1)
347-349: Consider defensive handling ifcompressed_sizecould be zero or missing.If there's any possibility that
column.compressed_sizecould be 0 (e.g., empty columns, metadata not populated), the billing metrics might underreport bytes scanned. Consider adding validation or a fallback.For example:
let compressed_bytes: u64 = columns.iter().map(|col| col.compressed_size).sum(); if compressed_bytes == 0 { tracing::warn!( "Compressed size is 0 for file {} in stream {}. This may indicate missing metadata.", file_path, self.stream ); } total_compressed_size += compressed_bytes;However, only add this if zero compressed_size is a realistic scenario in your system. If
compressed_sizeis always populated correctly, this check is unnecessary.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/query/stream_schema_provider.rs(3 hunks)
🧰 Additional context used
🧠 Learnings (4)
📚 Learning: 2025-08-25T01:31:41.786Z
Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1415
File: src/metadata.rs:63-68
Timestamp: 2025-08-25T01:31:41.786Z
Learning: The TOTAL_EVENTS_INGESTED_DATE, TOTAL_EVENTS_INGESTED_SIZE_DATE, and TOTAL_EVENTS_STORAGE_SIZE_DATE metrics in src/metadata.rs and src/storage/object_storage.rs are designed to track total events across all streams, not per-stream. They use labels [origin, parsed_date] to aggregate by format and date, while per-stream metrics use [stream_name, origin, parsed_date] labels.
Applied to files:
src/query/stream_schema_provider.rs
📚 Learning: 2025-08-25T01:32:25.980Z
Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1415
File: src/metrics/mod.rs:163-173
Timestamp: 2025-08-25T01:32:25.980Z
Learning: The TOTAL_EVENTS_INGESTED_DATE, TOTAL_EVENTS_INGESTED_SIZE_DATE, and TOTAL_EVENTS_STORAGE_SIZE_DATE metrics in src/metrics/mod.rs are intentionally designed to track global totals across all streams for a given date, using labels ["format", "date"] rather than per-stream labels. This is the correct design for global aggregation purposes.
Applied to files:
src/query/stream_schema_provider.rs
📚 Learning: 2025-08-18T19:10:11.941Z
Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1405
File: src/handlers/http/ingest.rs:163-164
Timestamp: 2025-08-18T19:10:11.941Z
Learning: Field statistics calculation in src/storage/field_stats.rs uses None for the time_partition parameter when calling flatten_and_push_logs(), as field stats generation does not require time partition functionality.
Applied to files:
src/query/stream_schema_provider.rs
📚 Learning: 2025-09-18T09:52:07.554Z
Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1415
File: src/storage/object_storage.rs:173-177
Timestamp: 2025-09-18T09:52:07.554Z
Learning: In Parseable's upload system (src/storage/object_storage.rs), the update_storage_metrics function can safely use path.metadata().map_err() to fail on local file metadata read failures because parquet validation (validate_uploaded_parquet_file) ensures file integrity before this step, and the system guarantees local staging files remain accessible throughout the upload flow.
Applied to files:
src/query/stream_schema_provider.rs
🧬 Code graph analysis (1)
src/query/stream_schema_provider.rs (1)
src/metrics/mod.rs (1)
increment_bytes_scanned_in_query_by_date(563-567)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)
- GitHub Check: Build Default x86_64-unknown-linux-gnu
- GitHub Check: Build Default x86_64-apple-darwin
- GitHub Check: Build Default aarch64-apple-darwin
- GitHub Check: Build Default aarch64-unknown-linux-gnu
- GitHub Check: Build Default x86_64-pc-windows-msvc
- GitHub Check: Build Kafka aarch64-apple-darwin
- GitHub Check: Build Kafka x86_64-unknown-linux-gnu
- GitHub Check: coverage
- GitHub Check: Quest Smoke and Load Tests for Distributed deployments
- GitHub Check: Quest Smoke and Load Tests for Standalone deployments
🔇 Additional comments (2)
src/query/stream_schema_provider.rs (2)
407-411: Good: Billing metrics now use compressed size for accuracy.The change to track
total_compressed_sizeinstead oftotal_file_sizefor theincrement_bytes_scanned_in_query_by_datemetric aligns with the PR objective. The comment clearly explains the rationale—compressed size represents actual bytes read from object storage, which is what cloud providers charge for.However, this assumes the critical compilation error at line 365 is resolved and that the compressed bytes calculation is verified.
347-349: Implementation verified - no issues found.The
Columnstruct correctly has acompressed_sizefield (defined insrc/catalog/column.rs:133), which is populated from parquet metadata and properly summed across all columns for billing calculations. The sum of per-column compressed sizes represents the file's total compressed ingestion size, which aligns with the PR's intent to useingestion_sizeinstead offile_size. No edge cases identified—empty column lists correctly yield 0.
instead of using file_size from manifest -- which is size of json
we should use ingestion_size -- which is compressed size
Summary by CodeRabbit