Skip to content

Conversation

@mmaslankaprv
Copy link
Member

@mmaslankaprv mmaslankaprv commented Oct 29, 2025

This commit introduces a logic that prevents the head chunk from being
appended if it is dispatched i.e. its content is being DMA written.

Previously the head chunk buffer might have been passed to the
file::dma_write_call but appends could still write to that chunk. This
caused a problem which manifested on some filesystems and some hardware
RAID controllers. The problem manifested as corruption in the last page
of a file on disk.

This PR prevents the head being written to disk from being concurrently
appended. The mechanism used here is copying the reminder which wasn't
flushed and spans beyond the last page aligned address to a new chunk.
The reminder is copied only if the current head write was dispatched.
After copying the chunk writer is requested to recycle the chunk and
return it to the cache.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v25.2.x
  • v25.1.x
  • v24.3.x

Release Notes

Improvements

  • potentially better behavior when working with some hardware RAID controllers and portworx

@mmaslankaprv
Copy link
Member Author

/dt

@mmaslankaprv mmaslankaprv force-pushed the segment-appender-rework-2 branch from d3756c9 to 180b333 Compare October 29, 2025 15:37
@mmaslankaprv
Copy link
Member Author

/dt

@vbotbuildovich
Copy link
Collaborator

Retry command for Build#75218

please wait until all jobs are finished before running the slash command

/ci-repeat 1
tests/rptest/tests/random_node_operations_smoke_test.py::RedpandaNodeOperationsSmokeTest.test_node_ops_smoke_test@{"cloud_storage_type":1,"mixed_versions":true}

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Oct 29, 2025

CI test results

test results on build#75218
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
log_segment_appender_test test_concurrent_append_flush unit https://buildkite.com/redpanda/redpanda/builds/75218#019a309f-5e49-458f-9734-29dc7d1c5fba FAIL 0/1
RedpandaNodeOperationsSmokeTest test_node_ops_smoke_test {"cloud_storage_type": 1, "mixed_versions": false} integration https://buildkite.com/redpanda/redpanda/builds/75218#019a30d9-5c5d-4902-bd55-798236739cdc FLAKY 11/21 upstream reliability is '100.0'. current run reliability is '52.38095238095239'. drift is 47.61905 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=RedpandaNodeOperationsSmokeTest&test_method=test_node_ops_smoke_test
RedpandaNodeOperationsSmokeTest test_node_ops_smoke_test {"cloud_storage_type": 1, "mixed_versions": true} integration https://buildkite.com/redpanda/redpanda/builds/75218#019a30d9-5c5f-4731-97bf-e30df1d19f23 FAIL 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=RedpandaNodeOperationsSmokeTest&test_method=test_node_ops_smoke_test
ScalingUpTest test_fast_node_addition null integration https://buildkite.com/redpanda/redpanda/builds/75218#019a30d9-5c5c-40e6-bb43-867f2f1fef4a FLAKY 19/21 upstream reliability is '99.57627118644068'. current run reliability is '90.47619047619048'. drift is 9.10008 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ScalingUpTest&test_method=test_fast_node_addition
src/v/cloud_storage/tests/cloud_storage_e2e_test src/v/cloud_storage/tests/cloud_storage_e2e_test unit https://buildkite.com/redpanda/redpanda/builds/75218#019a309f-5e49-458f-9734-29dc7d1c5fba FAIL 0/1
src/v/cloud_storage/tests/topic_recovery_service_test src/v/cloud_storage/tests/topic_recovery_service_test unit https://buildkite.com/redpanda/redpanda/builds/75218#019a309f-5e49-458f-9734-29dc7d1c5fba FAIL 0/1
src/v/cloud_topics/level_one/metastore/tests/replicated_metastore_test src/v/cloud_topics/level_one/metastore/tests/replicated_metastore_test unit https://buildkite.com/redpanda/redpanda/builds/75218#019a309f-5e49-458f-9734-29dc7d1c5fba FAIL 0/1
src/v/cluster/archival/tests/ntp_archiver_reupload_test src/v/cluster/archival/tests/ntp_archiver_reupload_test unit https://buildkite.com/redpanda/redpanda/builds/75218#019a309f-5e49-458f-9734-29dc7d1c5fba FAIL 0/1
src/v/cluster/archival/tests/ntp_archiver_test src/v/cluster/archival/tests/ntp_archiver_test unit https://buildkite.com/redpanda/redpanda/builds/75218#019a309f-5e49-458f-9734-29dc7d1c5fba FAIL 0/1
src/v/cluster/cloud_metadata/tests/controller_snapshot_reconciliation_test src/v/cluster/cloud_metadata/tests/controller_snapshot_reconciliation_test unit https://buildkite.com/redpanda/redpanda/builds/75218#019a309f-5e49-458f-9734-29dc7d1c5fba FAIL 0/1
src/v/cluster/cloud_metadata/tests/offsets_lookup_test src/v/cluster/cloud_metadata/tests/offsets_lookup_test unit https://buildkite.com/redpanda/redpanda/builds/75218#019a309f-5e49-458f-9734-29dc7d1c5fba FAIL 0/1
src/v/cluster/cloud_metadata/tests/uploader_test src/v/cluster/cloud_metadata/tests/uploader_test unit https://buildkite.com/redpanda/redpanda/builds/75218#019a309f-5e49-458f-9734-29dc7d1c5fba FAIL 0/1
src/v/cluster/tests/controller_state_test src/v/cluster/tests/controller_state_test unit https://buildkite.com/redpanda/redpanda/builds/75218#019a309f-5e49-458f-9734-29dc7d1c5fba FAIL 0/1
src/v/datalake/tests/translator_fixture_test src/v/datalake/tests/translator_fixture_test unit https://buildkite.com/redpanda/redpanda/builds/75218#019a309f-5e49-458f-9734-29dc7d1c5fba FAIL 0/1
src/v/kafka/client/test/cluster_test src/v/kafka/client/test/cluster_test unit https://buildkite.com/redpanda/redpanda/builds/75218#019a309f-5e49-458f-9734-29dc7d1c5fba FAIL 0/1
src/v/kafka/server/tests/alter_config_test src/v/kafka/server/tests/alter_config_test unit https://buildkite.com/redpanda/redpanda/builds/75218#019a309f-5e49-458f-9734-29dc7d1c5fba FAIL 0/1
src/v/kafka/server/tests/alter_user_scram_credentials_test src/v/kafka/server/tests/alter_user_scram_credentials_test unit https://buildkite.com/redpanda/redpanda/builds/75218#019a309f-5e49-458f-9734-29dc7d1c5fba FAIL 0/1
src/v/kafka/server/tests/consumer_groups_test src/v/kafka/server/tests/consumer_groups_test unit https://buildkite.com/redpanda/redpanda/builds/75218#019a309f-5e49-458f-9734-29dc7d1c5fba FAIL 0/1
src/v/kafka/server/tests/kafka_fetch_plan_rpbench_test src/v/kafka/server/tests/kafka_fetch_plan_rpbench_test unit https://buildkite.com/redpanda/redpanda/builds/75218#019a309f-5e49-458f-9734-29dc7d1c5fba FAIL 0/1
src/v/kafka/server/tests/write_at_offset_stm_test src/v/kafka/server/tests/write_at_offset_stm_test unit https://buildkite.com/redpanda/redpanda/builds/75218#019a309f-5e49-458f-9734-29dc7d1c5fba FAIL 0/1
src/v/pandaproxy/schema_registry/test/api_rpbench_test src/v/pandaproxy/schema_registry/test/api_rpbench_test unit https://buildkite.com/redpanda/redpanda/builds/75218#019a309f-5e49-458f-9734-29dc7d1c5fba FAIL 0/1
src/v/raft/tests/basic_raft_fixture_test src/v/raft/tests/basic_raft_fixture_test unit https://buildkite.com/redpanda/redpanda/builds/75218#019a309f-5e49-458f-9734-29dc7d1c5fba FAIL 0/1
src/v/raft/tests/raft_replicate_rpbench_test src/v/raft/tests/raft_replicate_rpbench_test unit https://buildkite.com/redpanda/redpanda/builds/75218#019a309f-5e49-458f-9734-29dc7d1c5fba FAIL 0/1
src/v/security/audit/tests/audit_test src/v/security/audit/tests/audit_test unit https://buildkite.com/redpanda/redpanda/builds/75218#019a309f-5e49-458f-9734-29dc7d1c5fba FAIL 0/1
src/v/storage/tests/compaction_e2e_test src/v/storage/tests/compaction_e2e_test unit https://buildkite.com/redpanda/redpanda/builds/75218#019a309f-5e49-458f-9734-29dc7d1c5fba FAIL 0/1
src/v/storage/tests/segment_appender_rpbench_test src/v/storage/tests/segment_appender_rpbench_test unit https://buildkite.com/redpanda/redpanda/builds/75218#019a309f-5e49-458f-9734-29dc7d1c5fba FAIL 0/1
src/v/storage/tests/segment_appender_test src/v/storage/tests/segment_appender_test unit https://buildkite.com/redpanda/redpanda/builds/75218#019a309f-5e49-458f-9734-29dc7d1c5fba FAIL 0/1
src/v/storage/tests/storage_e2e_fixture_test src/v/storage/tests/storage_e2e_fixture_test unit https://buildkite.com/redpanda/redpanda/builds/75218#019a309f-5e49-458f-9734-29dc7d1c5fba FAIL 0/1
test results on build#75299
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
EndToEndCloudTopicsTxTest test_write null integration https://buildkite.com/redpanda/redpanda/builds/75299#019a3471-576c-442c-9853-6c2cae3180b3 FLAKY 17/21 upstream reliability is '93.40974212034384'. current run reliability is '80.95238095238095'. drift is 12.45736 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=EndToEndCloudTopicsTxTest&test_method=test_write
WriteCachingFailureInjectionE2ETest test_crash_all {"use_transactions": false} integration https://buildkite.com/redpanda/redpanda/builds/75299#019a3471-576f-4113-98c4-af69903f811d FLAKY 19/21 upstream reliability is '84.4017094017094'. current run reliability is '90.47619047619048'. drift is -6.07448 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=WriteCachingFailureInjectionE2ETest&test_method=test_crash_all
src/v/storage/tests/segment_appender_rpbench_test src/v/storage/tests/segment_appender_rpbench_test unit https://buildkite.com/redpanda/redpanda/builds/75299#019a3438-9d32-47e4-9cfe-dbfbe91e6a02 FAIL 0/1
test results on build#75472
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
AuditLogTestKafkaApi test_no_auth_enabled {"audit_transport_mode": "rpc"} integration https://buildkite.com/redpanda/redpanda/builds/75472#019a499c-2616-4c84-b026-3240432cad91 FLAKY 20/21 upstream reliability is '100.0'. current run reliability is '95.23809523809523'. drift is 4.7619 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=AuditLogTestKafkaApi&test_method=test_no_auth_enabled
ShadowLinkingReplicationTests test_replication_basic {"shuffle_leadership": true, "source_cluster_spec": {"cluster_type": "redpanda"}} integration https://buildkite.com/redpanda/redpanda/builds/75472#019a499c-2609-41a5-b4f3-006930ca3852 FLAKY 20/21 upstream reliability is '97.9689366786141'. current run reliability is '95.23809523809523'. drift is 2.73084 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkingReplicationTests&test_method=test_replication_basic
ShadowLinkingReplicationTests test_replication_timestamps_match {"source_cluster_spec": {"cluster_type": "kafka", "kafka_quorum": "COMBINED_KRAFT", "kafka_version": "3.8.0"}, "timestamp_type": "CreateTime"} integration https://buildkite.com/redpanda/redpanda/builds/75472#019a4995-ad94-441c-a99f-844c11a03f4e FLAKY 20/21 upstream reliability is '95.54794520547945'. current run reliability is '95.23809523809523'. drift is 0.30985 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkingReplicationTests&test_method=test_replication_timestamps_match
ShadowLinkingReplicationTests test_replication_timestamps_match {"source_cluster_spec": {"cluster_type": "kafka", "kafka_quorum": "COMBINED_KRAFT", "kafka_version": "3.8.0"}, "timestamp_type": "CreateTime"} integration https://buildkite.com/redpanda/redpanda/builds/75472#019a499c-260a-4486-a755-c63906aa85ab FLAKY 20/21 upstream reliability is '95.54794520547945'. current run reliability is '95.23809523809523'. drift is 0.30985 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkingReplicationTests&test_method=test_replication_timestamps_match
QuickTerminateTest test_terminate null integration https://buildkite.com/redpanda/redpanda/builds/75472#019a499c-2619-463a-bf8b-1af290c65052 FAIL 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=QuickTerminateTest&test_method=test_terminate

@mmaslankaprv mmaslankaprv force-pushed the segment-appender-rework-2 branch 2 times, most recently from 98d8d0d to 8d9bf0c Compare October 30, 2025 08:23
@mmaslankaprv
Copy link
Member Author

/dt

1 similar comment
@mmaslankaprv
Copy link
Member Author

/dt

Added method that copies the reminder that spans from the last page
aligned address to the end of the buffer. The method copies data and
sets the chunk positions to make sure the number of not flushed bytes is
preserved.

Signed-off-by: Michał Maślanka <michal@redpanda.com>
This commit introduces a logic that prevents the head chunk from being
appended if it is dispatched i.e. its content is being DMA written.

Previously the head chunk buffer might have been passed to the
`file::dma_write_call` but appends could still write to that chunk. This
caused a problem which manifested on some filesystems and some hardware
RAID controllers. The problem manifested as corruption in the last page
of a file on disk.

This PR prevents the head being written to disk from being concurrently
appended. The mechanism used here is copying the reminder which wasn't
flushed and spans beyond the last page aligned address to a new chunk.
The reminder is copied only if the current head write was dispatched.
After copying the chunk writer is requested to recycle the chunk and
return it to the cache.

Signed-off-by: Michał Maślanka <michal@redpanda.com>
Signed-off-by: Michał Maślanka <michal@redpanda.com>
Sometimes the benchmark can take too long to execute. Increasing timeout
to eliminate intermittent failures.

Signed-off-by: Michał Maślanka <michal@redpanda.com>
@mmaslankaprv mmaslankaprv force-pushed the segment-appender-rework-2 branch from c6c178a to b10ea9b Compare November 3, 2025 11:00
@mmaslankaprv mmaslankaprv marked this pull request as ready for review November 3, 2025 11:02
Copilot AI review requested due to automatic review settings November 3, 2025 11:02
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a mechanism to prevent the segment appender from writing to the head chunk buffer while it's being DMA written to disk. This addresses data corruption issues that could occur on certain filesystems and hardware RAID controllers when concurrent appends modified data that was already dispatched for writing.

Key changes:

  • Prevents concurrent modification of dispatched head chunks by copying unwritten data to a new chunk
  • Adds tracking for bytes copied in chunk reminders to detect when the prevention mechanism is active
  • Introduces comprehensive tests to verify the fix works correctly under various concurrent scenarios

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/v/storage/segment_appender.cc Core logic to detect dispatched writes and copy chunk reminders to prevent concurrent modification
src/v/storage/segment_appender.h API changes including new stats tracking and chunk dispatch detection methods
src/v/storage/segment_appender_chunk.h New copy_reminder_from method to safely copy unwritten data between chunks
src/v/storage/tests/segment_appender_test.cc Comprehensive test coverage for concurrent flush scenarios and chunk copying behavior
src/v/storage/tests/BUILD Test timeout configuration update

* pending_aligned_begin() i.e. it is the part that overflows last full page
* boundary.
*
* This method preservers the position and flushed position relative to the
Copy link

Copilot AI Nov 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected spelling of 'preservers' to 'preserves'.

Suggested change
* This method preservers the position and flushed position relative to the
* This method preserves the position and flushed position relative to the

Copilot uses AI. Check for mistakes.
*
* This method preservers the position and flushed position relative to the
* pending_aligned_begin()
* IMPORTANT: this method will reset the chunk conent before copying the
Copy link

Copilot AI Nov 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected spelling of 'conent' to 'content'.

Suggested change
* IMPORTANT: this method will reset the chunk conent before copying the
* IMPORTANT: this method will reset the chunk content before copying the

Copilot uses AI. Check for mistakes.
Comment on lines +241 to +243
// true if the write extends to the end of the chunk or current write is
// the last one using current chunk i.e., this is the last write that
// will use the current chunk before it is recycled
Copy link

Copilot AI Nov 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment should be updated to reflect the new field name. The field was renamed from 'full' to 'last_write_to_current_chunk' but the comment still references the old meaning.

Suggested change
// true if the write extends to the end of the chunk or current write is
// the last one using current chunk i.e., this is the last write that
// will use the current chunk before it is recycled
// true if this is the last write that will use the current chunk before
// it is recycled. This does not necessarily mean the write extends to
// the end of the chunk.

Copilot uses AI. Check for mistakes.

chunk_1.flush();
auto chunk_3 = make_chunk(16_KiB);
// only last 2 KiB should be
Copy link

Copilot AI Nov 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment on line 323 is incomplete. It should specify what the 'only last 2 KiB should be' refers to (e.g., 'only last 2 KiB should be copied').

Suggested change
// only last 2 KiB should be
// only last 2 KiB should be copied to chunk_3 from chunk_1 after flush

Copilot uses AI. Check for mistakes.
@vbotbuildovich
Copy link
Collaborator

Retry command for Build#75472

please wait until all jobs are finished before running the slash command

/ci-repeat 1
tests/rptest/tests/quick_terminate_test.py::QuickTerminateTest.test_terminate

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants