Skip to content

Conversation

@hardcoretime
Copy link
Contributor

@hardcoretime hardcoretime commented Nov 8, 2025

Description

Why do we need it, and what problem does it solve?

Sometimes, when the virtual disk snapshot controller creates a batch of snapshots with required consistency, a race condition in the freeze or unfreeze filesystem operation can cause an inconsistent snapshot. To prevent this, the KubeVirt virtual machine instance is now annotated with a "freeze" or "unfreeze" request, and the service checks that the type of this request matches the fsFreezeStatus. The fsFreezeStatus is considered trusted only if it matches the request type.

What is the expected result?

Snapshots with required consistency now execute without race conditions in the filesystem freeze process.

Checklist

  • The code is covered by unit tests.
  • e2e tests passed.
  • Documentation updated according to the changes.
  • Changes were tested in the Kubernetes cluster manually.

Changelog entries

section: vdsnapshot
type: fix
summary: "Snapshots with required consistency now execute without race conditions in the filesystem freeze process."

@hardcoretime hardcoretime added this to the v1.2.0 milestone Nov 8, 2025
@hardcoretime hardcoretime force-pushed the fix/vdsnapshot/freeze-request-race-condition branch 12 times, most recently from 6918581 to 04b73ba Compare November 13, 2025 12:07
Roman Sysoev added 16 commits November 14, 2025 15:52
Signed-off-by: Roman Sysoev <roman.sysoev@flant.com>
Use kvvmi status instead of vm.

Signed-off-by: Roman Sysoev <roman.sysoev@flant.com>
Signed-off-by: Roman Sysoev <roman.sysoev@flant.com>
Signed-off-by: Roman Sysoev <roman.sysoev@flant.com>
Signed-off-by: Roman Sysoev <roman.sysoev@flant.com>
Signed-off-by: Roman Sysoev <roman.sysoev@flant.com>
Signed-off-by: Roman Sysoev <roman.sysoev@flant.com>
Signed-off-by: Roman Sysoev <roman.sysoev@flant.com>
Signed-off-by: Roman Sysoev <roman.sysoev@flant.com>
Signed-off-by: Roman Sysoev <roman.sysoev@flant.com>
Signed-off-by: Roman Sysoev <roman.sysoev@flant.com>
Signed-off-by: Roman Sysoev <roman.sysoev@flant.com>
Signed-off-by: Roman Sysoev <roman.sysoev@flant.com>
Signed-off-by: Roman Sysoev <roman.sysoev@flant.com>
Signed-off-by: Roman Sysoev <roman.sysoev@flant.com>
Signed-off-by: Roman Sysoev <roman.sysoev@flant.com>
@hardcoretime hardcoretime force-pushed the fix/vdsnapshot/freeze-request-race-condition branch from dffed0c to 9065409 Compare November 14, 2025 12:52
@hardcoretime hardcoretime marked this pull request as ready for review November 14, 2025 13:48

filesystemFrozen, _ := conditions.GetCondition(vmcondition.TypeFilesystemFrozen, vm.Status.Conditions)
if _, ok := kvvmi.Annotations[annotations.AnnVMFilesystemFrozenRequest]; ok {
return false, fmt.Errorf("failed to check %s/%s fsFreezeStatus: %w", kvvmi.Namespace, kvvmi.Name, ErrUntrustedFilesystemFrozenCondition)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let’s add to the error which exact value was found in the annotation

if vm != nil && vm.Status.Phase != v1alpha2.MachineStopped && !isFSFrozen {
canFreeze, err := h.snapshotter.CanFreeze(ctx, kvvmi)
if err != nil {
if errors.Is(err, service.ErrUntrustedFilesystemFrozenCondition) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the kvvmi is fetched from the cache once for all functions (in GetKubeVirtVirtualMachineInstance), then there’s no point in checking for Untrusted every time. You already checked that at the very beginning in SyncFSFreezeRequest on line 173.
At this stage of the handler’s processing, such an error should no longer occur. This means it’s an unexpected case that should be handled as an internal error (return reconcile.Result{}, err)

} else {
switch {
case vm == nil, vm.Status.Phase == v1alpha2.MachineStopped:
if vdSnapshot.Status.Consistent == nil || !*vdSnapshot.Status.Consistent {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just if vdSnapshot.Status.Consistent == nil {. No need to overwrite *vdSnapshot.Status.Consistent if it has already been set.

case vm == nil, vm.Status.Phase == v1alpha2.MachineStopped:
if vdSnapshot.Status.Consistent == nil || !*vdSnapshot.Status.Consistent {
vdSnapshot.Status.Consistent = ptr.To(true)
return reconcile.Result{RequeueAfter: 2 * time.Second}, nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let’s add a comment explaining why reconciliation is needed here and what problem it helps prevent

Message(service.CapitalizeFirstLetter(err.Error() + "."))
}

func (h LifeCycleHandler) unfreezeFilesystemIfFailed(ctx context.Context, vdSnapshot *v1alpha2.VirtualDiskSnapshot) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function should be removed. Instead of it, we should use unfreezeFilesystem.

}
return reconcile.Result{}, err
}
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Terrible cyclomatic complexity

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This way is much simpler:

		if vdSnapshot.Status.Consistent == nil {
			if vm == nil || vm.Status.Phase == v1alpha2.MachineStopped || isFSFrozen {
				vdSnapshot.Status.Consistent = ptr.To(true)
				return reconcile.Result{RequeueAfter: 2 * time.Second}, nil
			}

			if vdSnapshot.Spec.RequiredConsistency {
				err = fmt.Errorf("virtual disk snapshot is not consistent because the virtual machine %s has not been stopped or its filesystem has not been frozen", vm.Name)
				setPhaseConditionToFailed(cb, &vdSnapshot.Status.Phase, err)
				return reconcile.Result{}, err
			}
		}

		err = h.unfreezeFilesystem(ctx, vdSnapshot.Name, vm, kvvmi)
		if err != nil {
			if k8serrors.IsConflict(err) {
				return reconcile.Result{RequeueAfter: 5 * time.Second}, nil
			}
			return reconcile.Result{}, err
		}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants