Skip to content

Conversation

@d-v-b
Copy link
Contributor

@d-v-b d-v-b commented Oct 28, 2025

This PR adds a Bytes dtype that is nearly identical to the existing VariableLengthBytes dtype save a few differences:

  • the V3 JSON form is "bytes" instead of "variable_length_bytes"
  • The fill value representation is an array of ints, instead of a base64-encoded string
  • Bytes is consistent with a published spec, instead of not being described by a spec.

The latter point is the most important.

Because Bytes is nearly identical to VariableLengthBytes, it could be a drop-in replacement for VariableLengthBytes, save for the JSON fill value encoding differences between the two codecs, which raises two questions:

  • is base64 encoding a bytestring a better or worse JSON serialization than representing the same bytestring as an array of integers?
  • could (or should) we amend the bytes data type spec to recommend reading (or reading and writing) both fill value encodings? If so, the bytes data type and the variable_length_bytes data types could be fused completely.

Since zarr python 2.x was saving bytes fill values as base64-encoded strings, there's a bit of inertia there. Would also be good to hear thoughts from other implementers (@LDeakin , @jbms, @manzt )

@github-actions github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Oct 28, 2025
@codecov
Copy link

codecov bot commented Oct 28, 2025

Codecov Report

❌ Patch coverage is 51.85185% with 52 lines in your changes missing coverage. Please review.
✅ Project coverage is 61.72%. Comparing base (54dcede) to head (3a61a14).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
src/zarr/core/dtype/npy/bytes.py 57.14% 30 Missing ⚠️
src/zarr/core/dtype/__init__.py 53.84% 6 Missing ⚠️
src/zarr/core/dtype/registry.py 66.66% 4 Missing ⚠️
src/zarr/errors.py 0.00% 3 Missing ⚠️
src/zarr/core/dtype/npy/bool.py 0.00% 1 Missing ⚠️
src/zarr/core/dtype/npy/common.py 50.00% 1 Missing ⚠️
src/zarr/core/dtype/npy/complex.py 0.00% 1 Missing ⚠️
src/zarr/core/dtype/npy/float.py 0.00% 1 Missing ⚠️
src/zarr/core/dtype/npy/int.py 0.00% 1 Missing ⚠️
src/zarr/core/dtype/npy/string.py 0.00% 1 Missing ⚠️
... and 3 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3559      +/-   ##
==========================================
- Coverage   61.89%   61.72%   -0.18%     
==========================================
  Files          85       85              
  Lines       10153    10246      +93     
==========================================
+ Hits         6284     6324      +40     
- Misses       3869     3922      +53     
Files with missing lines Coverage Δ
src/zarr/core/dtype/common.py 28.39% <ø> (+0.68%) ⬆️
src/zarr/core/dtype/npy/bool.py 45.61% <0.00%> (-0.82%) ⬇️
src/zarr/core/dtype/npy/common.py 61.53% <50.00%> (-0.19%) ⬇️
src/zarr/core/dtype/npy/complex.py 47.61% <0.00%> (-0.58%) ⬇️
src/zarr/core/dtype/npy/float.py 46.87% <0.00%> (-0.50%) ⬇️
src/zarr/core/dtype/npy/int.py 53.41% <0.00%> (-0.17%) ⬇️
src/zarr/core/dtype/npy/string.py 44.11% <0.00%> (-0.33%) ⬇️
src/zarr/core/dtype/npy/structured.py 55.78% <0.00%> (-0.60%) ⬇️
src/zarr/core/dtype/npy/time.py 52.84% <0.00%> (-0.31%) ⬇️
src/zarr/dtype.py 0.00% <0.00%> (ø)
... and 4 more
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@d-v-b d-v-b marked this pull request as ready for review November 16, 2025 15:00
…ions for bytes / variable-length bytes; set up alias logic for variable-length bytes; parse_dtype now takes the bytes type as an alias for the bytes dtype
@github-actions github-actions bot removed the needs release notes Automatically applied to PRs which haven't added release notes label Nov 16, 2025
@d-v-b
Copy link
Contributor Author

d-v-b commented Nov 17, 2025

I am assuming that zarr-developers/zarr-extensions#38 will be merged soon, as it was approved by @normanrz.

the Bytes data type now reads the tuple-of-ints JSON fill value defined in the current text of the "bytes" dtype spec, but it reads (and writes) the base64-encoded string form that the future version of the "bytes" dtype spec softly recommends.

The scope of this PR has broadened somewhat because we are effectively replacing one data type (VariableLengthBytes) with a new one (Bytes).

These two data types cannot co-exist in the data type registry, because they have the same zarr v2 JSON representation ({"id": "|O", "object_codec_id": "vlen-bytes"}), and the registry treats "1 JSON input matches 2 or more data type classes" as an error.

So I removed VariableLengthBytes from the registry, and added logic to Bytes to ensure that the JSON string "variable_length_bytes" is an alias for "bytes", which guarantees that old data will be readable by the new data type.

For any users who are committed to the VariableLengthBytes data type for some reason, I added a function to reset the data type registry to the old state (de-registering Bytes and registering VariableLengthBytes in its place), and a function to reverse this transformation. We need to foreground these functions in our documentation for the bytes dtype.

This is not a breaking change for runtime functionality (Bytes and VariableLengthBytes are effectively identical), but I think this is a breaking change for anyone who relies on the round-trip integrity of old data -- data saved with "data_type" : "variable_length_bytes" will be resaved with "data_type": "bytes" after this change. For that reason I think we should ship this in a 3.2 release.

Comment on lines +287 to +289
if dtype_spec is bytes:
# Treat the bytes type as a request for the Bytes dtype
return Bytes()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flagging this change -- parse_dtype(bytes, zarr_format = 3) will now return an instance of Bytes

Comment on lines +296 to +319
def enable_legacy_bytes_dtype() -> None:
"""
Unregister the new Bytes data type from the registry, and replace it with the
VariableLengthBytes dtype instead. Used for backwards compatibility.
"""
if (
"bytes" in data_type_registry.contents
and "variable_length_bytes" not in data_type_registry.contents
):
data_type_registry.unregister("bytes")
data_type_registry.register("variable_length_bytes", VariableLengthBytes)


def disable_legacy_bytes_dtype() -> None:
"""
Unregister the old VariableLengthBytes dtype from the registry, and replace it with
the new Bytes dtype. Used to reverse the effect of enable_legacy_bytes_dtype
"""
if (
"variable_length_bytes" in data_type_registry.contents
and "bytes" not in data_type_registry.contents
):
data_type_registry.unregister("variable_length_bytes")
data_type_registry.register("bytes", Bytes)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these functions let users effectively disable the changes in this PR

Comment on lines -154 to -159
class DataTypeValidationError(ValueError): ...


class ScalarTypeValidationError(ValueError): ...


Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these errors were moved to the main errors module

class ScalarTypeValidationError(ValueError): ...


class DataTypeResolutionError(ValueError):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a new error class that models problems encountered during data type resolution, i.e. the process of mapping an input to one of the available data types. This is different from data type validation, which we have defined to mean the process of mapping an input to a specific data type.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant