-
Notifications
You must be signed in to change notification settings - Fork 26
Next major release #219
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Next major release #219
Conversation
appending combined_factors into a list of combined_factors variable and use it as the final column to avoid using row_index which seems to fail mypy
The failing was not because = was not handled in CS= value but more because of space after ; which make "CS" not matching with the key " CS" from the cell.
date of retrieval for all the databases used were on 29/08/2025
update validate method to instead of failing right away on encountering an error in self.ontology_term_parser and self.validate_ontology_terms by collecting error from them within try and except statement to be added to the errors list.
characteristics[age] with range annotation matching is broken with the current regex. This update change the regex to be a bit more complex. Add PXD001474 annotated sdrf file from the bigbio/proteomics-sample-metadata as reference for testing with range. Also update a docstring to correctly refer to the input parameter name in OntologyValidator.validate
add to reference folder under tests
update validator to include a unique values validator that can be used to check if a column only have unique value. This is then used for assay name column validation.
remove single column unique values validator from minimum.yml and add multiple column unique values validator that can be used in yml to validate unique combination of multiple columns. Here we use it by default for assay name, source name combination
Search for TMT and iTRAQ mods in lowercase
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR represents a major overhaul of the SDRF pipelines package, transitioning from a legacy validation system to a modern schema-driven validation architecture. The changes span the entire codebase with a complete refactor of the validation framework, build system migration to Poetry, and extensive modernization.
- Complete validation system replacement: Legacy validation replaced with Pydantic-based schema system with YAML/JSON schema definitions and modular validators
- New schema templates: Added organism-specific SDRF schemas (minimum, default, human, plants, vertebrates, cell lines) with inheritance support
- Modernized codebase: Migrated from
pkg_resourcestopathlib.Path, updated type annotations, and enhanced error handling throughout
Reviewed Changes
Copilot reviewed 57 out of 81 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| sdrf_pipelines/sdrf/validators.py | New modular validator system with comprehensive validation rules |
| sdrf_pipelines/sdrf/schemas.py | Schema-driven validation system with inheritance support |
| sdrf_pipelines/parse_sdrf.py | CLI refactoring with new schema-based validation system |
| tests/test_ols.py | Complete test coverage for OLS client functionality |
| sdrf_pipelines/openms/openms.py | OpenMS converter refactoring with improved type safety |
| sdrf_pipelines/ols/ols.py | OLS client improvements with better error handling |
| setup.py | Removed (migrated to Poetry) |
| sdrf_pipelines/sdrf/sdrf.py | Complete refactor using Pydantic models |
Comments suppressed due to low confidence (1)
sdrf_pipelines/sdrf/schemas.py:1
- Using bare
except Exceptioncatches all exceptions including system errors. Consider catching specific exceptions likeValueError,KeyError, orRuntimeErrorthat are expected during schema validation.
import json
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
sdrf_pipelines/sdrf/validators.py
Outdated
|
|
||
| def __init__(self, params: dict[str, Any] | None = None, **data: Any): | ||
| super().__init__(**data) | ||
| logging.info(params) |
Copilot
AI
Sep 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Logging parameters at INFO level in a validator initialization may be too verbose for production use. Consider using DEBUG level or removing this logging statement.
| logging.info(params) | |
| logging.debug(params) |
change name of multiple_collumn_unique_values_validator to combination_of_columns_no_duplicate_validator . Add single_cardinality_validator to help validating that there is only 1 single unique value in a column. Remove mentioning of cardinality outside of yml validators section.
adding proteomics data acquisition method to minimum template. Add column_name_warning to combination no duplication validator to allow specifying combination that would also be notified as warning but not as error. Adding ability to output a data table of all errors and warnings parse_sdrf validate_sdrf function with optional -o or --out parameter.
adding a ValidationProof class that can be used to generate a SHA512 hash for the validation action that is combination of content of the sdrf, content of the template, version of the validator, timestamp and an optional unique salt string. validate_sdrf function from parse_sdrf also have optional parameters for producing a validation proof json and/or just print the hash into the cli.
updating most of print statement in schemas.py to logging.debug only keeping the print statement in the debug_col function which is necessary for some of the test.
updating read_sdrf method to be able to return SDRFDataFrame object instead of just pd.DataFrame and update SDRFDataFrame to accept both normal pd.DataFrame as well as SDRFDataFrame. Adding SDRFMetadata class to allow automated parsing of metadata line from sdrf file.
User description
Summary by CodeRabbit
New Features
Bug Fixes
Documentation
Refactor
Chores
Revert
PR Type
Enhancement, Bug fix, Tests, Documentation
Description
• Major refactoring: Complete replacement of legacy validation system with new schema-driven validation using Pydantic models and YAML/JSON schema definitions
• New validation framework: Introduced modular validator system with plugin architecture supporting ontology validation, pattern matching, and customizable error levels
• Schema templates: Added multiple organism-specific SDRF schema templates (minimum, default, human, plants, vertebrates, cell lines, non-vertebrates) with inheritance support
• CLI improvements: Added simplified
validate-sdrf-simplecommand with color-coded error/warning output and enhanced help text• Code modernization: Migrated from
pkg_resourcestopathlib.Path, replaced deprecated functions, improved type annotations throughout codebase• Build system migration: Switched from setuptools to Poetry with updated dependencies and development tooling
• Enhanced testing: Added comprehensive test coverage for new validation system, OLS client functionality, and TMT label inference
• Bug fixes: Fixed XML element creation in MaxQuant converter, improved pandas operations, and enhanced error handling
• Documentation: Updated README and CONTRIBUTING guides to reflect new schema system and validation workflow
• CI/CD improvements: Added mypy type checking, updated Python version support (3.10-3.13), and enhanced pre-commit hooks
Diagram Walkthrough
File Walkthrough
10 files
openms.py
OpenMS converter refactoring with improved type safety anddocumentationsdrf_pipelines/openms/openms.py
• Added comprehensive docstring with example command usage
• Improved
type annotations throughout the class and functions
• Refactored
modification parsing logic into smaller, more maintainable methods
•
Added
infer_tmtplexfunction and moved TMT plex dictionaries to modulelevel
• Enhanced error handling and logging with better warning
messages
• Replaced deprecated
_removesuffixfunction with built-inremovesuffixmethod• Improved file writing with proper encoding
specification
ols.py
OLS client improvements with better error handling and type safetysdrf_pipelines/ols/ols.py
• Enhanced type annotations and function signatures throughout the
module
• Improved error handling with connection error fallback to
cache
• Added comprehensive docstrings for all public methods
•
Replaced deprecated
pkg_resourceswithpathlib.Pathfor resourcehandling
• Enhanced OBO and OWL file parsing with better error
handling
• Added retry logic for OLS API requests with exponential
backoff
parse_sdrf.py
CLI refactoring with new schema-based validation systemsdrf_pipelines/parse_sdrf.py
• Replaced legacy SDRF validation system with new schema-based
validation
• Added new
validate-sdrf-simplecommand for streamlinedvalidation
• Improved CLI help text and option descriptions
• Enhanced
error reporting with color-coded output (red for errors, yellow for
warnings)
• Removed deprecated validation flags and simplified
validation workflow
validators.py
New modular validator system with comprehensive validation rulessdrf_pipelines/sdrf/validators.py
• New comprehensive validator system with plugin architecture
•
Implemented multiple validator types: trailing whitespace, minimum
columns, ontology, pattern, column order, empty cells
• Added
validator registry system for extensible validation framework
•
Enhanced ontology validation with OLS integration and caching support
• Improved error reporting with detailed location information
normalyzerde.py
NormalyzerDE converter improvements and modernizationsdrf_pipelines/normalyzerde/normalyzerde.py
• Improved pandas column operations using vectorized methods
•
Enhanced type annotations for method parameters
• Fixed regex pattern
matching with proper None comparison
• Improved file I/O with proper
encoding specification
specification.py
New SDRF specification constants modulesdrf_pipelines/sdrf/specification.py
• Added new specification constants module
• Defined standard values
for SDRF validation:
NOT_AVAILABLE,NOT_APPLICABLE,NORMschemas.py
New schema-driven validation system with inheritance supportsdrf_pipelines/sdrf/schemas.py
• Added comprehensive schema-based validation system with
SchemaRegistryandSchemaValidator• Implemented YAML/JSON schema
loading with inheritance support via
extendsfield• Added ontology
validation with multiple ontology support and rich error messages
•
Created modular validator system with configurable parameters
unimod.py
Modernize resource handling and code formattingsdrf_pipelines/openms/unimod.py
• Replaced
pkg_resourceswithpathlib.Pathfor resource file access•
Updated string formatting to use f-strings instead of
.format()•
Fixed boolean comparison from
== Truetois True• Added type hints
for method parameters
msstats.py
Type hints and pandas operations modernizationsdrf_pipelines/msstats/msstats.py
• Added type hints for instance variables
• Replaced
map()with.str.lower()for pandas column operations• Updated dictionary key
checking to use
inoperator• Improved code readability and type
safety
utils.py
Added TSV line utility functionsdrf_pipelines/utils/utils.py
• Added utility function
tsv_line()for creating tab-separated valuelines
• Simple helper function for TSV file generation
3 files
maxquant.py
MaxQuant converter modernization and bug fixessdrf_pipelines/maxquant/maxquant.py
• Replaced deprecated
pkg_resourceswithpathlib.Pathfor resourcehandling
• Fixed XML element creation bug (changed
typetoname)•
Improved pandas operations using vectorized methods
• Enhanced file
I/O with proper encoding specification
• Added utility function usage
for TSV line formatting
pythonpublish.yml
Fixed build dependency installation command.github/workflows/pythonpublish.yml
• Fixed typo in pip install command for build dependencies
param2sdrf.yml
Fixed YAML syntax in parameter configurationsdrf_pipelines/maxquant/param2sdrf.yml
• Fixed YAML comment syntax error in commented parameter
7 files
test_sdrfchecker.py
Test updates for new validation system and version checkingtests/test_sdrfchecker.py
• Added version output validation test with semantic versioning regex
• Updated test expectations to match new validation system output
•
Enhanced help command testing
• Modified error message assertions to
work with new validation framework
test_ols.py
Complete test coverage for OLS client functionalitytests/test_ols.py
• Added comprehensive test suite for OLS client functionality
• Tests
cover real API calls for EFO and GO ontologies
• Added tests for OBO
and OWL file parsing functions
• Includes tests for ontology index
building and error handling
test_min_columns.py
Tests for schema-based minimum column validationtests/test_min_columns.py
• Added tests for new schema validation system
• Tests minimum column
requirements and error counting
• Validates specific error messages
and types
• Uses new
SDRFDataFrameandSchemaValidatorclassestest_openms.py
Added tests for TMT label inference functionalitytests/test_openms.py
• Added imports for
TMT_PLEXESandinfer_tmtplexfunctions• Added
parametrized tests for TMT label inference functionality
• Tests both
full and incomplete TMT plex scenarios
test_sdrf.py
Basic tests for new SDRFDataFrame implementationtests/test_sdrf.py
• Added basic tests for new
SDRFDataFrameclass• Tests DataFrame
creation, column access, and shape properties
• Validates the new
Pydantic-based SDRF data structure
test_ontology.py
Ontology test cleanup and standardizationtests/test_ontology.py
• Updated ontology names to lowercase (
NCBITaxontoncbitaxon)•
Removed debug print statements
• Cleaned up test code for better
maintainability
test_unimod.py
Enhanced Unimod tests with proper assertionstests/test_unimod.py
• Added assertions to validate test results instead of just printing
•
Removed debug print statements and main execution block
• Improved
test reliability with proper assertions
1 files
helpers.py
Test helper improvements with better type annotationstests/helpers.py
• Updated import statements to use more specific Click types
•
Enhanced type annotations for better code clarity
1 files
sdrf.py
Complete refactor of SDRF data structure using Pydanticsdrf_pipelines/sdrf/sdrf.py
• Replaced legacy
SdrfDataFrameclass with newSDRFDataFramePydanticmodel
• Removed complex validation methods and schema imports
• Added
simple
read_sdrffunction for parsing SDRF files• Simplified
DataFrame operations with delegation pattern
4 files
exceptions.py
Remove pandas_schema dependency with custom validation classessdrf_pipelines/utils/exceptions.py
• Removed dependency on
pandas_schema.ValidationWarning• Implemented
custom
ValidationWarningclass with same interface• Enhanced
LogicErrorclass with proper error type handling• Improved string
formatting and type hints
environment.yml
Cleaned up conda environment dependenciesenvironment.yml
• Removed build-specific dependencies (conda-build, anaconda-client)
•
Removed deprecated pandas_schema and setuptools
• Simplified
dependency list to runtime requirements only
requirements.txt
Updated dependencies for Pydantic validation systemrequirements.txt
• Removed deprecated pandas_schema dependency
• Added pydantic>=2.0.0
for new validation system
• Removed setuptools as it's no longer
needed
requirements-dev.txt
Enhanced development dependencies with type checkingrequirements-dev.txt
• Added mypy and related type checking dependencies
• Added
pandas-stubs for better pandas type checking
• Enhanced development
tooling for type safety
2 files
test_convert_openms.py
Code cleanup and import optimizationtests/test_convert_openms.py
• Updated import statements to use tuple unpacking
• Replaced unused
variable assignments with underscore
• Minor code cleanup and
formatting improvements
modifications.xml
XML schema simplification for modificationssdrf_pipelines/maxquant/modifications.xml
• Removed XML schema declarations from modifications file
• Simplified
XML structure while maintaining content
3 files
__init__.py
Module documentation for SDRF v2sdrf_pipelines/sdrf/init.py
• Added module docstring describing the new Pydantic-based validation
system
README.md
Comprehensive documentation for new validation systemREADME.md
• Added extensive documentation for new schema-based validation system
• Documented YAML schema features, inheritance, and custom template
creation
• Added examples of JSON schema definitions and validation
commands
• Explained the new simplified validation interface
CONTRIBUTING.md
Updated development workflow documentationCONTRIBUTING.md
• Updated development setup instructions for Poetry and modern Python
• Added mypy type checking documentation
• Enhanced pre-commit hooks
documentation
• Improved development workflow guidance
14 files
minimum.yaml
Minimum SDRF schema definition with ontology validationsdrf_pipelines/sdrf/schemas/minimum.yaml
• Defined minimum SDRF schema with 12 required columns
• Specified
ontology validators for organism, cell type, label, and instrument
fields
• Added comprehensive field descriptions and validation rules
ci.yml
Enhanced CI with dev branch support and type checking.github/workflows/ci.yml
• Added
devbranch to CI triggers for push and pull requests• Updated
Python version from 3.12 to 3.10 for consistency
• Added mypy type
checking job with proper dependencies
• Updated package installation
to use
pip install .pyproject.toml
Migration to Poetry build system with modern toolingpyproject.toml
• Migrated from setup.py to Poetry-based configuration
• Added
comprehensive project metadata and dependencies
• Configured
development tools (black, isort, mypy, pydantic-mypy)
• Added proper
package structure and script entry points
human.yaml
Human-specific SDRF schema with specialized validationsdrf_pipelines/sdrf/schemas/human.yaml
• Defined human-specific SDRF schema extending default schema
• Added
fields for ancestry, age, sex, developmental stage, and individual
•
Configured ontology validation for EFO terms and pattern matching for
age
pythonpackage.yml
Python version updates and installation improvements.github/workflows/pythonpackage.yml
• Updated Python version matrix to remove 3.9 and add 3.13
•
Standardized pip installation commands with
python -m pip• Updated
flake8 line length to 120 characters
• Improved package installation
process
pythonapp.yml
Poetry integration and Python version standardization.github/workflows/pythonapp.yml
• Updated Python version from 3.11 to 3.10
• Added Poetry installation
and usage for dependency management
• Updated test execution to use
specific test selection
.pre-commit-config.yaml
Updated pre-commit hooks with mypy integration.pre-commit-config.yaml
• Updated all hook versions to latest releases
• Added mypy pre-commit
hook with additional dependencies
• Enhanced code quality checks with
type validation
meta.yaml
Conda recipe update for Poetry build systemrecipe/meta.yaml
• Updated build system to use Poetry instead of setuptools
•
Reorganized dependencies and removed deprecated packages
• Added
pydantic and updated package versions
cell_lines.yaml
Cell line SDRF schema with ontology validationsdrf_pipelines/sdrf/schemas/cell_lines.yaml
• Created cell line-specific schema extending human schema
• Added
cell line field with CLO and BTO ontology validation
• Provided
examples of common cell lines (HeLa, HEK293, MCF7)
default.yaml
Default SDRF schema with disease field validationsdrf_pipelines/sdrf/schemas/default.yaml
• Created default schema extending minimum schema
• Added disease
field with MONDO and EFO ontology validation
• Configured
warning-level validation for disease terms
nonvertebrates.yaml
Non-vertebrate SDRF schema definitionsdrf_pipelines/sdrf/schemas/nonvertebrates.yaml
• Created non-vertebrate schema extending default schema
• Added
developmental stage and strain/breed fields
• Configured appropriate
validation rules for non-vertebrate samples
plants.yaml
Plant-specific SDRF schema definitionsdrf_pipelines/sdrf/schemas/plants.yaml
• Created plant-specific schema extending minimum schema
• Added
developmental stage and strain/breed fields for plant samples
•
Configured validation rules appropriate for plant biology
conda-build.yml
Added conda recipe verification to build process.github/workflows/conda-build.yml
• Added conda recipe verification step before building
• Enhanced
build process with validation checks
vertebrates.yaml
Vertebrate SDRF schema definitionsdrf_pipelines/sdrf/schemas/vertebrates.yaml
• Created vertebrate schema extending default schema
• Added
developmental stage field for vertebrate-specific validation
4 files