Skip to content

Conversation

@fabianegli
Copy link
Collaborator

@fabianegli fabianegli commented May 13, 2025

User description

Summary by CodeRabbit

  • New Features

    • Introduced a schema-driven SDRF validation system with modular, extensible schema definitions and validator plugins.
    • Added multiple organism- and use case-specific SDRF schema templates (e.g., minimum, default, human, plants, vertebrates, cell lines).
    • Added a simplified SDRF validation CLI command and improved error/warning output formatting.
    • Enabled ontology validation and pattern checks for SDRF columns with customizable error levels.
    • Added utility and specification constants for SDRF handling.
  • Bug Fixes

    • Improved encoding handling for file operations and error reporting in CLI and IO routines.
    • Fixed pandas column handling and DataFrame operations in several modules.
  • Documentation

    • Expanded and clarified README and CONTRIBUTING guides to reflect the new schema system and validation workflow.
  • Refactor

    • Replaced legacy validation logic with schema-based validation using Pydantic and modular validators.
    • Simplified SDRF data structures and removed deprecated or redundant modules and configuration files.
    • Enhanced type annotations, error handling, and logging throughout the codebase.
  • Chores

    • Updated development and CI dependencies, Python version requirements, and packaging configuration (migrated to Poetry).
    • Added and updated tests to cover new validation logic and ontology utilities.
    • Improved pre-commit hooks and code formatting standards.
  • Revert

    • Removed legacy SDRF validation and parameter merging scripts, along with associated configuration files.

PR Type

Enhancement, Bug fix, Tests, Documentation


Description

Major refactoring: Complete replacement of legacy validation system with new schema-driven validation using Pydantic models and YAML/JSON schema definitions
New validation framework: Introduced modular validator system with plugin architecture supporting ontology validation, pattern matching, and customizable error levels
Schema templates: Added multiple organism-specific SDRF schema templates (minimum, default, human, plants, vertebrates, cell lines, non-vertebrates) with inheritance support
CLI improvements: Added simplified validate-sdrf-simple command with color-coded error/warning output and enhanced help text
Code modernization: Migrated from pkg_resources to pathlib.Path, replaced deprecated functions, improved type annotations throughout codebase
Build system migration: Switched from setuptools to Poetry with updated dependencies and development tooling
Enhanced testing: Added comprehensive test coverage for new validation system, OLS client functionality, and TMT label inference
Bug fixes: Fixed XML element creation in MaxQuant converter, improved pandas operations, and enhanced error handling
Documentation: Updated README and CONTRIBUTING guides to reflect new schema system and validation workflow
CI/CD improvements: Added mypy type checking, updated Python version support (3.10-3.13), and enhanced pre-commit hooks


Diagram Walkthrough

flowchart LR
  A["Legacy Validation System"] --> B["Schema-driven Validation"]
  B --> C["YAML/JSON Schema Templates"]
  B --> D["Pydantic Models"]
  B --> E["Modular Validators"]
  C --> F["Organism-specific Schemas"]
  E --> G["Ontology Validation"]
  E --> H["Pattern Matching"]
  I["setuptools"] --> J["Poetry Build System"]
  K["pkg_resources"] --> L["pathlib.Path"]
Loading

File Walkthrough

Relevant files
Enhancement
10 files
openms.py
OpenMS converter refactoring with improved type safety and
documentation

sdrf_pipelines/openms/openms.py

• Added comprehensive docstring with example command usage
• Improved
type annotations throughout the class and functions
• Refactored
modification parsing logic into smaller, more maintainable methods

Added infer_tmtplex function and moved TMT plex dictionaries to module
level
• Enhanced error handling and logging with better warning
messages
• Replaced deprecated _removesuffix function with built-in
removesuffix method
• Improved file writing with proper encoding
specification

+329/-359
ols.py
OLS client improvements with better error handling and type safety

sdrf_pipelines/ols/ols.py

• Enhanced type annotations and function signatures throughout the
module
• Improved error handling with connection error fallback to
cache
• Added comprehensive docstrings for all public methods

Replaced deprecated pkg_resources with pathlib.Path for resource
handling
• Enhanced OBO and OWL file parsing with better error
handling
• Added retry logic for OLS API requests with exponential
backoff

+234/-82
parse_sdrf.py
CLI refactoring with new schema-based validation system   

sdrf_pipelines/parse_sdrf.py

• Replaced legacy SDRF validation system with new schema-based
validation
• Added new validate-sdrf-simple command for streamlined
validation
• Improved CLI help text and option descriptions
• Enhanced
error reporting with color-coded output (red for errors, yellow for
warnings)
• Removed deprecated validation flags and simplified
validation workflow

+154/-66
validators.py
New modular validator system with comprehensive validation rules

sdrf_pipelines/sdrf/validators.py

• New comprehensive validator system with plugin architecture

Implemented multiple validator types: trailing whitespace, minimum
columns, ontology, pattern, column order, empty cells
• Added
validator registry system for extensible validation framework

Enhanced ontology validation with OLS integration and caching support

• Improved error reporting with detailed location information

+468/-0 
normalyzerde.py
NormalyzerDE converter improvements and modernization       

sdrf_pipelines/normalyzerde/normalyzerde.py

• Improved pandas column operations using vectorized methods

Enhanced type annotations for method parameters
• Fixed regex pattern
matching with proper None comparison
• Improved file I/O with proper
encoding specification

+19/-8   
specification.py
New SDRF specification constants module                                   

sdrf_pipelines/sdrf/specification.py

• Added new specification constants module
• Defined standard values
for SDRF validation: NOT_AVAILABLE, NOT_APPLICABLE, NORM

+4/-0     
schemas.py
New schema-driven validation system with inheritance support

sdrf_pipelines/sdrf/schemas.py

• Added comprehensive schema-based validation system with
SchemaRegistry and SchemaValidator
• Implemented YAML/JSON schema
loading with inheritance support via extends field
• Added ontology
validation with multiple ontology support and rich error messages

Created modular validator system with configurable parameters

+365/-0 
unimod.py
Modernize resource handling and code formatting                   

sdrf_pipelines/openms/unimod.py

• Replaced pkg_resources with pathlib.Path for resource file access

Updated string formatting to use f-strings instead of .format()

Fixed boolean comparison from == True to is True
• Added type hints
for method parameters

+20/-12 
msstats.py
Type hints and pandas operations modernization                     

sdrf_pipelines/msstats/msstats.py

• Added type hints for instance variables
• Replaced map() with
.str.lower() for pandas column operations
• Updated dictionary key
checking to use in operator
• Improved code readability and type
safety

+9/-4     
utils.py
Added TSV line utility function                                                   

sdrf_pipelines/utils/utils.py

• Added utility function tsv_line() for creating tab-separated value
lines
• Simple helper function for TSV file generation

+6/-0     
Bug fix
3 files
maxquant.py
MaxQuant converter modernization and bug fixes                     

sdrf_pipelines/maxquant/maxquant.py

• Replaced deprecated pkg_resources with pathlib.Path for resource
handling
• Fixed XML element creation bug (changed type to name)

Improved pandas operations using vectorized methods
• Enhanced file
I/O with proper encoding specification
• Added utility function usage
for TSV line formatting

+48/-46 
pythonpublish.yml
Fixed build dependency installation command                           

.github/workflows/pythonpublish.yml

• Fixed typo in pip install command for build dependencies

+1/-1     
param2sdrf.yml
Fixed YAML syntax in parameter configuration                         

sdrf_pipelines/maxquant/param2sdrf.yml

• Fixed YAML comment syntax error in commented parameter

+1/-1     
Tests
7 files
test_sdrfchecker.py
Test updates for new validation system and version checking

tests/test_sdrfchecker.py

• Added version output validation test with semantic versioning regex

• Updated test expectations to match new validation system output

Enhanced help command testing
• Modified error message assertions to
work with new validation framework

+47/-7   
test_ols.py
Complete test coverage for OLS client functionality           

tests/test_ols.py

• Added comprehensive test suite for OLS client functionality
• Tests
cover real API calls for EFO and GO ontologies
• Added tests for OBO
and OWL file parsing functions
• Includes tests for ontology index
building and error handling

+205/-0 
test_min_columns.py
Tests for schema-based minimum column validation                 

tests/test_min_columns.py

• Added tests for new schema validation system
• Tests minimum column
requirements and error counting
• Validates specific error messages
and types
• Uses new SDRFDataFrame and SchemaValidator classes

+57/-0   
test_openms.py
Added tests for TMT label inference functionality               

tests/test_openms.py

• Added imports for TMT_PLEXES and infer_tmtplex functions
• Added
parametrized tests for TMT label inference functionality
• Tests both
full and incomplete TMT plex scenarios

+16/-2   
test_sdrf.py
Basic tests for new SDRFDataFrame implementation                 

tests/test_sdrf.py

• Added basic tests for new SDRFDataFrame class
• Tests DataFrame
creation, column access, and shape properties
• Validates the new
Pydantic-based SDRF data structure

+31/-0   
test_ontology.py
Ontology test cleanup and standardization                               

tests/test_ontology.py

• Updated ontology names to lowercase (NCBITaxon to ncbitaxon)

Removed debug print statements
• Cleaned up test code for better
maintainability

+2/-5     
test_unimod.py
Enhanced Unimod tests with proper assertions                         

tests/test_unimod.py

• Added assertions to validate test results instead of just printing

Removed debug print statements and main execution block
• Improved
test reliability with proper assertions

+4/-8     
Miscellaneous
1 files
helpers.py
Test helper improvements with better type annotations       

tests/helpers.py

• Updated import statements to use more specific Click types

Enhanced type annotations for better code clarity

+3/-4     
Refactor
1 files
sdrf.py
Complete refactor of SDRF data structure using Pydantic   

sdrf_pipelines/sdrf/sdrf.py

• Replaced legacy SdrfDataFrame class with new SDRFDataFrame Pydantic
model
• Removed complex validation methods and schema imports
• Added
simple read_sdrf function for parsing SDRF files
• Simplified
DataFrame operations with delegation pattern

+60/-254
Dependencies
4 files
exceptions.py
Remove pandas_schema dependency with custom validation classes

sdrf_pipelines/utils/exceptions.py

• Removed dependency on pandas_schema.ValidationWarning
• Implemented
custom ValidationWarning class with same interface
• Enhanced
LogicError class with proper error type handling
• Improved string
formatting and type hints

+47/-8   
environment.yml
Cleaned up conda environment dependencies                               

environment.yml

• Removed build-specific dependencies (conda-build, anaconda-client)

Removed deprecated pandas_schema and setuptools
• Simplified
dependency list to runtime requirements only

+5/-9     
requirements.txt
Updated dependencies for Pydantic validation system           

requirements.txt

• Removed deprecated pandas_schema dependency
• Added pydantic>=2.0.0
for new validation system
• Removed setuptools as it's no longer
needed

+1/-6     
requirements-dev.txt
Enhanced development dependencies with type checking         

requirements-dev.txt

• Added mypy and related type checking dependencies
• Added
pandas-stubs for better pandas type checking
• Enhanced development
tooling for type safety

+5/-0     
Formatting
2 files
test_convert_openms.py
Code cleanup and import optimization                                         

tests/test_convert_openms.py

• Updated import statements to use tuple unpacking
• Replaced unused
variable assignments with underscore
• Minor code cleanup and
formatting improvements

+3/-4     
modifications.xml
XML schema simplification for modifications                           

sdrf_pipelines/maxquant/modifications.xml

• Removed XML schema declarations from modifications file
• Simplified
XML structure while maintaining content

+2/-2     
Documentation
3 files
__init__.py
Module documentation for SDRF v2                                                 

sdrf_pipelines/sdrf/init.py

• Added module docstring describing the new Pydantic-based validation
system

+3/-0     
README.md
Comprehensive documentation for new validation system       

README.md

• Added extensive documentation for new schema-based validation system

• Documented YAML schema features, inheritance, and custom template
creation
• Added examples of JSON schema definitions and validation
commands
• Explained the new simplified validation interface

+106/-1 
CONTRIBUTING.md
Updated development workflow documentation                             

CONTRIBUTING.md

• Updated development setup instructions for Poetry and modern Python

• Added mypy type checking documentation
• Enhanced pre-commit hooks
documentation
• Improved development workflow guidance

+25/-12 
Configuration changes
14 files
minimum.yaml
Minimum SDRF schema definition with ontology validation   

sdrf_pipelines/sdrf/schemas/minimum.yaml

• Defined minimum SDRF schema with 12 required columns
• Specified
ontology validators for organism, cell type, label, and instrument
fields
• Added comprehensive field descriptions and validation rules

+176/-0 
ci.yml
Enhanced CI with dev branch support and type checking       

.github/workflows/ci.yml

• Added dev branch to CI triggers for push and pull requests
• Updated
Python version from 3.12 to 3.10 for consistency
• Added mypy type
checking job with proper dependencies
• Updated package installation
to use pip install .

+39/-19 
pyproject.toml
Migration to Poetry build system with modern tooling         

pyproject.toml

• Migrated from setup.py to Poetry-based configuration
• Added
comprehensive project metadata and dependencies
• Configured
development tools (black, isort, mypy, pydantic-mypy)
• Added proper
package structure and script entry points

+65/-3   
human.yaml
Human-specific SDRF schema with specialized validation     

sdrf_pipelines/sdrf/schemas/human.yaml

• Defined human-specific SDRF schema extending default schema
• Added
fields for ancestry, age, sex, developmental stage, and individual

Configured ontology validation for EFO terms and pattern matching for
age

+66/-0   
pythonpackage.yml
Python version updates and installation improvements         

.github/workflows/pythonpackage.yml

• Updated Python version matrix to remove 3.9 and add 3.13

Standardized pip installation commands with python -m pip
• Updated
flake8 line length to 120 characters
• Improved package installation
process

+7/-7     
pythonapp.yml
Poetry integration and Python version standardization       

.github/workflows/pythonapp.yml

• Updated Python version from 3.11 to 3.10
• Added Poetry installation
and usage for dependency management
• Updated test execution to use
specific test selection

+7/-6     
.pre-commit-config.yaml
Updated pre-commit hooks with mypy integration                     

.pre-commit-config.yaml

• Updated all hook versions to latest releases
• Added mypy pre-commit
hook with additional dependencies
• Enhanced code quality checks with
type validation

+11/-6   
meta.yaml
Conda recipe update for Poetry build system                           

recipe/meta.yaml

• Updated build system to use Poetry instead of setuptools

Reorganized dependencies and removed deprecated packages
• Added
pydantic and updated package versions

+8/-9     
cell_lines.yaml
Cell line SDRF schema with ontology validation                     

sdrf_pipelines/sdrf/schemas/cell_lines.yaml

• Created cell line-specific schema extending human schema
• Added
cell line field with CLO and BTO ontology validation
• Provided
examples of common cell lines (HeLa, HEK293, MCF7)

+20/-0   
default.yaml
Default SDRF schema with disease field validation               

sdrf_pipelines/sdrf/schemas/default.yaml

• Created default schema extending minimum schema
• Added disease
field with MONDO and EFO ontology validation
• Configured
warning-level validation for disease terms

+21/-0   
nonvertebrates.yaml
Non-vertebrate SDRF schema definition                                       

sdrf_pipelines/sdrf/schemas/nonvertebrates.yaml

• Created non-vertebrate schema extending default schema
• Added
developmental stage and strain/breed fields
• Configured appropriate
validation rules for non-vertebrate samples

+14/-0   
plants.yaml
Plant-specific SDRF schema definition                                       

sdrf_pipelines/sdrf/schemas/plants.yaml

• Created plant-specific schema extending minimum schema
• Added
developmental stage and strain/breed fields for plant samples

Configured validation rules appropriate for plant biology

+14/-0   
conda-build.yml
Added conda recipe verification to build process                 

.github/workflows/conda-build.yml

• Added conda recipe verification step before building
• Enhanced
build process with validation checks

+4/-0     
vertebrates.yaml
Vertebrate SDRF schema definition                                               

sdrf_pipelines/sdrf/schemas/vertebrates.yaml

• Created vertebrate schema extending default schema
• Added
developmental stage field for vertebrate-specific validation

+9/-0     
Additional files
4 files
sdrf_schema.py +0/-631 
add_data_analysis_param.py +0/-237 
param2sdrf.yml +0/-227 
setup.py +0/-65   

appending combined_factors into a  list of combined_factors variable and use it as the final column to avoid using row_index which seems to fail mypy
The failing was not because = was not handled in CS= value but more because of space after ; which make "CS" not matching with the key " CS" from the cell.
date of retrieval for all the databases used were on 29/08/2025
update validate method to instead of failing right away on encountering an error in self.ontology_term_parser and self.validate_ontology_terms by collecting error from them within try and except statement to be added to the errors list.
@ypriverol ypriverol requested a review from Copilot September 1, 2025 11:12

This comment was marked as outdated.

characteristics[age] with range annotation matching is broken with the current regex. This update change the regex to be a bit more complex. Add PXD001474 annotated sdrf file from the bigbio/proteomics-sample-metadata as reference for testing with range.

Also update a docstring to correctly refer to the input parameter name in OntologyValidator.validate
add to reference folder under tests
update validator to include a unique values validator that can be used to check if a column only have unique value. This is then used for assay name column validation.
remove single column unique values validator from minimum.yml and add multiple column unique values validator that can be used in yml to validate unique combination of multiple columns. Here we use it by default for assay name, source name combination
@ypriverol ypriverol requested a review from Copilot September 8, 2025 10:01
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR represents a major overhaul of the SDRF pipelines package, transitioning from a legacy validation system to a modern schema-driven validation architecture. The changes span the entire codebase with a complete refactor of the validation framework, build system migration to Poetry, and extensive modernization.

  • Complete validation system replacement: Legacy validation replaced with Pydantic-based schema system with YAML/JSON schema definitions and modular validators
  • New schema templates: Added organism-specific SDRF schemas (minimum, default, human, plants, vertebrates, cell lines) with inheritance support
  • Modernized codebase: Migrated from pkg_resources to pathlib.Path, updated type annotations, and enhanced error handling throughout

Reviewed Changes

Copilot reviewed 57 out of 81 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
sdrf_pipelines/sdrf/validators.py New modular validator system with comprehensive validation rules
sdrf_pipelines/sdrf/schemas.py Schema-driven validation system with inheritance support
sdrf_pipelines/parse_sdrf.py CLI refactoring with new schema-based validation system
tests/test_ols.py Complete test coverage for OLS client functionality
sdrf_pipelines/openms/openms.py OpenMS converter refactoring with improved type safety
sdrf_pipelines/ols/ols.py OLS client improvements with better error handling
setup.py Removed (migrated to Poetry)
sdrf_pipelines/sdrf/sdrf.py Complete refactor using Pydantic models
Comments suppressed due to low confidence (1)

sdrf_pipelines/sdrf/schemas.py:1

  • Using bare except Exception catches all exceptions including system errors. Consider catching specific exceptions like ValueError, KeyError, or RuntimeError that are expected during schema validation.
import json

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.


def __init__(self, params: dict[str, Any] | None = None, **data: Any):
super().__init__(**data)
logging.info(params)
Copy link

Copilot AI Sep 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Logging parameters at INFO level in a validator initialization may be too verbose for production use. Consider using DEBUG level or removing this logging statement.

Suggested change
logging.info(params)
logging.debug(params)

Copilot uses AI. Check for mistakes.
change name of multiple_collumn_unique_values_validator to combination_of_columns_no_duplicate_validator . Add single_cardinality_validator to help validating that there is only 1 single unique value in a column. Remove mentioning of cardinality outside of yml validators section.
adding proteomics data acquisition method to minimum template. Add column_name_warning to combination no duplication validator to allow specifying combination that would also be notified as warning but not as error. Adding ability to output a data table of all errors and warnings parse_sdrf validate_sdrf function with optional -o or --out parameter.
adding a ValidationProof class that can be used to generate a SHA512 hash for the validation action that is combination of content of the sdrf, content of the template, version of the  validator, timestamp and an optional unique salt string. validate_sdrf function from parse_sdrf also have optional parameters for producing a validation proof json and/or just print the hash into the cli.
updating most of print statement in schemas.py to logging.debug only keeping the print statement in the debug_col function which is necessary for some of the test.
updating read_sdrf method to be able to return SDRFDataFrame object instead of just pd.DataFrame and update SDRFDataFrame to accept both normal pd.DataFrame as well as SDRFDataFrame. Adding SDRFMetadata class to allow automated parsing of metadata line from sdrf file.
fix console out behaviour for instances where validate_sdrf would only find warnings but no errors. Add initial support for schema merging after schema have already been loaded from files.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants