Releases: scienxlab/redflag
Releases · scienxlab/redflag
v0.5.0
- This release makes more changes to the tests and documentation in reponse to the review process for the submission to JOSS (see below).
- In particular, see the following issue: #97
- Changed the method of handling dynamic versioning. For now the package
__version__attribute is still defined, but it is deprecated and will be removed in0.6.0. Usefrom importlib.metadata.version('redflag')to get the version information instead. - Changed the default
get_outliers()method from isolation forest ('iso') to Mahalanobis ('mah') to match other functions, eghas_outliers()and thesklearnpipeline object. - Updated
actions/setup-pythonto use v5.
v0.5.0-rc1
Checking CI pipeline
v0.4.2
- This is a minor release making changes to the tests and documentation in reponse to the review process for a submission to The Journal of Open Source Software (JOSS).
- See the following issues: #89, #90, #91, #92, #93, #94 and #95.
- Now building and testing on Windows and MacOS as well as Linux.
- Python version
3.12added to package classifiers - Python version
3.12tested during CI
v0.4.1
- This is a minor release intended to preview new
pandas-related features for version 0.5.0. - Added another
pandasSeries accessor,is_imbalanced(). - Added two
pandasDataFrame accessors,feature_importances()andcorrelation_detector(). These are experimental features.
v0.4.1-rc1
Testing CI
v0.4.0
redflagcan now be installed by thecondapackage and environment manager. To do so, useconda install -c conda-forge redflag.- All of the
sklearncomponents can now be instantiated withwarn=Falsein order to trigger aValueExceptioninstead of a warning. This allows you to build pipelines that will break if a detector is triggered. - Added
redflag.target.is_ordered()to check if a single-label categorical target is ordered in some way. The test uses a Markov chain analysis, applying chi-squared test to the transition matrix. In general, the Boolean result should only be used on targets with several classes, perhaps at least 10. Below that, it seems to give a lot of false positives. - You can now pass
groupstoredflag.distributions.is_multimodal(). If present, the modality will be checked for each group, returning a Boolean array of values (one for each group). This allows you to check a feature partitioned by target class, for example. - Added
redflag.sklearn.MultimodalityDetectorto provide a way to check for multimodal features. Ifyis passed and is categorical, it will be used to partition the data and modality will be checked for each class. - Added
redflag.sklearn.InsufficientDataDetectorwhich checks that there are at least M2 records (rows inX), where M is the number of features (i.e. columns) inX. - Removed
RegressionMultimodalDetector. UseMultimodalDetectorinstead.
v0.3.0
- Added some accessors to give access to
redflagfunctions directly frompandas.Seriesobjects, via an 'accessor'. For example, for a Seriess, one can callminority_classes = s.redflag.minority_classes()instead ofredflag.minority_classes(s). Other functions includeimbalance_degree(),dummy_scores()(see below). Probably not very useful yet, but future releases will add some reporting functions that wrap multiple Redflag functions. This is an experimental feature and subject to change. - Added a Series accessor
report()to perform a range of tests and make a small text report suitable for printing. Access for a Seriesslikes.redflag.report(). This is an experimental feature and subject to change. - Added new documentation page for the Pandas accessor.
- Added
redflag.target.dummy_classification_scores(),redflag.target.dummy_regression_scores(), which train a dummy (i.e. naive) model and compute various relevant scores (MSE and R2 for regression, F1 and ROC-AUC for classification tasks). Additionally, bothmost_frequentandstratifiedstrategies are tested for classification tasks; only themeanstrategy is employed for regression tasks. The helper functionredflag.target.dummy_scores()tries to guess what kind of task suits the data and calls the appropriate function. - Moved
redflag.target.update_p()toredflag.utils. - Added
is_imbalanced()to return a Boolean depending on a threshold of imbalance degree. Default threshold is 0.5 but the best value is up for debate. - Removed
utils.has_low_distance_stdev.
v0.2.0
- Moved to something more closely resembling semantic versioning, which is the main reason this is version 0.2.0.
- Builds and tests on Python 3.11 have been successful, so now supporting this version.
- Added custom 'alarm'
Detector, which can be instantiated with a function and a warning to emit when the function returns True for a 1D array. You can easily write your own detectors with this class. - Added
make_detector_pipeline()which can take sequences of functions and warnings (or a mapping of functions to warnings) and returns ascikit-learn.pipeline.Pipelinecontaining aDetectorfor each function. - Added
RegressionMultimodalDetectorto allow detection of non-unimodal distributions in features, when considered across the entire dataset. (Coming soon, a similar detector for classification tasks that will partition the data by class.) - Redefined
is_standardized(deprecated) asis_standard_normal, which implements the Kolmogorov–Smirnov test. It seems more reliable than assuming the data will have a mean of almost exactly 0 and standard deviation of exactly 1, when all we really care about is that the feature is roughly normal. - Changed the wording slightly in the existing detector warning messages.
- No longer warning if
yisNonein, eg,ImportanceDetector, since you most likely know this. - Some changes to
ImportanceDetector. It now uses KNN estimators instead of SVMs as the third measure of importance; the SVMs were too unstable, causing numerical issues. It also now requires that the number of important features is less than the total number of features to be triggered. So if you have 2 features and both are important, it does not trigger. - Improved
is_continuous()which was erroneously classifying integer arrays with many consecutive values as non-continuous. - Note that
wassersteinno longer checks that the data are standardized; this check will probably return in the future, however. - Added a
Tutorial.ipynbnotebook to the docs. - Added a Copy button to code blocks in the docs.
v0.1.10
- Added
redflag.importance.least_important_features()andredflag.importance.most_important_features(). These functions are complementary (in other words, if the same threshold is used in each, then between them they return all of the features). The default threshold for importance is half the expected value. E.g. if there are 5 features, then the default threshold is half of 0.2, or 0.1. Part of Issue 2. - Added
redflag.sklearn.ImportanceDetectorclass, which warns if 1 or 2 features have anomalously high importance, or if some features have anomalously low importance. Part of Issue 2. - Added
redflag.sklearn.ImbalanceComparatorclass, which learns the imbalance present in the training data, then compares what is observed in subsequent data (evaluation, test, or production data). If there's a difference, it throws a warning. Note: it does not warn if there is imbalance present in the training data; useImbalanceDetectorfor that. - Added
redflag.sklearn.RfPipelineclass, which is needed to include theImbalanceComparatorin a pipeline (because the common-or-gardensklearn.pipeline.Pipelineclass does not passyinto a transformer'stransform()method). Also added theredflag.sklearn.make_rf_pipeline()function to help make pipelines with this special class. These components are straight-up forks of the code inscikit-learn(3-clause BSD licensed). - Added example to
docs/notebooks/Using_redflag_with_sklearn.ipynbto show how to use these new objects. - Improved
redflag.is_continuous(), which was buggy; see Issue 3. It still fails on some cases. I'm not sure a definitive test for continuousness (or, conversely, discreteness) is possible; it's just a heuristic.
v0.1.9
- Added some experimental
sklearntransformers that implement variousredflagtests. These do not transform the data in any way, they just inspect the data and emit warnings if tests fail. The main ones are:redflag.sklearn.ClipDetector,redflag.sklearn.OutlierDetector,redflag.sklearn.CorrelationDetector,redflag.sklearn.ImbalanceDetector, andredflag.sklearn.DistributionComparator. - Added tests for the
sklearntransformers. These are inredflag/tests/test_redflag.pyfile, whereas all other tests are doctests. You can run all the tests at once withpytest; coverage is currently 94%. - Added
docs/notebooks/Using_redflag_with_sklearn.ipynbto show how to use these new objects in ansklearnpipeline. - Since there's quite a bit of
sklearncode in theredflagpackage, it is now a hard dependency. I removed the other dependencies because they are all dependencies ofsklearn. - Added
redflag.has_outliers()to make it easier to check for excessive outliers in a dataset. This function only uses Mahalanobis distance and always works in a multivariate sense. - Reorganized the
redflag.featuresmodule into new modules:redflag.distributions,redflag.outliers, andredflag.independence. All of the functions are still imported into theredflagnamespace, so this doesn't affect existing code. - Added examples to
docs/notebooks/Basic_usage.ipynb. - Removed the
class_imbalance()function, which was confusing. Useimbalance_ratio()instead.