Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Handle Imbalanced Data in Machine Learning

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

scikit-learn-contrib/imbalanced-learn is a Python Package to Tackle the Curse of Imbalanced Datasets in Machine Learning.

Type of Imbalanced Data

Impact of Imbalanced Data on Decision Tree

Evaluation

Ways to Handle Imbalanced Data

Undersampling

Border-based Approaches

A pair of minimally distanced nearest neighbors of opposite classes. Remove the majority instance of Tomek Links. Makes the border more clear

SMOTE

Synthetic Minority Oversampling TEchique Synthesizing new minority class examples break the tie introduced by simple oversampling and augment the original data shown a great success in various applications
Similar to mixup for deep learning

Variation of SMOTE

Borderline-SMOTE ADASYN SMOTE + Undersampling SMOTE-NC (nominal continuous) SMOTE-N (nominal)

Sampling + Data Cleaning

OSS CNN + Tomek Links NCL based on ENN SMOTE + ENN SMOTE + Tomek

Adjusting Algorithms

Class weights Decision threshold Modify an algorithm to be more sensitive to rare classes

Box Drawings

Construct boxes (axis-parallel hyper-rectangles) around minority class examples Concise, intelligible representation of the minority class Penalize the number of boxes Exact Boxes Mixed-integer programming Exact but fairly expensive solution Fast Boxes Faster clustering method to generate the initial boxes Refine the boxes Both perform well among a large set of test datasets

Anomaly Detection - Isolation Forest

identify anomalies in data (by learning random forests) measuring the average number of decision splits to isolate each point calculate each data points anomaly score (likelihood to belong to minority)

References

https://www.youtube.com/watch?v=YMPMZmlH5Bo

http://storm.cis.fordham.edu/~gweiss/small_disjuncts.html

https://www.svds.com/learning-imbalanced-classes/