Ben Chuanlong Du's Blog

It is never too late to learn.

Handle Imbalanced Data in Machine Learning

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

scikit-learn-contrib/imbalanced-learn is a Python Package to Tackle the Curse of Imbalanced Datasets in Machine Learning.

Type of Imbalanced Data

  • Intrinsic (imbalance is a direct result of the nature of the dataspace)
  • Extrinsic (due to time and/or storage, etc.)

  • Between-class Imbalance

    • Relative imbalance (OOM)
    • Rare instances a.k.a. absolute rarity (pink blood patient)
  • Within-class Imbalance

  • Data complexity (primary)

    • Overlapping
    • Lack of representative data
    • Small disjuncts
  • Imbalanced
  • Small sample size

Impact of Imbalanced Data on Decision Tree

  • Fewer and fewer observations of minority class examples resulting in fewer leaves describing minority concepts and successively weaker confidences estimates
  • Concepts that have dependencies on different feature space conjunctions can go unlearned by the sparseness introduced through partitionining

Evaluation

  • don't use accuracy (or error rate)
  • use ROC, PR curve, F1 score, etc.
  • don't get hard classifications
  • get probability estimates
  • don't use a 0.5 decision threshold blindly
  • check performance curves
  • test on data to operate on

Ways to Handle Imbalanced Data

  • Do nothing
  • Balance the training set Oversampling: tied data leading to overfitting Undersampling: miss important concepts overall undersampling is preferred if there are enough data. However, oversampling might be better if you have very small data.

  • Border based approach

  • Sampling with Data Cleaning
  • Adjust algorithms
  • Cluster-based Sampling
  • Sampling + Boosting
  • New algorithms
  • Anomaly detection

Undersampling

  • EasyEnsemble (recommended)
  • BalanceCascade
  • KNN based (NearMiss-1, NearMiss-2, Near-Miss-3, Most Distant)
  • One-sided selection (OSS)

Border-based Approaches

A pair of minimally distanced nearest neighbors of opposite classes. Remove the majority instance of Tomek Links. Makes the border more clear

SMOTE

Synthetic Minority Oversampling TEchique Synthesizing new minority class examples break the tie introduced by simple oversampling and augment the original data shown a great success in various applications
Similar to mixup for deep learning

Variation of SMOTE

Borderline-SMOTE ADASYN SMOTE + Undersampling SMOTE-NC (nominal continuous) SMOTE-N (nominal)

Sampling + Data Cleaning

OSS CNN + Tomek Links NCL based on ENN SMOTE + ENN SMOTE + Tomek

Adjusting Algorithms

Class weights Decision threshold Modify an algorithm to be more sensitive to rare classes

Box Drawings

Construct boxes (axis-parallel hyper-rectangles) around minority class examples Concise, intelligible representation of the minority class Penalize the number of boxes Exact Boxes Mixed-integer programming Exact but fairly expensive solution Fast Boxes Faster clustering method to generate the initial boxes Refine the boxes Both perform well among a large set of test datasets

Anomaly Detection - Isolation Forest

identify anomalies in data (by learning random forests) measuring the average number of decision splits to isolate each point calculate each data points anomaly score (likelihood to belong to minority)

References

https://www.youtube.com/watch?v=YMPMZmlH5Bo

http://storm.cis.fordham.edu/~gweiss/small_disjuncts.html

https://www.svds.com/learning-imbalanced-classes/

Comments