Ben Chuanlong Du's Blog

It is never too late to learn.

Machine Learning Libraries, Computing Frames and Programming Languages

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

  1. GPU is more accisible for average individual people. GPU is still the main tool for deep learning right now.

  2. Python Distributed Computing Frameworks (Ray, Modin, etc.) servers as a mid solution between GPU and Spark. It can handle more data than GPU but less then Spark. Ray, Modin, etc is easier to use and maintain than Spark.

  3. Even though there are many libraries making it possible to run deep learning on Spark, I still don't it is the right choice unless you have really large data that cannot be handled by other frameworks. There are rarely such situations. Real big data mostly occur in the ETL and preprocessing stage rather than in the model training stage.

  4. Python and Rust are good choices. C is not productive. C++ is too complicated. JVM-based languages are first-class citizens for production. Rust seems to have a bright future.

  5. As the development of Kubernetes, there will be distributed computing frameworks that does not limit you into a specific languages. Once that is a common situation, people will start shifting away from JVM languages and seek for better performance and easier to use solutions. Rust is a good language choice for performance while Python is a good choice for glue-language that is easy to use.

Machine Learning Frameworks

scikit-learn

LightGBM / XGBoost

PyTorch

TensorFlow 2

Apache MXNet

Caffe2 is often used for productionizing models trained in PyTorch and it is part of the PyTorch project now.

Notice that H2O-3 (less popularity and lower quality compared to the above libraries), AI-Blocks, and Nvidia DIGIGS provides user-friendly UI for training models.

Computing Frameworks

Multi-threading & Multi-Processing are not discussed here since they are relatively simple for scientific computing.

GPU

ZeRO + DeepSpeed is a deep learning optimization library that makes distributed training on GPU clusters easy, efficient, and effective.

Apache Ray

Python Distributed Computing Frameworks (Ray, Celery, Dask, Modin, etc.)

Spark

TPU

Model Serving

https://github.com/cortexlabs/cortex

Multi-framework machine learning model serving infrastructure.

Ray Serve

TFX

Torch Serve

Programming Languages

Python

Rust

JVM (Java, Scala, Kotlin)

C/C++

References

  • (Machine Learning and Deep Learning frameworks and libraries for large-scale data mining: a survey)[https://link.springer.com/article/10.1007/s10462-018-09679-z]

  • caffe2/AICamera is a demonstration of using Caffe2 inside an Android application.

  • Comparison of AI Frameworks

Accelerating Deep Learning Using Distributed SGD — An Overview

Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis

Scalable Distributed DL Training: Batching Communication and Computation

Scalable Deep Learning on Distributed Infrastructures: Challenges, Techniques and Tools

A Hitchhiker’s Guide On Distributed Training of Deep Neural Networks

Distributed training of Deep Learning models with PyTorch

Awesome Distributed Deep Learning

Intro to Distributed Deep Learning Systems

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

Comments