Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!
Computing Frames¶
Apache Ray¶
A fast and simple framework for building and running distributed applications.
Ray does not handle large data well (as of 2018/05/28). Please refer to the discussion for details.
https://
https://
https://
pai¶
Resource scheduling and cluster management for AI.
Horovod¶
A framework for distributed training (on GPU)
for TensorFlow, Keras, PyTorch, and Apache MXNet. https://
PetaStorm¶
Petastorm is a parquet access library that may be used from TF, PyTorch or pure Python to load data from parquet stores directly into ML framework.
AiiDa¶
Automated interactive infrastructure and database for computational science.
mars¶
It sems to me that mars focus on tensor computation. Mars is a tensor-based unified framework for large-scale data computation which scales Numpy, Pandas and Scikit-learn.
modin-project/modin¶
Modin is scaling pandas pipeline specifically. Modin is a DataFrame for datasets from 1KB to 1TB+. Notice that modin leverages the Apache Ray project.
Modin seems to be a better solution than Dask if you work with data frames. Query: What is the difference between Dask and Modin?
Celery¶
https://
RQ¶
RQ (Redis Queue) is a simple Python library for queueing jobs and processing them in the background with workers. It is backed by Redis and it is designed to have a low barrier to entry. It can be integrated in your web stack easily.
Dask¶
GPU Computing¶
Please refer to GPU Computing in Python for more details.
Array Specific¶
numpy
DataFrame Specific¶
cudf, dask, modin, numba, PySpark DataFrame
References¶
http://
https://