Please log in to watch this conference skillscast.
Popular ML techniques like Reinforcement learning (RL) and Hyperparameter Optimization (HPO) require a variety of computational patterns for data processing, simulation (e.g., game engines), model search, training, and serving, and other tasks. Few frameworks efficiently support all these patterns, especially when scaling to clusters.
Ray is an open-source, distributed framework from U.C. Berkeley’s RISELab that easily scales applications from a laptop to a cluster. It was created to address the needs of reinforcement learning and hyperparameter tuning, in particular, but it is broadly applicable for almost any distributed Python-based application, with support for other languages forthcoming.
I'll explain the problems Ray solves and how Ray works. Then I'll discuss RLlib and Tune, the RL and HPO systems implemented with Ray. You'll learn when to use Ray versus alternatives, and how to adopt it for your projects.
Q&A
Question: Does Ray broadcast to nodes when distributing state? If so is that a bottleneck? And is it difficult to reason about when the broadcast might happen when writing code for Ray?
Answer: Ray doesn’t give you a lot of control (yet) about this fine-tuning of behavior. Mostly it tries to balance resources across a cluster. For example, an actor’s state is stored in a distributed object store (with some optimizations to skip it…), so if another actor/task needs that state on another node, Ray will copy it over. That has the nice effect of collocating the data where it should be, with the compute, over time. There are a few best practices about not asking for results when you don’t need them (like the results of those calls to make_array
, but otherwise, Ray is reasonably smart about where it schedules stuff to avoid too many network hops moving data around.
Question: What are your thoughts on frameworks vs containerisation for MLOps?
Answer: Fortunately, you can use Ray for fine-grained distribution (at the MB level for example), and larger frameworks, like K8s, for macro-level scheduling. Ray has an autoscaler that people use to dynamically scale pods.
Question: I worked at a company that was reluctant to allow provisioning of cloud resources. So we made our own cluster from our work computers overnight.
Does Ray work with local commodity hardware networks? It seems like Ray could have reduced a lot of overhead work for us coordinating our “poor-mans-clustering”.
Answer: Yes, it works great that way. That’s mostly how the Berkeley researchers worked at first, on donated clusters.
Question: Any comments on performance of Ray vs alternatives?
Answer: two popular multithreading/multiprocessing libs in Python are joblib
and multiprocessing.Pool
. Ray provides API-compatible replacements that are a bit slower on the same node, but break the node boundary if you want to scale beyond a single machine. If you use asyncio
, Ray works nicely with that, too, as an alternative syntax from what I demonstrated.
YOU MAY ALSO LIKE:
Cluster-wide Scaling of Machine Learning with Ray
Dean Wampler
Product Engineering Director for Accelerated Discovery
IBM Research