BreezeML offers an automated, affordable, and reliable solution to deep learning jobs by enabling safe and SLA-guaranteed use of preemptible instances in a GPU cloud.
Seamless Drop-in Library
dist = breezeml.init_dist(config=config)
engine = breezeml.init_engine(model, dist, train=True)
for step, batch in enumerate(dataloader):
Public Cloud: For companies that use public clouds to run AI jobs, BreezeML provides a user-friendly virtual cloud interface to which users can submit their jobs and their SLAs. Based on the specified SLA, our cloud automatically selects a combination of on-demand and spot GPU instances from a public cloud to run the job with SLA guarantees.
On-Premise Cluster: For enterprises or government organizations that have on-premise clusters, using our preemption-resilient system in their cluster allows them to share GPU resources between high-priority inference jobs and low-priority training jobs — low-priority jobs can continuously run and get preempted when high-priority jobs arrive without losing any data.
Key Value Points
1. Significantly reduced costs or increased workloads (large models/more users) under a fixed budget
2. SLAs fully preserved
3. Customizable to both public and on-premise clouds
Red Line: Non-preemptible baseline
BreezeML, recently launched by a team of researchers from UCLA and Princeton, is an ML-infrastructure company that aims to significantly reduce the costs and development/management challenges of ML (training and inference) jobs by virtualizing heterogeneous cloud resources (e.g., different computation/hardware families on AWS) and leveraging their diverse pricing models. At the core of BreezeML’s virtual cloud service are two innovations backed by years of research: (1) a common runtime that simultaneously powers multiple ML platforms/frameworks (e.g., TensorFlow, PyTorch, and Ray) to enable seamless integration of different ML stacks; and (2) a platform-independent scheduling and orchestration system that makes intelligent use of economically favorable hardware (e.g., lambdas, spot instances) without (a) requiring a single line of code modification to existing jobs, and (b) impacting the SLA of a given ML job.
Professor of Computer Science at UCLA
Professor of Computer Science at Princeton University
Director of Product
Graduating Ph.D. Student at UCLA
Director of Engineering
Ph.D. Student of Computer Science at UCLA
Tell us if you would use BreezeML and a bit about your use-case: