TensorOpera and Aethir Partner to Advance Massive-Scale LLM Training on Decentralized Cloud
PALO ALTO, Calif., June 20, 2024 — TensorOpera, the company providing “Your Generative AI Platform at Scale,” has partnered with Aethir, a distributed cloud infrastructure provider, to accelerate its newest foundation model, TensorOpera Fox-1, highlighting the first mass-scale LLM training use case on a decentralized physical infrastructure network.
Introduced last week, TensorOpera Fox-1 is a cutting-edge open-source small language model (SLM) with 1.6 billion parameters, outperforming other models in its class from tech giants like Apple, Google, and Alibaba. This decoder-only transformer was trained from scratch on three trillion tokens using a novel 3-stage curriculum. It features an innovative architecture that is 78% deeper than comparable models such as Google’s Gemma 2B and surpasses competitors in standard LLM benchmarks like GSM8k and MMLU, even with significantly fewer parameters.
The partnership with Aethir equips TensorOpera with advanced GPU resources necessary for training Fox-1. Aethir’s collaboration with NVIDIA Cloud Partners, Infrastructure Funds, and various enterprise-grade hardware providers has established a global, large-scale GPU cloud. This network ensures the delivery of cost-effective and scalable GPU resources, essential for high-throughput, substantial memory capacity, and efficient parallel processing capabilities. With the support of Aethir’s decentralized cloud infrastructure, TensorOpera obtains the necessary tools for facilitating streamlined AI development that requires high network bandwidth and ample amounts of GPU power.
Through this collaboration, TensorOpera is further integrating a pool of GPU resources from Aethir that can be used seamlessly via TensorOpera’s AI platform for a variety of jobs, such as model deployment and serving, fine-tuning, and full training. With Aethir’s distributed GPU cloud network, dynamically adjusting GPU power consumption for AI platforms on the go is no issue. Together, Aethir and TensorOpera aim to empower the next generation of large language model (LLM) training and give AI developers the assets they need to create powerful models and applications.
“I am thrilled about our partnership with Aethir,” said Salman Avestimehr, Co-Founder and CEO of TensorOpera. “In the dynamic landscape of generative AI, the ability to efficiently scale up and down during various stages of model development and in-production deployment is essential. Aethir’s decentralized infrastructure offers this flexibility, combining cost-effectiveness with high-quality performance. Having experienced these benefits firsthand during the training of our Fox-1 model, we decided to deepen our collaboration by integrating Aethir’s GPU resources into TensorOpera’s AI platform to empower developers with the resources necessary for pioneering the next generation of AI technologies.”
Aethir’s operational model is based on a globally distributed network of top-shelf GPUs capable of effectively servicing enterprise clients in the AI and machine learning industry regardless of their physical locations. To effectively provide lag-free, highly scalable GPU power worldwide, Aethir’s GPU resources are decentralized across a multitude of locations in smaller clusters. Instead of pooling resources in a few massive data centers like in the case of traditional, centralized cloud service providers, Aethir distributes its infrastructure to cover the network’s edge and cut the physical distance between GPU resources and end-users.
“TensorOpera is the premier AI platform for LLMs and generative AI applications, and we are excited to be their supplier of enterprise GPU infrastructure,” said Kyle Okamoto, CTO of Aethir.
“Aethir is firmly dedicated to supporting the AI and machine learning sector in developing and launching groundbreaking solutions that can improve the everyday lives of people around the world. TensorOpera provides developers with a comprehensive AI platform, while Aethir will provide them with a steady supply of GPU power that can handle even the most demanding LLM training and AI inference. Thanks to our vast decentralized cloud infrastructure, Aethir is capable of powering large-scale AI development and deployment worldwide,” said Daniel Wang, Aethir’s CEO.
About TensorOpera, Inc.
TensorOpera, Inc. (formerly FedML, Inc.) is an innovative AI company based in Silicon Valley, specifically Palo Alto, California. TensorOpera specializes in developing scalable and secure AI platforms, offering two flagship products tailored for enterprises and developers. The TensorOpera AI Platform, available at TensorOpera.ai, is a comprehensive generative AI platform for model deployment and serving, model training and fine-tuning, AI agent creation, and more. It supports launching training and inference jobs on a serverless/decentralized GPU cloud, experimental tracking for distributed training, and enhanced security and privacy measures. The TensorOpera FedML Platform, accessible at FedML.ai, leads in federated learning and analytics with zero-code implementation. It includes a lightweight, cross-platform Edge AI SDK suitable for edge GPUs, smartphones, and IoT devices. Additionally, it offers a user-friendly MLOps platform to streamline decentralized machine learning and deployment in real-world applications. Founded in February 2022, TensorOpera has quickly grown to support a large number of enterprises and developers worldwide.
About Aethir
Aethir is the only Enterprise-grade AI-focused GPU-as-a-service provider in the market. Its decentralized cloud computing infrastructure allows GPU providers to meet Enterprise clients who need powerful GPU chips for professional AI/ML tasks. Thanks to a constantly growing network of over 40,000 top-shelf GPUs, including over 3,000 NVIDIA H100s, Aethir can provide enterprise-grade GPU computing wherever it’s needed, at scale.
Source: TensorOpera