Nvidia Unveils GPUs for Generative Inference Workloads like ChatGPT
Today at its GPU Technology Conference, Nvidia took the wraps off three new GPUs designed to accelerate inference workloads for generative AI applications, including generating text, images, and videos. It also launched a new GPU for recommendation models, vector databases, and graph neural nets.
Generative AI has surged in popularity since November, when OpenAI released ChatGPT to the world. Companies are now looking to use conversational AI systems (sometimes called chatbots) to service customer needs. That is great news for Nvidia, which makes the GPUs that are typically used to train large language models (LLMs) such as ChatGPT, GPT-4, BERT, or Google’s PaLM.
But in addition to training LLMs and generative computer vision models such as OpenAI’s DALL-E, GPUs can also be used to accelerate the inference side of the AI workload. To that end, Nvidia today unveiled three new GPUs designed to accelerate inference workloads.
The first is the Nvidia H100 NVL for Large Language Model Deployment. Nvidia says this new offering is “ideal for deploying massive LLMs like ChatGPT at scale.” It sports 188GB of memory and features a “transformer engine” that the company claims can deliver delivers up to 12x faster inference performance for GPT-3 compared to the prior generation A100, at data center scale.
The H100 NVL for LLM Deployment is composed of two previously announced H100 GPUs built on the PCI form factor connected via an NVLink bridge, and “will supercharge” LLM inferencing, says Ian Buck, Nvidia’s vice president of hyperscale and HPC computing.
“These two GPUs work as one to deploy large language models and GPT models from anywhere from 5 billion parameters all the way up to 200 [billion parameters],” Buck said during a press briefing Monday. “It has 188 gigabytes of memory and is 12x faster, this one GPU, than the throughput of an DGX A100 system that’s being used today everywhere. I’m really excited about the Nvidia H100 NVL. It’s going to help democratize the ChatGPT use cases and bring that capability to every server in every cloud.”
The Santa Clara, California company also revealed more about the L40 GPU for Image Generation that it introduced last September. The new GPU SKU is optimized for graphics and AI-enabled 2D, video, and 3D image generation. Compared to the previous generation chip, the L40 for Image Generation delivers 7x the inference performance for Stable Diffusion (an AI image generator) and 12x the performance for powering Omniverse workloads.
Nvidia also revealed the L4 for AI Video. This GPU, which can serve as a general GPU for any workload, can deliver 120 times faster video inference than CPU servers, the company claims.
Finally, the company talked up its Grace Hopper processor as being ideal for graph recommendation models, vector databases, and graph neural nets. Sporing a 900 GB/s NVLink-C2C connection between CPU and GPU, the Grace Hopper “superchip” will be able to deliver 7X faster data transfers and queries compared to PCIe Gen 5, Nvidia says.
“The Grace CPU and the Hopper GPU combined really excel at those very large memory AI tasks for inference, for workloads like large recommender systems, where they have huge embedding tables to help predict what customers need, want, and want to buy,” Buck says. “We see Grace Hopper superchip [bringing] amazing value in the areas of large recommender systems and vector databases.”
All of the new inference GPUs ship with Nvidia software, such as its AI Enterprise suite. This suite includes Nvidia’s TensorRT software development kit (SDK) high-performance deep learning inference and the Triton Inference Server, which is an open-source inference-serving software that helps standardize model deployment.
Some of Nvidia’s partners have already adopted some of these new products. Google Cloud, for instance, is using L4 in its Vertex AI cloud service. A company called Descript is using the L4 GPU in Google Cloud to power its generative AI service, which caters to video and podcast creators. Another startup called WOMBO is using L4 on Google Cloud to power its text-to-art generation service. A company called Kuaishou is also using L4 on Google Cloud to power its short video service.
The L4 GPU is available as a private preview on Google Cloud as well as through 30 server makers, including ASUS, Dell Technologies, HPE, Lenovo, and Supermicro. The L40 is available from a select number of system builders, while Grace Hopper and H100 NVL are expected to be available in the second half of the year.
Related Items:
GPT-4 Has Arrived: Here’s What to Know
Like ChatGPT? You Haven’t Seen Anything Yet
Hallucinations, Plagiarism, and ChatGPT