December 08, 2023
Choosing the Right Infrastructure for Generative AI – 3 Keys to Success
To capitalize on artificial intelligence, organizations must avoid common pitfalls associated with choosing infrastructure to support development.
1. Developer Inefficiency Is Costing You
To make the most of their time, AI developers and data scientists need a user experience that’s push-button simple. These users don’t need to know (or care) about infrastructure; they simply want to build prototypes, experiment and get to production-ready models sooner. They need a simplified user interface, and tools that make it easier to start with pretrained, ready-to-customize models offering a solid foundation for a faster start. An IT platform should help streamline model development, letting developers access resources without having to worry about infrastructure.
Why is their productivity critical? Why can’t they make do with the same compute resources they use today? Data science talent doesn’t come cheap and retaining it can be hard. When they’re waiting on resources, the business is essentially burning cash — workloads that should only take a couple hours to run might take days. Many expend up to a month “DIY”-ing their software stack to run on the infrastructure provided to them. A dollar spent on traditional, non-optimal infrastructure might actually be costing you three, if that infrastructure has your developers idling or expending effort that adds no value, such as re-engineering one’s software stack to make that infrastructure usable.
2. IaaS vs PaaS
When provisioning AI resources, many IT leaders instinctively turn to Infrastructure as a Service (IaaS) offerings, accessing bare-metal server instances in the cloud for the lowest possible price per GPU-hour. This is understandable, given the way organizations have become accustomed to provisioning resources for more traditional enterprise workloads. However, when it comes to AI, it often makes more sense to move up the stack and adopt a full-stack AI platform.
Platforms optimized for AI include the right infrastructure such as multinode clusters of GPU resources interconnected with ultrahigh bandwidth, low latency networking. They also include a developer workflow hub that insulates teams from the complexity of infrastructure while letting them collaborate, share their work and dynamically allocate resources across multiple projects at once. And to jump-start projects, they also include accelerated data science libraries, optimized AI frameworks and even pretrained models that unleash productivity.
3. Filling the AI Expertise Gap
AI talent can be extremely hard to find and expensive to retain. For some organizations, AI talent is essentially unavailable at any price. With enterprise AI being such a nascent space riddled with unsupported, unproven technology, today’s businesses need enterprise-grade 24/7 support and access to AI-fluent practitioners who know how to solve problems.
This was an important consideration as NVIDIA developed the DGX™ platform, and it’s why NVIDIA makes its AI expertise available, on-demand, to every DGX customer — helping them achieve better results, quicker. This expertise can range from addressing problems related to optimizing models for faster training runs to finding the root of software incompatibilities that cause training jobs to crash. A full-stack platform that comes with integrated access to AI expertise can help ensure applications are delivered to market quickly and cost-effectively.
Story by Tony Paikeday, who is a Senior Director of AI systems at NVIDIA, responsible for go-to-market for NVIDIA’s DGX platform. In his role, Tony helps enterprise organizations infuse their businesses with the power of AI via infrastructure solutions that enable faster insights from data. Tony was previously with VMware, where he was responsible for bringing desktop and application virtualization solutions to market, as well as key enabling technologies, including GPU virtualization and software-defined data centre.