Chapter 20

System Software, Firmware & Orchestration

Chapter 20: System Software, Firmware & Orchestration

20.1 Overview

Between the physical hardware described in Chapters 2 through 18 and the AI models that run on top of it sits a layer of system software that makes the hardware function as a coherent whole. This chapter covers the software and firmware that no AI data center can operate without: GPU programming stacks (CUDA, ROCm), server firmware (BIOS, BMC), cluster orchestration (Slurm, Kubernetes), network operating systems (SONiC, Arista EOS), and virtualization platforms (VMware, KVM).

The reason this layer exists as a separate chapter is concentration. NVIDIA’s CUDA ecosystem has approximately 80% 1 of the AI accelerator software market, 5.9 million developers, and 20 years of accumulated ecosystem depth. Aspeed Technology in Taiwan holds roughly 70% 2 of the BMC chip market. AMI and Insyde together control the vast majority of x86 BIOS/UEFI firmware. Broadcom (via its VMware acquisition) holds approximately 72% 3 of virtualization revenue share. These positions are less visible than hardware monopolies (nobody photographs a BIOS chip the way they photograph an EUV machine), but they are equally concentrated and in some cases harder to displace.

NVIDIA’s vertical integration through this layer deserves specific attention. In 2025, NVIDIA acquired SchedMD (the company behind Slurm, the dominant GPU cluster scheduler) 4 and Run:AI ($700 million 5, GPU orchestration). Combined with its existing CUDA toolkit, DGX OS, Base Command Manager, Slinky (Slurm-on-Kubernetes), Cumulus Linux (network OS), and custom BMC firmware for DGX systems, NVIDIA now controls the deepest vertical stack in AI infrastructure. A customer running NVIDIA GPUs in NVIDIA DGX servers with NVIDIA networking, managed by NVIDIA’s scheduler, on NVIDIA’s network OS, programmed in NVIDIA’s CUDA, faces switching costs at every layer simultaneously. This is lock-in measured not in one dimension but in six.

Morgan Stanley estimates the CPU-side orchestration market (a subset of this chapter) at $60-110 billion TAM by 2030 6, noting that CPU-side orchestration accounts for 50-90% 7 of workload latency in agentic AI applications. The chapter is small by current revenue but large by strategic importance and future TAM.


20.2 Market Sizing

CUDA ecosystem: NVIDIA does not separately disclose CUDA revenue because the toolkit is bundled free with GPU purchases. NVIDIA AI Enterprise (NVAIE), the subscription overlay that provides enterprise support and certified containers, is priced per-GPU-per-year but revenue is not broken out. The economic value of CUDA is embedded in NVIDIA’s GPU ASP premium: customers pay more for NVIDIA GPUs partly because the CUDA ecosystem reduces total cost of ownership for AI workloads.

BIOS/UEFI firmware: A niche market dominated by two companies. AMI (being acquired by Lattice Semiconductor for $1.65B 8) and Insyde Software (6231.TPEx, market cap ~$290M 9) together supply firmware for the vast majority of x86 server platforms. The market is small in revenue terms (hundreds of millions) but sits at a critical chokepoint: a BIOS vulnerability or supply disruption affects every server built on that firmware.

BMC chips: Aspeed Technology (5274.TPEx, ~$19B market cap 10) holds approximately 70% 11 of the global BMC chip market. A GB300 rack requires 71 BMC chips compared to 15 for an H100-based system, a 4.7x multiplier driven by higher GPU count per rack. Every server needs a BMC for out-of-band management (remote power control, health monitoring, firmware updates). This creates a demand multiplier that tracks the AI rack buildout linearly.

Cluster orchestration: Slurm (now NVIDIA-owned via SchedMD acquisition, December 2025 12) is the dominant scheduler for GPU clusters. Kubernetes is the dominant container orchestrator for cloud-native workloads. The two are converging (NVIDIA’s Slinky runs Slurm workloads on Kubernetes). Run:AI ($700M 13 acquisition by NVIDIA, January 2025) provided GPU-aware scheduling that optimizes utilization across a shared GPU pool.

Virtualization: Broadcom/VMware holds approximately 72% 3 of virtualization revenue share. VMware’s aggressive licensing changes post-Broadcom-acquisition have driven some customers toward KVM/QEMU (open source) and Nutanix (NTNX, ~$12.4B market cap 14, revenue $2.54B 15). Proxmox VE, an open-source alternative, saw 340% year-over-year growth in evaluations.


20.3 Key Companies

20.3.1 GPU Software Stacks

CompanyTickerExchangeApprox. Mkt CapRoleKey Metric
NVIDIA (CUDA)NVDANASDAQ~$5.2TCUDA, cuDNN, TensorRT, NCCL, Triton, DGX OS, Base Command Manager5.9M developers, 450M+ toolkit downloads, 3000+ GPU-accelerated apps. ~80% AI accelerator ecosystem.
AMD (ROCm)AMDNASDAQ~$742BROCm open-source GPU programming stack; HIP translation layerNative PyTorch support. HIP converts most CUDA code. Ecosystem is 5+ years behind in library depth.
Intel (oneAPI)INTCNASDAQ~$628BoneAPI cross-architecture programming modelAttempting to create hardware-agnostic alternative. Limited traction in AI workloads.

20.3.2 BIOS/UEFI Firmware

CompanyTickerExchangeApprox. Mkt CapRoleKey Metric
AMI (American Megatrends)PrivateBeing acquired$1.65B 8 (acq. price)Dominant x86 BIOS/UEFI firmware vendorBeing acquired by Lattice Semiconductor. Near-monopoly position in server BIOS.
Lattice SemiconductorLSCCNASDAQ~$17.4BAcquiring AMI; FPGA + firmware combined platform$1.65B 8 AMI acquisition creates hardware+firmware security platform.
Insyde Software6231TPEx~$290M 9#2 BIOS/UEFI vendor; sole remaining public pure-playServes major server OEMs. Small market cap relative to critical position.

20.3.3 BMC (Baseboard Management Controller)

CompanyTickerExchangeApprox. Mkt CapRoleKey Metric
Aspeed Technology5274TPEx~$19.0B 10~70% 11 monopoly in BMC chipsEvery server needs a BMC. GB300 needs 71 per rack (vs 15 for H100). Demand multiplied 4.7x by AI rack density.
OpenBMCN/AOpen source (Linux Foundation)N/AOpen-source BMC firmware (backed by Meta, Google, Microsoft, NVIDIA)Runs on Aspeed chips. Reduces firmware vendor lock-in but not chip vendor lock-in.

20.3.4 Cluster Orchestration

CompanyTickerExchangeApprox. Mkt CapRoleKey Metric
NVIDIA (Slurm/SchedMD)NVDANASDAQ~$5.2TAcquired SchedMD (Dec 2025); Slurm is the dominant GPU cluster schedulerNVIDIA now controls the scheduler. Slinky bridges Slurm and Kubernetes.
NVIDIA (Run:AI)NVDANASDAQ~$5.2TAcquired Run:AI ($700M 13, Jan 2025); GPU-aware orchestrationOptimizes GPU utilization across shared pools. Integrated into DGX Cloud.
AnyscalePrivatePrivate~$1.0B (est.)Ray distributed computing frameworkOpen-source Ray used by OpenAI, Uber, Spotify for distributed training. Anyscale provides managed Ray platform.

20.3.5 Network Operating Systems

CompanyTickerExchangeApprox. Mkt CapRoleKey Metric
Arista Networks (EOS)ANETNYSE~$218BArista EOS; ~35% of DC Ethernet switch market (see also Chapter 10)Revenue ~$9B. EOS is the leading commercial NOS for cloud data centers.
NVIDIA (Cumulus Linux)NVDANASDAQ~$5.2TOpen NOS for white-box switches; part of NVIDIA networking stackAcquired Cumulus Networks (2020). Integrated with Spectrum switches.
SONiCN/AOpen source (Linux Foundation, orig. Microsoft)N/AOpen-source NOS for data center switchesAdopted by Azure, Alibaba, and other hyperscalers for white-box switching. Running on Broadcom, NVIDIA, and other ASICs.

20.3.6 Virtualization / Hypervisor

CompanyTickerExchangeApprox. Mkt CapRoleKey Metric
Broadcom (VMware)AVGONASDAQ~$2.0TVMware vSphere; ~72% 3 virtualization revenue shareAggressive licensing changes post-acquisition driving customer migration. VMware is bundled into Broadcom’s infrastructure software segment.
NutanixNTNXNASDAQ~$12.4B 14Hyperconverged infrastructure; VMware alternativeRevenue $2.54B 15. Benefiting from VMware pricing backlash. AHV hypervisor gaining share.
Red Hat (IBM)IBMNYSE~$250BOpenShift container platform, RHEL, KVMEnterprise Kubernetes and Linux. OpenShift competes with VMware for cloud-native workloads.

20.4 Bottleneck Analysis

CUDA ecosystem (EXTREME, RPN 90). CUDA is the most important software moat in the AI industry. The switching costs are enormous: rewriting CUDA kernels to HIP (AMD) or oneAPI (Intel) takes months of engineering effort, risks performance regression, and requires retraining development teams. Every major AI framework (PyTorch, TensorFlow, JAX) is first optimized for CUDA. The 5.9 million developer base creates a self-reinforcing ecosystem where the best AI tools are CUDA-first because most developers use CUDA, and most developers use CUDA because the best tools are CUDA-first. Breaking this cycle would require a competing ecosystem to offer materially better performance at the same maturity level, which neither ROCm nor oneAPI has achieved.

However, the RPN for CUDA disruption is moderate (90) rather than extreme because the occurrence probability is very low. NVIDIA has no incentive to disrupt its own ecosystem. The risk scenarios (antitrust forced divestiture, sudden licensing change, catastrophic security vulnerability) are all low probability.

Aspeed BMC chips (HIGH, RPN 112). This is the hidden bottleneck in this layer. Every server needs a BMC chip. The 4.7x demand multiplier from AI rack density (71 BMCs per GB300 rack vs 15 per H100 rack) means Aspeed’s production must scale linearly with the AI server buildout. At ~70% 11 market share and a $19B 10 market cap, Aspeed is one of the most concentrated dependencies in the supply chain that almost nobody discusses. There is no credible alternative BMC chip vendor at comparable scale.

NVIDIA vertical integration (HIGH, structural). The concentration of CUDA + Slurm + Run:AI + Cumulus Linux + DGX OS + custom BMC in one company creates a lock-in stack where switching any single layer is hard, and switching all of them simultaneously is nearly impossible. This is a structural competitive advantage, but from a supply chain resilience perspective, it means that regulatory action against NVIDIA (antitrust, export controls) would disrupt six layers of the software stack simultaneously.

BIOS/UEFI firmware consolidation (MODERATE). AMI’s acquisition by Lattice reduces the number of independent BIOS vendors from three (AMI, Insyde, Phoenix) to effectively two (Lattice/AMI, Insyde). Phoenix Technologies’ firmware business was sold to Lenovo. Insyde (6231.TPEx, ~$290M 9 market cap) is the sole remaining publicly traded pure-play BIOS vendor. A firmware vulnerability at this level affects every server built on that firmware, making BIOS a supply chain risk that is low-frequency but high-severity.


20.5 Risks

CUDA lock-in may erode, particularly for inference. CUDA’s moat is strongest in training, where developers write custom kernels and optimize low-level memory access patterns. For inference, workloads run through higher-level frameworks (PyTorch, TensorRT, ONNX Runtime, Triton compiler) that abstract away the hardware. As inference grows to represent the majority of AI compute (some estimates suggest 70-80% by 2028), the fraction of the market that requires deep CUDA expertise shrinks. AMD’s ROCm has native PyTorch support, and the HIP translation layer converts most CUDA code with minimal changes. If ROCm reaches parity in library depth and debugging tooling, the switching cost drops significantly. OpenAI’s 6GW AMD GPU deal 16 suggests that at least one frontier lab believes ROCm is viable for large-scale training.

Open-source alternatives. SONiC (network OS), OpenBMC (BMC firmware), KVM (hypervisor), and Kubernetes (orchestration) are all open-source alternatives to proprietary stacks. The trend toward open-source infrastructure software could limit the pricing power of proprietary vendors. However, enterprise support, certification, and integration still favor commercial offerings.

Antitrust scrutiny. NVIDIA’s acquisition of SchedMD (Slurm) and Run:AI, combined with its CUDA monopoly, could attract regulatory attention. If antitrust authorities require NVIDIA to divest or open-source any of these acquisitions, the vertical lock-in thesis weakens.

First principles check. Is CUDA lock-in real or overstated? The existence of the OpenAI-AMD deal suggests that at very large scale, with dedicated engineering teams, CUDA lock-in can be broken. But for the vast majority of enterprises and researchers who do not have OpenAI’s engineering resources, CUDA remains the path of least resistance. The lock-in is real for the vast majority of the market 1718 even if a handful of hyperscalers can escape it.