You are managing a Kubernetes cluster running AI training jobs using TensorFlow. The jobs require access to multiple GPUs across different nodes, but inter-node communication seems slow, impacting performance.
What is a potential networking configuration you would implement to optimize inter-node communication for distributed training?
You are managing a high-performance computing environment. Users have reported storage performance degradation, particularly during peak usage hours when both small metadata-intensive operations and large sequential I/O operations are being performed simultaneously. You suspect that the mixed workload is causing contention on the storage system.
Which of the following actions is most likely to improve overall storage performance in this mixed workload environment?
What should an administrator check if GPU-to-GPU communication is slow in a distributed system using Magnum IO?
A system administrator is troubleshooting a Docker container that crashes unexpectedly due to a segmentation fault. They want to generate and analyze core dumps to identify the root cause of the crash.
Why would generating core dumps be a critical step in troubleshooting this issue?
A cloud engineer is looking to provision a virtual machine for machine learning using the NVIDIA Virtual Machine Image (VMI) and Rapids.
What technology stack will be set up for the development team automatically when the VMI is deployed?
A system administrator wants to run these two commands in Base Command Manager.
main
showprofile device status apc01
What command should the system administrator use from the management node system shell?
You are deploying AI applications at the edge and want to ensure they continue running even if one of the servers at an edge location fails.
How can you configure NVIDIA Fleet Command to achieve this?
Which of the following correctly identifies the key components of a Kubernetes cluster and their roles?
You are tasked with deploying a deep learning framework container from NVIDIA NGC on a stand-alone GPU-enabled server.
What must you complete before pulling the container? (Choose two.)
Your Kubernetes cluster is running a mixture of AI training and inference workloads. You want to ensure that inference services have higher priority over training jobs during peak resource usage times.
How would you configure Kubernetes to prioritize inference workloads?
You are deploying an AI workload on a Kubernetes cluster that requires access to GPUs for training deep learning models. However, the pods are not able to detect the GPUs on the nodes.
What would be the first step to troubleshoot this issue?
A Slurm user is experiencing a frequent issue where a Slurm job is getting stuck in the “PENDING” state and unable to progress to the “RUNNING” state.
Which Slurm command can help the user identify the reason for the job’s pending status?
What two (2) platforms should be used with Fabric Manager? (Choose two.)
You are configuring cloudbursting for your on-premises cluster using BCM, and you plan to extend the cluster into both AWS and Azure.
What is a key requirement for enabling cloudbursting across multiple cloud providers?
A system administrator needs to scale a Kubernetes Job to 4 replicas.
What command should be used?
A GPU administrator needs to virtualize AI/ML training in an HGX environment.
How can the NVIDIA Fabric Manager be used to meet this demand?
You are configuring networking for a new AI cluster in your data center. The cluster will handle large-scale distributed training jobs that require fast communication between servers.
What type of networking architecture can maximize performance for these AI workloads?
An organization has multiple containers and wants to view STDIN, STDOUT, and STDERR I/O streams of a specific container.
What command should be used?
What must be done before installing new versions of DOCA drivers on a BlueField DPU?