Summer Sale- Special Discount Limited Time 65% Offer - Ends in 0d 00h 00m 00s - Coupon code: netdisc

NVIDIA NCP-AIO NVIDIA AI Operations Exam Practice Test

Page: 1 / 7
Total 66 questions

NVIDIA AI Operations Questions and Answers

Question 1

You are managing a Kubernetes cluster running AI training jobs using TensorFlow. The jobs require access to multiple GPUs across different nodes, but inter-node communication seems slow, impacting performance.

What is a potential networking configuration you would implement to optimize inter-node communication for distributed training?

Options:

A.

Increase the number of replicas for each job to reduce the load on individual nodes.

B.

Use standard Ethernet networking with jumbo frames enabled to reduce packet overhead during communication.

C.

Configure a dedicated storage network to handle data transfer between nodes during training.

D.

Use InfiniBand networking between nodes to reduce latency and increase throughput for distributed training jobs.

Question 2

You are managing a high-performance computing environment. Users have reported storage performance degradation, particularly during peak usage hours when both small metadata-intensive operations and large sequential I/O operations are being performed simultaneously. You suspect that the mixed workload is causing contention on the storage system.

Which of the following actions is most likely to improve overall storage performance in this mixed workload environment?

Options:

A.

Reducing stripe count for large files would decrease parallelism, likely worsening performance for large sequential I/O operations.

B.

Separate metadata-intensive operations and large sequential I/O operations by using different storage pools for each type of workload.

C.

Increase the number of Object Storage Targets (OSTs) to handle more metadata operations.

D.

Disable GPUDirect Storage (GDS) during peak hours to reduce I/O load on the Lustre file system.

Question 3

What should an administrator check if GPU-to-GPU communication is slow in a distributed system using Magnum IO?

Options:

A.

Limit the number of GPUs used in the system to reduce congestion.

B.

Increase the system's RAM capacity to improve communication speed.

C.

Disable InfiniBand to reduce network complexity.

D.

Verify the configuration of NCCL or NVSHMEM.

Question 4

A system administrator is troubleshooting a Docker container that crashes unexpectedly due to a segmentation fault. They want to generate and analyze core dumps to identify the root cause of the crash.

Why would generating core dumps be a critical step in troubleshooting this issue?

Options:

A.

Core dumps prevent future crashes by stopping any further execution of the faulty process.

B.

Core dumps provide real-time logs that can be used to monitor ongoing application performance.

C.

Core dumps restore the process to its previous state, often fixing the error-causing crash.

D.

Core dumps capture the memory state of the process at the time of the crash.

Question 5

A cloud engineer is looking to provision a virtual machine for machine learning using the NVIDIA Virtual Machine Image (VMI) and Rapids.

What technology stack will be set up for the development team automatically when the VMI is deployed?

Options:

A.

Ubuntu Server, Docker-CE, NVIDIA Container Toolkit, CSP CLI, NGC CLI, NVIDIA Driver

B.

Cent OS, Docker-CE, NVIDIA Container Toolkit, CSP CLI, NGC CLI

C.

Ubuntu Server, Docker-CE, NVIDIA Container Toolkit, CSP CLI, NGC CLI, NVIDIA Driver, Rapids

D.

Ubuntu Server, Docker-CE, NVIDIA Container Toolkit, CSP CLI, NGC CLI

Question 6

A system administrator wants to run these two commands in Base Command Manager.

main

showprofile device status apc01

What command should the system administrator use from the management node system shell?

Options:

A.

cmsh -c “main showprofile; device status apc01”

B.

cmsh -p “main showprofile; device status apc01”

C.

system -c “main showprofile; device status apc01”

D.

cmsh-system -c “main showprofile; device status apc01”

Question 7

You are deploying AI applications at the edge and want to ensure they continue running even if one of the servers at an edge location fails.

How can you configure NVIDIA Fleet Command to achieve this?

Options:

A.

Use Secure NFS support for data redundancy.

B.

Set up over-the-air updates to automatically restart failed applications.

C.

Enable high availability for edge clusters.

D.

Configure Fleet Command's multi-instance GPU (MIG) to handle failover.

Question 8

Which of the following correctly identifies the key components of a Kubernetes cluster and their roles?

Options:

A.

The control plane consists of the kube-apiserver, etcd, kube-scheduler, and kube-controller-manager, while worker nodes run kubelet and kube-proxy.

B.

Worker nodes manage the kube-apiserver and etcd, while the control plane handles all container runtimes.

C.

The control plane is responsible for running all application containers, while worker nodes manage network traffic through etcd.

D.

The control plane includes the kubelet and kube-proxy, and worker nodes are responsible for running etcd and the scheduler.

Question 9

You are tasked with deploying a deep learning framework container from NVIDIA NGC on a stand-alone GPU-enabled server.

What must you complete before pulling the container? (Choose two.)

Options:

A.

Install Docker and the NVIDIA Container Toolkit on the server.

B.

Set up a Kubernetes cluster to manage the container.

C.

Install TensorFlow or PyTorch manually on the server before pulling the container.

D.

Generate an NGC API key and log in to the NGC container registry using docker login.

Question 10

Your Kubernetes cluster is running a mixture of AI training and inference workloads. You want to ensure that inference services have higher priority over training jobs during peak resource usage times.

How would you configure Kubernetes to prioritize inference workloads?

Options:

A.

Increase the number of replicas for inference services so they always have more resources than training jobs.

B.

Set up a separate namespace for inference services and limit resource usage in other namespaces.

C.

Use Horizontal Pod Autoscaling (HPA) based on memory usage to scale up inference services during peak times.

D.

Implement ResourceQuotas and PriorityClasses to assign higher priority and resource guarantees to inference workloads over training jobs.

Question 11

You are deploying an AI workload on a Kubernetes cluster that requires access to GPUs for training deep learning models. However, the pods are not able to detect the GPUs on the nodes.

What would be the first step to troubleshoot this issue?

Options:

A.

Verify that the NVIDIA GPU Operator is installed and running on the cluster.

B.

Ensure that all pods are using the latest version of TensorFlow or PyTorch.

C.

Check if the nodes have sufficient memory allocated for AI workloads.

D.

Increase the number of CPU cores allocated to each pod to ensure better resource utilization.

Question 12

A Slurm user is experiencing a frequent issue where a Slurm job is getting stuck in the “PENDING” state and unable to progress to the “RUNNING” state.

Which Slurm command can help the user identify the reason for the job’s pending status?

Options:

A.

sinfo -R

B.

scontrol show job

C.

sacct -j

D.

squeue -u

Question 13

What two (2) platforms should be used with Fabric Manager? (Choose two.)

Options:

A.

HGX

B.

L40S Certified

C.

GeForce Series

D.

DGX

Question 14

You are configuring cloudbursting for your on-premises cluster using BCM, and you plan to extend the cluster into both AWS and Azure.

What is a key requirement for enabling cloudbursting across multiple cloud providers?

Options:

A.

You only need to configure credentials for one cloud provider, as BCM will automatically replicate them across other providers.

B.

You need to set up a single set of credentials that works across both AWS and Azure for seamless integration.

C.

You must configure separate credentials for each cloud provider in BCM to enable their use in the cluster extension process.

D.

BCM automatically detects and configures credentials for all supported cloud providers without requiring admin input.

Question 15

A system administrator needs to scale a Kubernetes Job to 4 replicas.

What command should be used?

Options:

A.

kubectl stretch job --replicas=4

B.

kubectl autoscale deployment job --min=1 --max=10

C.

kubectl scale job --replicas=4

D.

kubectl scale job -r 4

Question 16

A GPU administrator needs to virtualize AI/ML training in an HGX environment.

How can the NVIDIA Fabric Manager be used to meet this demand?

Options:

A.

Video encoding acceleration

B.

Enhance graphical rendering

C.

Manage NVLink and NVSwitch resources

D.

GPU memory upgrade

Question 17

You are configuring networking for a new AI cluster in your data center. The cluster will handle large-scale distributed training jobs that require fast communication between servers.

What type of networking architecture can maximize performance for these AI workloads?

Options:

A.

Implement a leaf-spine network topology using standard Ethernet switches to ensure scalability as more nodes are added.

B.

Prioritize out-of-band management networks over compute networks to ensure efficient job scheduling across nodes.

C.

Use standard Ethernet networking with a focus on increasing bandwidth through multiple connections per server.

D.

Use InfiniBand networking to provide low-latency, high-throughput communication between servers in the cluster.

Question 18

An organization has multiple containers and wants to view STDIN, STDOUT, and STDERR I/O streams of a specific container.

What command should be used?

Options:

A.

docker top CONTAINER-NAME

B.

docker stats CONTAINER-NAME

C.

docker logs CONTAINER-NAME

D.

docker inspect CONTAINER-NAME

Question 19

What must be done before installing new versions of DOCA drivers on a BlueField DPU?

Options:

A.

Uninstall any previous versions of DOCA drivers.

B.

Re-flash the firmware every time.

C.

Disable network interfaces during installation.

D.

Reboot the host system.

Page: 1 / 7
Total 66 questions