Spring Sale Limited Time Flat 70% Discount offer - Ends in 0d 00h 00m 00s - Coupon code: 70spcl

NVIDIA NCP-AII NVIDIA AI Infrastructure Exam Practice Test

Page: 1 / 7
Total 71 questions

NVIDIA AI Infrastructure Questions and Answers

Question 1

A systems administrator is preparing a new DGX server for deployment. What is the most secure approach to configuring the BMC port during initial setup?

Options:

A.

Enable remote access to the BMC over the internet using the default admin credentials for initial troubleshooting.

B.

Connect the BMC port directly to the production network and retain default admin credentials for convenience.

C.

Leave the BMC port disconnected until after the operating system is fully configured and in production.

D.

Connect the BMC port to a dedicated and firewalled network and change the default admin credentials.

Question 2

An administrator needs to verify HA functionality after configuring BCM (Bright Cluster Manager). Which command confirms the active head node and failover readiness?

Options:

A.

cmsh status to check HA status and active/standby roles.

B.

nvsm show health to validate GPU status on both head nodes.

C.

systemctl restart cmdaemon to force a failover test.

D.

ping to test basic connectivity.

Question 3

An administrator installs NVIDIA GPU drivers on a DGX H100 system with UEFI Secure Boot enabled. After reboot, the drivers fail to load. What is the first action to resolve this issue?

Options:

A.

Disable Secure Boot permanently in BIOS/UEFI settings.

B.

Delete /etc/X11/xorg.conf to force driver reconfiguration.

C.

Enroll the Machine Owner Key (MOK) during system reboot and enter the recorded password.

D.

Reinstall drivers using apt-get install nvidia-driver-550 without rebooting.

Question 4

During East-West fabric validation on a 64-GPU cluster, an engineer runs all_reduce_perf and observes an algorithm bandwidth of 350 GB/s and bus bandwidth of 656 GB/s. What does this indicate about the fabric performance?

Options:

A.

Inconclusive; rerun with point-to-point tests.

B.

Optimal performance; bus bandwidth near theoretical peak for NDR InfiniBand.

C.

Critical failure; bus bandwidth exceeds hardware capabilities.

D.

Suboptimal performance; algorithm bandwidth should match bus bandwidth.

Question 5

A network engineer is tasked with configuring the management, storage, and compute networks for a new DGX BasePOD deployment. Which statement best describes the network segmentation required for optimal operation?

Options:

A.

A single VLAN for all types of network traffic.

B.

Two networks: one for management and one for compute.

C.

Four networks: compute, storage, out-of-band, and management.

Question 6

You are following the official steps to install the NVIDIA Container Toolkit using a package manager on Ubuntu. After importing the NVIDIA package repository and GPG key, what is the next action?

Options:

A.

Reboot the host system to apply the repository changes and proceed.

B.

Install the nvidia-container-toolkit package using your package manager.

C.

Format the disk to clear any existing NVIDIA-related dependencies first.

D.

Download the CUDA toolkit installer from NVIDIA'S official website.

Question 7

A customer has just completed the first boot of their DGX system and is prompted to create an administrative user. What is the correct approach for setting up this user to ensure secure BMC and GRUB access?

Options:

A.

Create separate usernames for BMC and GRUB to maximize flexibility.

B.

Skip the creation of a new user and retain the default admin account for BMC and GRUB access.

C.

Create a unique, strong, lower-case username and password that will be used for both BMC and GRUB access, avoiding default or weak credentials.

D.

Use “sysadmin” as the username and a simple password for ease of management.

Question 8

When updating the firmware on an NVLink switch transceiver, how can an engineer apply new firmware without interrupting the network?

Options:

A.

mlxfwreset -d -lid 27 reset --yes to reset the transceiver

B.

Physically disconnect and reconnect the transceiver.

C.

flint -d -lid 27 --linkx --linkx_auto_update --activate

D.

nv action reboot system to force immediate activation.

Question 9

You are a network administrator responsible for configuring an East-West (E/W) Spectrum-X fabric using SuperNIC. The Bluefield-3 devices in your network should be set to NIC mode with RoCE enabled to optimize data flow between servers. You have access to the Spectrum-X management tools and the necessary documentation. You need to use specific configuration commands to achieve this setup. Which of the following steps and commands are necessary to configure the Bluefield-3 devices in NIC mode for the E/W Spectrum-X fabric using SuperNIC? (Pick the 2 correct responses below)

Options:

A.

Use the command sudo mlxconfig -d /dev/mst/ set LINK_TYPE_P1=2 to enable Ethernet on the Bluefield-3 devices.

B.

Use the command sudo mlxconfig -d /dev/mst/ set DISABLE_SPECTRUM_X=1 to reduce overhead.

C.

Use the command sudo mlxconfig -d /dev/mst/ set INTERNAL_CPU_OFFLOAD_ENGINE=1 to configure the SuperNIC to operate in NIC mode.

D.

Use the command sudo mlxconfig -d /dev/mst/ set DPU_MODE=1 to set up the Bluefield-3 devices in DPU mode.

Question 10

After NCCL burn-in reports "transport retry count exceeded," which corrective action addresses the underlying fabric issue?

Options:

A.

Switch from Ring to Tree algorithms via NCCL_ALGO=TREE

B.

Reduce message size to decrease network utilization

C.

Increase NCCL_IB_TIMEOUT to tolerate longer latencies

D.

Inspect InfiniBand link quality metrics (BER, symbol errors) and replace faulty cables

Question 11

What command sequence is used to identify the exact name of the server that runs as the master SM in a multi-node fabric?

Options:

A.

sminfo, then smpquery ND

B.

ibstat, then sminfo

C.

ibnetdiscover, then ibsim

D.

sminfo, then smpquery NI

Question 12

During BCM cluster setup, an engineer must configure bonded network interfaces on DGX nodes for high availability. Which cmsh command sequence properly configures a bond0 interface with two physical NICs?

Options:

A.

device use dgx001 ; interfaces add vlan vlan100 ; set parent bond0 ; set mode 1 ; set network internalnet

B.

device use dgx001 ; interfaces add bond bond0 ; append interfaces enp225s0f1np1 enp97s0f1np1 ; set mode 1 ; set network internalnet

C.

device use dgx001 ; interfaces set enp225s0f1np1 network internalnet ; interfaces set enp97s0f1np1 network internalnet

D.

device use dgx001 ; interfaces delete enp225s0f1np1 ; interfaces delete enp97s0f1np1

Question 13

A System Administrator needs to change the scheduling behavior of a single GPU to use a fixed share scheduler. What command achieves this?

Options:

A.

esxcli system module parameters set -m nvidia -p

B.

esxcli -i 0 -mig 18

C.

nvidia-smi -i 0 -mig 1

D.

mlxconfig -d /dev/mst/mt4123_pciconf0 set LINK_TYPE_P1 =2

Question 14

One of the nodes in a cluster is not running as fast as the others and the system administrator needs to check the status of the GPUs on that system. What command should be used?

Options:

A.

lspci | grep NVIDIA

B.

nvidia-smi

C.

nvidia-gpu-status

D.

iblinkinfo

Question 15

A company has a registered NGC account and their server has NGC CLI installed. What step should be taken first to gain access to NGC?

Options:

A.

ngc config get

B.

ngc init

C.

ngc config set

D.

ngc config update

Question 16

During cluster deployment, the UFM Cable Validation Tool reports "Wrong-neighbor" errors on multiple InfiniBand links. What is the most efficient way to resolve this issue?

Options:

A.

Reboot all leaf switches to force LLDP rediscovery.

B.

Replace all affected cables with higher-grade OM5 fiber optics.

C.

Verify LLDP data against topology files and remediate.

D.

Disable FEC on all switches to bypass neighbor validation.

Question 17

A team is validating a DGX BasePOD deployment. Using cmsh, they run a command to check GPU health across all nodes. What indicates that the system is ready for AI workloads?

Options:

A.

The command output is ignored if the system powers on without errors.

B.

At least half of the GPUs report Status_Health = OK.

C.

All GPUs report Status_Health = OK and Health = OK for each device.

D.

Only the head node's GPUs need to be healthy.

Question 18

A cluster administrator is preparing to update the firmware on a DGX H100 system, including the GPU tray (baseboard). What is the correct sequence of steps to perform a safe and successful firmware upgrade?

Options:

A.

Update the BMC and skip the GPU tray and motherboard tray updates if the system appears healthy.

B.

Perform a cold reset, stop all GPU activity, update and reboot the BMC, update motherboard and tray components, and verify completion.

C.

Update the GPU tray first, then the motherboard tray, and reboot the BMC after all updates are complete.

D.

Stop all GPU activity, update and reboot the BMC, update motherboard and tray components, perform a cold reset, and verify completion.

Question 19

As the infrastructure lead for an NVIDIA AI Factory deployment, you have just uploaded the latest supported firmware packages to your DGX system. It is now critical to ensure all hardware components run the new firmware and the DGX returns to full operational capability. Which sequence best guarantees that all relevant components are correctly running updated firmware?

Options:

A.

Perform a software-driven restart on the operating system of every compute node, then use advanced tools to check firmware status, and reissue update commands if any firmware appears inactive afterward.

B.

Execute a single AC power cycle on the DGX after the update process, then reset the software stack and verify status using diagnostic commands on each node for confirmation of all component updates.

C.

Initiate a cold power cycle on all node trays to activate firmware, follow with a DGX reboot procedure, and use the management interface to finish activating CPLD firmware on the host.

D.

Initiate a cold power cycle on the system to activate firmware for components, reset the BMC using the recommended command, and perform an AC power cycle to ensure EROT and CPLD firmware is activated.

Question 20

A system administrator noticed a failure on a DGX H100 server. After a reboot, only the BMC is available. What could be the reason for this behavior?

Options:

A.

The network card has no link / connection.

B.

A boot disk has failed.

C.

Multiple GPUs have failed.

D.

There are more than two failed power supplies.

Question 21

After running a 24-hour stress test on a DGX node, the administrator should verify which two key metrics to ensure system stability?

Options:

A.

Average CPU usage >80% and Docker container uptime.

B.

No thermal throttling events and consistent GPU utilization >95% throughout the test.

C.

SSD write endurance and RAM capacity.

D.

Total energy consumption and NVLink bandwidth.

Page: 1 / 7
Total 71 questions