Pre-Summer Sale Limited Time Flat 70% Discount offer - Ends in 0d 00h 00m 00s - Coupon code: 70spcl

Databricks Databricks-Certified-Data-Engineer-Associate Databricks Certified Data Engineer Associate Exam Exam Practice Test

Databricks Certified Data Engineer Associate Exam Questions and Answers

Question 1

A data engineer wants to create a new table containing the names of customers that live in France.

They have written the following command:

Question # 1

A senior data engineer mentions that it is organization policy to include a table property indicating that the new table includes personally identifiable information (PII).

Which of the following lines of code fills in the above blank to successfully complete the task?

Options:

A.

There is no way to indicate whether a table contains PII.

B.

" COMMENT PII "

C.

TBLPROPERTIES PII

D.

COMMENT " Contains PII "

E.

PII

Question 2

A data engineering team has noticed that their Databricks SQL queries are running too slowly when they are submitted to a non-running SQL endpoint. The data engineering team wants this issue to be resolved.

Which of the following approaches can the team use to reduce the time it takes to return results in this scenario?

Options:

A.

They can turn on the Serverless feature for the SQL endpoint and change the Spot Instance Policy to " Reliability Optimized. "

B.

They can turn on the Auto Stop feature for the SQL endpoint.

C.

They can increase the cluster size of the SQL endpoint.

D.

They can turn on the Serverless feature for the SQL endpoint.

E.

They can increase the maximum bound of the SQL endpoint ' s scaling range

Question 3

In which of the following scenarios should a data engineer use the MERGE INTO command instead of the INSERT INTO command?

Options:

A.

When the location of the data needs to be changed

B.

When the target table is an external table

C.

When the source table can be deleted

D.

When the target table cannot contain duplicate records

E.

When the source is not a Delta table

Question 4

Which two components function in the DB platform architecture’s control plane? (Choose two.)

Options:

A.

Virtual Machines

B.

Compute Orchestration

C.

Serverless Compute

D.

Compute

E.

Unity Catalog

Question 5

Which of the following describes a benefit of creating an external table from Parquet rather than CSV when using a CREATE TABLE AS SELECT statement?

Options:

A.

Parquet files can be partitioned

B.

CREATE TABLE AS SELECT statements cannot be used on files

C.

Parquet files have a well-defined schema

D.

Parquet files have the ability to be optimized

E.

Parquet files will become Delta tables

Question 6

A data engineer needs to ingest from both streaming and batch sources for a firm that relies on highly accurate data. Occasionally, some of the data picked up by the sensors that provide a streaming input are outside the expected parameters. If this occurs, the data must be dropped, but the stream should not fail.

Which feature of Delta Live Tables meets this requirement?

Options:

A.

Monitoring

B.

Change Data Capture

C.

Expectations

D.

Error Handling

Question 7

Which of the following must be specified when creating a new Delta Live Tables pipeline?

Options:

A.

A key-value pair configuration

B.

The preferred DBU/hour cost

C.

A path to cloud storage location for the written data

D.

A location of a target database for the written data

E.

At least one notebook library to be executed

Question 8

A data engineer needs to use a Delta table as part of a data pipeline, but they do not know if they have the appropriate permissions.

In which location can the data engineer review their permissions on the table?

Options:

A.

Jobs

B.

Dashboards

C.

Catalog Explorer

D.

Repos

Question 9

A data engineer is attempting to write Python and SQL in the same command cell and is running into an error The engineer thought that it was possible to use a Python variable in a select statement.

Why does the command fail?

Options:

A.

Databricks supports multiple languages but only one per notebook.

B.

Databricks supports language interoperability in the same cell but only between Scala and SQL

C.

Databricks supports language interoperability but only if a special character is used.

D.

Databricks supports one language per cell.

Question 10

A dataset has been defined using Delta Live Tables and includes an expectations clause:

CONSTRAINT valid_timestamp EXPECT (timestamp > ' 2020-01-01 ' ) ON VIOLATION FAIL UPDATE

What is the expected behavior when a batch of data containing data that violates these constraints is processed?

Options:

A.

Records that violate the expectation cause the job to fail.

B.

Records that violate the expectation are added to the target dataset and flagged as invalid in a field added to the target dataset.

C.

Records that violate the expectation are dropped from the target dataset and recorded as invalid in the event log.

D.

Records that violate the expectation are added to the target dataset and recorded as invalid in the event log.

Question 11

Which of the following statements regarding the relationship between Silver tables and Bronze tables is always true?

Options:

A.

Silver tables contain a less refined, less clean view of data than Bronze data.

B.

Silver tables contain aggregates while Bronze data is unaggregated.

C.

Silver tables contain more data than Bronze tables.

D.

Silver tables contain a more refined and cleaner view of data than Bronze tables.

E.

Silver tables contain less data than Bronze tables.

Question 12

A data engineer needs to apply custom logic to string column city in table stores for a specific use case. In order to apply this custom logic at scale, the data engineer wants to create a SQL user-defined function (UDF).

Which of the following code blocks creates this SQL UDF?

Options:

A.

B.

C.

D.

E.

Question 13

A data engineer is writing a script that is meant to ingest new data from cloud storage. In the event of the Schema change, the ingestion should fail. It should fail until the changes downstream source can be found and verified as intended changes.

Which command will meet the requirements?

Options:

A.

addNewColumns

B.

failOnNewColumns

C.

rescue

D.

none

Question 14

A Python file is ready to go into production and the client wants to use the cheapest but most efficient type of cluster possible. The workload is quite small, only processing 10GBs of data with only simple joins and no complex aggregations or wide transformations.

Which cluster meets the requirement?

Options:

A.

Job cluster with Photon enabled

B.

Interactive cluster

C.

Job cluster with spot instances disabled

D.

Job cluster with spot instances enabled

Question 15

A data engineer is attempting to drop a Spark SQL table my_table and runs the following command:

DROP TABLE IF EXISTS my_table;

After running this command, the engineer notices that the data files and metadata files have been deleted from the file system.

Which of the following describes why all of these files were deleted?

Options:

A.

The table was managed

B.

The table ' s data was smaller than 10 GB

C.

The table ' s data was larger than 10 GB

D.

The table was external

E.

The table did not have a location

Question 16

A data engineer that is new to using Python needs to create a Python function to add two integers together and return the sum?

Which of the following code blocks can the data engineer use to complete this task?

A)

Question # 16

B)

Question # 16

C)

Question # 16

D)

Question # 16

E)

Question # 16

Options:

A.

Option A

B.

Option B

C.

Option C

D.

Option D

E.

Option E

Question 17

A data engineer wants to delegate day-to-day permission management for the schema main.marketing to the mkt-admins group, without making them workspace admins. They should be able to grant and revoke privileges for other users on objects within that schema.

Which approach aligns with Unity Catalog’s ownership and privilege model?

Options:

A.

Transfer ownership of the schema main.marketing to mkt-admins ; owners can manage privileges on the schema and its contained objects.

B.

Grant MANAGE permissions on the metastore to mkt-admins , which allows managing privileges for all schemas and tables globally.

C.

Grant USE SCHEMA on main.marketing , and MODIFY on all tables to mkt-admins , which enables the management of grants within the schema.

D.

Make mkt-admins a workspace-level admins group, then assign SELECT on main.marketing to allow privilege delegation.

Question 18

A data engineer has realized that they made a mistake when making a daily update to a table. They need to use Delta time travel to restore the table to a version that is 3 days old. However, when the data engineer attempts to time travel to the older version, they are unable to restore the data because the data files have been deleted.

Which of the following explains why the data files are no longer present?

Options:

A.

The VACUUM command was run on the table

B.

The TIME TRAVEL command was run on the table

C.

The DELETE HISTORY command was run on the table

D.

The OPTIMIZE command was nun on the table

E.

The HISTORY command was run on the table

Question 19

A data engineer needs to conduct Exploratory Analysis on data residing in a database that is within the company ' s custom-defined network in the cloud. The data engineer is using SQL for this task.

Which type of SQL Warehouse will enable the data engineer to process large numbers of queries quickly and cost-effectively?

Options:

A.

Serverless compute for notebooks

B.

Serverless SQL Warehouse

C.

Classic SQL Warehouse

D.

Pro SQL Warehouse

Question 20

A departing platform owner currently holds ownership of multiple catalogs and controls storage credentials and external locations. The data engineer wants to ensure continuity: transfer catalog ownership to the platform team group, delegate ongoing privilege management, and retain the ability to receive and share data via Delta Sharing .

Which role must be in place to perform these actions across the metastore?

Options:

A.

Account Admin, because account admins can only create metastores but cannot change ownership of catalogs.

B.

Workspace Admin, because workspace admins can transfer ownership of any Unity Catalog object.

C.

Metastore Admin, because metastore admins can transfer ownership and manage privileges across all metastore objects, including shares and recipients.

D.

Catalog Owner, because catalog owners can transfer any object in any catalog in the metastore.

Question 21

A data engineer has configured a Structured Streaming job to read from a table, manipulate the data, and then perform a streaming write into a new table.

Question # 21

The code block used by the data engineer is below:

Which line of code should the data engineer use to fill in the blank if the data engineer only wants the query to execute a micro-batch to process data every 5 seconds?

Options:

A.

trigger( " 5 seconds " )

B.

trigger(continuous= " 5 seconds " )

C.

trigger(once= " 5 seconds " )

D.

trigger(processingTime= " 5 seconds " )

Question 22

Which of the following data lakehouse features results in improved data quality over a traditional data lake?

Options:

A.

A data lakehouse provides storage solutions for structured and unstructured data.

B.

A data lakehouse supports ACID-compliant transactions.

C.

A data lakehouse allows the use of SQL queries to examine data.

D.

A data lakehouse stores data in open formats.

E.

A data lakehouse enables machine learning and artificial Intelligence workloads.

Question 23

A Delta Live Table pipeline includes two datasets defined using STREAMING LIVE TABLE. Three datasets are defined against Delta Lake table sources using LIVE TABLE.

The table is configured to run in Development mode using the Continuous Pipeline Mode.

Assuming previously unprocessed data exists and all definitions are valid, what is the expected outcome after clicking Start to update the pipeline?

Options:

A.

All datasets will be updated once and the pipeline will shut down. The compute resources will be terminated.

B.

All datasets will be updated at set intervals until the pipeline is shut down. The compute resources will persist until the pipeline is shut down.

C.

All datasets will be updated once and the pipeline will persist without any processing. The compute resources will persist but go unused.

D.

All datasets will be updated once and the pipeline will shut down. The compute resources will persist to allow for additional testing.

E.

All datasets will be updated at set intervals until the pipeline is shut down. The compute resources will persist to allow for additional testing.

Question 24

A data engineer is using the OPTIMIZE command on a Delta table. What happens when OPTIMIZE is run twice on the same table with the same data?

Options:

A.

It further reduces file sizes by re-clustering the data

B.

Triggers a full liquid clustering process

C.

Changes the number of tuples per file significantly

D.

It has no effect because it is idempotent.

Question 25

A data engineer needs to create a table in Databricks using data from a CSV file at location /path/to/csv.

They run the following command:

Question # 25

Which of the following lines of code fills in the above blank to successfully complete the task?

Options:

A.

None of these lines of code are needed to successfully complete the task

B.

USING CSV

C.

FROM CSV

D.

USING DELTA

E.

FROM " path/to/csv "

Question 26

Which compute option should be chosen in a scenario where small-scale ad hoc Python scripts need to be run at high frequency and should wind down quickly after these queries have finished running?

Options:

A.

All-purpose cluster

B.

Job cluster

C.

Serverless compute

D.

SQL Warehouse

Question 27

A Delta Live Table pipeline includes two datasets defined using STREAMING LIVE TABLE. Three datasets are defined against Delta Lake table sources using LIVE TABLE.

The table is configured to run in Production mode using the Continuous Pipeline Mode.

Assuming previously unprocessed data exists and all definitions are valid, what is the expected outcome after clicking Start to update the pipeline?

Options:

A.

All datasets will be updated at set intervals until the pipeline is shut down. The compute resources will persist to allow for additional testing.

B.

All datasets will be updated once and the pipeline will persist without any processing. The compute resources will persist but go unused.

C.

All datasets will be updated at set intervals until the pipeline is shut down. The compute resources will be deployed for the update and terminated when the pipeline is stopped.

D.

All datasets will be updated once and the pipeline will shut down. The compute resources will be terminated.

E.

All datasets will be updated once and the pipeline will shut down. The compute resources will persist to allow for additional testing.

Question 28

Identify how the count_if function and the count where x is null can be used

Consider a table random_values with below data.

What would be the output of below query?

select count_if(col > 1) as count_a. count(*) as count_b.count(col1) as count_c from random_values col1

0

1

2

NULL -

2

3

Options:

A.

3 6 5

B.

4 6 5

C.

3 6 6

D.

4 6 6

Question 29

A data engineer is working on a Databricks project that utilizes cloud storage. The data engineer wants to load several json files from containers on a storage account as soon as the file arrives within the storage account.

Which syntax should the data engineer follow to first load the files into a dataframe and check that it is working as expected using Python?

Options:

A.

df = spark.readStream.format( " json " ).load( " input/path " )

B.

df = spark.readStream.format( " cloud " ),option( " json " ).load( " /input/path " )

C.

df = spark.readStream.format( " cloudFiles " ) .option( " cloudFiles.format " , " json " ) .load( " /input/path " )

D.

df = spark.read.json( " inp i./path " )

Question 30

A data engineer wants to create an external table in Databricks that references data stored in an Azure Data Lake Storage (ADLS) location. The goal is to enable Databricks to access and query this external data without moving it into Databricks-managed storage.

Which step should the data engineer take to successfully create the external table?

Options:

A.

Use the CREATE TABLE statement and specify the LOCATION clause with the path to the external data.

B.

Use the CREATE UNMANAGED TABLE statement without specifying a LOCATION clause.

C.

Use the CREATE EXTERNAL TABLE statement without specifying a LOCATION clause.

D.

Use the CREATE MANAGED TABLE statement and specify the LOCATION clause with the path to the external data.

Question 31

What is the maximum output supported by a job cluster to ensure a notebook does not fail?

Options:

A.

10MBS

B.

25MBS

C.

30MBS

D.

15MBS

Question 32

Which Databricks Asset Bundle format is valid?

Options:

A.

resources:

jobs:

hello-job:

name: hello-job

tasks:

- task_key: hello-task

existing_cluster_id: 1234-567890-abcde123

notebook_task:

notebook_path: ./hello.py

B.

{

" resources " : {

" jobs " : {

" name " : " hello-job " ,

" tasks " : {

" task_key " : " hello-task " ,

" existing_cluster_id " : " 1234-567890-abcde123 " ,

" notebook_task " : {

" notebook_path " : " ./hello.py "

}

}

}

}

}

C.

configuration = {

" resources " : {

" jobs " : {

" name " : " hello-job " ,

" tasks " : {

" task_key " : " hello-task " ,

" existing_cluster_id " : " 1234-567890-abcde123 " ,

" notebook_task " : {

" notebook_path " : " ./hello.py "

}

}

}

}

}

D.

resources {

jobs {

name = " hello-job "

tasks {

task_key = " hello-task "

existing_cluster_id = " 1234-567890-abcde123 "

notebook_task {

notebook_path = " ./hello.py "

}

}

}

}

Question 33

A data engineer is working in a Python notebook on Databricks to process data, but notices that the output is not as expected. The data engineer wants to investigate the issue by stepping through the code and checking the values of certain variables during execution.

Which tool should the data engineer use to inspect the code execution and variables in real-time?

Options:

A.

Python Notebook Interactive Debugger

B.

Cluster Logs

C.

SQL Analytics

D.

Job Execution Dashboard

Question 34

In which of the following scenarios should a data engineer select a Task in the Depends On field of a new Databricks Job Task?

Options:

A.

When another task needs to be replaced by the new task

B.

When another task needs to fail before the new task begins

C.

When another task has the same dependency libraries as the new task

D.

When another task needs to use as little compute resources as possible

E.

When another task needs to successfully complete before the new task begins

Question 35

A data engineer is managing a data pipeline in Databricks, where multiple Delta tables are used for various transformations. The team wants to track how data flows through the pipeline, including identifying dependencies between Delta tables, notebooks, jobs, and dashboards. The data engineer is utilizing the Unity Catalog lineage feature to monitor this process.

How does Unity Catalog’s data lineage feature support the visualization of relationships between Delta tables, notebooks, jobs, and dashboards?

Options:

A.

Unity Catalog lineage visualizes dependencies between Delta tables, notebooks, and jobs, but does not provide column-level tracing or relationships with dashboards.

B.

Unity Catalog lineage only supports visualizing relationships at the table level and does not extend to notebooks, jobs, or dashboards.

C.

Unity Catalog lineage provides an interactive graph that tracks dependencies between tables and notebooks but excludes any job-related dependencies or dashboard visualizations.

D.

Unity Catalog provides an interactive graph that visualizes the dependencies between Delta tables, notebooks, jobs, and dashboards, while also supporting column-level tracking of data transformations.

Question 36

A data engineer at a company that uses Databricks with Unity Catalog needs to share a collection of tables with an external partner who also uses a Databricks workspace enabled for Unity Catalog. The data engineer decides to use Delta Sharing to accomplish this.

What is the first piece of information the data engineer should request from the external partner to set up Delta Sharing?

Options:

A.

Their Databricks account password

B.

The name of their Databricks cluster

C.

The IP address of their Databricks workspace

D.

The sharing identifier of their Unity Catalog metastore

Question 37

Which of the following is stored in the Databricks customer ' s cloud account?

Options:

A.

Databricks web application

B.

Cluster management metadata

C.

Repos

D.

Data

E.

Notebooks

Question 38

A data engineering project involves processing large batches of data on a daily schedule using ETL. The jobs are resource-intensive and vary in size, requiring a scalable, cost-efficient compute solution that can automatically scale based on the workload.

Which compute approach will satisfy the needs described?

Options:

A.

Databricks SQL Serverless

B.

Dedicated Cluster

C.

All-Purpose Cluster

D.

Job Cluster

Question 39

A data engineer is working on a personal laptop and needs to perform complex transformations on data stored in a Delta Lake on cloud storage. The engineer decides to use Databricks Connect to interact with Databricks clusters and work in their local IDE.

How does Databricks Connect enable the engineer to develop, test, and debug code seamlessly on their local machine while interacting with Databricks clusters?

Options:

A.

By allowing direct execution of Spark jobs from the local machine without needing a network connection

B.

By providing a local environment that mimics the Databricks runtime, enabling the engineer to develop, test, and debug code using a specific IDE that is required by Databricks

C.

By providing a local environment that mimics the Databricks runtime, enabling the engineer to develop, test, and debug code using their preferred ide

D.

By providing a local environment that mimics the Databricks runtime, enabling the engineer to develop, test, and debug code only through Databricks ' own web interface

Question 40

A data engineer is using the following code block as part of a batch ingestion pipeline to read from a composable table:

Question # 40

Which of the following changes needs to be made so this code block will work when the transactions table is a stream source?

Options:

A.

Replace predict with a stream-friendly prediction function

B.

Replace schema(schema) with option ( " maxFilesPerTrigger " , 1)

C.

Replace " transactions " with the path to the location of the Delta table

D.

Replace format( " delta " ) with format( " stream " )

E.

Replace spark.read with spark.readStream

Question 41

A data engineer needs to use a Delta table as part of a data pipeline, but they do not know if they have the appropriate permissions.

In which of the following locations can the data engineer review their permissions on the table?

Options:

A.

Databricks Filesystem

B.

Jobs

C.

Dashboards

D.

Repos

E.

Data Explorer

Question 42

A data engineer is developing a small proof of concept in a notebook. When running the entire notebook, cluster usage spikes. The data engineer wants to keep the development experience and get real-time results.

Which cluster meets these requirements?

Options:

A.

All-Purpose Cluster with a large fixed memory size

B.

All-Purpose Cluster with autoscaling

C.

Job Cluster with autoscaling enabled

D.

Job Cluster with Photon enabled and autoscaling

Question 43

A data organization leader is upset about the data analysis team’s reports being different from the data engineering team’s reports. The leader believes the siloed nature of their organization’s data engineering and data analysis architectures is to blame.

Which of the following describes how a data lakehouse could alleviate this issue?

Options:

A.

Both teams would autoscale their work as data size evolves

B.

Both teams would use the same source of truth for their work

C.

Both teams would reorganize to report to the same department

D.

Both teams would be able to collaborate on projects in real-time

E.

Both teams would respond more quickly to ad-hoc requests

Question 44

Which of the following SQL keywords can be used to convert a table from a long format to a wide format?

Options:

A.

PIVOT

B.

CONVERT

C.

WHERE

D.

TRANSFORM

E.

SUM

Question 45

What is the functionality of AutoLoader in Databricks?

Options:

A.

Auto Loader automatically ingests and processes new files from cloud storage, handling batch data with support for schema evolution.

B.

Auto Loader automatically ingests and processes new files from cloud storage, handling only streaming data with no support for schema evolution.

C.

Auto Loader automatically ingests and processes new files from cloud storage, handling batch and streaming data with no support for schema evolution.

D.

Auto Loader automatically ingests and processes new files from cloud storage, handling both batch and streaming data with support for schema evolution.

Question 46

An engineering manager uses a Databricks SQL query to monitor ingestion latency for each data source. The manager checks the results of the query every day, but they are manually rerunning the query each day and waiting for the results.

Which of the following approaches can the manager use to ensure the results of the query are updated each day?

Options:

A.

They can schedule the query to refresh every 1 day from the SQL endpoint ' s page in Databricks SQL.

B.

They can schedule the query to refresh every 12 hours from the SQL endpoint ' s page in Databricks SQL.

C.

They can schedule the query to refresh every 1 day from the query ' s page in Databricks SQL.

D.

They can schedule the query to run every 1 day from the Jobs UI.

E.

They can schedule the query to run every 12 hours from the Jobs UI.

Question 47

Which of the following tools is used by Auto Loader process data incrementally?

Options:

A.

Checkpointing

B.

Spark Structured Streaming

C.

Data Explorer

D.

Unity Catalog

E.

Databricks SQL

Question 48

A data engineer needs to determine whether to use the built-in Databricks Notebooks versioning or version their project using Databricks Repos.

Which of the following is an advantage of using Databricks Repos over the Databricks Notebooks versioning?

Options:

A.

Databricks Repos automatically saves development progress

B.

Databricks Repos supports the use of multiple branches

C.

Databricks Repos allows users to revert to previous versions of a notebook

D.

Databricks Repos provides the ability to comment on specific changes

E.

Databricks Repos is wholly housed within the Databricks Lakehouse Platform

Question 49

Which of the following approaches should be used to send the Databricks Job owner an email in the case that the Job fails?

Options:

A.

Manually programming in an alert system in each cell of the Notebook

B.

Setting up an Alert in the Job page

C.

Setting up an Alert in the Notebook

D.

There is no way to notify the Job owner in the case of Job failure

E.

MLflow Model Registry Webhooks

Question 50

A team creates YAML manifests that declare jobs, resources, and dependencies, then deploys them to Databricks using the Databricks CLI . The deployment succeeds.

Which feature are they using?

Options:

A.

Databricks Asset Bundles

B.

GitHub

C.

Terraform

D.

DataOps

Question 51

A new data engineering team has been assigned to work on a project. The team will need access to database customers in order to see what tables already exist. The team has its own group team.

Which of the following commands can be used to grant the necessary permission on the entire database to the new team?

Options:

A.

GRANT VIEW ON CATALOG customers TO team;

B.

GRANT CREATE ON DATABASE customers TO team;

C.

GRANT USAGE ON CATALOG team TO customers;

D.

GRANT CREATE ON DATABASE team TO customers;

E.

GRANT USAGE ON DATABASE customers TO team;

Question 52

A Delta Live Table pipeline includes two datasets defined using streaming live table. Three datasets are defined against Delta Lake table sources using live table.

The table is configured to run in Production mode using the Continuous Pipeline Mode.

What is the expected outcome after clicking Start to update the pipeline assuming previously unprocessed data exists and all definitions are valid?

Options:

A.

All datasets will be updated once and the pipeline will shut down. The compute resources will be terminated.

B.

All datasets will be updated at set intervals until the pipeline is shut down. The compute resources will persist to allow for additional testing.

C.

All datasets will be updated once and the pipeline will shut down. The compute resources will persist to allow for additional testing.

D.

All datasets will be updated at set intervals until the pipeline is shut down. The compute resources will be deployed for the update and terminated when the pipeline is stopped.