Weekend Sale Special Limited Time Flat 70% Discount offer - Ends in 0d 00h 00m 00s - Coupon code: 70spcl

Databricks Databricks-Certified-Data-Engineer-Associate Databricks Certified Data Engineer Associate Exam Exam Practice Test

Page: 1 / 10
Total 99 questions

Databricks Certified Data Engineer Associate Exam Questions and Answers

Question 1

Which of the following tools is used by Auto Loader process data incrementally?

Options:

A.

Checkpointing

B.

Spark Structured Streaming

C.

Data Explorer

D.

Unity Catalog

E.

Databricks SQL

Question 2

A data engineer has joined an existing project and they see the following query in the project repository:

CREATE STREAMING LIVE TABLE loyal_customers AS

SELECT customer_id -

FROM STREAM(LIVE.customers)

WHERE loyalty_level = 'high';

Which of the following describes why the STREAM function is included in the query?

Options:

A.

The STREAM function is not needed and will cause an error.

B.

The table being created is a live table.

C.

The customers table is a streaming live table.

D.

The customers table is a reference to a Structured Streaming query on a PySpark DataFrame.

E.

The data in the customers table has been updated since its last run.

Question 3

A Delta Live Table pipeline includes two datasets defined using STREAMING LIVE TABLE. Three datasets are defined against Delta Lake table sources using LIVE TABLE.

The table is configured to run in Development mode using the Continuous Pipeline Mode.

Assuming previously unprocessed data exists and all definitions are valid, what is the expected outcome after clicking Start to update the pipeline?

Options:

A.

All datasets will be updated once and the pipeline will shut down. The compute resources will be terminated.

B.

All datasets will be updated at set intervals until the pipeline is shut down. The compute resources will persist until the pipeline is shut down.

C.

All datasets will be updated once and the pipeline will persist without any processing. The compute resources will persist but go unused.

D.

All datasets will be updated once and the pipeline will shut down. The compute resources will persist to allow for additional testing.

E.

All datasets will be updated at set intervals until the pipeline is shut down. The compute resources will persist to allow for additional testing.

Question 4

A data analyst has a series of queries in a SQL program. The data analyst wants this program to run every day. They only want the final query in the program to run on Sundays. They ask for help from the data engineering team to complete this task.

Which of the following approaches could be used by the data engineering team to complete this task?

Options:

A.

They could submit a feature request with Databricks to add this functionality.

B.

They could wrap the queries using PySpark and use Python’s control flow system to determine when to run the final query.

C.

They could only run the entire program on Sundays.

D.

They could automatically restrict access to the source table in the final query so that it is only accessible on Sundays.

E.

They could redesign the data model to separate the data used in the final query into a new table.

Question 5

A data engineer has configured a Structured Streaming job to read from a table, manipulate the data, and then perform a streaming write into a new table.

The cade block used by the data engineer is below:

Question # 5

If the data engineer only wants the query to execute a micro-batch to process data every 5 seconds, which of the following lines of code should the data engineer use to fill in the blank?

Options:

A.

trigger("5 seconds")

B.

trigger()

C.

trigger(once="5 seconds")

D.

trigger(processingTime="5 seconds")

E.

trigger(continuous="5 seconds")

Question 6

A data engineer and data analyst are working together on a data pipeline. The data engineer is working on the raw, bronze, and silver layers of the pipeline using Python, and the data analyst is working on the gold layer of the pipeline using SQL. The raw source of the pipeline is a streaming input. They now want to migrate their pipeline to use Delta Live Tables.

Which of the following changes will need to be made to the pipeline when migrating to Delta Live Tables?

Options:

A.

None of these changes will need to be made

B.

The pipeline will need to stop using the medallion-based multi-hop architecture

C.

The pipeline will need to be written entirely in SQL

D.

The pipeline will need to use a batch source in place of a streaming source

E.

The pipeline will need to be written entirely in Python

Question 7

An engineering manager uses a Databricks SQL query to monitor ingestion latency for each data source. The manager checks the results of the query every day, but they are manually rerunning the query each day and waiting for the results.

Which of the following approaches can the manager use to ensure the results of the query are updated each day?

Options:

A.

They can schedule the query to refresh every 1 day from the SQL endpoint's page in Databricks SQL.

B.

They can schedule the query to refresh every 12 hours from the SQL endpoint's page in Databricks SQL.

C.

They can schedule the query to refresh every 1 day from the query's page in Databricks SQL.

D.

They can schedule the query to run every 1 day from the Jobs UI.

E.

They can schedule the query to run every 12 hours from the Jobs UI.

Question 8

Which file format is used for storing Delta Lake Table?

Options:

A.

Parquet

B.

Delta

C.

SV

D.

JSON

Question 9

A data engineer runs a statement every day to copy the previous day’s sales into the table transactions. Each day’s sales are in their own file in the location "/transactions/raw".

Today, the data engineer runs the following command to complete this task:

Question # 9

After running the command today, the data engineer notices that the number of records in table transactions has not changed.

Which of the following describes why the statement might not have copied any new records into the table?

Options:

A.

The format of the files to be copied were not included with the FORMAT_OPTIONS keyword.

B.

The names of the files to be copied were not included with the FILES keyword.

C.

The previous day’s file has already been copied into the table.

D.

The PARQUET file format does not support COPY INTO.

E.

The COPY INTO statement requires the table to be refreshed to view the copied rows.

Question 10

A data engineer has configured a Structured Streaming job to read from a table, manipulate the data, and then perform a streaming write into a new table.

The code block used by the data engineer is below:

Question # 10

If the data engineer only wants the query to process all of the available data in as many batches as required, which of the following lines of code should the data engineer use to fill in the blank?

Options:

A.

processingTime(1)

B.

trigger(availableNow=True)

C.

trigger(parallelBatch=True)

D.

trigger(processingTime="once")

E.

trigger(continuous="once")

Question 11

A data engineer is attempting to drop a Spark SQL table my_table and runs the following command:

DROP TABLE IF EXISTS my_table;

After running this command, the engineer notices that the data files and metadata files have been deleted from the file system.

Which of the following describes why all of these files were deleted?

Options:

A.

The table was managed

B.

The table's data was smaller than 10 GB

C.

The table's data was larger than 10 GB

D.

The table was external

E.

The table did not have a location

Question 12

A data engineer has been given a new record of data:

id STRING = 'a1'

rank INTEGER = 6

rating FLOAT = 9.4

Which of the following SQL commands can be used to append the new record to an existing Delta table my_table?

Options:

A.

INSERT INTO my_table VALUES ('a1', 6, 9.4)

B.

my_table UNION VALUES ('a1', 6, 9.4)

C.

INSERT VALUES ( 'a1' , 6, 9.4) INTO my_table

D.

UPDATE my_table VALUES ('a1', 6, 9.4)

E.

UPDATE VALUES ('a1', 6, 9.4) my_table

Question 13

Which of the following is hosted completely in the control plane of the classic Databricks architecture?

Options:

A.

Worker node

B.

JDBC data source

C.

Databricks web application

D.

Databricks Filesystem

E.

Driver node

Question 14

Which of the following Git operations must be performed outside of Databricks Repos?

Options:

A.

Commit

B.

Pull

C.

Push

D.

Clone

E.

Merge

Question 15

A data engineer has left the organization. The data team needs to transfer ownership of the data engineer’s Delta tables to a new data engineer. The new data engineer is the lead engineer on the data team.

Assuming the original data engineer no longer has access, which of the following individuals must be the one to transfer ownership of the Delta tables in Data Explorer?

Options:

A.

Databricks account representative

B.

This transfer is not possible

C.

Workspace administrator

D.

New lead data engineer

E.

Original data engineer

Question 16

Which of the following describes a scenario in which a data engineer will want to use a single-node cluster?

Options:

A.

When they are working interactively with a small amount of data

B.

When they are running automated reports to be refreshed as quickly as possible

C.

When they are working with SQL within Databricks SQL

D.

When they are concerned about the ability to automatically scale with larger data

E.

When they are manually running reports with a large amount of data

Question 17

A data engineer is using the following code block as part of a batch ingestion pipeline to read from a composable table:

Question # 17

Which of the following changes needs to be made so this code block will work when the transactions table is a stream source?

Options:

A.

Replace predict with a stream-friendly prediction function

B.

Replace schema(schema) with option ("maxFilesPerTrigger", 1)

C.

Replace "transactions" with the path to the location of the Delta table

D.

Replace format("delta") with format("stream")

E.

Replace spark.read with spark.readStream

Question 18

A data engineer has realized that they made a mistake when making a daily update to a table. They need to use Delta time travel to restore the table to a version that is 3 days old. However, when the data engineer attempts to time travel to the older version, they are unable to restore the data because the data files have been deleted.

Which of the following explains why the data files are no longer present?

Options:

A.

The VACUUM command was run on the table

B.

The TIME TRAVEL command was run on the table

C.

The DELETE HISTORY command was run on the table

D.

The OPTIMIZE command was nun on the table

E.

The HISTORY command was run on the table

Question 19

A single Job runs two notebooks as two separate tasks. A data engineer has noticed that one of the notebooks is running slowly in the Job’s current run. The data engineer asks a tech lead for help in identifying why this might be the case.

Which of the following approaches can the tech lead use to identify why the notebook is running slowly as part of the Job?

Options:

A.

They can navigate to the Runs tab in the Jobs UI to immediately review the processing notebook.

B.

They can navigate to the Tasks tab in the Jobs UI and click on the active run to review the processing notebook.

C.

They can navigate to the Runs tab in the Jobs UI and click on the active run to review the processing notebook.

D.

There is no way to determine why a Job task is running slowly.

E.

They can navigate to the Tasks tab in the Jobs UI to immediately review the processing notebook.

Question 20

A data engineer wants to create a relational object by pulling data from two tables. The relational object does not need to be used by other data engineers in other sessions. In order to save on storage costs, the data engineer wants to avoid copying and storing physical data.

Which of the following relational objects should the data engineer create?

Options:

A.

Spark SQL Table

B.

View

C.

Database

D.

Temporary view

E.

Delta Table

Question 21

A data organization leader is upset about the data analysis team’s reports being different from the data engineering team’s reports. The leader believes the siloed nature of their organization’s data engineering and data analysis architectures is to blame.

Which of the following describes how a data lakehouse could alleviate this issue?

Options:

A.

Both teams would autoscale their work as data size evolves

B.

Both teams would use the same source of truth for their work

C.

Both teams would reorganize to report to the same department

D.

Both teams would be able to collaborate on projects in real-time

E.

Both teams would respond more quickly to ad-hoc requests

Question 22

A data engineer is working with two tables. Each of these tables is displayed below in its entirety.

Question # 22

The data engineer runs the following query to join these tables together:

Question # 22

Which of the following will be returned by the above query?

Question # 22

Options:

A.

Option A

B.

Option B

C.

Option C

D.

Option D

E.

Option E

Question 23

Which of the following commands will return the location of database customer360?

Options:

A.

DESCRIBE LOCATION customer360;

B.

DROP DATABASE customer360;

C.

DESCRIBE DATABASE customer360;

D.

ALTER DATABASE customer360 SET DBPROPERTIES ('location' = '/user'};

E.

USE DATABASE customer360;

Question 24

A data engineer only wants to execute the final block of a Python program if the Python variable day_of_week is equal to 1 and the Python variable review_period is True.

Which of the following control flow statements should the data engineer use to begin this conditionally executed code block?

Options:

A.

if day_of_week = 1 and review_period:

B.

if day_of_week = 1 and review_period = "True":

C.

if day_of_week == 1 and review_period == "True":

D.

if day_of_week == 1 and review_period:

E.

if day_of_week = 1 & review_period: = "True":

Question 25

Which of the following describes a scenario in which a data team will want to utilize cluster pools?

Options:

A.

An automated report needs to be refreshed as quickly as possible.

B.

An automated report needs to be made reproducible.

C.

An automated report needs to be tested to identify errors.

D.

An automated report needs to be version-controlled across multiple collaborators.

E.

An automated report needs to be runnable by all stakeholders.

Question 26

Which of the following approaches should be used to send the Databricks Job owner an email in the case that the Job fails?

Options:

A.

Manually programming in an alert system in each cell of the Notebook

B.

Setting up an Alert in the Job page

C.

Setting up an Alert in the Notebook

D.

There is no way to notify the Job owner in the case of Job failure

E.

MLflow Model Registry Webhooks

Question 27

Which of the following commands will return the number of null values in the member_id column?

Options:

A.

SELECT count(member_id) FROM my_table;

B.

SELECT count(member_id) - count_null(member_id) FROM my_table;

C.

SELECT count_if(member_id IS NULL) FROM my_table;

D.

SELECT null(member_id) FROM my_table;

E.

SELECT count_null(member_id) FROM my_table;

Question 28

A data engineer and data analyst are working together on a data pipeline. The data engineer is working on the raw, bronze, and silver layers of the pipeline using Python, and the data analyst is working on the gold layer of the pipeline using SQL The raw source of the pipeline is a streaming input. They now want to migrate their pipeline to use Delta Live Tables.

Which change will need to be made to the pipeline when migrating to Delta Live Tables?

Options:

A.

The pipeline can have different notebook sources in SQL & Python.

B.

The pipeline will need to be written entirely in SQL.

C.

The pipeline will need to be written entirely in Python.

D.

The pipeline will need to use a batch source in place of a streaming source.

Question 29

What is stored in a Databricks customer's cloud account?

Options:

A.

Data

B.

Cluster management metadata

C.

Databricks web application

D.

Notebooks

Page: 1 / 10
Total 99 questions