Weekend Sale Special Limited Time Flat 70% Discount offer - Ends in 0d 00h 00m 00s - Coupon code: 70spcl

Databricks Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 Databricks Certified Associate Developer for Apache Spark 3.0 Exam Exam Practice Test

Page: 1 / 18
Total 180 questions

Databricks Certified Associate Developer for Apache Spark 3.0 Exam Questions and Answers

Question 1

Which of the following statements about DAGs is correct?

Options:

A.

DAGs help direct how Spark executors process tasks, but are a limitation to the proper execution of a query when an executor fails.

B.

DAG stands for "Directing Acyclic Graph".

C.

Spark strategically hides DAGs from developers, since the high degree of automation in Spark means that developers never need to consider DAG layouts.

D.

In contrast to transformations, DAGs are never lazily executed.

E.

DAGs can be decomposed into tasks that are executed in parallel.

Question 2

Which of the following code blocks applies the Python function to_limit on column predError in table transactionsDf, returning a DataFrame with columns transactionId and result?

Options:

A.

1.spark.udf.register("LIMIT_FCN", to_limit)

2.spark.sql("SELECT transactionId, LIMIT_FCN(predError) AS result FROM transactionsDf")

(Correct)

B.

1.spark.udf.register("LIMIT_FCN", to_limit)

2.spark.sql("SELECT transactionId, LIMIT_FCN(predError) FROM transactionsDf AS result")

C.

1.spark.udf.register("LIMIT_FCN", to_limit)

2.spark.sql("SELECT transactionId, to_limit(predError) AS result FROM transactionsDf")

spark.sql("SELECT transactionId, udf(to_limit(predError)) AS result FROM transactionsDf")

D.

1.spark.udf.register(to_limit, "LIMIT_FCN")

2.spark.sql("SELECT transactionId, LIMIT_FCN(predError) AS result FROM transactionsDf")

Question 3

Which of the following DataFrame methods is classified as a transformation?

Options:

A.

DataFrame.count()

B.

DataFrame.show()

C.

DataFrame.select()

D.

DataFrame.foreach()

E.

DataFrame.first()

Question 4

Which of the following code blocks immediately removes the previously cached DataFrame transactionsDf from memory and disk?

Options:

A.

array_remove(transactionsDf, "*")

B.

transactionsDf.unpersist()

(Correct)

C.

del transactionsDf

D.

transactionsDf.clearCache()

E.

transactionsDf.persist()

Question 5

Which of the following code blocks reads in the two-partition parquet file stored at filePath, making sure all columns are included exactly once even though each partition has a different schema?

Schema of first partition:

1.root

2. |-- transactionId: integer (nullable = true)

3. |-- predError: integer (nullable = true)

4. |-- value: integer (nullable = true)

5. |-- storeId: integer (nullable = true)

6. |-- productId: integer (nullable = true)

7. |-- f: integer (nullable = true)

Schema of second partition:

1.root

2. |-- transactionId: integer (nullable = true)

3. |-- predError: integer (nullable = true)

4. |-- value: integer (nullable = true)

5. |-- storeId: integer (nullable = true)

6. |-- rollId: integer (nullable = true)

7. |-- f: integer (nullable = true)

8. |-- tax_id: integer (nullable = false)

Options:

A.

spark.read.parquet(filePath, mergeSchema='y')

B.

spark.read.option("mergeSchema", "true").parquet(filePath)

C.

spark.read.parquet(filePath)

D.

1.nx = 0

2.for file in dbutils.fs.ls(filePath):

3. if not file.name.endswith(".parquet"):

4. continue

5. df_temp = spark.read.parquet(file.path)

6. if nx == 0:

7. df = df_temp

8. else:

9. df = df.union(df_temp)

10. nx = nx+1

11.df

E.

1.nx = 0

2.for file in dbutils.fs.ls(filePath):

3. if not file.name.endswith(".parquet"):

4. continue

5. df_temp = spark.read.parquet(file.path)

6. if nx == 0:

7. df = df_temp

8. else:

9. df = df.join(df_temp, how="outer")

10. nx = nx+1

11.df

Question 6

Which of the following statements about data skew is incorrect?

Options:

A.

Spark will not automatically optimize skew joins by default.

B.

Broadcast joins are a viable way to increase join performance for skewed data over sort-merge joins.

C.

In skewed DataFrames, the largest and the smallest partition consume very different amounts of memory.

D.

To mitigate skew, Spark automatically disregards null values in keys when joining.

E.

Salting can resolve data skew.

Question 7

Which of the following code blocks returns a copy of DataFrame transactionsDf in which column productId has been renamed to productNumber?

Options:

A.

transactionsDf.withColumnRenamed("productId", "productNumber")

B.

transactionsDf.withColumn("productId", "productNumber")

C.

transactionsDf.withColumnRenamed("productNumber", "productId")

D.

transactionsDf.withColumnRenamed(col(productId), col(productNumber))

E.

transactionsDf.withColumnRenamed(productId, productNumber)

Question 8

Which of the following code blocks returns a single row from DataFrame transactionsDf?

Full DataFrame transactionsDf:

1.+-------------+---------+-----+-------+---------+----+

2.|transactionId|predError|value|storeId|productId| f|

3.+-------------+---------+-----+-------+---------+----+

4.| 1| 3| 4| 25| 1|null|

5.| 2| 6| 7| 2| 2|null|

6.| 3| 3| null| 25| 3|null|

7.| 4| null| null| 3| 2|null|

8.| 5| null| null| null| 2|null|

9.| 6| 3| 2| 25| 2|null|

10.+-------------+---------+-----+-------+---------+----+

Options:

A.

transactionsDf.where(col("storeId").between(3,25))

B.

transactionsDf.filter((col("storeId")!=25) | (col("productId")==2))

C.

transactionsDf.filter(col("storeId")==25).select("predError","storeId").distinct()

D.

transactionsDf.select("productId", "storeId").where("storeId == 2 OR storeId != 25")

E.

transactionsDf.where(col("value").isNull()).select("productId", "storeId").distinct()

Question 9

Which of the following describes a shuffle?

Options:

A.

A shuffle is a process that is executed during a broadcast hash join.

B.

A shuffle is a process that compares data across executors.

C.

A shuffle is a process that compares data across partitions.

D.

A shuffle is a Spark operation that results from DataFrame.coalesce().

E.

A shuffle is a process that allocates partitions to executors.

Question 10

Which of the following code blocks reads in parquet file /FileStore/imports.parquet as a DataFrame?

Options:

A.

spark.mode("parquet").read("/FileStore/imports.parquet")

B.

spark.read.path("/FileStore/imports.parquet", source="parquet")

C.

spark.read().parquet("/FileStore/imports.parquet")

D.

spark.read.parquet("/FileStore/imports.parquet")

E.

spark.read().format('parquet').open("/FileStore/imports.parquet")

Question 11

The code block shown below should set the number of partitions that Spark uses when shuffling data for joins or aggregations to 100. Choose the answer that correctly fills the blanks in the code

block to accomplish this.

spark.sql.shuffle.partitions

__1__.__2__.__3__(__4__, 100)

Options:

A.

1. spark

2. conf

3. set

4. "spark.sql.shuffle.partitions"

B.

1. pyspark

2. config

3. set

4. spark.shuffle.partitions

C.

1. spark

2. conf

3. get

4. "spark.sql.shuffle.partitions"

D.

1. pyspark

2. config

3. set

4. "spark.sql.shuffle.partitions"

E.

1. spark

2. conf

3. set

4. "spark.sql.aggregate.partitions"

Question 12

The code block displayed below contains an error. The code block is intended to perform an outer join of DataFrames transactionsDf and itemsDf on columns productId and itemId, respectively.

Find the error.

Code block:

transactionsDf.join(itemsDf, [itemsDf.itemId, transactionsDf.productId], "outer")

Options:

A.

The "outer" argument should be eliminated, since "outer" is the default join type.

B.

The join type needs to be appended to the join() operator, like join().outer() instead of listing it as the last argument inside the join() call.

C.

The term [itemsDf.itemId, transactionsDf.productId] should be replaced by itemsDf.itemId == transactionsDf.productId.

D.

The term [itemsDf.itemId, transactionsDf.productId] should be replaced by itemsDf.col("itemId") == transactionsDf.col("productId").

E.

The "outer" argument should be eliminated from the call and join should be replaced by joinOuter.

Question 13

Which of the following code blocks reads all CSV files in directory filePath into a single DataFrame, with column names defined in the CSV file headers?

Content of directory filePath:

1._SUCCESS

2._committed_2754546451699747124

3._started_2754546451699747124

4.part-00000-tid-2754546451699747124-10eb85bf-8d91-4dd0-b60b-2f3c02eeecaa-298-1-c000.csv.gz

5.part-00001-tid-2754546451699747124-10eb85bf-8d91-4dd0-b60b-2f3c02eeecaa-299-1-c000.csv.gz

6.part-00002-tid-2754546451699747124-10eb85bf-8d91-4dd0-b60b-2f3c02eeecaa-300-1-c000.csv.gz

7.part-00003-tid-2754546451699747124-10eb85bf-8d91-4dd0-b60b-2f3c02eeecaa-301-1-c000.csv.gz

spark.option("header",True).csv(filePath)

Options:

A.

spark.read.format("csv").option("header",True).option("compression","zip").load(filePath)

B.

spark.read().option("header",True).load(filePath)

C.

spark.read.format("csv").option("header",True).load(filePath)

D.

spark.read.load(filePath)

Question 14

The code block displayed below contains an error. The code block should write DataFrame transactionsDf as a parquet file to location filePath after partitioning it on column storeId. Find the error.

Code block:

transactionsDf.write.partitionOn("storeId").parquet(filePath)

Options:

A.

The partitioning column as well as the file path should be passed to the write() method of DataFrame transactionsDf directly and not as appended commands as in the code block.

B.

The partitionOn method should be called before the write method.

C.

The operator should use the mode() option to configure the DataFrameWriter so that it replaces any existing files at location filePath.

D.

Column storeId should be wrapped in a col() operator.

E.

No method partitionOn() exists for the DataFrame class, partitionBy() should be used instead.

Question 15

The code block displayed below contains an error. The code block should merge the rows of DataFrames transactionsDfMonday and transactionsDfTuesday into a new DataFrame, matching

column names and inserting null values where column names do not appear in both DataFrames. Find the error.

Sample of DataFrame transactionsDfMonday:

1.+-------------+---------+-----+-------+---------+----+

2.|transactionId|predError|value|storeId|productId| f|

3.+-------------+---------+-----+-------+---------+----+

4.| 5| null| null| null| 2|null|

5.| 6| 3| 2| 25| 2|null|

6.+-------------+---------+-----+-------+---------+----+

Sample of DataFrame transactionsDfTuesday:

1.+-------+-------------+---------+-----+

2.|storeId|transactionId|productId|value|

3.+-------+-------------+---------+-----+

4.| 25| 1| 1| 4|

5.| 2| 2| 2| 7|

6.| 3| 4| 2| null|

7.| null| 5| 2| null|

8.+-------+-------------+---------+-----+

Code block:

sc.union([transactionsDfMonday, transactionsDfTuesday])

Options:

A.

The DataFrames' RDDs need to be passed into the sc.union method instead of the DataFrame variable names.

B.

Instead of union, the concat method should be used, making sure to not use its default arguments.

C.

Instead of the Spark context, transactionDfMonday should be called with the join method instead of the union method, making sure to use its default arguments.

D.

Instead of the Spark context, transactionDfMonday should be called with the union method.

E.

Instead of the Spark context, transactionDfMonday should be called with the unionByName method instead of the union method, making sure to not use its default arguments.

Question 16

Which of the following describes characteristics of the Dataset API?

Options:

A.

The Dataset API does not support unstructured data.

B.

In Python, the Dataset API mainly resembles Pandas' DataFrame API.

C.

In Python, the Dataset API's schema is constructed via type hints.

D.

The Dataset API is available in Scala, but it is not available in Python.

E.

The Dataset API does not provide compile-time type safety.

Question 17

The code block displayed below contains an error. The code block should return the average of rows in column value grouped by unique storeId. Find the error.

Code block:

transactionsDf.agg("storeId").avg("value")

Options:

A.

Instead of avg("value"), avg(col("value")) should be used.

B.

The avg("value") should be specified as a second argument to agg() instead of being appended to it.

C.

All column names should be wrapped in col() operators.

D.

agg should be replaced by groupBy.

E.

"storeId" and "value" should be swapped.

Question 18

Which of the following code blocks sorts DataFrame transactionsDf both by column storeId in ascending and by column productId in descending order, in this priority?

Options:

A.

transactionsDf.sort("storeId", asc("productId"))

B.

transactionsDf.sort(col(storeId)).desc(col(productId))

C.

transactionsDf.order_by(col(storeId), desc(col(productId)))

D.

transactionsDf.sort("storeId", desc("productId"))

E.

transactionsDf.sort("storeId").sort(desc("productId"))

Question 19

Which of the following describes the conversion of a computational query into an execution plan in Spark?

Options:

A.

Spark uses the catalog to resolve the optimized logical plan.

B.

The catalog assigns specific resources to the optimized memory plan.

C.

The executed physical plan depends on a cost optimization from a previous stage.

D.

Depending on whether DataFrame API or SQL API are used, the physical plan may differ.

E.

The catalog assigns specific resources to the physical plan.

Question 20

Which of the following statements about the differences between actions and transformations is correct?

Options:

A.

Actions are evaluated lazily, while transformations are not evaluated lazily.

B.

Actions generate RDDs, while transformations do not.

C.

Actions do not send results to the driver, while transformations do.

D.

Actions can be queued for delayed execution, while transformations can only be processed immediately.

E.

Actions can trigger Adaptive Query Execution, while transformation cannot.

Question 21

The code block shown below should return a column that indicates through boolean variables whether rows in DataFrame transactionsDf have values greater or equal to 20 and smaller or equal to

30 in column storeId and have the value 2 in column productId. Choose the answer that correctly fills the blanks in the code block to accomplish this.

transactionsDf.__1__((__2__.__3__) __4__ (__5__))

Options:

A.

1. select

2. col("storeId")

3. between(20, 30)

4. and

5. col("productId")==2

B.

1. where

2. col("storeId")

3. geq(20).leq(30)

4. &

5. col("productId")==2

C.

1. select

2. "storeId"

3. between(20, 30)

4. &&

5. col("productId")==2

D.

1. select

2. col("storeId")

3. between(20, 30)

4. &&

5. col("productId")=2

E.

1. select

2. col("storeId")

3. between(20, 30)

4. &

5. col("productId")==2

Question 22

Which of the following describes a difference between Spark's cluster and client execution modes?

Options:

A.

In cluster mode, the cluster manager resides on a worker node, while it resides on an edge node in client mode.

B.

In cluster mode, executor processes run on worker nodes, while they run on gateway nodes in client mode.

C.

In cluster mode, the driver resides on a worker node, while it resides on an edge node in client mode.

D.

In cluster mode, a gateway machine hosts the driver, while it is co-located with the executor in client mode.

E.

In cluster mode, the Spark driver is not co-located with the cluster manager, while it is co-located in client mode.

Question 23

Which of the following statements about lazy evaluation is incorrect?

Options:

A.

Predicate pushdown is a feature resulting from lazy evaluation.

B.

Execution is triggered by transformations.

C.

Spark will fail a job only during execution, but not during definition.

D.

Accumulators do not change the lazy evaluation model of Spark.

E.

Lineages allow Spark to coalesce transformations into stages

Question 24

Which of the following options describes the responsibility of the executors in Spark?

Options:

A.

The executors accept jobs from the driver, analyze those jobs, and return results to the driver.

B.

The executors accept tasks from the driver, execute those tasks, and return results to the cluster manager.

C.

The executors accept tasks from the driver, execute those tasks, and return results to the driver.

D.

The executors accept tasks from the cluster manager, execute those tasks, and return results to the driver.

E.

The executors accept jobs from the driver, plan those jobs, and return results to the cluster manager.

Question 25

The code block shown below should add column transactionDateForm to DataFrame transactionsDf. The column should express the unix-format timestamps in column transactionDate as string

type like Apr 26 (Sunday). Choose the answer that correctly fills the blanks in the code block to accomplish this.

transactionsDf.__1__(__2__, from_unixtime(__3__, __4__))

Options:

A.

1. withColumn

2. "transactionDateForm"

3. "MMM d (EEEE)"

4. "transactionDate"

B.

1. select

2. "transactionDate"

3. "transactionDateForm"

4. "MMM d (EEEE)"

C.

1. withColumn

2. "transactionDateForm"

3. "transactionDate"

4. "MMM d (EEEE)"

D.

1. withColumn

2. "transactionDateForm"

3. "transactionDate"

4. "MM d (EEE)"

E.

1. withColumnRenamed

2. "transactionDate"

3. "transactionDateForm"

4. "MM d (EEE)"

Question 26

Which of the following describes the difference between client and cluster execution modes?

Options:

A.

In cluster mode, the driver runs on the worker nodes, while the client mode runs the driver on the client machine.

B.

In cluster mode, the driver runs on the edge node, while the client mode runs the driver in a worker node.

C.

In cluster mode, each node will launch its own executor, while in client mode, executors will exclusively run on the client machine.

D.

In client mode, the cluster manager runs on the same host as the driver, while in cluster mode, the cluster manager runs on a separate node.

E.

In cluster mode, the driver runs on the master node, while in client mode, the driver runs on a virtual machine in the cloud.

Question 27

Which of the following code blocks generally causes a great amount of network traffic?

Options:

A.

DataFrame.select()

B.

DataFrame.coalesce()

C.

DataFrame.collect()

D.

DataFrame.rdd.map()

E.

DataFrame.count()

Page: 1 / 18
Total 180 questions