Unlock your Full Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 Databricks Stable Exam

Databricks Certified Associate Developer for Apache Spark 3.0 Exam Questions and Answers

Question 1

Which of the following statements about DAGs is correct?

Options:

DAGs help direct how Spark executors process tasks, but are a limitation to the proper execution of a query when an executor fails.

DAG stands for "Directing Acyclic Graph".

Spark strategically hides DAGs from developers, since the high degree of automation in Spark means that developers never need to consider DAG layouts.

In contrast to transformations, DAGs are never lazily executed.

DAGs can be decomposed into tasks that are executed in parallel.

Question 2

Which of the following code blocks applies the Python function to_limit on column predError in table transactionsDf, returning a DataFrame with columns transactionId and result?

Options:

1.spark.udf.register("LIMIT_FCN", to_limit)

2.spark.sql("SELECT transactionId, LIMIT_FCN(predError) AS result FROM transactionsDf")

(Correct)

1.spark.udf.register("LIMIT_FCN", to_limit)

2.spark.sql("SELECT transactionId, LIMIT_FCN(predError) FROM transactionsDf AS result")

1.spark.udf.register("LIMIT_FCN", to_limit)

2.spark.sql("SELECT transactionId, to_limit(predError) AS result FROM transactionsDf")

spark.sql("SELECT transactionId, udf(to_limit(predError)) AS result FROM transactionsDf")

1.spark.udf.register(to_limit, "LIMIT_FCN")

2.spark.sql("SELECT transactionId, LIMIT_FCN(predError) AS result FROM transactionsDf")

Question 3

Which of the following DataFrame methods is classified as a transformation?

Options:

DataFrame.count()

DataFrame.show()

DataFrame.select()

DataFrame.foreach()

DataFrame.first()

Question 4

Which of the following code blocks immediately removes the previously cached DataFrame transactionsDf from memory and disk?

Options:

array_remove(transactionsDf, "*")

transactionsDf.unpersist()

(Correct)

del transactionsDf

transactionsDf.clearCache()

transactionsDf.persist()

Question 5

Which of the following code blocks reads in the two-partition parquet file stored at filePath, making sure all columns are included exactly once even though each partition has a different schema?

Schema of first partition:

1.root

2. |-- transactionId: integer (nullable = true)

3. |-- predError: integer (nullable = true)

4. |-- value: integer (nullable = true)

5. |-- storeId: integer (nullable = true)

6. |-- productId: integer (nullable = true)

7. |-- f: integer (nullable = true)

Schema of second partition:

1.root

2. |-- transactionId: integer (nullable = true)

3. |-- predError: integer (nullable = true)

4. |-- value: integer (nullable = true)

5. |-- storeId: integer (nullable = true)

6. |-- rollId: integer (nullable = true)

7. |-- f: integer (nullable = true)

8. |-- tax_id: integer (nullable = false)

Options:

spark.read.parquet(filePath, mergeSchema='y')

spark.read.option("mergeSchema", "true").parquet(filePath)

spark.read.parquet(filePath)

1.nx = 0

2.for file in dbutils.fs.ls(filePath):

3. if not file.name.endswith(".parquet"):

4. continue

5. df_temp = spark.read.parquet(file.path)

6. if nx == 0:

7. df = df_temp

8. else:

9. df = df.union(df_temp)

10. nx = nx+1

11.df

1.nx = 0

2.for file in dbutils.fs.ls(filePath):

3. if not file.name.endswith(".parquet"):

4. continue

5. df_temp = spark.read.parquet(file.path)

6. if nx == 0:

7. df = df_temp

8. else:

9. df = df.join(df_temp, how="outer")

10. nx = nx+1

11.df

Answer:

Explanation:

Explanation

This is a very tricky QUESTION NO: and involves both knowledge about merging as well as schemas when reading parquet files.

spark.read.option("mergeSchema", "true").parquet(filePath)

Correct. Spark's DataFrameReader's mergeSchema option will work well here, since columns that appear in both partitions have matching data types. Note that mergeSchema would fail if one or

more columns with the same name that appear in both partitions would have different data types.

spark.read.parquet(filePath)

Incorrect. While this would read in data from both partitions, only the schema in the parquet file that is read in first would be considered, so some columns that appear only in the second partition

(e.g. tax_id) would be lost.

nx = 0

for file in dbutils.fs.ls(filePath):

if not file.name.endswith(".parquet"):

continue

df_temp = spark.read.parquet(file.path)

if nx == 0:

df = df_temp

else:

df = df.union(df_temp)

nx = nx+1

Wrong. The key idea of this solution is the DataFrame.union() command. While this command merges all data, it requires that both partitions have the exact same number of columns with identical

data types.

spark.read.parquet(filePath, mergeSchema="y")

False. While using the mergeSchema option is the correct way to solve this problem and it can even be called with DataFrameReader.parquet() as in the code block, it accepts the value True as a

boolean or string variable. But 'y' is not a valid option.

nx = 0

for file in dbutils.fs.ls(filePath):

if not file.name.endswith(".parquet"):

continue

df_temp = spark.read.parquet(file.path)

if nx == 0:

df = df_temp

else:

df = df.join(df_temp, how="outer")

nx = nx+1

No. This provokes a full outer join. While the resulting DataFrame will have all columns of both partitions, columns that appear in both partitions will be duplicated - the QUESTION NO: says all

columns that

are included in the partitions should appear exactly once.

More info: Merging different schemas in Apache Spark | by Thiago Cordon | Data Arena | Medium

Static notebook | Dynamic notebook: See test 3, QUESTION NO: 37 (Databricks import instructions)

Question 6

Which of the following statements about data skew is incorrect?

Options:

Spark will not automatically optimize skew joins by default.

Broadcast joins are a viable way to increase join performance for skewed data over sort-merge joins.

In skewed DataFrames, the largest and the smallest partition consume very different amounts of memory.

To mitigate skew, Spark automatically disregards null values in keys when joining.

Salting can resolve data skew.

Answer:

Explanation:

Explanation

To mitigate skew, Spark automatically disregards null values in keys when joining.

This statement is incorrect, and thus the correct answer to the question. Joining keys that contain null values is of particular concern with regard to data skew.

In real-world applications, a table may contain a great number of records that do not have a value assigned to the column used as a join key. During the join, the data is at risk of being heavily

skewed. This is because all records with a null-value join key are then evaluated as a single large partition, standing in stark contrast to the potentially diverse key values (and therefore small

partitions) of the non-null-key records.

Spark specifically does not handle this automatically. However, there are several strategies to mitigate this problem like discarding null values temporarily, only to merge them back later (see last link

below).

In skewed DataFrames, the largest and the smallest partition consume very different amounts of memory.

This statement is correct. In fact, having very different partition sizes is the very definition of skew. Skew can degrade Spark performance because the largest partition occupies a single executor for

a long time. This blocks a Spark job and is an inefficient use of resources, since other executors that processed smaller partitions need to idle until the large partition is processed.

Salting can resolve data skew.

This statement is correct. The purpose of salting is to provide Spark with an opportunity to repartition data into partitions of similar size, based on a salted partitioning key.

A salted partitioning key typically is a column that consists of uniformly distributed random numbers. The number of unique entries in the partitioning key column should match the number of your

desired number of partitions. After repartitioning by the salted key, all partitions should have roughly the same size.

Spark does not automatically optimize skew joins by default.

This statement is correct. Automatic skew join optimization is a feature of Adaptive Query Execution (AQE). By default, AQE is disabled in Spark. To enable it, Spark's spark.sql.adaptive.enabled

configuration option needs to be set to true instead of leaving it at the default false.

To automatically optimize skew joins, Spark's spark.sql.adaptive.skewJoin.enabled options also needs to be set to true, which it is by default.

When skew join optimization is enabled, Spark recognizes skew joins and optimizes them by splitting the bigger partitions into smaller partitions which leads to performance increases.

Broadcast joins are a viable way to increase join performance for skewed data over sort-merge joins.

This statement is correct. Broadcast joins can indeed help increase join performance for skewed data, under some conditions. One of the DataFrames to be joined needs to be small enough to fit

into each executor's memory, along a partition from the other DataFrame. If this is the case, a broadcast join increases join performance over a sort-merge join.

The reason is that a sort-merge join with skewed data involves excessive shuffling. During shuffling, data is sent around the cluster, ultimately slowing down the Spark application. For skewed data,

the amount of data, and thus the slowdown, is particularly big.

Broadcast joins, however, help reduce shuffling data. The smaller table is directly stored on all executors, eliminating a great amount of network traffic, ultimately increasing join performance relative

to the sort-merge join.

It is worth noting that for optimizing skew join behavior it may make sense to manually adjust Spark's spark.sql.autoBroadcastJoinThreshold configuration property if the smaller DataFrame is bigger

than the 10 MB set by default.

More info:

- Performance Tuning - Spark 3.0.0 Documentation

- Data Skew and Garbage Collection to Improve Spark Performance

- Section 1.2 - Joins on Skewed Data • GitBook

Question 7

Which of the following code blocks returns a copy of DataFrame transactionsDf in which column productId has been renamed to productNumber?

Options:

transactionsDf.withColumnRenamed("productId", "productNumber")

transactionsDf.withColumn("productId", "productNumber")

transactionsDf.withColumnRenamed("productNumber", "productId")

transactionsDf.withColumnRenamed(col(productId), col(productNumber))

transactionsDf.withColumnRenamed(productId, productNumber)

Question 8

Which of the following code blocks returns a single row from DataFrame transactionsDf?

Full DataFrame transactionsDf:

1.+-------------+---------+-----+-------+---------+----+

3.+-------------+---------+-----+-------+---------+----+

4.| 1| 3| 4| 25| 1|null|

5.| 2| 6| 7| 2| 2|null|

6.| 3| 3| null| 25| 3|null|

7.| 4| null| null| 3| 2|null|

8.| 5| null| null| null| 2|null|

9.| 6| 3| 2| 25| 2|null|

10.+-------------+---------+-----+-------+---------+----+

Options:

transactionsDf.where(col("storeId").between(3,25))

transactionsDf.filter((col("storeId")!=25) | (col("productId")==2))

transactionsDf.filter(col("storeId")==25).select("predError","storeId").distinct()

transactionsDf.select("productId", "storeId").where("storeId == 2 OR storeId != 25")

transactionsDf.where(col("value").isNull()).select("productId", "storeId").distinct()

Question 9

Which of the following describes a shuffle?

Options:

A shuffle is a process that is executed during a broadcast hash join.

A shuffle is a process that compares data across executors.

A shuffle is a process that compares data across partitions.

A shuffle is a Spark operation that results from DataFrame.coalesce().

A shuffle is a process that allocates partitions to executors.

Question 10

Which of the following code blocks reads in parquet file /FileStore/imports.parquet as a DataFrame?

Options:

spark.mode("parquet").read("/FileStore/imports.parquet")

spark.read.path("/FileStore/imports.parquet", source="parquet")

spark.read().parquet("/FileStore/imports.parquet")

spark.read.parquet("/FileStore/imports.parquet")

spark.read().format('parquet').open("/FileStore/imports.parquet")

Question 11

The code block shown below should set the number of partitions that Spark uses when shuffling data for joins or aggregations to 100. Choose the answer that correctly fills the blanks in the code

block to accomplish this.

spark.sql.shuffle.partitions

__1__.__2__.__3__(__4__, 100)

Options:

1. spark

2. conf

3. set

4. "spark.sql.shuffle.partitions"

1. pyspark

2. config

3. set

4. spark.shuffle.partitions

1. spark

2. conf

3. get

4. "spark.sql.shuffle.partitions"

1. pyspark

2. config

3. set

4. "spark.sql.shuffle.partitions"

1. spark

2. conf

3. set

4. "spark.sql.aggregate.partitions"

Question 12

The code block displayed below contains an error. The code block is intended to perform an outer join of DataFrames transactionsDf and itemsDf on columns productId and itemId, respectively.

Find the error.

Code block:

transactionsDf.join(itemsDf, [itemsDf.itemId, transactionsDf.productId], "outer")

Options:

The "outer" argument should be eliminated, since "outer" is the default join type.

The join type needs to be appended to the join() operator, like join().outer() instead of listing it as the last argument inside the join() call.

The term [itemsDf.itemId, transactionsDf.productId] should be replaced by itemsDf.itemId == transactionsDf.productId.

The term [itemsDf.itemId, transactionsDf.productId] should be replaced by itemsDf.col("itemId") == transactionsDf.col("productId").

The "outer" argument should be eliminated from the call and join should be replaced by joinOuter.

Question 13

Which of the following code blocks reads all CSV files in directory filePath into a single DataFrame, with column names defined in the CSV file headers?

Content of directory filePath:

1._SUCCESS

2._committed_2754546451699747124

3._started_2754546451699747124

4.part-00000-tid-2754546451699747124-10eb85bf-8d91-4dd0-b60b-2f3c02eeecaa-298-1-c000.csv.gz

5.part-00001-tid-2754546451699747124-10eb85bf-8d91-4dd0-b60b-2f3c02eeecaa-299-1-c000.csv.gz

6.part-00002-tid-2754546451699747124-10eb85bf-8d91-4dd0-b60b-2f3c02eeecaa-300-1-c000.csv.gz

7.part-00003-tid-2754546451699747124-10eb85bf-8d91-4dd0-b60b-2f3c02eeecaa-301-1-c000.csv.gz

spark.option("header",True).csv(filePath)

Options:

spark.read.format("csv").option("header",True).option("compression","zip").load(filePath)

spark.read().option("header",True).load(filePath)

spark.read.format("csv").option("header",True).load(filePath)

spark.read.load(filePath)

Question 14

The code block displayed below contains an error. The code block should write DataFrame transactionsDf as a parquet file to location filePath after partitioning it on column storeId. Find the error.

Code block:

transactionsDf.write.partitionOn("storeId").parquet(filePath)

Options:

The partitioning column as well as the file path should be passed to the write() method of DataFrame transactionsDf directly and not as appended commands as in the code block.

The partitionOn method should be called before the write method.

The operator should use the mode() option to configure the DataFrameWriter so that it replaces any existing files at location filePath.

Column storeId should be wrapped in a col() operator.

No method partitionOn() exists for the DataFrame class, partitionBy() should be used instead.

Question 15

The code block displayed below contains an error. The code block should merge the rows of DataFrames transactionsDfMonday and transactionsDfTuesday into a new DataFrame, matching

column names and inserting null values where column names do not appear in both DataFrames. Find the error.

Sample of DataFrame transactionsDfMonday:

1.+-------------+---------+-----+-------+---------+----+

3.+-------------+---------+-----+-------+---------+----+

4.| 5| null| null| null| 2|null|

5.| 6| 3| 2| 25| 2|null|

6.+-------------+---------+-----+-------+---------+----+

Sample of DataFrame transactionsDfTuesday:

1.+-------+-------------+---------+-----+

3.+-------+-------------+---------+-----+

4.| 25| 1| 1| 4|

5.| 2| 2| 2| 7|

6.| 3| 4| 2| null|

7.| null| 5| 2| null|

8.+-------+-------------+---------+-----+

Code block:

sc.union([transactionsDfMonday, transactionsDfTuesday])

Options:

The DataFrames' RDDs need to be passed into the sc.union method instead of the DataFrame variable names.

Instead of union, the concat method should be used, making sure to not use its default arguments.

Instead of the Spark context, transactionDfMonday should be called with the join method instead of the union method, making sure to use its default arguments.

Instead of the Spark context, transactionDfMonday should be called with the union method.

Instead of the Spark context, transactionDfMonday should be called with the unionByName method instead of the union method, making sure to not use its default arguments.

Answer:

Explanation:

Explanation

Correct code block:

transactionsDfMonday.unionByName(transactionsDfTuesday, True)

Output of correct code block:

+-------------+---------+-----+-------+---------+----+

+-------------+---------+-----+-------+---------+----+

| 6| 3| 2| 25| 2|null|

| 1| null| 4| 25| 1|null|

| 2| null| 7| 2| 2|null|

| 4| null| null| 3| 2|null|

+-------------+---------+-----+-------+---------+----+

For solving this question, you should be aware of the difference between the DataFrame.union() and DataFrame.unionByName() methods. The first one matches columns independent of their

names, just by their order. The second one matches columns by their name (which is asked for in the question). It also has a useful optional argument, allowMissingColumns. This allows you to

merge DataFrames that have different columns - just like in this example.

sc stands for SparkContext and is automatically provided when executing code on Databricks. While sc.union() allows you to join RDDs, it is not the right choice for joining DataFrames. A hint away

from sc.union() is given where the QUESTION NO: talks about joining "into a new DataFrame".

concat is a method in pyspark.sql.functions. It is great for consolidating values from different columns, but has no place when trying to join rows of multiple DataFrames.

Finally, the join method is a contender here. However, the default join defined for that method is an inner join which does not get us closer to the goal to match the two DataFrames as instructed,

especially given that with the default arguments we cannot define a join condition.

More info:

- pyspark.sql.DataFrame.unionByName — PySpark 3.1.2 documentation

- pyspark.SparkContext.union — PySpark 3.1.2 documentation

- pyspark.sql.functions.concat — PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3, QUESTION NO: 45 (Databricks import instructions)

Question 16

Which of the following describes characteristics of the Dataset API?

Options:

The Dataset API does not support unstructured data.

In Python, the Dataset API mainly resembles Pandas' DataFrame API.

In Python, the Dataset API's schema is constructed via type hints.

The Dataset API is available in Scala, but it is not available in Python.

The Dataset API does not provide compile-time type safety.

Question 17

The code block displayed below contains an error. The code block should return the average of rows in column value grouped by unique storeId. Find the error.

Code block:

transactionsDf.agg("storeId").avg("value")

Options:

Instead of avg("value"), avg(col("value")) should be used.

The avg("value") should be specified as a second argument to agg() instead of being appended to it.

All column names should be wrapped in col() operators.

agg should be replaced by groupBy.

"storeId" and "value" should be swapped.

Question 18

Which of the following code blocks sorts DataFrame transactionsDf both by column storeId in ascending and by column productId in descending order, in this priority?

Options:

transactionsDf.sort("storeId", asc("productId"))

transactionsDf.sort(col(storeId)).desc(col(productId))

transactionsDf.order_by(col(storeId), desc(col(productId)))

transactionsDf.sort("storeId", desc("productId"))

transactionsDf.sort("storeId").sort(desc("productId"))

Question 19

Which of the following describes the conversion of a computational query into an execution plan in Spark?

Options:

Spark uses the catalog to resolve the optimized logical plan.

The catalog assigns specific resources to the optimized memory plan.

The executed physical plan depends on a cost optimization from a previous stage.

Depending on whether DataFrame API or SQL API are used, the physical plan may differ.

The catalog assigns specific resources to the physical plan.

Question 20

Which of the following statements about the differences between actions and transformations is correct?

Options:

Actions are evaluated lazily, while transformations are not evaluated lazily.

Actions generate RDDs, while transformations do not.

Actions do not send results to the driver, while transformations do.

Actions can be queued for delayed execution, while transformations can only be processed immediately.

Actions can trigger Adaptive Query Execution, while transformation cannot.

Question 21

The code block shown below should return a column that indicates through boolean variables whether rows in DataFrame transactionsDf have values greater or equal to 20 and smaller or equal to

30 in column storeId and have the value 2 in column productId. Choose the answer that correctly fills the blanks in the code block to accomplish this.

transactionsDf.__1__((__2__.__3__) __4__ (__5__))

Options:

1. select

2. col("storeId")

3. between(20, 30)

4. and

5. col("productId")==2

1. where

2. col("storeId")

3. geq(20).leq(30)

4. &

5. col("productId")==2

1. select

2. "storeId"

3. between(20, 30)

4. &&

5. col("productId")==2

1. select

2. col("storeId")

3. between(20, 30)

4. &&

5. col("productId")=2

1. select

2. col("storeId")

3. between(20, 30)

4. &

5. col("productId")==2

Question 22

Which of the following describes a difference between Spark's cluster and client execution modes?

Options:

In cluster mode, the cluster manager resides on a worker node, while it resides on an edge node in client mode.

In cluster mode, executor processes run on worker nodes, while they run on gateway nodes in client mode.

In cluster mode, the driver resides on a worker node, while it resides on an edge node in client mode.

In cluster mode, a gateway machine hosts the driver, while it is co-located with the executor in client mode.

In cluster mode, the Spark driver is not co-located with the cluster manager, while it is co-located in client mode.

Question 23

Which of the following statements about lazy evaluation is incorrect?

Options:

Predicate pushdown is a feature resulting from lazy evaluation.

Execution is triggered by transformations.

Spark will fail a job only during execution, but not during definition.

Accumulators do not change the lazy evaluation model of Spark.

Lineages allow Spark to coalesce transformations into stages

Question 24

Which of the following options describes the responsibility of the executors in Spark?

Options:

The executors accept jobs from the driver, analyze those jobs, and return results to the driver.

The executors accept tasks from the driver, execute those tasks, and return results to the cluster manager.

The executors accept tasks from the driver, execute those tasks, and return results to the driver.

The executors accept tasks from the cluster manager, execute those tasks, and return results to the driver.

The executors accept jobs from the driver, plan those jobs, and return results to the cluster manager.

Question 25

The code block shown below should add column transactionDateForm to DataFrame transactionsDf. The column should express the unix-format timestamps in column transactionDate as string

type like Apr 26 (Sunday). Choose the answer that correctly fills the blanks in the code block to accomplish this.

transactionsDf.__1__(__2__, from_unixtime(__3__, __4__))

Options:

1. withColumn

2. "transactionDateForm"

3. "MMM d (EEEE)"

4. "transactionDate"

1. select

2. "transactionDate"

3. "transactionDateForm"

4. "MMM d (EEEE)"

1. withColumn

2. "transactionDateForm"

3. "transactionDate"

4. "MMM d (EEEE)"

1. withColumn

2. "transactionDateForm"

3. "transactionDate"

4. "MM d (EEE)"

1. withColumnRenamed

2. "transactionDate"

3. "transactionDateForm"

4. "MM d (EEE)"

Question 26

Which of the following describes the difference between client and cluster execution modes?

Options:

In cluster mode, the driver runs on the worker nodes, while the client mode runs the driver on the client machine.

In cluster mode, the driver runs on the edge node, while the client mode runs the driver in a worker node.

In cluster mode, each node will launch its own executor, while in client mode, executors will exclusively run on the client machine.

In client mode, the cluster manager runs on the same host as the driver, while in cluster mode, the cluster manager runs on a separate node.

In cluster mode, the driver runs on the master node, while in client mode, the driver runs on a virtual machine in the cloud.

Question 27

Which of the following code blocks generally causes a great amount of network traffic?

Options:

DataFrame.select()

DataFrame.coalesce()

DataFrame.collect()

DataFrame.rdd.map()

DataFrame.count()

Load More Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 Questions

Weekend Sale Limited Time Flat 70% Discount offer - Ends in 0d 00h 00m 00s - Coupon code: 70spcl

Activedumpsnet Logo

Activedumpsnet Navigation

Activedumpsnet Slider

Databricks Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 Databricks Certified Associate Developer for Apache Spark 3.0 Exam Exam Practice Test

Databricks Certified Associate Developer for Apache Spark 3.0 Exam Questions and Answers

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer: