Visit here for our full Databricks Certified Data Analyst Associate exam dumps and practice test questions.
Question 61:
Which SQL function is used to extract the year from a timestamp column in Databricks SQL?
A) EXTRACT(YEAR FROM timestamp_column)
B) YEAR(timestamp_column)
C) GET_YEAR(timestamp_column)
D) DATEPART(year, timestamp_column)
Answer: B
Explanation:
The YEAR function is used to extract the year from a timestamp column in Databricks SQL. This function takes a date or timestamp value as input and returns the year component as an integer, making it the simplest and most direct method for year extraction in Databricks. The YEAR function is part of Databricks SQL’s extensive date and time function library designed for temporal data analysis, commonly used in queries for grouping data by year, filtering records from specific years, or calculating year-over-year metrics.
The YEAR function syntax is straightforward: YEAR(column_name) where column_name contains date or timestamp values. The function handles various input formats including DATE types, TIMESTAMP types, and string representations of dates that Databricks can implicitly convert. For example, SELECT YEAR(order_date) AS order_year FROM orders returns the year portion of each order date. The function is particularly useful in GROUP BY clauses for aggregating data annually, such as SELECT YEAR(sale_date) AS year, SUM(amount) AS total_sales FROM sales GROUP BY YEAR(sale_date) for yearly sales summaries.
Databricks SQL provides a family of related date extraction functions including MONTH for extracting month numbers, DAY for day of month, HOUR for hour component, MINUTE for minutes, and SECOND for seconds. These functions enable comprehensive temporal analysis without requiring complex string manipulation or date arithmetic. The YEAR function performs efficiently even on large datasets because it operates as a built-in function optimized in the query engine, and its results can be cached when used in repeated calculations.
While EXTRACT(YEAR FROM timestamp_column) is valid ANSI SQL syntax and may work in some SQL dialects, Databricks SQL uses the simpler YEAR function. GET_YEAR and DATEPART are not standard Databricks SQL functions. The YEAR function is the recommended, idiomatic approach for extracting years from timestamp columns in Databricks SQL.
Question 62:
A data analyst needs to create a query that returns the top 10 customers by total purchase amount. Which clause should be used?
A) FETCH FIRST 10 ROWS ONLY
B) TOP 10
C) LIMIT 10
D) ROWNUM <= 10
Answer: C
Explanation:
The LIMIT clause should be used to return the top 10 customers by total purchase amount in Databricks SQL. LIMIT is the standard SQL clause in Databricks for restricting the number of rows returned by a query, placing a cap on result set size regardless of how many rows match the query conditions. When combined with ORDER BY clauses that sort results by aggregated metrics like total purchase amount, LIMIT effectively retrieves top-N results, making it essential for creating leaderboards, identifying high-value customers, or focusing analysis on most significant data points.
The typical query pattern for top-N analysis combines GROUP BY for aggregation, ORDER BY for sorting in descending order, and LIMIT for row restriction. For example: SELECT customer_id, SUM(purchase_amount) AS total_purchases FROM transactions GROUP BY customer_id ORDER BY total_purchases DESC LIMIT 10. This query groups transactions by customer, calculates total purchases for each, sorts customers from highest to lowest total, and returns only the top 10 results. The LIMIT clause appears at the end of the query after all filtering, grouping, and sorting operations.
LIMIT accepts numeric literals or expressions specifying the maximum number of rows to return. Databricks SQL also supports LIMIT with OFFSET for pagination, such as LIMIT 10 OFFSET 20 to skip the first 20 rows and return the next 10, useful for implementing paged result displays in dashboards or applications. The LIMIT clause executes after ORDER BY, ensuring consistent results when combined with sorting. Without ORDER BY, LIMIT returns an arbitrary subset of rows, which may vary between query executions due to parallel processing and data distribution.
FETCH FIRST is ANSI SQL syntax that works in some databases but LIMIT is preferred in Databricks. TOP is SQL Server syntax. ROWNUM is Oracle syntax. The LIMIT clause is the standard, supported method for restricting result set size in Databricks SQL queries.
Question 63:
Which visualization type in Databricks SQL is most appropriate for showing the distribution of a continuous numerical variable?
A) Bar chart
B) Histogram
C) Line chart
D) Pie chart
Answer: B
Explanation:
A histogram is the most appropriate visualization type for showing the distribution of a continuous numerical variable in Databricks SQL. Histograms divide continuous data ranges into discrete intervals called bins and display the frequency or count of values falling within each bin using vertical bars. This visualization reveals data distribution patterns including central tendency, spread, skewness, and presence of outliers or multiple modes, making histograms essential tools for exploratory data analysis and understanding variable characteristics before deeper analysis.
Histograms in Databricks SQL visualizations automatically bin continuous data into appropriate intervals based on data range and variability, though analysts can customize bin sizes for specific analytical needs. The x-axis represents the continuous variable divided into bins (such as age ranges 0-10, 10-20, 20-30), while the y-axis shows the count or frequency of observations in each bin. The height of each bar indicates how many data points fall within that range, with taller bars representing more common values. Unlike bar charts where categories are discrete and independent, histogram bars are contiguous representing continuous ranges.
Histograms reveal important distribution characteristics including whether data follows normal distribution appearing bell-shaped and symmetric, exhibits skewness with longer tails on one side, contains outliers appearing as isolated bars far from the main distribution, or shows bimodal patterns with two distinct peaks suggesting multiple underlying populations. Databricks SQL allows customizing histogram appearance including bin width for finer or coarser granularity, color schemes for visual emphasis, and overlay options like normal distribution curves for comparison. Histograms work best for variables with sufficient data points (typically 30 or more) to show meaningful distribution patterns.
Bar charts display categorical data not continuous distributions. Line charts show trends over time. Pie charts display parts of a whole. Histograms are the specialized visualization for revealing continuous variable distributions in Databricks SQL analytics.
Question 64:
A data analyst needs to replace NULL values in a column with a default value. Which function should be used?
A) ISNULL
B) COALESCE
C) IFNULL
D) NVL
Answer: B
Explanation:
The COALESCE function should be used to replace NULL values with default values in Databricks SQL. COALESCE is a versatile SQL function that accepts multiple arguments and returns the first non-NULL value from the list, making it ideal for NULL handling, providing fallback values, and consolidating data from multiple columns. The function evaluates arguments from left to right and returns immediately upon finding the first non-NULL value, or returns NULL if all arguments are NULL, providing flexible and readable NULL value replacement logic.
COALESCE syntax is straightforward: COALESCE(column_name, default_value) where column_name is the column potentially containing NULLs and default_value is the replacement value. For example, SELECT COALESCE(phone_number, ‘Not Provided’) FROM customers replaces NULL phone numbers with the string ‘Not Provided’. COALESCE accepts any number of arguments, enabling sophisticated fallback chains like COALESCE(primary_email, secondary_email, work_email, ‘no-email@company.com’) that tries multiple columns before using a final default, useful when data might be present in alternative locations.
The function works with any data type including strings, numbers, dates, and complex types, as long as all arguments are compatible types or can be implicitly converted. COALESCE is particularly valuable in calculated columns, joins where missing values could cause issues, and aggregations where NULL handling affects results. The function follows ANSI SQL standards ensuring code portability across database systems. Databricks SQL optimizes COALESCE execution efficiently evaluating only as many arguments as necessary to find a non-NULL value.
ISNULL is a function that checks for NULL but does not replace values. IFNULL exists in some SQL dialects but COALESCE is more standard. NVL is Oracle-specific syntax. COALESCE is the standard, flexible function for NULL value replacement in Databricks SQL supporting multiple fallback values.
Question 65:
Which SQL clause is used to filter results after aggregation in a GROUP BY query?
A) WHERE
B) HAVING
C) FILTER
D) QUALIFY
Answer: B
Explanation:
The HAVING clause is used to filter results after aggregation in GROUP BY queries in Databricks SQL. While the WHERE clause filters individual rows before grouping and aggregation occur, HAVING filters grouped results after aggregation functions have been calculated, enabling conditions based on aggregate values like sums, counts, averages, or other computed metrics. This distinction makes HAVING essential for analytical queries that need to identify groups meeting specific aggregate criteria, such as finding customers with total purchases exceeding thresholds or products with average ratings above certain levels.
The HAVING clause appears after the GROUP BY clause in SQL statements and before ORDER BY if present. Its conditions reference either grouped columns or aggregate functions applied during grouping. For example: SELECT customer_id, COUNT() AS order_count, SUM(amount) AS total_spent FROM orders GROUP BY customer_id HAVING COUNT() >= 5 AND SUM(amount) > 1000. This query groups orders by customer, calculates order counts and spending totals, then filters to show only customers with at least 5 orders and spending over 1000, demonstrating HAVING filtering on aggregated values.
HAVING conditions can use any aggregate function including COUNT, SUM, AVG, MIN, MAX, and more complex expressions combining multiple aggregates. The clause supports logical operators like AND, OR, and NOT for compound conditions, comparison operators for numerical and date comparisons, and even subqueries for advanced filtering. Understanding when to use WHERE versus HAVING is crucial: WHERE filters rows before grouping (reducing data volume for aggregation), while HAVING filters groups after aggregation (applying conditions to computed metrics). Using WHERE for non-aggregated conditions improves performance by reducing rows processed during aggregation.
WHERE filters pre-aggregation rows not post-aggregation groups. FILTER is for conditional aggregation within queries. QUALIFY is used in window function contexts. HAVING is the specific clause for filtering aggregated results in GROUP BY queries in Databricks SQL.
Question 66:
A dashboard needs to update automatically when new data arrives in the underlying tables. Which feature should be configured?
A) Manual refresh
B) Auto refresh schedule
C) Real-time streaming
D) Incremental load
Answer: B
Explanation:
Auto refresh schedule should be configured to make dashboards update automatically when new data arrives in underlying tables in Databricks SQL. Dashboard auto refresh is a built-in feature that periodically re-executes all queries in a dashboard and updates visualizations with the latest data according to a defined schedule. This capability ensures stakeholders always see current information without manual intervention, making dashboards reliable tools for monitoring business metrics, operational KPIs, or analytical insights that change as new data arrives.
Auto refresh schedules in Databricks SQL are configured at the dashboard level through the dashboard settings interface. Administrators specify refresh intervals ranging from minutes to days depending on data freshness requirements and query execution costs. Common patterns include hourly refreshes for operational dashboards monitoring real-time business activity, daily refreshes for executive dashboards summarizing yesterday’s performance, or weekly refreshes for strategic dashboards showing longer-term trends. The scheduler automatically executes queries at specified intervals, caches results, and updates dashboard displays for all users viewing the dashboard.
When configuring auto refresh, considerations include query execution time ensuring queries complete before the next scheduled refresh, computational costs as frequent refreshes consume cluster resources, data update patterns aligning refresh timing with when new data typically arrives, and cache expiration settings controlling how long results remain valid. Databricks SQL provides visibility into refresh history showing execution times, success or failure status, and any errors encountered. Dashboards can also be manually refreshed on-demand by users needing the absolute latest data between scheduled refreshes.
Manual refresh requires user action not automatic updates. Real-time streaming is for continuous data ingestion not dashboard refresh. Incremental load is a data engineering pattern not a dashboard feature. Auto refresh schedule is the specific dashboard capability for automatic, periodic updates in Databricks SQL.
Question 67:
Which SQL function converts a string to uppercase in Databricks SQL?
A) UCASE
B) UPPER
C) TO_UPPER
D) UPPERCASE
Answer: B
Explanation:
The UPPER function converts strings to uppercase in Databricks SQL. This function takes a string value as input and returns a new string with all alphabetic characters converted to their uppercase equivalents, leaving numbers, punctuation, and special characters unchanged. UPPER is commonly used for data standardization ensuring consistent case for comparison and matching, cleaning data where case variations cause duplicates, and formatting text for display purposes where uppercase presentation is desired.
The UPPER function syntax is simple: UPPER(string_expression) where string_expression can be a column reference, string literal, or any expression resulting in a string value. For example, SELECT UPPER(first_name) FROM customers converts all first names to uppercase. The function is particularly valuable in WHERE clauses for case-insensitive matching: WHERE UPPER(email) = UPPER(‘user@example.com’) matches emails regardless of case. UPPER also helps deduplicate data where the same entity appears with different case variations, enabling SELECT DISTINCT UPPER(city) FROM locations to get unique cities regardless of capitalization.
Databricks SQL provides complementary string case functions including LOWER for converting to lowercase, INITCAP for title case with first letters capitalized, and case-insensitive comparison operators for queries not requiring explicit conversion. The UPPER function processes strings efficiently handling Unicode characters and international alphabets correctly according to Unicode case mapping rules. The function works with string columns, literal values, and results from other string functions, enabling composition like UPPER(TRIM(column_name)) for combined operations.
UCASE exists in some SQL dialects but UPPER is standard. TO_UPPER and UPPERCASE are not valid Databricks SQL functions. UPPER is the standard, universally supported function for uppercase string conversion in Databricks SQL.
Question 68:
A query needs to combine rows from two tables where matching rows exist in both tables. Which join type should be used?
A) LEFT JOIN
B) RIGHT JOIN
C) INNER JOIN
D) FULL OUTER JOIN
Answer: C
Explanation:
INNER JOIN should be used to combine rows from two tables where matching rows exist in both tables in Databricks SQL. An INNER JOIN returns only rows where the join condition is satisfied in both tables, excluding rows from either table that do not have matching counterparts. This join type is the most common and restrictive, producing result sets containing only records with relationships in both source tables, making it ideal for analysis requiring complete information from related entities like orders with customer details or transactions with product information.
INNER JOIN syntax follows the pattern: SELECT columns FROM table1 INNER JOIN table2 ON table1.key = table2.key. For example, SELECT orders.order_id, customers.customer_name, orders.amount FROM orders INNER JOIN customers ON orders.customer_id = customers.customer_id returns only orders that have corresponding customer records, omitting any orphaned orders without matching customers. The ON clause specifies the join condition, typically matching primary and foreign keys, though any comparable columns can be used including composite keys with multiple conditions joined by AND.
INNER JOINs offer several advantages including reducing result set size by excluding non-matching rows improving query performance, ensuring referential integrity in results by requiring relationships exist, and providing clean datasets without NULL values from missing matches. Multiple INNER JOINs can be chained to combine three or more tables: FROM orders INNER JOIN customers ON orders.customer_id = customers.customer_id INNER JOIN products ON orders.product_id = products.product_id. The order of INNER JOINs does not affect results due to their commutative property, though query optimizers may reorder joins for performance.
LEFT JOIN includes all left table rows with NULLs for non-matches. RIGHT JOIN includes all right table rows. FULL OUTER JOIN includes all rows from both tables. INNER JOIN specifically returns only rows with matches in both tables, providing the intersection of related data.
Question 69:
Which aggregate function calculates the number of non-NULL values in a column?
A) COUNT(*)
B) COUNT(column_name)
C) COUNT_NON_NULL(column_name)
D) SUM(column_name)
Answer: B
Explanation:
COUNT(column_name) calculates the number of non-NULL values in a specific column in Databricks SQL. This function variant of COUNT specifically counts rows where the specified column contains non-NULL values, excluding NULL entries from the count. This behavior differs from COUNT(*) which counts all rows regardless of NULL values, making COUNT(column_name) essential for data quality analysis, understanding column completeness, and calculating metrics where NULL values should not contribute to totals.
The distinction between COUNT() and COUNT(column_name) is important for accurate analysis. COUNT() returns the total number of rows in a table or group, including rows where all columns are NULL. COUNT(column_name) returns only the number of rows where the specified column has a non-NULL value. For example, if a customers table has 1000 rows but only 800 have email addresses, SELECT COUNT() returns 1000 while SELECT COUNT(email) returns 800. This difference enables calculating data completeness percentages: SELECT COUNT(email) * 100.0 / COUNT() AS email_completion_rate.
COUNT with column names is commonly used in data profiling to assess column population rates, in aggregations where NULL values represent missing data that should not be counted, and in conditional logic where presence of values matters. The function works with any data type including numbers, strings, dates, and complex types. COUNT can also be combined with DISTINCT for unique non-NULL value counts: COUNT(DISTINCT column_name) returns the number of unique values excluding NULLs, useful for understanding cardinality and diversity in data columns.
COUNT(*) counts all rows including those with all NULLs. COUNT_NON_NULL is not a valid function name. SUM aggregates values not counts them. COUNT(column_name) is the specific function for counting non-NULL values in a column in Databricks SQL.
Question 70:
A data analyst needs to create a calculated column that shows “High”, “Medium”, or “Low” based on sales amount ranges. Which SQL construct should be used?
A) IF function
B) CASE expression
C) DECODE function
D) IIF function
Answer: B
Explanation:
The CASE expression should be used to create calculated columns with multiple conditional outcomes based on ranges in Databricks SQL. CASE is SQL’s primary conditional expression that evaluates multiple conditions sequentially and returns corresponding values when conditions are met, similar to if-then-else logic in programming languages. CASE expressions enable complex categorization, bucketing continuous variables into discrete groups, and implementing business logic directly in SQL queries without requiring user-defined functions or procedural code.
CASE expressions come in two forms: simple CASE that matches a value against multiple possible values, and searched CASE that evaluates multiple conditions. For the sales categorization scenario, searched CASE is appropriate: CASE WHEN sales_amount >= 10000 THEN ‘High’ WHEN sales_amount >= 5000 THEN ‘Medium’ WHEN sales_amount >= 0 THEN ‘Low’ ELSE ‘Unknown’ END AS sales_category. This expression evaluates conditions in order, returning the first matching result, making condition ordering important with most restrictive conditions first. The ELSE clause provides a default value when no conditions match, ensuring the expression always returns a value.
CASE expressions can appear anywhere values are used including SELECT lists for calculated columns, WHERE clauses for complex filtering, ORDER BY for custom sorting, and aggregate functions for conditional aggregation. The expressions handle any data type with conditions and results being type-compatible. Nested CASE expressions enable hierarchical categorization though complex nesting reduces readability. Databricks SQL optimizes CASE evaluation efficiently, especially when used with indexes or partition pruning, making it performant even in large datasets.
IF function exists in some SQL dialects but CASE is standard SQL. DECODE is Oracle-specific syntax. IIF is SQL Server function not standard in Databricks. CASE expression is the standard, flexible conditional construct for multi-way branching in Databricks SQL queries.
Question 71:
Which Databricks SQL feature allows analysts to create reusable SQL queries that can be shared across multiple dashboards?
A) Stored procedures
B) Query snippets
C) SQL endpoints
D) Views
Answer: B
Explanation:
Query snippets allow analysts to create reusable SQL queries that can be shared across multiple dashboards in Databricks SQL. Query snippets are saved SQL code fragments that can be inserted into new queries, promoting code reusability, maintaining consistency across analyses, and accelerating query development by eliminating repetitive typing. Snippets are particularly valuable for complex SQL patterns used frequently like date filtering logic, standard aggregations, common join patterns, or specific business calculations that appear in multiple analyses across different dashboards and reports.
Query snippets in Databricks SQL are created and managed through the snippets interface where analysts define a snippet name, write the SQL code, and optionally include parameters that get replaced when the snippet is used. Snippets are accessible workspace-wide making them available to all users with appropriate permissions, fostering collaboration and ensuring teams use standardized logic for common calculations. When creating queries, analysts insert snippets using simple syntax, and the snippet code automatically expands into the query editor where it can be used as-is or customized for specific needs.
Snippets serve multiple purposes including standardizing metrics definitions ensuring revenue, profit, or customer counts are calculated consistently across all analyses, accelerating development by providing starting templates for common query patterns, reducing errors by using tested, validated SQL logic instead of rewriting from scratch, and documenting best practices by capturing recommended approaches in reusable form. Snippets can include parameters indicated by special syntax that prompt users to provide values when inserting, enabling flexible templates that adapt to different contexts.
Stored procedures are not supported in Databricks SQL in the same way as traditional databases. SQL endpoints are compute resources not reusable queries. Views create virtual tables but query snippets provide more flexible reusability. Query snippets are the specific Databricks SQL feature for creating and sharing reusable SQL code fragments.
Question 72:
A data analyst needs to calculate the cumulative sum of sales over time. Which SQL function should be used?
A) RUNNING_TOTAL
B) CUMSUM
C) SUM with OVER
D) AGGREGATE
Answer: C
Explanation:
SUM with OVER (window function) should be used to calculate cumulative sums of sales over time in Databricks SQL. Window functions, also called analytical functions, perform calculations across sets of rows related to the current row without collapsing rows into groups like traditional aggregates. The SUM window function combined with appropriate window specifications calculates running totals, cumulative sums, and moving aggregates essential for time-series analysis, financial reporting, and trend visualization showing how metrics accumulate over time periods.
The syntax for cumulative sum using window functions is: SUM(column_name) OVER (ORDER BY date_column ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW). This expression calculates the sum of column_name from the beginning of the dataset up to and including the current row when ordered by date_column. A simpler equivalent syntax is: SUM(column_name) OVER (ORDER BY date_column) which defaults to cumulative behavior. For example, SELECT date, daily_sales, SUM(daily_sales) OVER (ORDER BY date) AS cumulative_sales FROM sales calculates both daily sales and running total, showing how sales accumulate over time.
Window functions provide powerful capabilities beyond simple cumulative sums including partitioning with PARTITION BY to calculate separate running totals for different categories like per-region cumulative sales, framing with ROWS or RANGE to define moving windows for moving averages, and ordering variations to control calculation direction. The functions maintain row-level detail unlike GROUP BY aggregations, enabling queries that show both individual transactions and cumulative metrics simultaneously. Databricks SQL optimizes window function execution using efficient sorting and accumulation algorithms handling large datasets with millions of rows.
RUNNING_TOTAL and CUMSUM are not standard SQL functions. AGGREGATE is a general term not a specific function. SUM with OVER (window function syntax) is the standard SQL approach for calculating cumulative sums and running totals in Databricks SQL.
Question 73:
Which SQL clause is used to sort query results in Databricks SQL?
A) SORT BY
B) ORDER BY
C) ARRANGE BY
D) RANK BY
Answer: B
Explanation:
The ORDER BY clause is used to sort query results in Databricks SQL. ORDER BY specifies one or more columns or expressions by which result rows should be sorted, controlling the sequence in which data appears in query outputs, dashboard visualizations, and exported results. Sorting is fundamental to data analysis for identifying top or bottom performers, presenting data in logical sequences, and creating ordered datasets for cumulative calculations or sequential processing. ORDER BY appears at the end of SQL statements after SELECT, FROM, WHERE, GROUP BY, and HAVING clauses.
ORDER BY syntax allows sorting by multiple columns with different directions: ORDER BY column1 ASC, column2 DESC sorts first by column1 in ascending order, then by column2 in descending order for rows with equal column1 values. ASC (ascending) is the default sort direction when not specified, while DESC explicitly requests descending order. ORDER BY accepts column names, column numbers (positional references like ORDER BY 2 referring to the second SELECT list item), expressions like ORDER BY YEAR(date_column), and even complex calculations like ORDER BY sales_amount / quantity DESC for sorting by unit price.
ORDER BY is essential for top-N queries when combined with LIMIT: SELECT product_name, sales FROM products ORDER BY sales DESC LIMIT 10 returns the 10 highest-selling products. The clause affects query performance especially on large datasets, though Databricks optimizes sorting using distributed algorithms and leveraging data locality. Sorted results are deterministic when sort keys uniquely identify rows, but non-deterministic when multiple rows have identical sort values, potentially returning different orderings between executions. Adding secondary sort columns ensures consistent results.
SORT BY exists in some SQL dialects with different semantics but ORDER BY is standard. ARRANGE BY and RANK BY are not valid SQL clauses. ORDER BY is the standard, universal SQL clause for sorting query results in Databricks SQL.
Question 74:
A query needs to combine results from two SELECT statements into a single result set without duplicates. Which set operator should be used?
A) UNION
B) UNION ALL
C) INTERSECT
D) MINUS
Answer: A
Explanation:
The UNION operator should be used to combine results from two SELECT statements into a single result set without duplicates in Databricks SQL. UNION takes the results of two or more SELECT queries and combines them into one result set, automatically removing duplicate rows that appear in multiple queries. This operator is valuable for consolidating data from multiple sources, combining similar data from different time periods or regions, or creating unified datasets from tables with similar but not identical structures.
UNION syntax requires the combined SELECT statements to have the same number of columns with compatible data types in corresponding positions. The column names in the final result come from the first SELECT statement. For example: SELECT customer_id, order_date FROM online_orders UNION SELECT customer_id, order_date FROM retail_orders produces a combined list of all orders from both online and retail channels with duplicates removed if the same customer_id and order_date appear in both sources. UNION performs duplicate elimination by comparing all columns, keeping only distinct rows across all combined queries.
The duplicate removal behavior of UNION involves sorting and comparison overhead making it potentially slower than UNION ALL on large datasets. UNION is appropriate when duplicate elimination is desired or necessary for correct results, such as creating unique customer lists from multiple sources, consolidating reference data from different systems, or generating distinct event listings from multiple logs. The operator supports combining more than two queries: query1 UNION query2 UNION query3, applying duplicate elimination across all combined results. Parentheses can group UNION operations with other set operators for complex combinations.
UNION ALL combines results but retains duplicates. INTERSECT returns only rows appearing in both queries. MINUS returns rows in the first query not in the second. UNION is the specific operator for combining query results while eliminating duplicates in Databricks SQL.
Question 75:
Which Databricks SQL feature provides row-level access control to restrict which data users can see?
A) Table ACLs
B) Row-level security
C) Column masking
D) Dynamic views
Answer: B
Explanation:
Row-level security provides row-level access control to restrict which data users can see in Databricks SQL. Row-level security (RLS) is a fine-grained access control mechanism that filters table rows based on user identity or group membership, ensuring users only see data they are authorized to access. Unlike table-level permissions that grant or deny access to entire tables, RLS allows multiple users to query the same table while each seeing only their permitted subset of data, essential for multi-tenant applications, privacy compliance, and role-based data access in shared analytics environments.
Row-level security in Databricks SQL is implemented through row filters defined using SQL expressions in table properties. Administrators create filter functions that accept user identity as parameters and return boolean conditions determining which rows each user can access. For example, a sales table might have row-level security filtering rows based on region: users see only rows where region matches their assigned territory. The filter expressions can reference user attributes from identity providers, group memberships, or custom attributes stored in configuration tables, enabling sophisticated access patterns like hierarchical access where managers see their team’s data plus their own.
Row-level security applies automatically to all queries accessing protected tables, with filters transparently added to WHERE clauses without requiring query modifications. Users remain unaware of filtering—they simply see their authorized data subset. RLS works with all Databricks SQL features including dashboards, queries, and SQL endpoints, and integrates with Unity Catalog for centralized security governance across workspaces. Administrators monitor RLS effectiveness through query logs and audit trails showing which filters applied to each query, ensuring compliance with data access policies.
Table ACLs provide table-level not row-level permissions. Column masking restricts column access not rows. Dynamic views can implement row filtering but row-level security is the native, optimized feature. Row-level security is the specific Databricks SQL capability for fine-grained, row-level access control based on user identity.