DiscoverX - Lakehouse Mapping and Search
DiscoverX is a Python library developed under Databricks Labs, designed as a "Swiss-Army-knife" for Lakehouse administration. It automates tasks like inspecting and operating on a large number of Lakehouse assets, particularly through multi-table operations with SQL templates. The current version is 0.0.9, released on May 2, 2025. It is provided for exploration and is not formally supported by Databricks with Service Level Agreements (SLAs).
Common errors
-
ModuleNotFoundError: No module named 'discoverx'
cause The `dbl-discoverx` package is either not installed or, in a Databricks notebook, the Python kernel has not been restarted after installation.fixRun `%pip install dbl-discoverx` (or `pip install dbl-discoverx` in non-Databricks environments) and then `dbutils.library.restartPython()` if in a Databricks notebook. -
Py4JJavaError: An error occurred while calling o.s.sql.functions.regexp_extract. Invalid regular expression
cause This error occurs in Databricks Runtime 15.4 LTS+ when a regular expression provided in a `with_sql` command (or other Spark SQL regex functions) is considered invalid by Photon's updated regex engine.fixReview and correct the SQL regular expression pattern to adhere strictly to Apache Spark's regex syntax. Test patterns in a standard Spark SQL context if unsure. -
Performance degradation or OutOfMemoryError for operations on a very large number of tables.
cause The default concurrency (10) for multi-table operations might not be optimal for extremely large lakehouses or specific workload patterns, leading to resource contention or inefficient execution.fixAdjust the concurrency level using the `with_concurrency()` method (e.g., `dx.from_tables(...).with_concurrency(20).with_sql(...).apply()`) to match your Databricks cluster's capacity and the nature of the operations.
Warnings
- breaking DiscoverX is a Databricks Labs project and is provided "AS-IS" without formal Service Level Agreements (SLAs). Issues should be filed as GitHub Issues and will be reviewed as time permits.
- deprecated The `scan` command has been deprecated. Users should migrate to `intro` or `scan (experimental)` for semantic classification and other scanning functionalities.
- gotcha When installing `dbl-discoverx` within a Databricks notebook using `%pip install`, it is often necessary to restart the Python kernel (`dbutils.library.restartPython()`) for the newly installed package to be properly loaded and available.
- gotcha On Databricks Runtime 15.4 LTS and above, regular expression handling in Photon is updated to match Apache Spark behavior. Previously accepted invalid regex patterns in `with_sql` commands might now cause queries to fail.
Install
-
pip install dbl-discoverx -
%pip install dbl-discoverx
Imports
- DX
from discoverx import DX
Quickstart
from discoverx import DX
# Initialize DiscoverX. 'locale' can be set for region-specific rules.
dx = DX(locale="US")
# Define the tables to operate on using a wildcard pattern
# Example: all tables in 'my_catalog.my_schema'
from_tables = "my_catalog.my_schema.*"
# Example: Count rows in all selected tables and display the results
# The '{full_table_name}' placeholder is automatically replaced.
table_counts = dx.from_tables(from_tables).with_sql("SELECT COUNT(*) FROM {full_table_name}").apply()
# Display the resulting DataFrame
table_counts.display()