{"id":7143,"library":"dbl-discoverx","title":"DiscoverX - Lakehouse Mapping and Search","description":"DiscoverX is a Python library developed under Databricks Labs, designed as a \"Swiss-Army-knife\" for Lakehouse administration. It automates tasks like inspecting and operating on a large number of Lakehouse assets, particularly through multi-table operations with SQL templates. The current version is 0.0.9, released on May 2, 2025. It is provided for exploration and is not formally supported by Databricks with Service Level Agreements (SLAs).","status":"active","version":"0.0.9","language":"en","source_language":"en","source_url":"https://github.com/databrickslabs/discoverx","tags":["lakehouse","databricks","data governance","data discovery","sql","automation","etl","administration","metadata"],"install":[{"cmd":"pip install dbl-discoverx","lang":"bash","label":"Standard pip install"},{"cmd":"%pip install dbl-discoverx","lang":"python","label":"Databricks notebook"}],"dependencies":[{"reason":"Core dependency for Databricks Lakehouse operations.","package":"pyspark","optional":false},{"reason":"Required for data manipulation and DataFrame operations. PyPI specifies <2.0.0,>=1.0.0.","package":"pandas","optional":false},{"reason":"Required for numerical operations. PyPI specifies <1.24.0,>=1.16.0.","package":"numpy","optional":false}],"imports":[{"symbol":"DX","correct":"from discoverx import DX"}],"quickstart":{"code":"from discoverx import DX\n\n# Initialize DiscoverX. 'locale' can be set for region-specific rules.\ndx = DX(locale=\"US\")\n\n# Define the tables to operate on using a wildcard pattern\n# Example: all tables in 'my_catalog.my_schema'\nfrom_tables = \"my_catalog.my_schema.*\"\n\n# Example: Count rows in all selected tables and display the results\n# The '{full_table_name}' placeholder is automatically replaced.\ntable_counts = dx.from_tables(from_tables).with_sql(\"SELECT COUNT(*) FROM {full_table_name}\").apply()\n\n# Display the resulting DataFrame\ntable_counts.display()","lang":"python","description":"This quickstart demonstrates how to initialize DiscoverX, define a set of tables using a wildcard pattern, and then apply a SQL template (counting rows) concurrently across all matching tables in a Databricks environment."},"warnings":[{"fix":"Be aware of the experimental nature; do not rely on it for critical production workloads without internal support.","message":"DiscoverX is a Databricks Labs project and is provided \"AS-IS\" without formal Service Level Agreements (SLAs). Issues should be filed as GitHub Issues and will be reviewed as time permits.","severity":"breaking","affected_versions":"All versions"},{"fix":"Replace `.scan(...)` with `.intro()` for general overview or `.scan(experimental=True, ...)` for detailed scanning.","message":"The `scan` command has been deprecated. Users should migrate to `intro` or `scan (experimental)` for semantic classification and other scanning functionalities.","severity":"deprecated","affected_versions":"0.0.9 and earlier"},{"fix":"Always follow `%pip install dbl-discoverx` with `dbutils.library.restartPython()` in Databricks notebooks.","message":"When installing `dbl-discoverx` within a Databricks notebook using `%pip install`, it is often necessary to restart the Python kernel (`dbutils.library.restartPython()`) for the newly installed package to be properly loaded and available.","severity":"gotcha","affected_versions":"All versions in Databricks notebooks"},{"fix":"Ensure all regex patterns used in `with_sql` or similar operations are strictly valid according to Apache Spark's regular expression syntax.","message":"On Databricks Runtime 15.4 LTS and above, regular expression handling in Photon is updated to match Apache Spark behavior. Previously accepted invalid regex patterns in `with_sql` commands might now cause queries to fail.","severity":"gotcha","affected_versions":"Databricks Runtime 15.4 LTS and above"}],"env_vars":null,"last_verified":"2026-04-16T00:00:00.000Z","next_check":"2026-07-15T00:00:00.000Z","problems":[{"fix":"Run `%pip install dbl-discoverx` (or `pip install dbl-discoverx` in non-Databricks environments) and then `dbutils.library.restartPython()` if in a Databricks notebook.","cause":"The `dbl-discoverx` package is either not installed or, in a Databricks notebook, the Python kernel has not been restarted after installation.","error":"ModuleNotFoundError: No module named 'discoverx'"},{"fix":"Review and correct the SQL regular expression pattern to adhere strictly to Apache Spark's regex syntax. Test patterns in a standard Spark SQL context if unsure.","cause":"This error occurs in Databricks Runtime 15.4 LTS+ when a regular expression provided in a `with_sql` command (or other Spark SQL regex functions) is considered invalid by Photon's updated regex engine.","error":"Py4JJavaError: An error occurred while calling o.s.sql.functions.regexp_extract. Invalid regular expression"},{"fix":"Adjust the concurrency level using the `with_concurrency()` method (e.g., `dx.from_tables(...).with_concurrency(20).with_sql(...).apply()`) to match your Databricks cluster's capacity and the nature of the operations.","cause":"The default concurrency (10) for multi-table operations might not be optimal for extremely large lakehouses or specific workload patterns, leading to resource contention or inefficient execution.","error":"Performance degradation or OutOfMemoryError for operations on a very large number of tables."}]}