Technical documentation and notes based on the Unity Catalog Masterclass. This guide covers the architectural hierarchy, data governance, and security implementation within the Databricks ecosystem.
Unity Catalog (UC) is the governance layer for Databricks. It provides a centralized location to manage access control, auditing, and data discovery across multiple workspaces.
Key Concepts:
- Unified Governance: Manage data, ML models, and files in one place.
- Metastore: The top-level container for metadata in Unity Catalog.
- Cross-Workspace Access: Policies defined in UC apply to all attached workspaces.
Unity Catalog follows a three-tier namespace structure to organize data assets.
Hierarchy:
- Metastore: The root container.
- Catalog: A high-level grouping of schemas.
- Schema (Database): A grouping of tables, views, and volumes.
- Tables/Views/Volumes: The final data objects.
Governance Principle: UC acts as a layer on top of cloud storage (Azure Data Lake Gen2). This allows for managing files without moving them from their original cloud location.
Unity Catalog implements Role-Based Access Control (RBAC) to ensure data privacy.
- Define Once, Secure Everywhere: Security policies are set at the account level and enforced across all workspaces.
- Granular Permissions: Access can be restricted at the table or column level.
- Auditing: Logs capture every query, user action, and performance metric for compliance.
| Table Type | Metadata Management | Data Management | Deletion Behavior |
|---|---|---|---|
| Managed | Databricks | Databricks | Deleting the table deletes the data. |
| External | Databricks | User-Defined Cloud Path | Deleting the table only removes metadata. |
Dynamic Data Masking allows sensitive data to be redacted based on the user's permissions.
Implementation Steps:
- Create a masking function using SQL.
- Use the "is_account_group_member" function to check user roles.
- Apply the mask to a specific column using the ALTER TABLE command.
Code Example: ALTER TABLE catalog_name.schema_name.employee ALTER COLUMN salary SET MASK masking_function;
- Metastore: The central repository for metadata.
- Catalog: A logical grouping of schemas.
- Schema: A collection of tables and views.
- RBAC: Role-Based Access Control.
- Volumes: Objects for managing non-tabular data like CSV or PDF files.
- Data Discovery: Searchable metadata to find data assets within the organization.
- Unity Catalog is the industry standard for Databricks governance.
- The three-tier namespace (Catalog.Schema.Table) is essential for data organization.
- Managed tables are best for internal workflows; External tables are best for shared cloud storage.
- Security can be automated via SQL-based masking functions.
- Auditing and quality monitoring are built-in features of UC.
- Databricks
- Azure
- Azure connectors
- Azure Data Lake Gen2
- Azure storage
- Azure IAM
- SQL