Skip to content

CAG9/Unity-Catalog-Databricks

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 

Repository files navigation

Unity Catalog Masterclass: The Backbone of Databricks

Technical documentation and notes based on the Unity Catalog Masterclass. This guide covers the architectural hierarchy, data governance, and security implementation within the Databricks ecosystem.

1. Introduction to Unity Catalog

Unity Catalog (UC) is the governance layer for Databricks. It provides a centralized location to manage access control, auditing, and data discovery across multiple workspaces.

Key Concepts:

  • Unified Governance: Manage data, ML models, and files in one place.
  • Metastore: The top-level container for metadata in Unity Catalog.
  • Cross-Workspace Access: Policies defined in UC apply to all attached workspaces.

2. Governance and Object Hierarchy

Unity Catalog follows a three-tier namespace structure to organize data assets.

Hierarchy:

  1. Metastore: The root container.
  2. Catalog: A high-level grouping of schemas.
  3. Schema (Database): A grouping of tables, views, and volumes.
  4. Tables/Views/Volumes: The final data objects.

Governance Principle: UC acts as a layer on top of cloud storage (Azure Data Lake Gen2). This allows for managing files without moving them from their original cloud location.

3. Security and Access Control

Unity Catalog implements Role-Based Access Control (RBAC) to ensure data privacy.

  • Define Once, Secure Everywhere: Security policies are set at the account level and enforced across all workspaces.
  • Granular Permissions: Access can be restricted at the table or column level.
  • Auditing: Logs capture every query, user action, and performance metric for compliance.

4. Managed vs External Tables

Table Type Metadata Management Data Management Deletion Behavior
Managed Databricks Databricks Deleting the table deletes the data.
External Databricks User-Defined Cloud Path Deleting the table only removes metadata.

5. Dynamic Data Masking

Dynamic Data Masking allows sensitive data to be redacted based on the user's permissions.

Implementation Steps:

  1. Create a masking function using SQL.
  2. Use the "is_account_group_member" function to check user roles.
  3. Apply the mask to a specific column using the ALTER TABLE command.

Code Example: ALTER TABLE catalog_name.schema_name.employee ALTER COLUMN salary SET MASK masking_function;

6. Glossary of Terms

  • Metastore: The central repository for metadata.
  • Catalog: A logical grouping of schemas.
  • Schema: A collection of tables and views.
  • RBAC: Role-Based Access Control.
  • Volumes: Objects for managing non-tabular data like CSV or PDF files.
  • Data Discovery: Searchable metadata to find data assets within the organization.

7. Key Takeaways

  • Unity Catalog is the industry standard for Databricks governance.
  • The three-tier namespace (Catalog.Schema.Table) is essential for data organization.
  • Managed tables are best for internal workflows; External tables are best for shared cloud storage.
  • Security can be automated via SQL-based masking functions.
  • Auditing and quality monitoring are built-in features of UC.

Tools and tehcnologies

  • Databricks
  • Azure
  • Azure connectors
  • Azure Data Lake Gen2
  • Azure storage
  • Azure IAM
  • SQL

Releases

No releases published

Packages

No packages published