Configure Lakehouse Safety for Databricks

Learn how to integrate Granica Screen with Databricks for safer analytics, ML and AI

Granica Screen is deployed as a managed Kubernetes cluster within customer cloud infrastructure on AWS and GCP, with support for Azure coming soon. This enables Granica to process customer data without data leaving the customer-managed environment. Granica manages updates, scaling, and monitoring of the deployed infrastructure.

Screen’s users are typically data infrastructure or platform engineers who interact with the Kubernetes cluster either directly or through the Granica CLI.

Specifically:

  1. The Granica cluster is configured through the Granica CLI with credentials for a storage provider and a policy configuration specifying which assets to scan.
  2. Screen automatically scans new and existing assets matching the policy configuration via batch processing and publishes reports of scan findings to a configured location in cloud storage or a Databricks table.

Databricks integration basics

Screen integrates with Databricks through the following touchpoints:

  1. Unity Catalog API to scan for catalogs, schemas, or tables to scan (via Go SDK).
  2. Databricks SQL Driver for Go to read table data for scanning.
  3. Databricks SQL Driver for Go to write detection results to a UC table.

Other integration details:

  1. Screen authenticates using a designated Granica service principal configured with permissions to read tables to be scanned, and write to the designated result location.
  2. The user-agent string is granica-screen/<version>.

The detailed customer steps for configuring the Databricks integration are included in the next section.

Granica Screen and Databricks integration architecture

Databricks integration - customer process

1. Create service principal and auth configuration

Create a service principal for Granica to access your Unity Catalog. Ensure the principal access to all of the following:

  • Databricks SQL access
  • Workspace access
  • Allow cluster creation (needed only if creating warehouse automatically in next step)

Once the account is created, follow the instructions provided by Databricks to create a .databrickscfg file:

Example .databrickscfg file:

[DEFAULT]
host = <workspace_url>
client_secret = <client_secret>
client_id = <application_id for service principal>

Note that the client_secret cannot be accessed again after it is initially created, so be sure to note it down before moving on.

Example .databrickscfg file:

[DEFAULT]
host = <workspace_url>
token = <token_value for access token>

2. Grant compute permissions

Granica uses Databricks compute to read data, and we recommend configuring with a dedicated SQL warehouse

  • Option 1: (recommended): Granica automatically creates and manages SQL warehouse
    • Grant SQL Warehouse creation permissions to our service principal (see previous step)
  • Option 2: Manually create a SQL warehouse and grant Granica access.
    • Create a SQL warehouse named granica-screen in your workspace.
    • Give our account access to your SQL Warehouse ("can use")

If you want us to automatically create a SQL Warehouse for you, you will need to Mack

For information about updating SQL permissions, including SQL warehouse, see this page.

3. Grant Unity Catalog permissions

Read access

Granica requires read access to tables designated for scanning.

In order to grant these permissions, go to your Unity Catalog, select the entity you want to grant permissions for, select the Permissions tab, and click Grant. Then, select the Granica service principal and check all of the permissions you would like to grant.

For any catalog/schema/table to scan, the following permission are needed:

  • SELECT [CATALOG|SCHEMA|TABLE]
  • USE [CATALOG|SCHEMA] (for all parent catalogs/schemas)

Write access

In addition, if Granica Screen is configured to write reports of discovered PII to Unity Catalog, the following permissions are also needed:

  • Option 1: (recommended): Granica automatically creates a catalog to store detection reports
    • Catalog creation permissions (CREATE CATALOG on metastore - use application id of service principal)
    • GRANT CREATE CATALOG ON METASTORE TO 53fb14a8-ddc7-486e-bb40-17aae59e8261
  • Option 2: Manually create a catalog and make the Granica service principal OWNER. The name of this catalog must be configured in the Granica policy

4. Add .databrickscfg to Screen

On the Granica admin server, add the .databrickscfg file as a Kubernetes secret named databricks-config. There will already be an empty placeholder so it will need to be overwritten, for example by deleting and recreating the secret:

kubectl delete secret databricks-config
kubectl create secret generic databricks-config --from-file=/home/ec2-user/.databrickscfg

Then replace the contents with the contents of the .databrickscfg file created above.

5. Configure Granica policy

Granica’s policy configuration allows specification of how to scan entities and which entities to scan. The Screen configuration page outlines the various available parameters.

Here is the minimal policy:

standard:
crunch-enable: false
screen:
enable: true

This sample policy enables a simple scan for all tables and schemas in catalogs matching the pattern b1-* but NOT b1-eu-*

granica policy set --auto-approve demo_policy.yml
# Default
standard:
screen:
enabled: true
classification-types:
- type: PHONE_NUMBER
- type: SSN
# Patterns to include for scanning
# All catalogs will be scanned if this
# field is not present
include:
- catalogs: b1-*
# Patterns to exclude for scanning
exclude:
- catalogs: b1-eu-*

Policies can be passed using the Granica CLI.

6. Consuming results

Customers can consume results as documented here. Mack - latest here? "The public documentation does not yet include specifying Databricks as a report location, but this is a new feature we are adding with the integration."

Here is an example table:

Example Databricks table

See also