Intro to Lakehouse Safety for Databricks

Learn how Granica Screen improves data safety for Databricks environments

As companies race to leverage their proprietary data to build effective analytical systems and differentiated LLM-powered applications, many struggle to ensure their data is AI-ready and safe for use. Large, data-intensive enterprises manage anywhere from 10,000 to 100,000 plus tables in Databricks. These tables host vast amounts of structured and unstructured data that often contains sensitive information.

Unstructured data growth, in particular, is driven by the prevalence of Large Language Models (LLMs), as the majority of training data for LLMs is unstructured text. GenAI and other NLP-based models have increased the complexity of managing sensitive information at scale, such as Personally Identifiable Information (PII). The risks of this sensitive information being inadvertently processed and output by models and personnel have also increased.

Challenges of current approaches

Data engineers are tasked with the responsibility of overseeing how data is collected, stored, processed, and transmitted within an organization, playing a crucial role in data security and compliance. Considering data security and compliance responsibilities, a data engineer must:

  1. Identify sensitive information found in source tables
  2. Appropriately classify sensitive information identified in source tables
  3. Utilize findings to apply appropriate access controls for sensitive data

Today, data engineers rely upon User Defined Functions (UDFs) using SQL, Spark NLP, or third-party data governance tools (eg. Immuta), to secure their respective Databricks workspaces. Each of these tools are often incomplete on their own, so data engineers typically use a combination of these tools for their respective use cases - but each tool has their own merits and shortcomings.

For example, UDFs are best-suited for small-scale structured datasets but require manual setup, maintenance and the internal “know-how”. Spark NLP is generally effective for both structured and unstructured datasets of varying sizes, but imposes the overhead of fine-tuning to get the models to function as intended and requires the internal “ML know-how”. Immuta is designed for large-scale datasets and well-suited for structured data, but experiences challenges with less well-defined types of PII and free-text - additionally, Immuta relies upon existing discovery tags and classifiers (typically provided by a tool with direct scanning functionality like a UDF or Spark NLP) found in metadata to apply access control policies.

Hence, data engineers need an effective, general purpose tool that is suitable for both structured and unstructured data of varying volumes, and does not require significant manual overhead or internal “know-how” to successfully accomplish core data security and compliance responsibilities.

How Screen helps

Granica Screen brings state-of-the-art table safety to Databricks Workspaces, allowing data engineers to make data ready for analytical tasks, model pre-training, and fine tuning. As a result, data engineers can ensure all sensitive information is appropriately managed and accessible to appropriate parties, while successfully meeting compliance with data protection regulations.

With native support for Databricks Tables, including managed delta tables and external delta tables, as well as deep Unity Catalog (UC) integration, data engineers can significantly enhance the safety of large-scale data used for analytics, pre-training, and fine-tuning.

Key highlights include:

  • State-of-the-art accuracy: Automatically discovers and classifies sensitive data in tables with high precision (low false negatives) and high recall (low false positives), providing highly accurate visibility into potential privacy vulnerabilities
  • Computationally-efficient scanning: Highly-compute efficient scanning saves costs when handling large-scale datasets, eliminating the need for cost/benefit/risk tradeoffs.
  • Schema Sentinel: Automatically monitors and adjusts to data additions or schema changes, ensuring continuous safety.
  • Extensive language and locale support: Out-of-the-box support for 100+ languages and locales eliminates fine-tuning overhead, while supporting global operations and enabling compliance with global data protection regulations.
  • Comprehensive scan reporting: Intuitive scan reporting informs where privacy vulnerabilities exist - the findings can be leveraged to remediate and apply access control policies as appropriate.

Use cases

Through Screen’s table safety functionality for Databricks Workspaces, the following use cases are made possible for data engineers:

  1. Discover PII In Training and Fine Tuning Data: Automatically identify and classify PII in training and fine-tuning datasets, preventing LLMs from inadvertently processing or leaking sensitive data.
  2. Accelerate ETL / ELT Processes: Speed up ETL / ELT workflows by automatically detecting and classifying data during extraction, enabling more efficient data handling, transformation, and storage.
  3. Ensure Regulatory Compliance: Easily scan large-scale datasets with high precision and recall to meet GDPR, CCPA, HIPAA, and PCI DSS requirements, ensuring adherence to regulatory standards.

See also