Configuration
Configuring Granica Screen
Once you've installed the Granica platform with Granica Screen enabled, it's time to start monitoring and protecting your data. This can be managed through the Granica CLI interface.
1. Identify data to be protected by Granica Screen
Granica Screen supports scanning existing data stored in data lakes such as Amazon S3 and Google Cloud Storage (GCS). The first step is to identify buckets or data of interest are identified, which might be all buckets within your organization! The data of interest can then be configured for scanning in the Granica policy.
Currently, the following file types are supported - unsupported files will be skipped and will not affect the scanning process.
File Type | Extensions | Scan Method | Available Now |
---|---|---|---|
Big Data | .parquet, .snappy.parquet | Structured Parsing | Yes |
Comma/tab separated | .csv, .tsv | Structured Parsing | Yes |
Text | .json, .txt, .html, etc. | Intelligent Parsing | Yes |
.eml | Intelligent Parsing | Yes | |
Archived/Compressed | .gz, .zip | Decompress and Parse | Yes |
Image | .jpeg, .png, .tiff | OCR | In progress, contact us |
Document | .pdf, .doc, .xlsx, .pptx | Intelligent Parsing | In progress, contact us |
2. Specify types of sensitive data to identify
Within the Granica policy, the set of sensitive data to identify can be configured.
Currently, the following types of sensitive data are supported by standard classifiers. Custom classifiers can also be specified in addition to these, and Granica is continuously adding support for additional types of sensitive data. Note: If data can be interpreted as multiple PII types, we report the most likely type.
3. Specify report format and location
After the data is scanned, Granica Screen generates reports for each instance of sensitive data identified. The format and location of this report can be customized as follows within the Granica policy.
Configuration | Options |
---|---|
Output format | json, csv, Parquet |
Output compression | none, gzip, snappy (Parquet only) |
Output location | An AWS S3 or GCS location. If unspecified, a bucket will automatically be created. |
The generated report includes the following information for each instance of sensitive data:
Column | Type | Description |
---|---|---|
n | bigint | Index of result within result file |
obj_key | string | The cloud object containing this instance of sensitive data |
classification_type | string | The type of sensitive data identified |
offset | bigint | The offset location within an unstructured file |
classified_size | bigint | The length of the result within an unstructured file |
row | bigint | The row number of a result within a tabular file |
col | bigint | The column number of a result within a tabular file |
column_name | bigint | The column name of a result within a tabular file, when available |
data | string | The sensitive data identified (optional via policy) |
4. Specify the redacted output format
In addition to generating a detection report, Granica Screen can directly redact sensitive data from a file and create a sanitized copy of the data at a separately configured cloud location. Appropriately redacted data can then be used in broader contexts to enable additional use cases while managing privacy risk.
A variety of redaction formats are supported, along with additional customization options.
Transformation Type | Description |
---|---|
Redaction | Removal of sensitive data without replacement, e.g. "My name is John Smith" to "My name is" |
Replacement | Replacement of sensitive data with a fixed value, e.g. [REDACTED] |
Size-preserving replacement | Replacement of sensitive data with a value of equal length, e.g. XXXXX |
Named replacement | Replacement of sensitive data with a label identifying the type of sensitive data, e.g. [EMAIL] |
Numbered replacement | Replacement of sensitive data with a label identifying each unique instance of sensitive data, e.g. [EMAIL_1] and [EMAIL_2] |
Encrypted | Replacement of sensitive data with an encrypted value, e.g. [EMAIL_encryptedemailaddress] |
Format preserving encrypted | Replacement of sensitive data with an encrypted value, preserving the original format, e.g. john@granica.ai to siek@jtiwoei.qb |
Synthetic data replacement | Replacement of sensitive data with a similar synthetic value of the same type, e.g. replacing John with Evan |
If you need further assistance with redaction formats, contact us for details.