Dataproc with Granica Sidekick

Configure your Dataproc cluster to use Granica Sidekick

## Sidekick Overview Sidekick is a sidecar process that allows your applications to talk with a Granica cluster. It provides a S3/GCS compatible endpoint for your applications using which they can read and write compressed data to the cloud storage.


  • Granica is deployed via the from the start pilot guide. To SSH back into the Granica Admin Server simply run ./ --login
  • Granica has custom domain set up and the metadata service is ready.
  • VPC Peering has been setup between the Dataproc VPC and Granica deployment VPC

Google ServiceAccount

  • Determine the Google ServiceAccount used by Dataproc VM instances. Unless specifically configured, this is the default Compute Engine service account.
    • You can find the service account by running the following command: gcloud dataproc clusters describe [CLUSTER_NAME] --region=[REGION]
    • This can also be found by clicking on one of the VMs of the Dataproc cluster and looking at the service account in the details section. Google ServiceAccount
  • Once determined, share the ServiceAccount name with Granica.
  • This is needed for two purposes:
    • To allow the granica-sidekick debian package to be installed from the Granica artifact repository
    • To allow granica-sidekick to read the target buckets which reside in the Granica deployment project

VM Instance Image

In order to ensure compatibility and leverage the latest features and improvements for Granica Sidekick, we highly recommend building your Dataproc custom image with a minimum version of Debian 2.2. Currently, the Dataproc console suggests only images with versions 2.1 and below. However, you can create a custom image with the latest version of Debian by using the gcloud command-line tool. Here is an example of how to create a custom image with Debian 2.2:

python3 \
--image-name=CUSTOM_IMAGE_NAME \
--dataproc-version=2.2-debian12 \
--zone=ZONE \

Read detailed instructions here:

Custom initialization script

The granica-sidekick package is installed and configured using a custom initialization script. Dataproc Init script Here is a sample init script. Please modify the environment variables as per your deployment. This script should be uploaded to a GCS bucket and its path should be provided during Dataproc cluster creation.

set -ex
# This SAMPLE initscript is used to install and setup granica-sidekick on a GCP Dataproc cluster
# Please modify the env variables below
# Set env variables needed for granica-sidekick package (sample values)
## required env variables
# AWS Region or GCP Region
# example: export GRANICA_REGION="us-west2"
export GRANICA_REGION=<region>
# Note that the :443 is needed in the metadata URL below
# example: export GRANICA_METADATA_URL=""
export GRANICA_METADATA_URL="<metadata-endpoint>:443"
# Select cloud platform: aws/gcp
## optional env variables
# set the size of granica-sidekick block cache. default is 1GB, '0' to disable
export GRANICA_BLK_CACHE="1073741824"
# Number of CPU cores for granica-sidekick
# Max memory for granica-sidekick
# Install Google Cloud Ops Agent to collect metrics and logs
# (If the agent is already installed, please modify the config file appropriately)
create_prometheus_config() {
mkdir -p /etc/google-cloud-ops-agent/
cat <<EOF_PROM_CONFIG > /etc/google-cloud-ops-agent/config.yaml
type: prometheus
- job_name: 'granica-sidekick-exporter'
- targets: ['']
role: 'worker'
metrics_path: '/prom-metrics'
scheme: 'http'
scrape_interval: '30s'
scrape_timeout: '10s'
receivers: ['prom_application']
curl -sSO
bash --also-install
systemctl daemon-reload
service $CLOUDOPS_SERVICE_NAME restart
if systemctl is-active --quiet $CLOUDOPS_SERVICE_NAME; then
echo "$CLOUDOPS_SERVICE_NAME is running"
echo "$CLOUDOPS_SERVICE_NAME is not running!"
exit 1
#### Install granica-sidekick
# Install apt transport for artifact-registry
curl | apt-key add -
if [ $? -ne 0 ]; then
echo "Error: Adding repo-signing-key for apt-transport failed"
exit 1
curl | apt-key add -
if [ $? -ne 0 ]; then
echo "Error: Adding package key for apt-transport failed"
exit 1
echo 'deb apt-transport-artifact-registry-stable main' | sudo tee -a /etc/apt/sources.list.d/artifact-registry.list
if [ $? -ne 0 ]; then
echo "Error: Adding apt-transport repo path failed"
exit 1
apt-get update
apt-get install -y apt-transport-artifact-registry
if [ $? -ne 0 ]; then
echo "Error: Installing apt-transport-artifact-registry failed"
exit 1
# Add granica's repo to apt sources list
echo "deb ar+ granica main" | tee -a /etc/apt/sources.list.d/artifact-registry.list
apt-get update
# Do install
apt-get install -y granica-sidekick
if [ $? -ne 0 ]; then
echo "Error: Installing granica-sidekick failed"
exit 1
# Finally, override Google Cloud Storage endpoint to point to granica-sidekick
sed -i '/<\/configuration>/i \
Google Cloud Storage root URL.\
</property>' /etc/hadoop/conf/core-site.xml

Verify the installation

Once the cluster is created/restarted, you can verify the installation by running the following command from a Jupyter notebook:

! systemctl status granica-edge.service

and to see logs from the granica-sidekick service on master node, run the following command:

! journalctl -u granica-edge.service

Get metrics and logs from granica-sidekick

The init script installs the Google Cloud Ops Agent on the Dataproc VM instances and configures it to scrape metrics from granica-sidekick. You can view the metrics in the Google Cloud Console under the Monitoring section. All metrics from granica-sidekick will be prefixed with n_.

The logs from granica-sidekick are available in the Google Cloud Console -> Log Explorer. Search for logs with resource.type=gce_instance and


  • Init script logs can be found at gs://<YOUR-STAGING-BUCKET>/google-cloud-dataproc-metainfo/<CLUSTER-UUID>/init-actions/<NODE-TYPE>/stdout
  • Connectivity
    • Ensure granica-sidekick is able to reach Granica metadata-service. You may need to install grpcurl on the master instance. ! grpcurl -d '{}' -insecure GRANICA_METADATA_URL:443
    • Ensure that VPC peering is setup between the Dataproc VPC and Granica deployment VPC