Dataproc with Granica Sidekick

Configure your Dataproc cluster to use Granica Sidekick

## Sidekick Overview Sidekick is a sidecar process that allows your applications to talk with a Granica cluster. It provides a S3/GCS compatible endpoint for your applications using which they can read and write compressed data to the cloud storage.

Pre-requisites

  • Granica is deployed via the setup.sh from the start pilot guide. To SSH back into the Granica Admin Server simply run ./setup.sh --login
  • Granica has custom domain set up and the metadata service is ready.
  • VPC Peering has been setup between the Dataproc VPC and Granica deployment VPC

Google ServiceAccount

  • Determine the Google ServiceAccount used by Dataproc VM instances. Unless specifically configured, this is the default Compute Engine service account.
    • You can find the service account by running the following command: gcloud dataproc clusters describe [CLUSTER_NAME] --region=[REGION]
    • This can also be found by clicking on one of the VMs of the Dataproc cluster and looking at the service account in the details section. Google ServiceAccount
  • Once determined, share the ServiceAccount name with Granica.
  • This is needed for two purposes:
    • To allow the granica-sidekick debian package to be installed from the Granica artifact repository
    • To allow granica-sidekick to read the target buckets which reside in the Granica deployment project

VM Instance Image

In order to ensure compatibility and leverage the latest features and improvements for Granica Sidekick, we highly recommend building your Dataproc custom image with a minimum version of Debian 2.2. Currently, the Dataproc console suggests only images with versions 2.1 and below. However, you can create a custom image with the latest version of Debian by using the gcloud command-line tool. Here is an example of how to create a custom image with Debian 2.2:

python3 generate_custom_image.py \
--image-name=CUSTOM_IMAGE_NAME \
--dataproc-version=2.2-debian12 \
--zone=ZONE \
--gcs-bucket=gs://BUCKET_NAME

Read detailed instructions here: https://cloud.google.com/dataproc/docs/guides/dataproc-images#generate_a_custom_image

Custom initialization script

The granica-sidekick package is installed and configured using a custom initialization script. Dataproc Init script Here is a sample init script. Please modify the environment variables as per your deployment. This script should be uploaded to a GCS bucket and its path should be provided during Dataproc cluster creation.

#!/bin/bash
set -ex
# This SAMPLE initscript is used to install and setup granica-sidekick on a GCP Dataproc cluster
# Please modify the env variables below
# Set env variables needed for granica-sidekick package (sample values)
## required env variables
# AWS Region or GCP Region
# example: export GRANICA_REGION="us-west2"
export GRANICA_REGION=<region>
# Note that the :443 is needed in the metadata URL below
# example: export GRANICA_METADATA_URL="metadata.region.my.custom.domain:443"
export GRANICA_METADATA_URL="<metadata-endpoint>:443"
# Select cloud platform: aws/gcp
export GRANICA_CLOUD_PLATFORM="gcp"
## optional env variables
# set the size of granica-sidekick block cache. default is 1GB, '0' to disable
export GRANICA_BLK_CACHE="1073741824"
# Number of CPU cores for granica-sidekick
export GRANICA_CPU_LIMIT_MILLICORE="4000"
# Max memory for granica-sidekick
export GRANICA_EDGE_MEMORY_MAX="8192"
# Install Google Cloud Ops Agent to collect metrics and logs
# (If the agent is already installed, please modify the config file appropriately)
create_prometheus_config() {
mkdir -p /etc/google-cloud-ops-agent/
cat <<EOF_PROM_CONFIG > /etc/google-cloud-ops-agent/config.yaml
metrics:
receivers:
prom_application:
type: prometheus
config:
scrape_configs:
- job_name: 'granica-sidekick-exporter'
static_configs:
- targets: ['127.0.0.1:9092']
labels:
role: 'worker'
metrics_path: '/prom-metrics'
scheme: 'http'
scrape_interval: '30s'
scrape_timeout: '10s'
service:
pipelines:
prometheus_pipeline:
receivers: ['prom_application']
EOF_PROM_CONFIG
}
curl -sSO https://dl.google.com/cloudagents/add-google-cloud-ops-agent-repo.sh
bash add-google-cloud-ops-agent-repo.sh --also-install
rm add-google-cloud-ops-agent-repo.sh
create_prometheus_config
systemctl daemon-reload
CLOUDOPS_SERVICE_NAME="google-cloud-ops-agent"
service $CLOUDOPS_SERVICE_NAME restart
if systemctl is-active --quiet $CLOUDOPS_SERVICE_NAME; then
echo "$CLOUDOPS_SERVICE_NAME is running"
else
echo "$CLOUDOPS_SERVICE_NAME is not running!"
exit 1
fi
#### Install granica-sidekick
# Install apt transport for artifact-registry
curl https://us-west2-apt.pkg.dev/doc/repo-signing-key.gpg | apt-key add -
if [ $? -ne 0 ]; then
echo "Error: Adding repo-signing-key for apt-transport failed"
exit 1
fi
curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add -
if [ $? -ne 0 ]; then
echo "Error: Adding package key for apt-transport failed"
exit 1
fi
echo 'deb http://packages.cloud.google.com/apt apt-transport-artifact-registry-stable main' | sudo tee -a /etc/apt/sources.list.d/artifact-registry.list
if [ $? -ne 0 ]; then
echo "Error: Adding apt-transport repo path failed"
exit 1
fi
apt-get update
apt-get install -y apt-transport-artifact-registry
if [ $? -ne 0 ]; then
echo "Error: Installing apt-transport-artifact-registry failed"
exit 1
fi
# Add granica's repo to apt sources list
echo "deb ar+https://us-west2-apt.pkg.dev/projects/stone-bounty-249217 granica main" | tee -a /etc/apt/sources.list.d/artifact-registry.list
apt-get update
# Do install
apt-get install -y granica-sidekick
if [ $? -ne 0 ]; then
echo "Error: Installing granica-sidekick failed"
exit 1
fi
# Finally, override Google Cloud Storage endpoint to point to granica-sidekick
sed -i '/<\/configuration>/i \
<property>\
<name>fs.gs.storage.root.url</name>\
<value>http://localhost:7078</value>\
<description>\
Google Cloud Storage root URL.\
</description>\
</property>' /etc/hadoop/conf/core-site.xml

Verify the installation

Once the cluster is created/restarted, you can verify the installation by running the following command from a Jupyter notebook:

! systemctl status granica-edge.service

and to see logs from the granica-sidekick service on master node, run the following command:

! journalctl -u granica-edge.service

Get metrics and logs from granica-sidekick

The init script installs the Google Cloud Ops Agent on the Dataproc VM instances and configures it to scrape metrics from granica-sidekick. You can view the metrics in the Google Cloud Console under the Monitoring section. All metrics from granica-sidekick will be prefixed with n_.

The logs from granica-sidekick are available in the Google Cloud Console -> Log Explorer. Search for logs with resource.type=gce_instance and source.name=granica-sidekick.

Troubleshooting

  • Init script logs can be found at gs://<YOUR-STAGING-BUCKET>/google-cloud-dataproc-metainfo/<CLUSTER-UUID>/init-actions/<NODE-TYPE>/stdout
  • Connectivity
    • Ensure granica-sidekick is able to reach Granica metadata-service. You may need to install grpcurl on the master instance. ! grpcurl -d '{}' -insecure GRANICA_METADATA_URL:443 grpc.health.v1.Health/Check
    • Ensure that VPC peering is setup between the Dataproc VPC and Granica deployment VPC