Monitoring

Learn how to monitor your Granica Crunch deployment.

Your Granica Crunch dashboard provides visibility into key operational metrics, and your deployment has built-in telemetry and analytics which we use internally to provide you with proactive, white-glove support. Optionally, you can use external monitoring tools to give you deeper visibility into your deployment.

External monitoring tools are not considered part of Granica's infrastructure.

Metrics and monitors

Metric definitions

All Granica provided metrics have a prefix of g., and each is updated every 60s:

TypeMetricDefinition
Alertablen.system.upKubernetes pod uptime. The app label carries the pod name. Monitored pods are db, quicksilver, and data-cruncher-read-replica.
Alertablen.bolt.api_usageCounts the number of API requests. The type label is GetObject, HeadObject, etc.
Alertablen.bolt.logical_bytes_readCounts the bytes returned via the Granica endpoint.
Alertablen.obj.logical_bytes_writtenCounts the bytes ingested by Crunch and ultimately stored in reduced form.
Monitoredn.obj.srcs_writtenCounts the number of objects ingested by Crunch and ultimately stored in reduced form.
Monitoredn.integrity.metadata_verifiedCounts the number of objects scanned by the integrity scanner. Resets to 0 at the end of every scan.
Monitoredn.integrity.data_verifiedCounts the number of objects that have been verified for data integrity by reading the stored data. Up to 32 objects are read every 12h.
Alertablen.integrity.failuresCounts the number of objects that fail the integrity check.

Monitor definitions

Priority levels, definitions and thresholds using the provided metrics:

PriorityDefinitionThreshold
P0Granica metrics are unavailable.For 60 mins
P0sum:n.system.up{app:quicksilver}< 1 for 60 mins
P0sum:n.system.up{app:db}< 1 for 60 mins
P0sum:n.system.up{app:data-cruncher-read-replica}< 1 for 60 mins
P0sum:n.integrity.failures{critical:true}.as_count()> 0 (instantly)
P1sum:n.system.up{app:data-cruncher-read-replica,zone:use1-azX}< 1 for 60 mins
P1Error rate: sum:n.bolt.api_usage{err:internal*} / sum:n.bolt.api_usage{*}> 0.01 (instantly)
P1sum:n.bolt.logical_bytes_read{*}.rollup(sum, 14400)== 0 (instantly)
P2sum:n.bolt.logical_bytes_written{*}.rollup(sum, 14400)== 0 (instantly)

API errors

API errors are reported by each Read Replica node using the metric g.api_usage. err is empty on success and carries the error code on failure.

Expected error codes:

  • Forbidden: Unauthorized reads (HTTP 403)
  • NoSuchKey/NoSuchBucket: Crunch has not crunched this bucket or object (HTTP 404)

These error codes can happen at a high rate in normal operation and are not a cause for concern.

Unexpected error codes:

In normal operation, all other errors (Internal, for example) should not occur. The client will see these as HTTP 5xx errors. Clients may not see any impact because API requests are retried internally by the Granica SDK.

Integrity errors

Granica employs a rigorous, periodic integrity scan that validates reduced objects do not have any data corruption. This is done by saving metadata about the object before it is reduced and comparing the metadata later during the integrity scan.

The metric g.integrity.failures reports the number of objects that fail the integrity check. Such objects must be recovered from a separate backup, independent of Granica, if one is available.

Read throughput

Read throughput in bytes/s can be tracked by the g.logical_bytes_read metric, and read throughput in objects/s can be tracked by the g.api_usage metric. Throughput can go down to zero in normal operation if there are no client reads.

Reduction throughput

Reduction throughput in bytes/s can be tracked by the g.crunch.obj.logical_bytes_written metric. In normal operation, reduction throughput will vary based on available data pending reduction and the number of compute spot instances available.

Monitor using Datadog

Configure Granica for export

Deploy (or update) Granica with the Datadog API key. Granica will begin exporting application health metrics to Datadog.

$ cat projectn.tfvars
datadog_api_key = <KEY>
<snip>
$ granica update –var-file projectn.tfvars

Configure Datadog monitors

Use the Monitors section to add monitors for each of the alerts/metrics. Simply copy/paste the queries from the alerting definitions table above.

Configure Datadog dashboard

Create your dashboard using the Granica metrics to visualize key operational data.

Monitor using CloudWatch

AWS CloudWatch Container Insights can collect, aggregate, and summarize metrics and logs including CPU, memory, disk, and network performance.

Identify the Granica EKS EC2 instances

SSH into your Granica Admin Server:

./setup.sh --login

Run granica ls and note the last four characters of the cluster_id:

$ granica ls
--- AWS Deployments ---
[default]
{
  "project_id": "project-n-uat-projectn-78d7",
  "cluster_id": "project-n-uat-projectn-78d7",
  "deployment_id": "project-n-uat-projectn-78d7-1d80",
  ...
}

Attach the CloudWatch IAM policy

  1. Find the Granica EC2 instances in the AWS console by searching for the last four characters of the cluster_id
  2. Click any instance to find the common IAM role in Security Details
  3. Attach the CloudWatchAgentServerPolicy to that role

Install and configure the CloudWatch agent and Fluent Bit

Run the following from your Granica Admin Server:

ClusterName=<my-cluster-name>
RegionName=<my-cluster-region>
FluentBitHttpPort='2020'
FluentBitReadFromHead='Off'
[[ ${FluentBitReadFromHead} = 'On' ]] && FluentBitReadFromTail='Off'|| FluentBitReadFromTail='On'
[[ -z ${FluentBitHttpPort} ]] && FluentBitHttpServer='Off' || FluentBitHttpServer='On'
curl https://raw.githubusercontent.com/aws-samples/amazon-cloudwatch-container-insights/latest/k8s-deployment-manifest-templates/deployment-mode/daemonset/container-insights-monitoring/quickstart/cwagent-fluent-bit-quickstart.yaml | sed 's/{{cluster_name}}/'${ClusterName}'/;s/{{region_name}}/'${RegionName}'/;s/{{http_server_toggle}}/"'${FluentBitHttpServer}'"/;s/{{http_server_port}}/"'${FluentBitHttpPort}'"/;s/{{read_from_head}}/"'${FluentBitReadFromHead}'"/;s/{{read_from_tail}}/"'${FluentBitReadFromTail}'"/' | kubectl apply -f -

Replace <my-cluster-name> with your cluster_id and <my-cluster-region> with your location.

Validate that the agents are running:

$ kubectl get pods -n amazon-cloudwatch
NAME                     READY   STATUS    RESTARTS   AGE
cloudwatch-agent-5w8tj   1/1     Running   0          41s
cloudwatch-agent-vmnct   1/1     Running   0          41s
fluent-bit-gc925         1/1     Running   0          41s
fluent-bit-n2zmh         1/1     Running   0          41s

Verify CloudWatch is working

The CloudWatch log groups should be created with the containerinsights prefix. In the Container Insights section you can see all performance metrics for the Granica EKS cluster.

Was this page helpful?

On this page