Monitoring

Your Granica Crunch dashboard provides visibility into key operational metrics, and your deployment has built-in telemetry and analytics which we use internally to provide you with proactive, white-glove support. Optionally, you can use external monitoring tools to give you deeper visibility into your deployment.

External monitoring tools are not considered part of Granica's infrastructure.

Metrics and monitors

Metric definitions

All Granica provided metrics have a prefix of g., and each is updated every 60s:

Type	Metric	Definition
Alertable	`n.system.up`	Kubernetes pod uptime. The `app` label carries the pod name. Monitored pods are `db`, `quicksilver`, and `data-cruncher-read-replica`.
Alertable	`n.bolt.api_usage`	Counts the number of API requests. The `type` label is `GetObject`, `HeadObject`, etc.
Alertable	`n.bolt.logical_bytes_read`	Counts the bytes returned via the Granica endpoint.
Alertable	`n.obj.logical_bytes_written`	Counts the bytes ingested by Crunch and ultimately stored in reduced form.
Monitored	`n.obj.srcs_written`	Counts the number of objects ingested by Crunch and ultimately stored in reduced form.
Monitored	`n.integrity.metadata_verified`	Counts the number of objects scanned by the integrity scanner. Resets to 0 at the end of every scan.
Monitored	`n.integrity.data_verified`	Counts the number of objects that have been verified for data integrity by reading the stored data. Up to 32 objects are read every 12h.
Alertable	`n.integrity.failures`	Counts the number of objects that fail the integrity check.

Monitor definitions

Priority levels, definitions and thresholds using the provided metrics:

Priority	Definition	Threshold
P0	Granica metrics are unavailable.	For 60 mins
P0	`sum:n.system.up{app:quicksilver}`	< 1 for 60 mins
P0	`sum:n.system.up{app:db}`	< 1 for 60 mins
P0	`sum:n.system.up{app:data-cruncher-read-replica}`	< 1 for 60 mins
P0	`sum:n.integrity.failures{critical:true}.as_count()`	> 0 (instantly)
P1	`sum:n.system.up{app:data-cruncher-read-replica,zone:use1-azX}`	< 1 for 60 mins
P1	Error rate: `sum:n.bolt.api_usage{err:internal}` / `sum:n.bolt.api_usage{}`	> 0.01 (instantly)
P1	`sum:n.bolt.logical_bytes_read{*}.rollup(sum, 14400)`	== 0 (instantly)
P2	`sum:n.bolt.logical_bytes_written{*}.rollup(sum, 14400)`	== 0 (instantly)

API errors

API errors are reported by each Read Replica node using the metric g.api_usage. err is empty on success and carries the error code on failure.

Expected error codes:

Forbidden: Unauthorized reads (HTTP 403)
NoSuchKey/NoSuchBucket: Crunch has not crunched this bucket or object (HTTP 404)

These error codes can happen at a high rate in normal operation and are not a cause for concern.

Unexpected error codes:

In normal operation, all other errors (Internal, for example) should not occur. The client will see these as HTTP 5xx errors. Clients may not see any impact because API requests are retried internally by the Granica SDK.

Integrity errors

Granica employs a rigorous, periodic integrity scan that validates reduced objects do not have any data corruption. This is done by saving metadata about the object before it is reduced and comparing the metadata later during the integrity scan.

The metric g.integrity.failures reports the number of objects that fail the integrity check. Such objects must be recovered from a separate backup, independent of Granica, if one is available.

Read throughput

Read throughput in bytes/s can be tracked by the g.logical_bytes_read metric, and read throughput in objects/s can be tracked by the g.api_usage metric. Throughput can go down to zero in normal operation if there are no client reads.

Reduction throughput

Reduction throughput in bytes/s can be tracked by the g.crunch.obj.logical_bytes_written metric. In normal operation, reduction throughput will vary based on available data pending reduction and the number of compute spot instances available.

Monitor using Datadog

Configure Granica for export

Deploy (or update) Granica with the Datadog API key. Granica will begin exporting application health metrics to Datadog.

$ cat projectn.tfvars
datadog_api_key = <KEY>
<snip>
$ granica update –var-file projectn.tfvars

./setup.sh --login

Run granica ls and note the last four characters of the cluster_id:

$ granica ls
--- AWS Deployments ---
[default]
{
  "project_id": "project-n-uat-projectn-78d7",
  "cluster_id": "project-n-uat-projectn-78d7",
  "deployment_id": "project-n-uat-projectn-78d7-1d80",
  ...
}

Attach the CloudWatch IAM policy

Find the Granica EC2 instances in the AWS console by searching for the last four characters of the cluster_id
Click any instance to find the common IAM role in Security Details
Attach the CloudWatchAgentServerPolicy to that role

Install and configure the CloudWatch agent and Fluent Bit

Run the following from your Granica Admin Server:

ClusterName=<my-cluster-name>
RegionName=<my-cluster-region>
FluentBitHttpPort='2020'
FluentBitReadFromHead='Off'
[[ ${FluentBitReadFromHead} = 'On' ]] && FluentBitReadFromTail='Off'|| FluentBitReadFromTail='On'
[[ -z ${FluentBitHttpPort} ]] && FluentBitHttpServer='Off' || FluentBitHttpServer='On'
curl https://raw.githubusercontent.com/aws-samples/amazon-cloudwatch-container-insights/latest/k8s-deployment-manifest-templates/deployment-mode/daemonset/container-insights-monitoring/quickstart/cwagent-fluent-bit-quickstart.yaml | sed 's/{{cluster_name}}/'${ClusterName}'/;s/{{region_name}}/'${RegionName}'/;s/{{http_server_toggle}}/"'${FluentBitHttpServer}'"/;s/{{http_server_port}}/"'${FluentBitHttpPort}'"/;s/{{read_from_head}}/"'${FluentBitReadFromHead}'"/;s/{{read_from_tail}}/"'${FluentBitReadFromTail}'"/' | kubectl apply -f -

Replace <my-cluster-name> with your cluster_id and <my-cluster-region> with your location.

Validate that the agents are running:

$ kubectl get pods -n amazon-cloudwatch
NAME                     READY   STATUS    RESTARTS   AGE
cloudwatch-agent-5w8tj   1/1     Running   0          41s
cloudwatch-agent-vmnct   1/1     Running   0          41s
fluent-bit-gc925         1/1     Running   0          41s
fluent-bit-n2zmh         1/1     Running   0          41s

Verify CloudWatch is working

The CloudWatch log groups should be created with the containerinsights prefix. In the Container Insights section you can see all performance metrics for the Granica EKS cluster.