Monitoring
Learn how to monitor your Granica Crunch deployment.
Your Granica Crunch dashboard provides visibility into key operational metrics, and your deployment has built-in telemetry and analytics which we use internally to provide you with proactive, white-glove support. Optionally, you can use external monitoring tools to give you deeper visibility into your deployment.
External monitoring tools are not considered part of Granica's infrastructure.
Metrics and monitors
Metric definitions
All Granica provided metrics have a prefix of g., and each is updated every 60s:
| Type | Metric | Definition |
|---|---|---|
| Alertable | n.system.up | Kubernetes pod uptime. The app label carries the pod name. Monitored pods are db, quicksilver, and data-cruncher-read-replica. |
| Alertable | n.bolt.api_usage | Counts the number of API requests. The type label is GetObject, HeadObject, etc. |
| Alertable | n.bolt.logical_bytes_read | Counts the bytes returned via the Granica endpoint. |
| Alertable | n.obj.logical_bytes_written | Counts the bytes ingested by Crunch and ultimately stored in reduced form. |
| Monitored | n.obj.srcs_written | Counts the number of objects ingested by Crunch and ultimately stored in reduced form. |
| Monitored | n.integrity.metadata_verified | Counts the number of objects scanned by the integrity scanner. Resets to 0 at the end of every scan. |
| Monitored | n.integrity.data_verified | Counts the number of objects that have been verified for data integrity by reading the stored data. Up to 32 objects are read every 12h. |
| Alertable | n.integrity.failures | Counts the number of objects that fail the integrity check. |
Monitor definitions
Priority levels, definitions and thresholds using the provided metrics:
| Priority | Definition | Threshold |
|---|---|---|
| P0 | Granica metrics are unavailable. | For 60 mins |
| P0 | sum:n.system.up{app:quicksilver} | < 1 for 60 mins |
| P0 | sum:n.system.up{app:db} | < 1 for 60 mins |
| P0 | sum:n.system.up{app:data-cruncher-read-replica} | < 1 for 60 mins |
| P0 | sum:n.integrity.failures{critical:true}.as_count() | > 0 (instantly) |
| P1 | sum:n.system.up{app:data-cruncher-read-replica,zone:use1-azX} | < 1 for 60 mins |
| P1 | Error rate: sum:n.bolt.api_usage{err:internal*} / sum:n.bolt.api_usage{*} | > 0.01 (instantly) |
| P1 | sum:n.bolt.logical_bytes_read{*}.rollup(sum, 14400) | == 0 (instantly) |
| P2 | sum:n.bolt.logical_bytes_written{*}.rollup(sum, 14400) | == 0 (instantly) |
API errors
API errors are reported by each Read Replica node using the metric g.api_usage. err is empty on success and carries the error code on failure.
Expected error codes:
Forbidden: Unauthorized reads (HTTP 403)NoSuchKey/NoSuchBucket: Crunch has not crunched this bucket or object (HTTP 404)
These error codes can happen at a high rate in normal operation and are not a cause for concern.
Unexpected error codes:
In normal operation, all other errors (Internal, for example) should not occur. The client will see these as HTTP 5xx errors. Clients may not see any impact because API requests are retried internally by the Granica SDK.
Integrity errors
Granica employs a rigorous, periodic integrity scan that validates reduced objects do not have any data corruption. This is done by saving metadata about the object before it is reduced and comparing the metadata later during the integrity scan.
The metric g.integrity.failures reports the number of objects that fail the integrity check. Such objects must be recovered from a separate backup, independent of Granica, if one is available.
Read throughput
Read throughput in bytes/s can be tracked by the g.logical_bytes_read metric, and read throughput in objects/s can be tracked by the g.api_usage metric. Throughput can go down to zero in normal operation if there are no client reads.
Reduction throughput
Reduction throughput in bytes/s can be tracked by the g.crunch.obj.logical_bytes_written metric. In normal operation, reduction throughput will vary based on available data pending reduction and the number of compute spot instances available.
Monitor using Datadog
Configure Granica for export
Deploy (or update) Granica with the Datadog API key. Granica will begin exporting application health metrics to Datadog.
$ cat projectn.tfvars
datadog_api_key = <KEY>
<snip>
$ granica update –var-file projectn.tfvarsConfigure Datadog monitors
Use the Monitors section to add monitors for each of the alerts/metrics. Simply copy/paste the queries from the alerting definitions table above.
Configure Datadog dashboard
Create your dashboard using the Granica metrics to visualize key operational data.
Monitor using CloudWatch
AWS CloudWatch Container Insights can collect, aggregate, and summarize metrics and logs including CPU, memory, disk, and network performance.
Identify the Granica EKS EC2 instances
SSH into your Granica Admin Server:
./setup.sh --loginRun granica ls and note the last four characters of the cluster_id:
$ granica ls
--- AWS Deployments ---
[default]
{
"project_id": "project-n-uat-projectn-78d7",
"cluster_id": "project-n-uat-projectn-78d7",
"deployment_id": "project-n-uat-projectn-78d7-1d80",
...
}Attach the CloudWatch IAM policy
- Find the Granica EC2 instances in the AWS console by searching for the last four characters of the
cluster_id - Click any instance to find the common IAM role in Security Details
- Attach the
CloudWatchAgentServerPolicyto that role
Install and configure the CloudWatch agent and Fluent Bit
Run the following from your Granica Admin Server:
ClusterName=<my-cluster-name>
RegionName=<my-cluster-region>
FluentBitHttpPort='2020'
FluentBitReadFromHead='Off'
[[ ${FluentBitReadFromHead} = 'On' ]] && FluentBitReadFromTail='Off'|| FluentBitReadFromTail='On'
[[ -z ${FluentBitHttpPort} ]] && FluentBitHttpServer='Off' || FluentBitHttpServer='On'
curl https://raw.githubusercontent.com/aws-samples/amazon-cloudwatch-container-insights/latest/k8s-deployment-manifest-templates/deployment-mode/daemonset/container-insights-monitoring/quickstart/cwagent-fluent-bit-quickstart.yaml | sed 's/{{cluster_name}}/'${ClusterName}'/;s/{{region_name}}/'${RegionName}'/;s/{{http_server_toggle}}/"'${FluentBitHttpServer}'"/;s/{{http_server_port}}/"'${FluentBitHttpPort}'"/;s/{{read_from_head}}/"'${FluentBitReadFromHead}'"/;s/{{read_from_tail}}/"'${FluentBitReadFromTail}'"/' | kubectl apply -f -Replace <my-cluster-name> with your cluster_id and <my-cluster-region> with your location.
Validate that the agents are running:
$ kubectl get pods -n amazon-cloudwatch
NAME READY STATUS RESTARTS AGE
cloudwatch-agent-5w8tj 1/1 Running 0 41s
cloudwatch-agent-vmnct 1/1 Running 0 41s
fluent-bit-gc925 1/1 Running 0 41s
fluent-bit-n2zmh 1/1 Running 0 41sVerify CloudWatch is working
The CloudWatch log groups should be created with the containerinsights prefix. In the Container Insights section you can see all performance metrics for the Granica EKS cluster.
How policies work
Learn how to manage Granica using policies for automated data optimization.
Administer and operate
Learn the standard operating procedures for your Granica Crunch deployment.