Deploy Screen in your Kubernetes Cluster (EKS)
You can run Granica Screen as part of your EKS cluster, allowing you to seamlessly integrate Screen into your AWS deployments.
System Requirements
We recommend deploying the pod on a g5.xlarge node running an amazon-eks-gpu-node-x.xx-vxxxxxxxx (find the version that matches with your version of k8s) with at least 32GB RAM and 128GB Disk (see: System Requirements).
Creating a Cluster
If you don't already have a Kubernetes cluster on EKS, you can use the following steps to bring up a cluster using eksctl
. If you already have a cluster, you can skip to the next section.
If you have not yet already, create a new IAM role and request access to the screen image.
Copy the following config to
screen_cluster.yaml
:
- You can then deploy the cluster with
eksctl create cluster -f screen_cluster.yaml
Deploy Screen in Your Cluster
- Run the following command on the EC2 instance to login to Granica's ECR repository on the instance:
- Then, use this to create a Kubernetes Secret to use as credentials to pull the image:
- You also need to get a license file from Granica. Save this locally and create a kubernetes secret from the file using the following (change the path to your license path):
- We recommend running Granica Screen as a deployment:
tip
Note that since each pod will have a resource requirement of 1 GPU, you may encounter GPU related resource constraint issues when rolling out Screen Kubernetes deployments. If this happens, consider adjusting the deployment's rollout strategy accordingly.
- Copy this into a file titled
screen.yaml
and create the deployment:
- You probably want to expose the Screen API port as a service:
That's it! You should be all ready to make requests to the service on the /screen
endpoint. Refer here to see how you can use the endpoint. To find out about logging, health check and versioning, refer here.
Updating
To update the running Screen image, you can just update the Deployment spec and rollout your changes. For example, to change the image to v1.29.1-gpu
:
- Do recall from the note above that your rollout strategy might need to be adjusted if you don't have headroom to perform the default
RollingUpdate
strategy, which requires at least one extra GPU node available in this case. You can update your strategy toRecreate
before running the above commands to get around this. Note that you may incur some downtime while the new pod comes back up:
System Performance
Performance was measured using the recommended system specs with a client running inside the Kubernetes cluster.
- With a single client sending requests of 100 tokens, average latency was 69ms with P50 at 70ms and P90 at 74ms.
- One instance of the Screen container can handle in steady state up to three sustained concurrent sending 100 token requests with a P90 of 90ms.
- With over 500 concurrent clients sending 1,000 word inputs, one instance can reach a throughput of 20175 words/s.
System Metrics
We expose Prometheus metrics on :8080/prom-metrics
documented here. We recommend using Prometheus Operator to scrape these metrics.
With Helm
- Copy the following snippet to
kube-prometheus-stack-values.yaml
:grafana:enabled: falsealertmanager:enabled: falseprometheus:prometheusSpec:scrapeInterval: 5spodMonitorSelector:matchLabels:prometheus: screenadditionalPodMonitors:- name: screenadditionalLabels:prometheus: screennamespaceSelector:matchNames:- defaultselector:matchLabels:app: screenpodMetricsEndpoints:- path: /prom-metricsport: metrics - Next, install the kube-prometheus-stack helm chart with these values:helm repo add prometheus-community https://prometheus-community.github.io/helm-chartshelm repo updatehelm install -f kube-prometheus-stack-values.yaml kube-prometheus-stack-release prometheus-community/kube-prometheus-stack
With kubectl
Follow the tutorial to set up Prometheus Operator if you don't have it set up already.
To create a PodMonitor for Screen metrics,
apply
the following yaml configurations to your Kubernetes cluster:- This sets up a PodMonitor for the Screen metrics:
apiVersion: monitoring.coreos.com/v1kind: PodMonitormetadata:name: screenlabels:app: screenspec:selector:matchLabels:app: screenpodMetricsEndpoints:- port: metricspath: /prom-metricsnote
Note that even though specifying ports on the Deployment spec is usually optional in Kubernetes, it's necessary in this case in order for PodMonitor to match the port.
- To sets up Prometheus Operator to use the PodMonitor:
apiVersion: monitoring.coreos.com/v1kind: Prometheusmetadata:name: prometheusspec:serviceAccountName: prometheuspodMonitorSelector:matchLabels:app: screenresources:requests:memory: 400MiscrapeInterval: 5senableAdminAPI: false
Scaling Out
- If your application needs better throughput or latency than possible with a single replica, you can increase the number of replicas in your Screen deployment.
- If your application has variable traffic and you want to be able to dynamically provision replicas, you can configure Kubernetes to autoscale Screen using Prometheus. For that you may find it useful to configure Prometheus Adapter and autoscale with K8s HPA.
- If you're using Helm, you can just install the Prometheus Adapter chart.
Autoscaling Tuning Recommendations
- A good starting point could be to autoscale with a target of
averageValue=3
forn_screen_api_num_calls_outstanding
. - If the data size of each request is roughly constant, you can also try a target of
averageValue=1k
forn_screen_api_num_bytes_outstanding
- You can take the
avg_over_time
(say, over 1m) for these metrics if you don't want your autoscaling to be too sensitive. - Note that you will need to have as many GPU-attached nodes in your NodeGroup as the
maxReplicas
that you want to scale to.