Deploy Screen in your Kubernetes Cluster (EKS)

You can run Granica Screen as part of your EKS cluster, allowing you to seamlessly integrate Screen into your AWS deployments.

System Requirements

We recommend deploying the pod on a g5.xlarge node running an amazon-eks-gpu-node-x.xx-vxxxxxxxx (find the version that matches with your version of k8s) with at least 32GB RAM and 128GB Disk (see: System Requirements).

Creating a Cluster

If you don't already have a Kubernetes cluster on EKS, you can use the following steps to bring up a cluster using eksctl. If you already have a cluster, you can skip to the next section.

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: screen-cluster
region: us-west-2
nodeGroups:
- name: screen-ng-1
ami: ami-043de4ad25ed718c1 # amazon-eks-gpu-node-1.29-v20240117, replace this as needed
amiFamily: AmazonLinux2
instanceType: g5.xlarge
minSize: 1
maxSize: 1
desiredCapacity: 1
volumeSize: 128
iam:
instanceRoleARN: # insert your granica-screen-docker-role here
overrideBootstrapCommand: |
#!/bin/bash
source /var/lib/cloud/scripts/eksctl/bootstrap.helper.sh
/etc/eks/bootstrap.sh ${CLUSTER_NAME} --kubelet-extra-args "--node-labels=${NODE_LABELS}"
  • You can then deploy the cluster with eksctl create cluster -f screen_cluster.yaml

Deploy Screen in Your Cluster

  • Run the following command on the EC2 instance to login to Granica's ECR repository on the instance:
aws ecr get-login-password --region us-east-2 | docker login --username AWS --password-stdin 809541265033.dkr.ecr.us-east-2.amazonaws.com
  • Then, use this to create a Kubernetes Secret to use as credentials to pull the image:
kubectl create secret generic regcred \
--from-file=.dockerconfigjson=<path/to/.docker/config.json> \
--type=kubernetes.io/dockerconfigjson
  • We recommend running Granica Screen as a deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: screen
spec:
replicas: 1
selector:
matchLabels:
app: screen
template:
metadata:
labels:
app: screen
spec:
containers:
- name: screen-container
image: 809541265033.dkr.ecr.us-east-2.amazonaws.com/screen:latest
imagePullPolicy: Always
name: screen
resources:
limits:
nvidia.com/gpu: "1"
env:
- name: NVIDIA_DRIVER_CAPABILITIES
value: compute,utility
ports:
- name: screenapi
containerPort: 8080
protocol: TCP
- name: metrics
containerPort: 9092
protocol: TCP
imagePullSecrets:
- name: regcred
tip

Note that since each pod will have a resource requirement of 1 GPU, you may encounter GPU related resource constraint issues when rolling out Screen Kubernetes deployments. If this happens, consider adjusting the deployment's rollout strategy accordingly.

  • Copy this into a file titled screen.yaml and create the deployment:
kubectl apply -f screen.yaml
  • You probably want to expose the Screen API port as a service:
kubectl expose deployment screen --type=NodePort --port=8080 --target-port=8080

That's it! You should be all ready to make requests to the service on the /screen endpoint. Refer here to see how you can use the endpoint. To find out about logging, health check and versioning, refer here.

Updating

To update the running Screen image, you can just update the Deployment spec and rollout your changes. For example, to change the image to v1.29.1-gpu:

kubectl set image deploy/screen screen=809541265033.dkr.ecr.us-east-2.amazonaws.com/screen:v1.29.1-gpu
kubectl rollout restart deploy/screen
  • Do recall from the note above that your rollout strategy might need to be adjusted if you don't have headroom to perform the default RollingUpdate strategy, which requires at least one extra GPU node available in this case. You can update your strategy to Recreate before running the above commands to get around this. Note that you may incur some downtime while the new pod comes back up:
kubectl patch deployment screen -p '{"spec":{"strategy":{"type":"Recreate", "rollingUpdate": null}}}'

System Performance

Performance was measured using the recommended system specs with a client running inside the Kubernetes cluster.

  • With a single client sending requests of 100 tokens, average latency was 69ms with P50 at 70ms and P90 at 74ms.
  • One instance of the Screen container can handle in steady state up to three sustained concurrent sending 100 token requests with a P90 of 90ms.
  • With over 500 concurrent clients sending 1,000 word inputs, one instance can reach a throughput of 20175 words/s.

System Metrics

We expose Prometheus metrics on :8080/prom-metrics documented here. We recommend using Prometheus Operator to scrape these metrics.

With Helm

  • Copy the following snippet to kube-prometheus-stack-values.yaml:
    grafana:
    enabled: false
    alertmanager:
    enabled: false
    prometheus:
    prometheusSpec:
    scrapeInterval: 5s
    podMonitorSelector:
    matchLabels:
    prometheus: screen
    additionalPodMonitors:
    - name: screen
    additionalLabels:
    prometheus: screen
    namespaceSelector:
    matchNames:
    - default
    selector:
    matchLabels:
    app: screen
    podMetricsEndpoints:
    - path: /prom-metrics
    port: metrics
  • Next, install the kube-prometheus-stack helm chart with these values:
    helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
    helm repo update
    helm install -f kube-prometheus-stack-values.yaml kube-prometheus-stack-release prometheus-community/kube-prometheus-stack

With kubectl

  • Follow the tutorial to set up Prometheus Operator if you don't have it set up already.

  • To create a PodMonitor for Screen metrics, apply the following yaml configurations to your Kubernetes cluster:

    • This sets up a PodMonitor for the Screen metrics:
    apiVersion: monitoring.coreos.com/v1
    kind: PodMonitor
    metadata:
    name: screen
    labels:
    app: screen
    spec:
    selector:
    matchLabels:
    app: screen
    podMetricsEndpoints:
    - port: metrics
    path: /prom-metrics
    note

    Note that even though specifying ports on the Deployment spec is usually optional in Kubernetes, it's necessary in this case in order for PodMonitor to match the port.

    • To sets up Prometheus Operator to use the PodMonitor:
    apiVersion: monitoring.coreos.com/v1
    kind: Prometheus
    metadata:
    name: prometheus
    spec:
    serviceAccountName: prometheus
    podMonitorSelector:
    matchLabels:
    app: screen
    resources:
    requests:
    memory: 400Mi
    scrapeInterval: 5s
    enableAdminAPI: false

Scaling Out

  • If your application needs better throughput or latency than possible with a single replica, you can increase the number of replicas in your Screen deployment.
  • If your application has variable traffic and you want to be able to dynamically provision replicas, you can configure Kubernetes to autoscale Screen using Prometheus. For that you may find it useful to configure Prometheus Adapter and autoscale with K8s HPA.

Autoscaling Tuning Recommendations

  • A good starting point could be to autoscale with a target of averageValue=3 for n_screen_api_num_calls_outstanding.
  • If the data size of each request is roughly constant, you can also try a target of averageValue=1k for n_screen_api_num_bytes_outstanding
  • You can take the avg_over_time (say, over 1m) for these metrics if you don't want your autoscaling to be too sensitive.
  • Note that you will need to have as many GPU-attached nodes in your NodeGroup as the maxReplicas that you want to scale to.