GCloud Run Deploy: Deploy ML models on Google Cloud Run

This blog post is part of our series on Serverless Inference for Machine Learning models accompanying our KubeCon 2020 talk: Serverless for ML Inference on Kubernetes: Panacea or Folly? - Manasi Vartak, Verta Inc. We'll be hosting a live replay on December 2nd at 2 PM ET for anyone who missed it.

As builders of an MLOps platform, we often get asked whether serverless is the right paradigm to deploy models. The cost savings touted by serverless seem extremely appealing for ML workloads as for other traditional workloads. However, the special requirements of ML models as related to hardware and resources can cause impediments to using serverless architectures. To provide the best solution to our customers, we ran extensive benchmarking to compare serverless to traditional computing for inference workloads. In particular, we evaluated inference workloads on different systems including AWS Lambda, Google Cloud Run, and Verta.

In this series of posts, we cover how to deploy ML models on each of the above platforms and summarize our results in our benchmarking blog post.

How to deploy ML models on AWS Lambda
How to deploy ML models on Google Cloud Run (this article)
How to deploy ML models on Verta
Serverless for ML Inference: a benchmark study

This post talks about how to get started with deploying models on Google Cloud Run, along with the pros and cons of using this system for inference.

What is Google Cloud Run?

Cloud Run is a serverless application running in a container. Cloud Run allows containerized applications to run either on a managed container service or in a customer’s (k8s) kubernetes cluster via Anthos. Cloud Run is powered by Knative, which bills itself as a “Kubernetes-based platform to deploy and manage modern serverless workloads.” Knative supports various service mesh frameworks but in the case of Cloud Run is using the Istio k8s service mesh. Cloud Run allows containerized applications to support modern serverless workloads including scale-to-zero and request concurrency scaling. Cloud Run is the paid / managed version of knative / istio / k8s allowing teams to leverage modern autoscaling container technology without any of the hassle of running the infrastructure.

How is Google Cloud Run appealing?

Pros

No infrastructure (in the case of fully managed)
Dashboards / logs / revision history (in the case of fully managed or on customer’s Anthos cluster)
Cost that scales linearly with usage

Cons

Limits (that cannot be changed) and Quotas (which can be requested to change) to put constraints on what kind of models that can be run
Cold starts, because serverless
Cold starts occur, even with steady traffic

Getting started with Google Cloud Run

Creating an endpoint with Google Cloud Run is very similar to creating an always-on container based deployment. But before we can deploy anything we need to perform some Google Cloud Platform setup.

Platform Requirements

You will need to create a GCP account, attach billing information and create a project. If you are a new GCP user then you are in luck-- you will receive $300 in credits to use for your first year!

Sign up here: https://cloud.google.com/gcp/
You should be prompted to create a project, but if you skip that step or want to create a separate project for this exercise, then go to https://console.cloud.google.com/projectcreate

Software Requirements

Install:

gcloud CLI https://cloud.google.com/sdk/gcloud

First, let’s login and create a new gcloud project (to make cleaning things up easier)

Log into gcloud:

$ gcloud auth login

$ gcloud auth login
$ gcloud auth login

$ gcloud auth login

A browser should open prompting you to log into your google account. If authentication is successful you should see the google cloud console landing page in your browser. Return to the terminal and set the current project where PROJECT_ID is the project you created earlier.

Set default project for further commands:

$ gcloud config set project PROJECT_ID

Using a separate project is a good idea so that we can find everything we started in order to clean up our account and avoid perpetual charges.

At this time we can also set a default region for Cloud Run. Feel free to use a different region closer to you.

Set the run/region:

$ gcloud config set run/region us-central1

If you have a POSIX terminal (linux / mac / windows wsl) you can save off the project ID for later use:

$ export PROJECT_ID=$(gcloud config get-value project)

Finally, let’s create a directory for some files we’ll add soon.

$ mkdir cloudrun_example

$ cd cloudrun_example

NLP example (DistilBERT)

For this blog post we will set up DistilBERT and perform question and answer predictions.

There are two python source files, the bootstrap to download and save the model file and the app to serve predictions. Having a separate bootstrap script that we run before creating the endpoint step will greatly improve the cold start time of the endpoint by skipping the need to download the entire model before servicing a request.

Create these files in your cloudrun_example folder: gist

There should be three files:

cloudrun_example$ ls
app.py bootstrap.py Dockerfile

This Dockerfile references two python scripts from earlier-- bootstrap.py and app.py. Because we have gcloud installed we can offload the docker build and publish to Google Cloud Build:

cloudrun_example$ gcloud builds submit --tag=us.gcr.io/$PROJECT_ID/distilbert:latest

If the command completes with a SUCCESS then the docker container has been built and published. If there is an error, then a google search of the error should lead you quickly towards the answer.

We are now ready to deploy a cloud run endpoint:

cloudrun_example$ gcloud run deploy distilbert \
--image=gcr.io/$PROJECT_ID/distilbert:latest \
--platform=managed \
--concurrency=1 \
--set-env-vars=MKL_NUM_THREADS=2,OMP_NUM_THREADS=2,NUMEXPR_NUM_THREADS=2 \
--cpu=2 \
--memory=3G \
--allow-unauthenticated

If successful, the terminal will return the URL of the new endpoint. e.g. https://distilbert-3bz2zvlxbq-uc.a.run.app

We are now ready to make a prediction! Try it out:

curl -d '{"context":"Roses are red, Violets are blue, Sugar is sweet, And so are you.", "question":"What am I?"}' -H "Content-Type: application/json" -X POST https://distilbert-3bz2zvlxbq-uc.a.run.app/predict

Responds with:

"sugar is sweet , and so are you"

Cloud Run limitations

Now that we’ve seen how easy it is to deploy a serverless ML application to Cloud Run, let’s have some real talk on some limitations. Cloud Run, being a platform that targets a wide variety of applications comes with a set of defaults and constraints that preclude some ML workloads. For example, the max CPUs that can be requested is 4 and the max memory is 4G. When deploying a ML workload, one will probably want to dive into settings to lower container concurrency which is the number of open requests per container. The maximum response time for Google Cloud Run is 60 minutes. While there is no container size limit, containers must be able to load and start within 5 minutes. Cloud Run containers are given 10s to terminate before being forcefully terminated.

Reference:

Default Quotas https://cloud.google.com/run/quotas
Container Contract: https://cloud.google.com/run/docs/reference/container-contract

When building this example, issues with long cold starts necessitated the bootstrap.py file to preload the model into the container. While running benchmarks, it was noted that Cloud Run seemed to suffer from “continuous cold starts” such that even with a very steady throughput, new instances would continuously spin up and the unlucky requestor to that instance would have to wait ~10s for a response. Because of this, cloud run is never able to truly achieve a steady state. It would appear that pods are continuously recycled. In Cloud Run there exists no configuration to customize this behavior compared to a self hosted Knative installation.

On Rolling one’s own Knative vs Cloud Run

Since Cloud Run is built almost entirely on open source technologies, one could choose to install and configure their own k8s cluster, ingress controllers, Istio and Knative to run their own “serverless” workloads directly. Of course to call such an installation “serverless” actually demonstrates how ridiculous the term “serverless” actually is. Of course there are servers… but when using a managed serverless platform like Cloud Run what you, as the customer, release is the responsibility of directly managing server infrastructure. Because Cloud Run is mostly built on open source technologies you always have the choice to host your own infrastructure to run similar workloads.

However, this author will confirm that teams looking to go this route are in for quite the ride. Configuring and running k8s, Istio and Knative are all non-trivial tasks. When you go with a managed platform, like Cloud Run, the intricacies of these platforms are mostly hidden from the user. Self hosting will open up a world of knobs and switches that can be set. While this can be a huge advantage to teams needing to highly tune specialized workloads, for most use cases is a big deep rabbit hole of testing, tuning and more testing.

Not yet afraid? Well we haven’t even covered logs, metrics, revision control and dashboards. With Cloud Run you get web and command line tools for viewing and managing your workloads. These tools are the part of the k8s / Istio / Knative stack that is not open source. Now you are on the hook to add all of these to be able to effectively manage and provide visibility to your workloads.

If one insists on setting up their own infrastructure, then consider taking a look at Anthos. Anthos is yet another installation on a k8s cluster that provides a rich set of integration with Google Cloud Console. An Anthos enabled cluster can be deployed via the Google Cloud Console Cloud Run dashboard as well as connect with all of Google Cloud’s existing monitoring tools. With Anthos customers can control and customize their Knative installation while also having access to all of Google's proprietary console tools.

Conclusion

Google Cloud Run is great for the use cases that fit within the container contract. Because Google Cloud Run is built on top of open source technologies, there is also a path towards self-hosting. In addition, since Cloud Run can scale to zero, you pay as you go. However, you will end up paying more for that compute compared to simply buying Google Cloud Compute instance outright and hosting the service directly. In this case you would also create a floor on expense because you would not be scaling to zero.

Google Cloud Run, with its default settings, might be best suited for smaller HTTP workloads than machine learning. There are hard limits to CPU and memory that prevent running larger ML workloads. The default request concurrency (80) makes sense for parallelizable workloads but will create serious issues for ML workloads. Care should be taken when creating the deployable container to ensure that the container is able to start, run and stop within the constraints of Cloud Run.

Stay tuned for our next blog in this series!

Author biography

John spent over ten years in the enterprise software/embedded/mobile development tools space before transitioning to fullstack web development for the past ten years. John currently leads the frontend team at Verta.

About Verta

Verta provides AI/ML model management and operations software that helps enterprise data science teams to manage inherently complex model-based products. Verta’s production-ready systems help data science and IT operations teams to focus on their strengths and rapidly bring AI/ML advances to market. Based in Palo Alto, Verta is backed by Intel Capital and General Catalyst. For more information, go to www.verta.ai or follow @VertaAI.

Subscribe To Our Blog

Get the latest from Verta delivered directly to you email.

How to Deploy ML models with Google Cloud Run