Machine Learning at Scale with OCI and Kubeflow

Sanjay Basu | Head of Technology Strategy, Oracle Cloud Engineering 

Seshadri Dehalisan 
Master Principal Cloud Architect, Oracle Cloud Engineering

Note: Our original blog was published in ORACLE CLOUD INFRASTRUCTURE blog site. I have republished it here with permission.

Official Disclaimer: The views and opinions expressed in this blog are those of the authors and do not necessarily reflect the official policy or position of Oracle Corporation.

Setting the context

Enterprises are increasingly reliant on machine learning (ML) to further their organization's goals. While machine learning can provide the necessary competitive advantage and intelligence, enterprises need framework to harvest the benefits. This multi-series blog discusses the challenges with machine learning at scale and how you can use the combined power of Oracle Cloud Infrastructure (OCI) offerings and open source Kubeflow platform to achieve your ML outcome.

Challenges 

Machine learning at scale introduces multiple challenges as outlined in the below diagram.

Key challenges with managing ML at scale

 

OCI & Kubeflow to rescue

Oracle Cloud Infrastructure (OCI) offers multiple services to enable enterprises' ML needs such as data science services, compute service with multiple shapes such as highly performant GPU, bare metal, HPC and genral compute shape as well as managed Kubernetes referred to Oracle Container Engine for Kubernetes (OKE). OCI also offers the other underlying foundational components from Network, Storage and Security perspectives.

Kubeflow is an open source project that contains a curated set of compatible tools and frameworks specific for ML. Kubeflow runs on Kubernetes. Deploying Kubeflow on OKE enables deployment of machine learning workflows that are composable, scalable, secure and portable.

Implementing Kubeflow on OKE

OCI offers ability to create OKE clusters in different ways - Console, Terraform, Oracle Resource Manager, or OCI SDKs. The blog will not go through the steps to create a OKE cluster and will let reader to go through the link referred here. OKE Clusters are completely managed - that is the control plane is managed by Oracle and customer has flexibility to choose disparate shapes for their worker nodes. The workers can be further grouped distinctly into different pools called node pools that can serve different purposes. 

Training for machine learning is resource intensive and slow while the model serving are typically light weight and have stringent performance SLAs. So, one can consider distinct node pools for training to that of serving. It is important to note that the OCI shapes are homogenous within a node pool.

  • Create a OKE Kubernetes Cluster. For the purposes of illustration, we have created a 3 node cluster on VM.Flex.E3 shape

    Copied to Clipboard
    Error: Could not Copy
    Copied to Clipboard
    Error: Could not Copy
    $ kubectl get nodes -o wide
    NAME          STATUS   ROLES   AGE     VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                  KERNEL-VERSION                    CONTAINER-RUNTIME
    10.0.24.218   Ready    node    2d15h   v1.20.8   10.0.24.218           Oracle Linux Server 7.9   5.4.17-2102.203.6.el7uek.x86_64   cri-o://1.20.2
    10.0.39.106   Ready    node    2d15h   v1.20.8   10.0.39.106           Oracle Linux Server 7.9   5.4.17-2102.203.6.el7uek.x86_64   cri-o://1.20.2
    10.0.40.137   Ready    node    2d15h   v1.20.8   10.0.40.137           Oracle Linux Server 7.9   5.4.17-2102.203.6.el7uek.x86_64   cri-o://1.20.2

 

Kustomize

Kubeflow uses Kustomize (a Kubernetes native application configuration management tool) to install its components. Kubeflow offers two options to implement Kubeflow - single command installation of all components and multi-command individual component installation. For this blog, we have chosen to illustrate single command installation. As of this writing, Kubeflow is not compatible with latest version of Kustomize 4.x and Kustomize 3.2 should be used. It can be downloaded from here.

Copied to Clipboard
Error: Could not Copy
Copied to Clipboard
Error: Could not Copy
./kustomize version
Version: {KustomizeVersion:3.2.0 GitCommit:a3103f1e62ddb5b696daa3fd359bb6f2e8333b49 BuildDate:2019-09-18T16:26:36Z GoOs:darwin GoArch:amd64}

Make sure to add Kustomize to your path or install Kustomize is common directory such as /usr/local/lib

Get the Kubeflow Repo

 

Copied to Clipboard
Error: Could not Copy
Copied to Clipboard
Error: Could not Copy
git clone https://github.com/kubeflow/manifests.git

 cd manifests

 

Make sure to add Kustomize to your path or install Kustomize is common directory such as /usr/local/lib

Pre-deploy Customizations

Default login credentials for Kubeflow out of the box is user@example.com and password of 12341234

Let us change the password to something more secure before deployment. Password change can be done as follows. This assumes you have Python version 3 installed in your client environment. You will enter the desired password and it will return an encrypted hash value

Copied to Clipboard
Error: Could not Copy
Copied to Clipboard
Error: Could not Copy
pip3 install passlib
pip3 install bcrypt

python3 -c 'from passlib.hash import bcrypt; import getpass; print(bcrypt.using(rounds=12, ident="2y").hash(getpass.getpass()))'

Take the hash value and replace in the config-map.yaml in manifests/common/dex/base directory We will take up changing the default username and pointing it to external SSO in subsequent blog posts

Deploy Kubeflow

Copied to Clipboard
Error: Could not Copy
Copied to Clipboard
Error: Could not Copy
while ! kustomize build example | kubectl apply -f -; do echo "Retrying to apply resources"; sleep 30; done

Kubeflow Components

Once Kubeflow installs successfully, it would have created multiple components and namespaces. Details of the key components are given below.

Namespace Purpose
Kubeflow Primary namespace for Kubeflow components
KFServing Components for Serverless Kubernetes Inferencing
cert-manager Kubeflow leverages Zero-trust and uses mutual-TLS. Namespace for managing mutual tls and admission web hooks
Istio Components that secure traffic, enforce network authorization and routing policies
Dex Components for OpenID Connect Identity

 

At this point, Istio-ingressgateway is exposed as NodePort. This can be verified as follows

Copied to Clipboard
Error: Could not Copy
Copied to Clipboard
Error: Could not Copy
kubectl describe svc istio-ingressgateway -n istio-system

Name:                     istio-ingressgateway
Namespace:                istio-system
Labels:                   app=istio-ingressgateway
                          install.operator.istio.io/owning-resource=unknown
                          istio=ingressgateway
                          istio.io/rev=default
                          operator.istio.io/component=IngressGateways
                          release=istio
Annotations:              
Selector:                 app=istio-ingressgateway,istio=ingressgateway
Type:                     NodePort
IP:                       10.233.254.172
Port:                     status-port  15021/TCP
TargetPort:               15021/TCP
NodePort:                 status-port  32723/TCP
Endpoints:                10.234.0.7:15021
Port:                     http2  80/TCP
TargetPort:               8080/TCP
NodePort:                 http2  31323/TCP
Endpoints:                10.234.0.7:8080
Port:                     https  443/TCP
TargetPort:               8443/TCP
NodePort:                 https  31547/TCP
Endpoints:                10.234.0.7:8443
Port:                     tcp  31400/TCP
TargetPort:               31400/TCP
NodePort:                 tcp  32426/TCP
Endpoints:                10.234.0.7:31400
Port:                     tls  15443/TCP
TargetPort:               15443/TCP
NodePort:                 tls  30051/TCP
Endpoints:                10.234.0.7:15443
Session Affinity:         None
External Traffic Policy:  Cluster
Events:

Considering it is nodeport, you can access Kubeflow with a simple port forward to test it out. This can be done as shown below:

Copied to Clipboard
Error: Could not Copy
Copied to Clipboard
Error: Could not Copy
kubectl port-forward svc/istio-ingressgateway -n istio-system 8080:80

Now you can point your browser to localhost:8080 and it will open Kubeflow UI as shown below

Kubeflow UI logon screen

Now, you can enter user@example.com and the password you created earlier. The UI will look as below:

 

Kubeflow UI

 

Conclusion

The objective of this blog series is to enable MLOps teams to overcome the ML pipeline related implementation issues by automating the model deployments into the core software applications and / or standing up an As-A-Service, API based software delivery component.

Till now, we have identified why Kubeflow is needed and how OCI and Kubeflow complement each other. The basic process to get Kubeflow has been expanded as well. In the interest of keeping the blog size to manageable limit, we will provide the key cornerstones of Kubeflow industrialization in subsequent blogs. The next series is shown below.


ML Blog Series Plan


Comments

Popular posts from this blog

Guidance for Setting Up a Cloud Security Operations Center (cSOC)

OCI Object Storage: Copy Objects Across Tenancies Within a Region

Access data anywhere using DataDistillr and your Oracle Cloud Credits