Kubernetes Volume Controller (KVC): Data Management Tailored for Machine Learning Workloads in Kubernetes

In this blog post, we describe Kubernetes Volume Controller (KVC). It is an open source project we’ve developed, which provides basic volume and data management in Kubernetes tailored towards machine learning (ML) workloads and pipelines.

 

Why Should I Care?

Data is an important component in ML workloads and pipelines. Typically, data scientists and ML practitioners handle the data using existing primitives available through a scheduling system such as Kubernetes. However, the users still need to keep track of the data as well as the relationship between data and the primitives used. Data from multiple sources might be required to run their workloads and pipelines. In some cases (e.g., hyperparameter tuning), the data might need to be replicated or made available in some of the compute nodes in a cluster. Making sure the data is available on the nodes in which the ML task is scheduled is also cumbersome. Moreover, the user experience with any software that alleviates these issues should be seamless. It should integrate well with existing ML workflows and should not hinder the progress of users.

For example, to enable the execution of an ML workload for a user group, cluster operators might download a frequently used dataset manually to a subset of compute nodes in the cluster and label the nodes appropriately to indicate the presence of that dataset. A data scientist who needs to use that dataset in their workload will have to keep track of these labels and make sure their ML workload lands on a compute node where the required dataset is available in order to consume the data. The same holds true for frequently used models. If the data scientist wants to explore a new dataset or a model, it might cause further difficulties and delays. Spending time in the orchestration of such a workflow is a drain on the productivity of a data scientist.

We believe data scientists, ML practitioners, and cluster operators would be happy to offload these systems in ML issues to the scheduling substrate when possible. The goal of the Kubernetes Volume Controller (KVC) is to solve these issues for ML workloads and pipelines for the Kubernetes container orchestration system.

What is Kubernetes Volume Controller?

KVC provides a single interface to manage data from different data sources in a Kubernetes cluster using existing primitives such as API extension capabilities and volumes. It establishes a relationship between data and volumes and provides a way to abstract the details away from the user. When using KVC, users are expected to only interact with a single resource type in Kubernetes without having to worry about other underlying complexities.

Hmm… What About Volume Support in Kubernetes?

Kubernetes natively supports a variety of volumes backed by different sources. However, data management for ML workloads on Kubernetes gives rise to several challenges with respect to user experience and system software. We describe some of the challenges below:

  1. The life cycle for deploying a model in production is already complex and the software system developed for data management should not be an obstacle and hinder the velocity of the user. Instead, the data management system should integrate well with Kubernetes and ML framework operators and custom resource definition (CRD) (e.g., tf-operator) in order to reduce the barrier to entry. It should leverage and provide abstractions over existing Kubernetes primitives related to data management.
  2. In the life cycle for deploying a model in production, the data may come from multiple sources. Any software developed should ease the use of data from different sources in any stage of model deployment. It should also enable the user to track and manage the data, volumes, and their relationship.
  3. A diverse group of users work in different stages of the life cycle for deploying a model in production. These users fall under the category of data scientists and ML practitioners, software developers, and devops engineers. A software system for data management should cater to the needs of all these user groups.

Use Cases

  • As a data scientist, I want the ability to use, track, and manage data and models from different sources in my experiments to optimize and deploy in production.
  • As a devops engineer, I want the ability to pre-seed frequently used data and models from different sources in a set of nodes in the cluster so that the users in my group need not do it themselves. In addition, I have the following requirements:
    1. I want to surface the pre-seeded data and models as volumes so that it can be consumed easily.
    2. I want to provide scheduling constraints so that a task (either single-node or distributed) using these volumes land on a node where the data is already available.
    3. I want it to be easily trackable and manageable.

Using Kubernetes Volume Controller

KVC leverages the operator pattern in Kubernetes to satisfy the requirements specified above for data management for ML workloads. It consists of a custom resource definition (CRD) and a custom controller which drives the current state [i.e., spec of a KVC custom resource (CR)] to its desired state (i.e., the status of a KVC CR). An example CR is shown in Figure 1. Each CR can contain one or more VolumeConfigs from different data sources and the metadata required to establish a relationship between volume and data and the information required to track and manage it. Each VolumeConfig contains an ID, the number of replicas required of this particular data, a data source type, options specific to the data source type and labels to annotate the data. These labels can also be used in retrieving and searching for a specific dataset in a cluster. The full schema for each data source type is described here.

Figure 1. Example KVC CR spec.

When created, the CR goes into a Pending state. The custom controller drives the execution of this CR to the desired state (i.e., the Running stateby creating appropriate sub-resources and managing the data transfer when required. Depending on the data source type, the custom controller either creates persistent volumes (PVs) and persistent volume claims (PVCs) or creates a host path volume and exposes the path along with node affinity details to guide the scheduling of the pods for data gravity. An example status of a KVC CR can be seen in Figure 2.

The status of a CR gives the details regarding the current state of the resource. The controller will update the status with a Running state and an array of volume statuses which have a 1-to-1 mapping with the array of VolumeConfigs if everything was executed successfully. If there were any errors, this error is bubbled up in the CR along with a corresponding verbose message guiding the user to debug the CR.

Figure 2. Example KVC CR status.


The node affinity information provided in the CR status can be used as-is in a pod spec along with the host path to access the data. For example, to use the CR status specified in Figure 2, the node affinity details can be added in the node affinity field and the host path details can be added in the volumes field of a pod spec, respectively.

Supported Data Source Types

S3 Data Source Type

If the data is in S3, a user can create a KVC CR with S3 data source type and the additional metadata required to access the data. KVC will then provision the data on nodes equal to the number of replicas and provide the node affinity details along with the host path on those nodes.

Figure 3. Example KVC Workflow for S3 Data Source Type

Figure 3 shows an example flow on how the KVC custom controller drives the execution for S3 data source. When a CR of S3 data source type with the location and the expected number of replicas is created, the controller choses a set of nodes, deploys a set of pods guiding them to each of nodes and downloads the data from S3 location provided in the CR. If the download is successful, the CR is updated with the appropriate volume source and the node affinity required to guide tasks to land on nodes where the data is available. Otherwise, appropriate error is propagated to the CR status.

NFS Data Source Type

If the data is located in an NFS share, the user can create a KVC CR with the SourceType as NFS and provide the NFS server IP and an exported path. The PV and PVC pair provided in the status can be used in a pod spec to mount the data.

Figure 4. Example KVC workflow for NFS Source Type

Figure 4 illustrates an example flow for an NFS source type. When a CR is created with NFS as the source type, the KVC controller creates a NFS PV and a PVC using the server endpoint location provided in the CR and exposes them via the status of the CR. If there was any error in the process, the appropriate error is updated in the CR status.

Adding Additional Data Source Types

Every data source type should implement the DataHandler interface. Additional source types can be added by implementing the same interface. Example on how to implement the interface can be seen in this pull request.

Contributing to Kubernetes Volume Controller

Read the developer manual in KVC repository to get more information on contributions. Provide feedback, ideas and report bugs by opening and commenting on issues. You can get in touch with us in the Kubeflow community slack channel or emailing to the kubeflow-discuss mailing list. We look forward to hearing from you!

Acknowledgments

We thank Elson Rodriguez, Jeremy Lewi, Nan Liu, Jose Aguirre, Scott Leishman and Jason Knight for providing feedback on this project.