Skip to main content
Version: 1.0.0

End to end Katonic Walkthrough


To help you get familiarize with the Katonic platform we have curated a set of tutorials that will take you through every functionality. The complete journey of Data Science, right from extracting the data from source to feature engineering to modelling and finally making available for production can be done through Katonic platform These tutorials are designed using the business use case about customer churn.

Let's understand what problem we need to solve in this customer churn business use case.

Problem Definition: Customer churn rate, is the rate at which a customer ceases its service with an entity. This can be described as the ratio of subscribers who discontinue their subscriptions within a particular period. For an organization to run smoothly, it needs to retain its customer. So, every business needs to learn the reason behind customer churn through the insights and patterns hidden inside the customer data. Through customer churn prediction, we need to identify which customer is likely to discontinue their service.

  • Selecting the required Connector

  • Create Docker Images

  • Working with the Notebook

  • Capturing the instance of features and models

  • Managing experiments and pipelines

  • Models

  • Scheduling the Pipeline

  • Dashboard

To move forward, go through every tutorial one by one:

Selecting the required Connector

Connectors#

  • Connectors are used to extract the data from various sources.

  • We have more than 70 inbuilt connectors integrated very securely in platform.

  • By using connectors, you can extract the data from the source and load it to the destination

  • Once you extracted the data from source, you can create a connection and see what exactly is happening.

  • As you can see in the below image we have a connection, which extracts the data from HTTP and load it into S3.

Components of the Connectors:#

  • Connections

To create the new connection set up, we must click on “new connection”

Name: Give a new name to the connection

Source Type: Select a source type from the drop down. Which shows available prebuild connectors

Click on “set up source” to create connection

  • Sources

Click on “new source” to create new source connection.

Provide name and source type and select “set up source” for setup

  • Destinations

“New destination” button is used to create new destination

Set up the destination by giving the name and destination type and then click the “set up destination”

Create Docker Images#

How to create and build Docker file

What is Docker file#

  1. A text file with instructions to build image Automation of Docker Image Creation.

  2. A docker file is a simple text file with instructions to build an image.

  3. So, as docker file is a simple text file where we give some instructions to build an image and when we do docker build for this docker file, image get created.

  4. Basic instructions that you use in a docker file

How to create Docker file#

Step.1 Create a file named Dockerfile from terminal

  • By default, when you run the docker build command, docker searches for a file named Dockerfile, however it is not compulsory you can also give some different name and then you can tell the docker that this particular file is a Dockerfile

  • Create a folder with command in terminal

mkdir directory_name
  • Go inside that folder with command in terminal
cd directory_name
  • Create a file called Dockerfile with the command in terminal
touch Dockerfile
  • Now go to edit that Dockerfile, write a command in terminal
vim Dockerfile
  • Now you are in vim editor, now you can press “i” on keyboard to go inside the insert mode

Step.2 Instruction inside the Dockerfile

  • The very first instruction that a Dockerfile starts with is FROM and you have to give a base image

  • Ideally, we always use some BASE IMAGE or just in case you do not want to use any base image you can also say from SCRATCH, so if you go to docker hub there is another image called scratch and this is basically an empty image which is used building images from scratch

  • Second instruction that a Dockerfile contain is USER instruction sets the user's name or UID to use when running the image and for any RUN, CMD and ENTRYPOINT instructions that follow it in the Dockerfile (Source: google)

  • Third instruction that a Dockerfile contain is RUN, if you want to run something like apt-get update, or want to install something like this pip install pandas, RUN gets executed during the building of the image

  • Fourth instruction that a Dockerfile contain is COPY, let you copy files from a specific location into a Docker image

  • Last instruction is again the USER

Now you can press “esc" on your keyboard and then you will press colon “:wq!” to exit and you are out from the Dockerfile

If you want to again see the content of the Dockerfile you can run

cat Dockerfile

Step.3 Build Dockerfile to create image

  • To build the docker image we run the command
docker build –t ImageName:tagName
  • -t for tagging your image

  • ImageName:tagName to name your image and if you want to give tag, so it will be easy to find out this image later on

  • To see all the images in your system, you can run command

docker images

Step.4 Run image to create container

  • To run this image, you can run command
docker run imageID

Basic Commands

wsargent/docker-cheat-sheet: Docker Cheat Sheet (github.com)

How to Tag Docker Image

  • First login to your docker hub then only you will upload your image in Docker Hub, so the command is
docker login
  • Enter your username and password to login and push docker image

  • To upload image in docker hub it is required to have a repository name, image name and have to tag them, command for this is

docker tag ImageName:tagName RepositoryName/ImageName:version

Note: if no version specified, it will take latest by default

  • Now if you enter “docker images”, you can see that previously it was only ImageName, but now it will create “RepositoryName/ImageName”

How to Push Docker Images into Docker Hub

The command to push docker image is docker push and name of the image with version if any

docker push RepositoryName/ImageName:version

Image is populated in list in docker hub once it gets uploaded

How to pull Docker image

Command to pull image from docker hub is

docker pull ImageName

How to delete image from our system

Command to delete image from our system is

docker rmi ImageName

Working with the Notebook#

Welcome to working with Notebook section. We will be covering the following sections in this documentation

  1. Create Notebook

  2. Launcher

  3. Upload Notebook/ dataset/file from device

  4. Working with GitHub:

  5. Understanding the Workflow using an example: Convert notebook to pipeline and Kale pipeline deployment

Create a Notebook#

Once you login to the platform, on your left-hand corner you will find various options. One of which is notebook.

Click on the notebook option, your screen will be looking similar to below image.

Click on the create notebook option from the screen. You will come across the below interface.

Name: Enter name of the notebook you want to create. Note your name should start from small letters.

Environment: Select the environment of your choice. It can be either JupyterLab, Rstudio etc.

Image: Select the type of image you want. This will be based on the project you are working on, be it tensorflow, pytorch, scikit-learn or spark.

CPU: Under CPU section select the number of nodes you want for your project.

Memory: Allocate the amount of memory of your choice from the dropdown list.

Click on the create button and your notebook would be ready. You could see something like the below image.

Status: It shows pending status.

Age: Age indicates how long this notebook has been created.

Delete: Click on the symbol, you can delete your notebook if it is no longer needed.

Start/Stop: Start option to start your notebook. Once you do this the pending symbol turns green and says running. This will enable your Connect button.

Click on the Connect button you will be redirected the environment you have selected. In this case it is JupyterLab.

Launcher#

On the launcher you can find various options like notebook, terminal, text file etc. You can choose them according to your need.

Select the notebook for your data science experiments.

Note: Due to security reasons, the platform is design to logout after every 15 mins, go back to platform screen re-login and reconnect to your jupyterlab. But make sure you continuously save your work.

When you get the below message then try to refresh the notebook you will get the screen shown in next image. Just close the window and reconnect from the platform.

Upload your notebook or dataset#

To upload your notebook or dataset or any files from your system, select the up-arrow mark that is being highlighted, it will open a window to choose you from your device.

Working with GitHub#

Click on the Terminal option. You will find a terminal window would be open in one of the tab.

If your project resides on GitHub you need to follow these commands to update your project.

First pull your project into the working directory.

git clone <your project directory>

Note: if git command is not present then install it using

sudo apt-get install git-all

Change your branch:

git checkout branch name
git pull

Start updating your code. Then try to push back your changes to github using following process:

git status
git add filenames
git status
git commit –m “commit message”
git push

Understanding the Workflow using an example#

Customer Churn Prediction Use case

  • Set up are some necessary prerequisites for this particular dataset,

  • Load the Customer Churn dataset.

  • Transforms raw data into meaningful information by doing data preprocessing.

  • Splitting the preprocessed data into training and testing data.

  • Train the different models using training data.

  • Evaluating the best model among different models.

  • For prediction, just pass the features as array and we are able to get the results.

We are using the customer churn dataset and we have the notebook already available in our GitHub store

Upload Customer Churn notebook file

  • In the Terminal window, run these commands and download the notebook and the data that you will use for the remainder of the lab.

Note git clone https://github.com/katonic-dev/Examples.git

  • This repository contains a series of curated examples with data and annotated Notebooks. Navigate to the folder in the sidebar and open the notebook customer_churn.ipynb inside Examples/customerchurn/.

Explore the ML code of the Customer Churn use case

  • Run the notebook step-by-step. Note that the code fails because a library is missing.

Missing Lib

  • You can install the required libraries either through the Terminal or directly in the cell in the notebook.

  • Run the cell right above to install the missing libraries:

  • Restart the notebook kernel by clicking on the Refresh icon.

Convert your notebook to a Katonic Pipeline

  • Enable Kale by clicking on the Kale slider in the Kale Deployment Panel (left pane of the notebook).
  • The moment you enable kale. You will get an edit symbol on every cell.

  • This will help you to tag your cells. Eg. In the cell where you have imports select import tag for it.

  • There are various options available for tagging. But when you select Pipeline step as option you have the leverage of customizing the tag names for the cells.

  • You also need to take care of the flow of your code. So that you maintain proper dependencies of one cell onto another.

  • Explore per-cell dependencies.

  • Multiple notebook cells can be part of a single pipeline step, as indicated by color bars on the left of the cells, and how a pipeline step may depend on previous ones, as indicated by depends on labels above the cells. For example, the image below shows multiple cells that are part of the same pipeline step. They have the same brown color and they depend on a previous pipeline step named "load_data". Colors indicate on which previous tag the current tag is dependent on.

  • Normally, you should create a new Docker image to include the newly installed libraries and to be able run this notebook as a Katonic pipeline. Docker image creation is explained in another part of the documentation

  • Click Advanced Settings and add Docker image

  • Click the Volume access mode and select the mode.

ReadOnlyMany - Read only by many node

ReadWriteOnce - Read write by single node

ReadWriteMany - Read write by many node

Mode

  • Select or type new experiment name, pipeline name and its description. Then click the Compile and Run button.
  • Monitor the progress of Compiling Notebook.

Comp

  • Observe the progress of Running pipeline

Run

  • Click the link to go to the Katonic Pipelines UI and view the run. You will be redirected to katonic’s platform pipeline section.

Katonic Pipeline Dashboard

  • Go to experiment section you will find the name of your experiment which you mentioned.

  • Dropdown the experiment and select the latest pipeline which is created.

  • Select that and check the progress of your pipeline execution through pipeline graph.
  • Wait till your final pipeline step is successfully executed. That is all blocks should be in green color.

Pipeline components execution

  • If any of your block fails to execute. The block will be in red color. You can always go to the logs section by clicking on the block and check for the exact issue. Shown below are the images of the successfully executed pipeline blocks and their respective information

  • Visualization of Customer Churn Load Data Components

  • Visualization of Customer Churn Data Preprocessing Components

  • Visualization of Customer Churn Decision tree model Components

  • Visualization of Customer Churn Model Evaluation Components

  • Similarly, you can see the visualizations and logs for other containers as well.

Congratulations! You just ran an end-to-end Katonic Pipeline starting from your notebook!

Capturing the instance of Features and Models#

In this section we will be exploring about

  1. Time Travel or Data Versioning

  2. Feature Meta Data

  3. Model Meta Data

Having meta data of features and model stored in meta store at every run helps in time travel on both feature and model to validate what happened in the past.

Time Travel or Data Versioning:

It allows changes to any table to be audited, reproduced, or even rolled back if needed in the event of unintentional changes made due to user error.

For data scientists, one of the most useful features is the ability to go back in time using data versioning. We maintain an ordered transaction log of every operation that is performed on any data/models, so if you want to revert back to an earlier version of a data/model, undo an unintended operation, or even just see what your data looked like at a certain period in time, you can.

It’s easy to use time travel to select data from an earlier version of a table. Users can view the history of a table, and see what the data looked like at that point by using experiment id and run id.

Feature Meta Data:

In feature metadata all the information about the features like time when features are created, location at which features are store, all features data, time stored in feature store, date of streaming data.

Model Meta Data:

In model metadata all the information about the model like model parameters, model run start time, model run end time, model location url, features to the model, performance metrics of the model are had been stored in meta store.

Managing experiments and pipelines#

In this section you will explore about pipelines and experiments. How to create new experiments or pipelines and all other operation on pipelines like delete, terminate, restore, achieve.

Open running pipeline#

Go to Experiments on the left panel. Click on the experiment where you have the pipeline that you want to open.

Once the experiment is open it will pop-up with all the pipelines in the that experiment.

Click on anyone of the experiment and open it in new tab for better view.

The pipeline will be opened as shown in the below figure. To view simple graph you can enable simple graph on the left side.

Graph: Graph shows the complete pipeline in a graph format.

Run: Output: It contains outputs of every container in one place

Config: It contains “Run Details” and “Run Parameters” configured for this pipeline

Pipeline components#

Click on any of the container then you can see all the details about the container in right side window.

Input/Output: It contains input parameters, output parameter, link to download logs and view complete logs

Visualization: It contains all the outputs, print statements, graphs that the container has produced. You can also create manual visualization.

ML Metadata: ML Operation Container it will populate the ML metadata that is saved for serving the model

Details: Contains details about the task like name, status, start time, end time, duration of the task

Volumes: Contains volume mounts for the container

Logs: Shows all the logs of the container from the time its starts and till it ends. If any error occurs error message will also be provided

Pod: Contains all meta information about the pod of the container.

Events: Kubeflow events and the version of the event are shown in this section

Terminating pipe#

Any running pipeline and struck in any stage of the pipeline and wants to terminate it can be terminated.

  • Open the running pipeline in new tab, in the top right you can see the “Terminated” button highlighted.

  • Click on the terminated button to terminate the pipeline

Note: Once pipeline terminated it cannot be started to rerun from where it has stopped.

Restore pipe#

Any pipeline you want to store it for future use and don’t need it now can be moved to archive.

And when it is required in future can restore it from archive.

Note: If a running pipeline is archived even then it will be running in the background. So, if the pipe is not to be used terminate the pipe and then archive it.

  • Select any pipe or experiment that has to be archived.

  • Click on archive button the right bottom. Then the pipe will be moved from active section to archived section.

  • If complete experiments have to be archived then click on Archive button on the top right

  • Go to Archived section.

  • Select the pipe or experiment that has to be restored.

  • Click on the restore button in the right top to restore selected pipe or experiment

Deleting pipe#

A pipeline which is archived and not to be used can be deleted from archived section

  • Go to pipelines > Runs > Archived section

  • Select a pipeline which has to be deleted.

  • Click on Delete on the right top to delete the pipeline

Create a new experiment#

This helps in creating new experiment from exists pipeline.

Image download failed.

Note: Experiment name cannot contain uppercase letters and space in it. And the experiment name which already exists can be used to create new experiment.

  • Give new experiment name and description and click on “Next”.

Pipeline: Choose the existing pipeline from the drop down Run Name: Give a new run name which you can see as new name in pipelines

Run Type: Select One-off if you want to run the pipe only once or if you want to run pipe multiple times select recurring and set all the parameter.

After filling all the details of run, click on the start button to run the pipe.

The pipeline which is in the new experiment start running on the appropriate time.

Models#

Model Registry:#

Lets go back to our customer churn usecase notebook, here we can see that, when we ran our experiment with multiple models, each model is getting logged in registery which is a storage to store model information. You can see your models , their parametes , metrics etc.

Once we are ready with our the best performing model, we need to registered it for production.

To Productionize a model, it involve two steps:

  1. Registered the best model of your choice.

Here we can see that , when we registerd model it return some information about mode like , the status , model Version etc. Here we can see the current stage of the model, right now it is set to None.

  1. Once model is registered, we can change its status of Production .

Voila! You have succesfully transit your model to Production.

Now you can view your model in the katonic platform Registery with all its version. And also the version which is in Production.

First go the Models tab in the side bar, from their select Registeries.

In the Model Registry, scroll down to find you model. Here our model is ‘telecom_customer_churn_logistic_regression’.

When you click on you model. You will get new window containing all your model versions and version which are in Production.

Here you can see , the Model version 3 is in Production.

Model Deployment:#

To deploy your model, Katonic platform make your task like child’s play.

From the model registries, you were able to see which version is available for production, so when you hit the deploy button, platform takes you the new window.

Here you can see the status and the model end point url to deploy. This end point can be integrated with your application to get models Prediction.

Scheduling the pipeline#

Schedule a pipe: There are four major steps to schedule a pipe.

  1. Create Run

  2. Create recurring run

  3. Check scheduled process information

  4. Compare run

Create Run: Create run option used to schedule multiple/single, pipeline/process at any specific time or multiple time according to requirement.

There are 4 steps involve to create run process:

  1. Select Pipeline

  2. Select Run

  3. Filling Details

  4. Create Run

Follow below steps to create run:#

  1. Select Pipelines option (1) then select option run (2) which will navigate you to below screen.
  1. To schedule a pipe(process) select Create run option which will navigate you to below screen.

  2. Pipeline/Process can be scheduled by filling Run details

Pipeline: Choose the pipeline which you want to run

Note: if you don't have pipeline already uploaded to choose. Click on “Upload pipeline” to upload new pipeline

Pipeline version: This is automatically populated when select the pipeline

Run Name: Give a new run name to the pipeline

Description: Describe the pipeline

Experiment: Cheese the experiment in which you want to run the pipeline

  1. After filling all the details, you can click on start tab to schedule your job.

Create Recurring Run: Create recurring option is used to execute/run current process multiple times.

  1. To execute process multiple times, select Recurring option and specify below details.

Pipeline: Choose the pipeline which you want to run

Note: if you don't have pipeline already uploaded to choose. Click on “Upload pipeline” to upload new pipeline

Pipeline version: This is automatically populated when select the pipeline

Run Name: Give a new run name to the pipeline

Description: Describe the pipeline

Experiment: Cheese the experiment in which you want to run the pipeline

Recurring: Select this option to run the pipe multiple times

Trigger Type: Choose periodic or cron job to run

Maximum Concurrent Runs: Number of containers that should run in parallel based on dependencies

Check scheduled process information: Check schedule process information option used to check the details of scheduled process.

  1. To get the details of scheduled job click on experiment tab.

  2. Select manage option on Recurring run config tab to check the details of scheduled job as below.

Compare Run:#

Compare run option used to compare results of two processes.

Shape

  1. To compare the results of two or more active processes, click on experiment tab and select the processes to compare.

  2. Select compare processes option to compare two processes, as below.

Dashboard#

As you login to the platform, you are presented with the dashboard on screen. The dashboard is one of the main components of Katonic platform and gives a summarized view of your recently opened notebooks, pipelines and model servers present. It is divided into 4 sections, namely:

a. Model Servers

b. Recent Notebooks

c. Recent Pipelines

d. Recent Pipeline Runs

  1. Model Servers – This section lists all the available model servers in the order of latest to oldest. ‘Status’ shows the current availability of the server. ‘Age’ shows how long ago the server was created. ‘Model Endpoint’ contains the location at which the server is hosted. At the extreme right, upon clicking on the gauge meter icon you can access performance monitoring statistics of the respective server. This is provided by Grafana.
  1. Recent Notebooks – This section lists the recently opened notebooks with the latest listing being shown at the top. It also shows the date it was last modified on. You can select the required one and open the notebook from here.
  1. Recent Pipelines – This section lists all the pipelines that were created recently along with their date of creation. You can access them by clicking on the respective listing.
  1. Recent Pipelines Runs – This section lists the pipelines that were run recently along with the date they finished running on. You can see the status of the run as a green tick for successful run or a red cross for a failed execution. To check the result of these runs, click on the respective pipeline.