16 - 04 - 2021

Our package is live! Meet tortoise, your starting point for building Machine Learning models in Python

Machine learning in Python made easy in one simple package, check the source code here.

Starting to build machine learning models using Python can surely be overwhelming due to the endless possibilities that this open source tool offers. Therefore, we built a Python package that guides (junior) data analysts and scientists through all the steps involved in building machine learning models with easy to use functions.

What to expect from this package

The package consists of three classes (bundling of data and functions) which are based on the CRISP-DM (cross-industry standard process for data mining) model. These are

  1. Loading & Understanding Data
  2. Preprocessing Data
  3. Training Model

For each class, we have written functions that we think are crucial to properly execute that step of data modelling. For now, the package supports building classification models only. However, we are open source, so feel free to help us with adding other types of algorithms! Also a fourth class, containing functionality for model evaluation is still on our wish list, so feel free to help us there as well.

Now that the application of algorithms becomes more common within organizations, the role of Data Scientists transforms as well. More focus is often dedicated to the added value of the model and not necessarily to understanding the nitty gritty details on why the model is so good at predicting. Considering the huge amounts of data that an algorithm is able to process, it might not even be possible for humans to understand the details anymore. The trick is focusing on accuracy, while not losing explainability or the ability to put it into production.

So whereas Data Scientists used to be experts in statistics, currently the added value of a model becomes more important. And that is exactly why we want to help analysts with this package: to be able to add value to customers and their organization, with as little effort as necessary.

Getting Started

In order to get started we recommend starting at our GitLab Repository. For more information on installing the package, start with the README. For more information on working with the package we created a Tutorial that should get you acquainted with the package functions within 10 minutes.

Why we made this package

Last year, our former colleague Jeroen wrote an article in which different statistical analysis tools are compared. He concluded that open source tooling, as opposed to commercial tooling, is the way to go. The strength of these tools is that the communities behind them enable continuous development, making them more powerful, innovative and useful than existing commercial tools. Also, many commercial parties now integrate the ability to use open-source languages in their tools. By now, we observe that especially the use of Python is becoming more and more common practice within organizations. Looking into the near future, we dare to state that every data scientist needs to know at least R or Python.

The growing community also entails a growing number of packages, which can be cumbersome for a starting Data Scientist; where to start? Your primary goal as a Data Scientist is to add value to your customers & organization, so a tutorial on how to perform a particular algorithm on the Titanic dataset won’t do. We therefore decided to create a package which can be used as a starting point for building a model. Its focus is not only on the algorithms but on the whole process, which starts by determining where to add value!

Moreover, data science or analytics departments are becoming more important within organizations. Given that Data Analysts or Scientists from different teams might work on similar challenges using the same data, this raises a need for consistency and efficiency across different models. Therefore, it is convenient to have all the logic behind a model integrated within one package. By doing so, all data analysts can rely on the same ruling which means that models are becoming less dependent on the person who initially built it. So, if you get confused by the endless number of available packages or don’t feel like building a Python package for your organization yourself, we would strongly recommend using ours!

And obviously, as big fans of Python, it would be hypocritical to advocate the strength of the communities behind it if we would only free ride on the efforts of others.

Curious to tip your toes in the water? Apart from creating this package, we teach Python (and R) courses for Data Analysts or Scientists who aim to learn on working with data and building models in Python or R.

This article is written by:
Wouter van Gils
Siri de Ruiter