Posts

question-mark-2492009_1920

Model in één middag in de praktijk

Ken je dat? Dat je eigenlijk gewoon eens wilt weten of de data die je hebt, geschikt is voor het voorspellen van bepaald gedrag van je klanten? Maar dat je geen tools tot je beschikking hebt en ook niet zo goed weet waar te beginnen…

Een klant van ons constateerde dat er omzet potentie bleef liggen door de foutieve inschatting van potentiële klantwaarde van prospects. De potentiële klantwaarde werd bepaald door naar enkele generieke kenmerken van prospects te kijken en op basis daarvan een schatting te maken. Dat de schatting van de klantwaarde niet goed (genoeg) was werd door de accountmanagers geconstateerd doordat zij (gedeeltelijk) de verkeerde prospects benaderden. Daarnaast bleek uit de evaluaties van campagnes voor lage klantwaarde prospects dat hier toch wel aardig wat hoge klantwaarde klanten tussen zaten. Hoog tijd voor verbetering dus. Voor dit probleem is een Business Case opgesteld met meerdere scenario’s; de potentie van deze Business Case lag tussen de vijfhonderdduizend en 2,5 miljoen euro.

Naast het helpen van de klant met de opzet voor deze Business Case, hebben we ook nog een aantal kwalitatieve aspecten meegenomen: we hebben een medewerker meegenomen in een aantal methoden en technieken voor data analytics en hem de grondbeginselen van R geleerd. Daarmee hebben we meteen input geleverd voor een grotere Business Case voor de uitfasering van SPSS naar R. Nadat we hier samen met de klant mee aan de slag zijn gegaan, hebben we binnen vijf dagen een eerste model op kunnen leveren, waarvan de toegevoegde waarde op dit moment getest wordt in een pilotproject.

Nou hoor ik je denken “jullie hebben toch een model in één middag, waarom heeft dit dan vijf dagen moeten duren?”. Een terechte vraag uiteraard. De extra tijd die we hierin hebben gestoken, hing vooral samen met het meenemen van de medewerker in de basics van data analytics en R. Zo kon hij meteen ruiken aan dit vakgebied en op basis daarvan (beter) bepalen of hij het leuk zou vinden om zich meer in die richting te ontwikkelen. Wat ons betreft niet alleen een interessant en leuk traject, maar vooral ook bijzonder nuttig en voor herhaling vatbaar!

Heb jij interesse, of loop je ook al tijden met een dergelijke kwestie? Schroom dan vooral niet om contact met ons op te nemen!

 

IMG_20170313_180058

R Experience @ University of Groningen

This Monday was the last day of the R Experience course at the University of Groningen. Since this year Marketing (Intelligence) students are required to use R for assignments, where they’ve only worked with SPSS before. To kickstart their learning curve in R, we lectured the R Experience, in cooperation with the MARUG (Marketing Association at University of Groningen).

The course is intended for analysts who already know the statistics and modelling theories and principles. The goal of the course is to learn how to do this in R, instead of using tools like SPSS or SAS. In five sessions, we showed how to work with Rstudio and how to write code.

They should make this course mandatory for the study Marketing Intelligence

The first module started with an entry level introduction to writing code. In the following module we’ve touched how Rstudio works, how to import data and how to use functions. In the next modules data manipulation and preparation (like missing values, outliers, correlations and near zero variance) were discussed. In the last sessions we showed how to use the caret package to train lots of different kind of models with minor changes in code.

Very clear explanation and definitely added value for every Marketing Intelligence Student

We were amazed by the enthusiasm of the group, students were on Facebook, but instead of looking at their timeline, they used it to send code to each other via Messenger. Also reactions in the feedback forms were very positive, with an average grade of 8.5/10 and no one who wouldn’t recommend the course to fellow students, we consider the course as a great success. Next year, we’ll definitely do this again!

494345930

Python & R vs. SPSS & SAS

When we’re working for clients we mostly come across the statistical programming languages SAS, SPSS, R and Python. Of these SAS and SPSS are probably the most used. However, the interest for the open source languages R and Python is increasing. In recent years, some of our clients migrated from using SAS or SPSS to using R and/or Python. And even if they haven’t (yet), most commercial software packages (including SAS and SPSS) make it possible to connect to R and Python nowadays.

SAS was developed at the North Carolina State University and was primarily developed to be able to analyse large quantities of agriculture data. The abbreviation SAS stands for Statistical Analysis System. In 1976 the company SAS was founded as the demand for such software increased. Statistical Package for the Social Sciences (SPSS) was developed for the social sciences and was the first statistical programming language for the PC. It was developed in 1968 at the Stanford University and eight years later the company SPSS Inc. was founded, which was bought by IBM in 2009.

In 2000 the University of Auckland released the first version of R, a programming language primarily focused on statistical modeling and was open sourced under the GNU license. Python is the only one that was not developed at a university. Python was created by a Dutch guy who is a big fan of Monty Python (where the name comes from). He needed a project during Christmas and created this language which is based on ABC. ABC is a language, also created by him, with the goal to teach non-programmers how to program. Python is a multi-purpose language, like C++ and Java, with the big difference and advantage that Python is way easier to learn. Programmers carried on and created lots of modules on top of Python and it therefore has a wide range of statistical modeling capabilities nowadays. That’s why Python definitely belongs in this list.

In this article, we compare the four languages on methods and techniques, ease of learning, visualisation, support and costs. We explicitly focus on the languages, the user interfaces SAS Enterprise Miner and SPSS Modeler are out of scope.

table

Statistical methods and Techniques

My vision on Data Analysis is that there is continuum between explanatory models on one side and predictive models on the other side. The decisions you make during the modeling process depend on your goal. Let’s take Customer Churn as an example, you can ask yourself why are customers leaving? Or you can ask yourself which customers are leaving? The first question has as its primary goal to explain churn, while the second question has as its primary goal to predict churn. These are two fundamentally different questions and this has implications for the decisions you take along the way. The predictive side of Data Analysis is closely related to terms like Data Mining and Machine Learning.

When we’re looking at SPSS and SAS, both of these languages originate from the explanatory side of Data Analysis. They are developed in an academic environment, where hypotheses testing plays a major role. This makes that they have significant less methods and techniques in comparison to R and Python. Nowadays, SAS and SPSS both have data mining tools (SAS Enterprise Miner and SPSS Modeler), however these are different tools and you’ll need extra licenses.

One of the major advantages of open source tooling is that the community continuously improves and increases functionality. R was created by academics, who wanted their algorithms to spread as easily as possible. Ergo R has the widest range of algorithms, which makes R strong on the explanatory side and on the predictive side of Data Analysis.

Python is developed with a strong focus on (business) applications, not from an academic or statistical standpoint. This makes Python very powerful when algorithms are directly used in applications. Hence, we see that the statistical capabilities are primarily focused on the predictive side. Python is mostly used in Data Mining or Machine Learning applications where a data analyst doesn’t need to intervene. Python is therefore also strong in analysing images and videos, for example we’ve used Python this summer to build our own autonomous driving RC car. Python is also the easiest language to use when using Big Data Frameworks like Spark.

Ease of learning

Both SPSS and SAS have a comprehensive user interface, with the consequence that a user doesn’t necessarily need to code. Furthermore, SPSS has a paste-function which creates syntaxes from steps executed in the user interface and SAS has Proc SQL, which makes SAS-coding a lot easier for people who know the SQL query language. SAS and SPSS code are syntaxtically far from similar to each other and also very different from other relevant programming languages, so when you need to learn one of these from scratch, good luck with it!

Although there are GUI alternatives for R, like Rattle, it doesn’t come close to SAS or SPSS in terms of its functionality. R is easily to learn for programmers, however, a lot of analysts don’t have a background in programming. R has the steepest learning curve from all, it’s the most difficult one to start with. But once you get the basics, it gets easier soon. For this specific reason, we’ve created a R course, called Experience R, which kickstarts (aspiring) data analysts / scientists in learning R. Python is based on ABC, which is developed with the sole purpose of teaching non-programmers how to program. Readability is one of the key features of Python. This makes Python the easiest language to learn. As Python is so broad, there are no GUI’s for Python.

To conclude, as for ease of learning SPSS and SAS are the best option for starting analysts as they provide tools where the user doesn’t need to program.

Support

Both SAS and SPSS are commercial products and therefore have official support. This motivates some companies to choose for these languages: if something goes wrong, they’ve got support.

There is a misconception around the support for open-source tooling. It’s true that there is no official support from the creators or owners, nonetheless, there’s a large community for both languages most willing to help you to solve your problem. And 99 out of 100 times (if not more often), your question has already been asked and answered on sites like Stack Overflow. On top of that, there are numerous companies that do provide professional support for R and Python. So, although there’s no official support for both R and Python, in practice we see that if you’ve got a question, you’ll likely have your answer sooner if it’s about R or Python than in case it’s SAS or SPSS related.

Visualisation

The graphical capabilities of SAS and SPSS are purely functional; although it is possible to make minor changes to graphs, to fully customize your plots and visualizations in SAS and SPSS can be very cumbersome or even impossible. R and Python offer much more opportunities to customize and optimize your graphs due to the wide range of modules that are available. The most widely used module for R is ggplot2, which has a wide set of graphs where you’re able to adjust practically everything. These graphs are also easily made interactive, which allows users to play with the data through applications like shiny.

Python and R learned (and still learn) a lot from each other. One of the best examples of this is that Python also has a ggplot-module , which has practically the same functionality and syntax as it does in R. Another widely used module for visualisation in Python is Matplotlib.

Costs

R and Python are open source, which makes them freely available for everybody. The downside is that, as we’ve discussed before, these are harder to learn languages compared to start using the SAS or SPSS GUI. As a result, analysts equipped with R and/or Python in their skillset have higher salaries than analyst that don’t. Educating employees that are currently not familiar with R and/or Python costs money as well. Therefore, in practice it isn’t the case that the open source programming language are completely free of costs, but when you compare it with the license fees for SAS or SPSS, the business case is very easily made: R and Python are way cheaper!

My choice

“Software is like sex, it’s better when it’s free” – Linus Torvalds (creator Linux)

My go-to tools are R and Python, I can use these languages everywhere without having to buy licenses. Also I don’t need to wait for the licenses. And time is a key feature in my job as a consultant. Aside from licenses, probably the main reason is the wide range of statistical methods; I can use any algorithm out there and choose the one that suits the challenge at hand best.

Which of the two languages I use depends on the goal, as mentioned above. Python is a multi-purpose language and is developed with a strong focus on applications. Python is therefore strong in Machine Learning applications; hence I use Python for example for Face or Object Recognition or Deep Learning applications. I use R for goals which have to do with customer behaviour, where the explanatory side also plays a major role; if I know which customers are about to churn, I would also like to know why.

These two languages are for a large part complementary. There are libraries for R that allow you to run Python code (reticulate, rPython), and there are Python modules which allow you to run R code (rpy2). This makes the combination of the two languages even stronger.


Jeroen Kromme, Senior Consultant Data Scientist

 

Happy_pi_day_header

Happy pi day!

Just something funny because it’s  pi day. Enjoy!

# clear your environment
rm(list = ls())
# load the necessary libraries
library(png)
library(plotrix)
# lab kleuren
oranje <- rgb(228/255, 86/255, 65/255)
donkergrijs <- rgb(75/255, 75/255, 74/255)
lichtblauw <- rgb(123/255, 176/255, 231/255)
# read the image of pi
img = readPNG("C:/Users/j.schoonemann/Desktop/pi.png")
# read the logo of The Analytics Lab
logo = readPNG("C:/Users/j.schoonemann/Desktop/Lab.png")
# define the x-position of the pie charts
x_position <- c(2, 4, 8, 14, 22)
# define the y-position of the pie charts
y_position <- c(4, 6, 8, 10, 12)
# define the size of the pie charts
pie_size <- c(0.5,1.0,1.5,2.0,2.5)
# create PacMan pie-charts
pacman <- list(c(20,80), c(20,80), c(20,80), c(20,80), c(20,80))
# calculate the chart limits for the x-axis
x_axis <- c(min(x_position - pie_size), max(x_position + pie_size))
# calculate the chart limits for the y-axis
y_axis <- c(min(y_position - pie_size),max(y_position + pie_size))
# define the colors of the PacMan pie-charts 
sector_col<- c("black", "yellow")
# define the startposition of the first slice of the pie in the charts
start_position <- c(-0.1, -0.2, -0.3, -0.4, -0.5)
# create the canvas for the plot
plot(0, xlim = x_axis, ylim = y_axis, type = "n", axes = F, xlab = "", ylab = "")
# add a title and subtitle to the plot, adjust size and color
title(main = "Eating Pi makes PacMan grow!\nHappy pi(e) day!", col.main = lichtblauw, cex.main = 2, 
 sub = "Powered by: The Analytics Lab", col.sub = oranje, cex.sub = 1)
# plot all the PacMan pie-charts
for(bubble in 1:length(x_position)){ 
 floating.pie(xpos = x_position[bubble], ypos = y_position[bubble], x = pacman[[bubble]], radius = pie_size[bubble], col = sector_col, startpos = start_position[bubble]) 
}
# add the logo of The Analytics Lab to the plot
rasterImage(image = logo, xleft = 0, ybottom = 12, xright = 5, ytop = 16)
# add pi multiple times to the plot
# pi between 1st and 2nd
rasterImage(image = img, xleft = 2.5, ybottom = 4.5, xright = 3.5, ytop = 5)
# pi between 2nd and 3d
rasterImage(image = img, xleft = 5, ybottom = 6.5, xright = 6, ytop = 7)
rasterImage(image = img, xleft = 5.8, ybottom = 7, xright = 6.8, ytop = 7.5)
# pi between 3d and 4th
rasterImage(image = img, xleft = 10, ybottom = 8.5, xright = 11, ytop = 9)
rasterImage(image = img, xleft = 11, ybottom = 9, xright = 12, ytop = 9.5)
# pi between 4th and 5th
rasterImage(image = img, xleft = 16.2, ybottom = 10, xright = 17.2, ytop = 10.5)
rasterImage(image = img, xleft = 17, ybottom = 10.5, xright = 18, ytop = 11)
rasterImage(image = img, xleft = 18, ybottom = 11, xright = 19, ytop = 11.5)

Happy_pi_day

 

Blog_header

Leer R met de Experience R. Schrijf je nu in!

Als (data) analist ben je altijd bezig met hoe jij jouw organisatie kan helpen met betere inzichten en modellen. Jouw innovativiteit wordt beperkt door de mogelijkheden die je hebt in de tool waarmee je werkt. Wij willen je graag helpen en je introduceren in R, een flexibel open-source programma waarbij de community ervoor zorgt dat state-of-the-art technieken al ingezet kunnen worden voordat jij het maar hebt kunnen bedenken.

Wil jij je graag verder ontwikkelen en innovatief bezig zijn in het Data Science vakgebied; schrijf je dan nu in voor onze Experience R!

De Experience R bestaat uit een zestal modules van vier uur. De eerste module geeft een introductie in R en Rstudio, vervolgens leer je hoe je data moet importeren vanuit verschillende bronnen, hoe je data kan manipuleren en transformeren, moet prepareren voor modelbouw en uiteindelijk het modelleren zelf. De visie van de laatste dag is ‘Putting it all together’, waarbij je het ‘R Template’ krijgt. Dit is een standaard script in R met als doel om met minimale aanpassingen een classificatie model te bouwen.

Praktische informatie:

  • Wanneer: wekelijks van woensdag 15 maart  t/m woensdag 19 april 2017
  • Tijd: van 13:00 tot 17:00 uur
  • Waar: The Analytics Lab, Atoomweg 50, 3542 AB Utrecht
  • Inschrijven vóór: woensdag 1 maart 2017
  • Kosten: 1.950 euro per persoon (voor zes modules)
  • Minder modules nodig: neem contact met ons op

 

Module_overzicht_grijze_achtergrond

 

Happy New Year

Fireworks (in R)

New Year – a new chapter, new verse, or just the same old story ? Ultimately we write it. The choice is ours. ― Alex Morritt

The Analytics Lab and Cmotions wish everybody a happy year. A year full of challenges, new experiences and new knowledge.

Happy New Year

Read more

ChristmasTree

Christmas Tree with ggplot

ChristmasTree

rm(list = ls())
library(ggplot2)

# create data
x <- c(8,7,6,7,6,5,6,5,4,5,4,3,4,3,2,3,2,1,0.5,0.1)

dat1 <- data.frame(x1 = 1:length(x), x2 = x)
dat2 <- data.frame(x1 = 1:length(x), x2 = -x)
dat1$xvar <- dat2$xvar <- NA
dat1$yvar <- dat2$yvar <- NA
dat1$siz <- dat2$siz <- NA
dat1$col <- dat2$col <- NA

# set threshold for christmas balls
dec_threshold = -0.5

# create random places, sizes and colors for christmas balls
set.seed(2512)
for (row in 1:nrow(dat1)){

if (rnorm(1) > dec_threshold){

dat1$xvar[row] <- row
dat1$yvar[row] <- sample(1:dat1$x2[row]-1,1)
dat1$siz[row] <- runif(1,0.5,1.5)
dat1$col[row] <- sample(1:5, 1)
}

if (rnorm(1) > dec_threshold){

dat2$xvar[row] <- row
dat2$yvar[row] <- sample(1:dat2$x2[row],1)
dat2$siz[row] <- runif(1,0.5,1.5)
dat2$col[row] <- sample(1:5, 1)
}
}

# plot the christmas tree
ggplot() +
geom_bar(data = dat1, aes(x=x1, y=x2),stat = "identity", fill = '#31a354') +
geom_bar(data = dat2, aes(x=x1, y=x2),stat = "identity", fill = '#31a354') +
geom_point(data = dat1,aes(x = xvar, y = yvar, size = siz, colour = as.factor(col)) ) +
geom_point(data = dat2,aes(x = xvar, y = yvar, size = siz, colour = as.factor(col)) ) +
coord_flip() + theme_minimal()+ theme(legend.position="none",
axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank(),
axis.title.y=element_blank(),
axis.text.y=element_blank(),
axis.ticks.y=element_blank()) +
ggtitle('The Analytics Lab wishes you a Merry Christmas')


FcXpQ

R

How to remove all user installed packages in R

A little while ago I ran into an issue with R and RStudio. In order to solve this issue I saw myself forced to remove all user installed packages. For a lot of practical reasons it was not an option for me to simply uninstall R and start with a clean slate and a new installation of R.

Therefore I had to find another solution which could help me with removing all user installed packages. I found a few useful tips and tricks, but nothing that immediately did the job for me. That’s why I’ve decided to share my solution, I hope this can help you in the hour of need.

Here is the script I’ve created in order to remove all user installed packages, without removing any base packages for R or MRO.

# create a list of all installed packages
 ip <- as.data.frame(installed.packages())
 head(ip)
# if you use MRO, make sure that no packages in this library will be removed
 ip <- subset(ip, !grepl("MRO", ip$LibPath))
# we don't want to remove base or recommended packages either\
 ip <- ip[!(ip[,"Priority"] %in% c("base", "recommended")),]
# determine the library where the packages are installed
 path.lib <- unique(ip$LibPath)
# create a vector with all the names of the packages you want to remove
 pkgs.to.remove <- ip[,1]
 head(pkgs.to.remove)
# remove the packages
 sapply(pkgs.to.remove, remove.packages, lib = path.lib)
20161027_TALblog_Amsterdam_heatmap

How to combine Google Maps with a Choropleth shapefile of Holland in R: Amsterdam neighbourhoods (postal codes) by number of customers

Sometimes the best way to visualize your information is by plotting it on an map. But what if the information you want to show isn’t a point on a map but a shape, like a neighbourhood? Well, that’s when you use a choropleth, which you can create using a shapefile (look for ESRI shapefiles on Google). Usually the map is still recognizable when you simply use the shapefile. For example, the shapefile of Holland looks like this:
20161027_TALblog_Nederland

But when you zoom in on this map and, like I did, only want to visualize the postal code areas of Amsterdam. The map becomes unrecognizable and you have absolutely no clue of what you’re looking at:
20161027_TALblog_Amsterdam_shapefile

For this reason I wanted to combine this choropleth with a Google Map plot of the same area. In this way you can still recognize the city of Amsterdam and it’s neighbourhoods.
What I wanted to end up with was a heatmap, which showed the “intensity” for each neighbourhood of Amsterdam. In this case, the “intensity” showed was the conversion percentage of a campaign that was done in Amsterdam.

For this purpose we’ll need the following packages: ggmap, RgoogleMaps, maptools, rgdal, ggplot2, rgeos. So we’ll start by loading these packages.

 # install and load packages 
library(ggmap) 
library(RgoogleMaps) 
library(maptools) 
library(rgdal) 
library(ggplot2) 
library(rgeos)

Now that we have loaded all the necessary packages, we’ll move on by gathering the Google Maps view we need to plot the choropleth on. There are two ways to use the geocode function (ggmap package) to get the coordinates we want to use as the center of the map, either you use the name of the location or you just enter the coordinates yourself:

# get the coordinates of the center of the map you want to show
 CenterOfMap <- geocode("Amsterdam")
 CenterOfMap <- geocode("52.374,4.618")

You can use these coordinates to tell the get_map function (ggmap package), which part of the world you’re looking for. With the zoom variable you can define the area you want to see.

# get the map from google
Amsterdam <- get_map(c(lon = CenterOfMap$lon, lat = CenterOfMap$lat),zoom = 12, maptype = "terrain", source = "google")

Make sure you’ve downloaded the correct area of the world with the right zoom level.

# create and plot the map
AmsterdamMap <- ggmap(Amsterdam)
AmsterdamMap

Now we’ve got the map we can start with loading the data we want to show on the map.

# load the data you want to show in the choropleth
geo_data <- read.csv("Data.csv", header = T, sep = ";", dec = ",")

This dataset contains two columns:

  1. id, which contains the four digits of the Dutch postal codes;
  2. value, which contains the value I want to show on the map, in this case this is the conversion rate for each area.

 

Finally, we need to load the shapefile of Holland with the four digit postal areas. In order for Google to recognize the latitude and longitude of the shapefile we need to transform it.

# load the shapefile
shapedata <- readOGR('.', "ESRI-PC4-2015R1")
 # transform the shapefile to something Google Maps can recognize
shapedata_trans <- spTransform(shapedata, CRS("+proj=longlat +datum=WGS84"))

After having done this, we’re ready to combine all our files to create one stunning plot! We start by merging the shapefile with the data we want to plot.

# add an id to each row
shapedata_trans@data$id <- rownames(shapedata_trans@data)
# merge the data and the shapefile
shape_and_data <- merge(x = shapedata_trans, y = geo_data, by.x = "PC4", by.y = "id")

In merging these files it can happen that self-intersections arise, we want to make sure to remove them, since they’ll cause errors.

# repair self-intersections
shape_and_data_wsi <- gBuffer(shape_and_data, width = 0, byid = TRUE)

Lastly, we fortify the combination of the data and the shapefile and combine the resulting data with the original shapefile.

# fortify the shapefile to make it usable to plot on Google Maps
fort <- fortify(shape_and_data_wsi)
# merge the fortified shapefile with the original shapefile data
PC4 <- merge(fort,shape_and_data_wsi@data, by = "id")

We end up with a file we can combine with our Google Map plot we’ve created earlier.

# create the final map with the overlaying choropleth
AmsterdamMap_final <- AmsterdamMap + 
  geom_polygon(aes(x = long, y = lat, group = group, fill = value), size = .2, color = 'black', data = PC4, alpha = 0.8) +
  coord_map() +
  scale_fill_gradient(low = "red", high = "green") + 
  theme(legend.title = element_blank(), legend.position = "bottom")
AmsterdamMap_final

Which in the end leads us to this plot:
20161027_TALblog_Amsterdam_heatmap

I hope this blog can help and inspire you! If you have a better way to achieve the same, or maybe even a better, combination of a choropleth and Google Maps, please let me know! I’m always looking for better ways to visualize my data on a map.