Friday, January 16, 2015

First Steps to Market Basket Analysis with R

In this article I'd like to provide a walkthrough guide on how to perform a simple market basket analysis using R. First of all, I suggest a brief reading of the "Data Mining Techniques: For Marketing, Sales, and Customer Relationship Management" book written by Gordon S. Linoff and Michael J. A. Berry, and more specifically its chapter 15.

The Market Basket Analysis is a marketing application of the data mining technique known as "affinity analysis". It is often used in the retail industry as instrument to infer the customer purchase behavior, in order to improve the sales process by means of activities such as loyalty programs, discounts, and cross-selling.

A well known application of market basket analysis: the Amazon "customers who bought this item also bought" cross-selling functionality.

We suppose to have access to the historical user search data of a car retail portal. In order to perform a Market Basket Analysis, data must to be first fully pivoted - i.e. every user search must correspond to a row, and every search criteria must be corresponding to a column.

Fully pivoted user search data. Source: e "Data Mining Techniques: For Marketing, Sales, and Customer Relationship Management" by Gordon S. Linoff and Michael J. A. Berry, p. 295.

We now need our pivoted data to be imported in our R environment. Through ODBC, by querying a table or executing a stored procedure, we can direct import them - thus generate a R data frame object:

channel <- odbcConnect("dwh", uid = "<your_username>", pwd = "<your_password>")
db <- sqlQuery(channel, "select * from <your_table>")
df <- db[,c('column1', 'column2', 'column3', 'column4')]

Our data are now in the "df" data frame object.

In order to perform a Market Basket Analysis, we need to load the "arules" and the "arulesViz" R packages:


Are all our data column of "factor" type? This is required by the arules package.

sapply(df, class)

If not, we can "correct" the "wrong" column by using a cast operation:

df$ProductX <- as.factor(df$ProductX)
We might not be want to manually cast every single column of our dataframe. If this is the case, we can use the colwise function of the plyr package:

tofactor <- function(x) { x <- as.factor(x) }
df <- colwise(tofactor)(df)

The Market Basket Analysis algorithm ignores every NA value (the R equivalent of the database NULL). In case our DWH dataset is still not qualitative good enough, we might want to to manually clean it:

df$WithDiscount <- replace(df$WithDiscount, df$WithDiscount==0, NA)

Our data are now ready to generate useful information. We begin with a standard support of 0.1% and a confidence of 80%:

rules <- apriori(df, parameter = list(supp = 0.001, conf = 0.8, maxlen=5))

Let's give a look at the top 7 rules, ordered by lift:

rules <- sort(rules, by="lift", decreasing=TRUE)

We obtain something like this:

   lhs   rhs      support     confidence     lift
1  {WithDiscount=1} => {Model=Golf}  0.0070  0.92   1.3
2  {WithDiscount=1} => {EngineType=Diesel} 0.0064  0.85  1.1
3  {Model=Golf} => {Color=Green}   0.0300      0.97 1.4
4  {Model=Golf} => {Color=Red}   0.4138      0.90   1.3
5  {Color=Red} => {EngineType=Gasoline}  0.4138   0.90   1.2
6  {Model=Golf} => {EngineType=Diesel}  0.6343  0.89  1.2
7  {EngineType=Electric} => {Color=White}  0.6343      0.

For better reading we can export our rules to a csv file:
write(rules, file="rules.csv", quote=TRUE, sep=";")

As usual, most of the Data Mining results appears to be trivial - i.e. they are already known by anyone familiar with the business.

In this example, it may appear non trivial and therefore worth a further analysis the rule 7 (a customer choosing an electric vehicle is more likely to purchase it of white color) and the rule 5 (future owners of red painted cars tend to prefer traditional gasoline engines over diesel or electric ones).

Finding out a sound and useful interpretation of these derived rules is, of course, part of the analysis itself.

No comments:

Post a Comment