Monday, October 13, 2014

Distinct Count of Values in R

issue of duplicate rows
Multidimensional datasets often shows the issue of rows containing duplicate values. In SQL we can easly handle this problem thanks to the COUNT DISTINCT aggregate function. But what about R?

According to a couple of websites and blogs I've quickly checked, the fastest and most efficient way to get a distinct count of values in R seems to be by making use of the R unique function:

unique(dataset$column)

where "column" is the column name of the "dataset" dataset, whose values we'd like to distinct count.
The function is gonna return us a vector containing the unique list of values of the specified column - i.e. a vector without duplicate elements.

Thus what we need now is a simple count of this vector:

nrow(newdataset)

Wrapping in one, single scalar-returning statement:

nrow(unique(dataset$column))

If we wanna apply the same logic to the whole dataset rather than a single column, we can use the sapply() lamba-function:

sapply(dataset, function(x), length(unique(x)))

No comments:

Post a Comment