06 Grouping and Aggregation in R

For illustration we are going we ggplot2::diamonds dataset.

Load diamonds data and look at the data structure

> diamonds = ggplot2::diamonds

> str(diamonds)

Find average carat (weight) for each cut

> aggregate(carat ~ cut, diamonds, mean)

diamonds$cut diamonds$carat

1 Fair 1.0461366

2 Good 0.8491847

3 Very Good 0.8063814

4 Premium 0.8919549

5 Ideal 0.7028370

Find average of carat and mean price for each cut.

> aggregate(cbind(carat, price) ~ cut, diamonds, mean)

cut carat price

1 Fair 1.0461366 4358.758

2 Good 0.8491847 3928.864

3 Very Good 0.8063814 3981.760

4 Premium 0.8919549 4584.258

5 Ideal 0.7028370 3457.542

You can also find mean of two columns that are grouped by 2 columns.

> aggregate(cbind(carat, price) ~ cut + color, diamonds, mean)

cut color carat price

1 Fair D 0.9201227 4291.061

2 Good D 0.7445166 3405.382

3 Very Good D 0.6964243 3470.467

4 Premium D 0.7215471 3631.293

5 Ideal D 0.5657657 2629.095

6 Fair E 0.8566071 3682.312

7 Good E 0.7451340 3423.644

...

For doing multiple aggregation at the same time plyr package can useful

> require(plyr)

> ddply(diamonds, .(cut, color), function(w) c(count = nrow(w), avgPrice = mean(w$price), maxCarat = max(w$carat)))

cut color count avgPrice maxCarat

1 Fair D 163 4291.061 3.40

2 Fair E 224 3682.312 2.04

3 Fair F 312 3827.003 2.58

4 Fair G 314 4239.255 2.60

5 Fair H 303 5135.683 4.13

6 Fair I 175 4685.446 3.02

7 Fair J 119 4975.655 5.01

8 Good D 662 3405.382 2.04

9 Good E 933 3423.644 3.00

...

If you are already familiar with sql statement, sqldf package will be useful

require(sqldf)

> sqldf("select cut, color, count(*), avg(price) avgPrice, max(carat) maxCarat from diamonds group by cut, color")

cut color count(*) avgPrice maxCarat