06 Grouping and Aggregation in R
For illustration we are going we ggplot2::diamonds dataset.
Load diamonds data and look at the data structure
> diamonds = ggplot2::diamonds
> str(diamonds)
Find average carat (weight) for each cut
> aggregate(carat ~ cut, diamonds, mean)
diamonds$cut diamonds$carat
1 Fair 1.0461366
2 Good 0.8491847
3 Very Good 0.8063814
4 Premium 0.8919549
5 Ideal 0.7028370
Find average of carat and mean price for each cut.
> aggregate(cbind(carat, price) ~ cut, diamonds, mean)
cut carat price
1 Fair 1.0461366 4358.758
2 Good 0.8491847 3928.864
3 Very Good 0.8063814 3981.760
4 Premium 0.8919549 4584.258
5 Ideal 0.7028370 3457.542
You can also find mean of two columns that are grouped by 2 columns.
> aggregate(cbind(carat, price) ~ cut + color, diamonds, mean)
cut color carat price
1 Fair D 0.9201227 4291.061
2 Good D 0.7445166 3405.382
3 Very Good D 0.6964243 3470.467
4 Premium D 0.7215471 3631.293
5 Ideal D 0.5657657 2629.095
6 Fair E 0.8566071 3682.312
7 Good E 0.7451340 3423.644
...
For doing multiple aggregation at the same time plyr package can useful
> require(plyr)
> ddply(diamonds, .(cut, color), function(w) c(count = nrow(w), avgPrice = mean(w$price), maxCarat = max(w$carat)))
cut color count avgPrice maxCarat
1 Fair D 163 4291.061 3.40
2 Fair E 224 3682.312 2.04
3 Fair F 312 3827.003 2.58
4 Fair G 314 4239.255 2.60
5 Fair H 303 5135.683 4.13
6 Fair I 175 4685.446 3.02
7 Fair J 119 4975.655 5.01
8 Good D 662 3405.382 2.04
9 Good E 933 3423.644 3.00
...
If you are already familiar with sql statement, sqldf package will be useful
require(sqldf)
> sqldf("select cut, color, count(*), avg(price) avgPrice, max(carat) maxCarat from diamonds group by cut, color")
cut color count(*) avgPrice maxCarat
1 Fair D 163 4291.061 3.40
2 Fair E 224 3682.312 2.04
3 Fair F 312 3827.003 2.58
4 Fair G 314 4239.255 2.60
5 Fair H 303 5135.683 4.13
6 Fair I 175 4685.446 3.02
...
Order diamonds dataset based on price
> diamonds = diamonds[order(diamonds$price, decreasing = TRUE),]