06 Grouping and Aggregation in R

For illustration we are going we ggplot2::diamonds dataset.

Load diamonds data and look at the data structure

> diamonds = ggplot2::diamonds
> str(diamonds)

Find average carat (weight) for each cut

> aggregate(carat ~ cut, diamonds, mean)
  diamonds$cut diamonds$carat
1         Fair      1.0461366
2         Good      0.8491847
3    Very Good      0.8063814
4      Premium      0.8919549
5        Ideal      0.7028370

Find average of carat and mean price for each cut.

> aggregate(cbind(carat, price) ~ cut, diamonds, mean)
        cut     carat    price
1      Fair 1.0461366 4358.758
2      Good 0.8491847 3928.864
3 Very Good 0.8063814 3981.760
4   Premium 0.8919549 4584.258
5     Ideal 0.7028370 3457.542

You can also find mean of two columns that are grouped by 2 columns.

> aggregate(cbind(carat, price) ~ cut + color, diamonds, mean)
         cut color     carat    price
1       Fair     D 0.9201227 4291.061
2       Good     D 0.7445166 3405.382
3  Very Good     D 0.6964243 3470.467
4    Premium     D 0.7215471 3631.293
5      Ideal     D 0.5657657 2629.095
6       Fair     E 0.8566071 3682.312
7       Good     E 0.7451340 3423.644
...

For doing multiple aggregation at the same time plyr package can useful

> require(plyr)
> ddply(diamonds, .(cut, color), function(w) c(count = nrow(w), avgPrice = mean(w$price), maxCarat = max(w$carat)))
         cut color count avgPrice maxCarat
1       Fair     D   163 4291.061     3.40
2       Fair     E   224 3682.312     2.04
3       Fair     F   312 3827.003     2.58
4       Fair     G   314 4239.255     2.60
5       Fair     H   303 5135.683     4.13
6       Fair     I   175 4685.446     3.02
7       Fair     J   119 4975.655     5.01
8       Good     D   662 3405.382     2.04
9       Good     E   933 3423.644     3.00
...

If you are already familiar with sql statement, sqldf package will be useful

require(sqldf)
> sqldf("select cut, color, count(*), avg(price) avgPrice, max(carat) maxCarat from diamonds group by cut, color")
         cut color count(*) avgPrice maxCarat
1       Fair     D      163 4291.061     3.40
2       Fair     E      224 3682.312     2.04
3       Fair     F      312 3827.003     2.58
4       Fair     G      314 4239.255     2.60
5       Fair     H      303 5135.683     4.13
6       Fair     I      175 4685.446     3.02
...

Order diamonds dataset based on price

> diamonds = diamonds[order(diamonds$price, decreasing = TRUE),]