Window Functions in Spark SQL

Add a unique id (Long) to a each row in a DF.

import sqlContext.implicits._ import org.apache.spark.sql.Row import org.apache.spark.sql.types.{StructType, StructField, LongType} val df = sc.parallelize(Seq( ("a", -1.0), ("b", -2.0), ("c", -3.0))).toDF("foo", "bar") val schema = df.schema val rows = df.rdd.zipWithUniqueId.map{ case (r: Row, id: Long) => Row.fromSeq(id +: r.toSeq)} val dfWithPK = sqlContext.createDataFrame( rows, StructType(StructField("id", LongType, false) +: schema.fields))

Another easier option

df.withColumn("id", monotonicallyIncreasingId())

+---+----+-----------+

|foo| bar| id|

+---+----+-----------+

| a|-1.0| 8589934592|

| b|-2.0|17179869184|

| c|-3.0|25769803776|

+---+----+-----------+

Window Functions

Download yelp data http://www.yelp.com/dataset_challenge

$ import org.apache.spark.sql.functions._

$ val business = sqlContext.read.json("/Volumes/SONY/Data/yelp_dataset_challenge_academic_dataset/yelp_academic_dataset_business.json")

$ business.registerTempTable("business")

$ sql("SELECT name, city, review_count, row_number() OVER (PARTITION BY city) as city_review_total FROM business").show

+--------------------+--------------+------------+-----------------+

| name| city|review_count|city_review_total|

+--------------------+--------------+------------+-----------------+

|A B-Z Auto Care, ...|110. Las Vegas| 15| 1|

| Wild Rice| Allentown| 16| 1|

| Popmart| Allentown| 3| 2|

| Desert Rose Bar| Arlington| 3| 1|

| Co-Op Grill| Arlington| 7| 2|

|West Valley Pedia...| Avondale| 18| 1|

|Dermatology for A...| Avondale| 3| 2|

| Earnhardt Hyundai| Avondale| 33| 3|

|Waterworks Water ...| Avondale| 5| 4|

| Domino's Pizza| Avondale| 8| 5|

|Universal Technic...| Avondale| 4| 6|

|Crystal Springs A...| Avondale| 11| 7|

| Wendy's| Avondale| 9| 8|

| Fire Department| Avondale| 3| 9|

| Gamestop| Avondale| 7| 10|

|Phoenix Plasterin...| Avondale| 5| 11|

| Nails of the World| Avondale| 14| 12|

|St Thomas Aquinas...| Avondale| 3| 13|

| Palermos Pizza| Avondale| 28| 14|

| Dollar Tree| Avondale| 5| 15|

+--------------------+--------------+------------+-----------------+

only showing top 20 rows

Get the top Yelp businesses based on the number reviews in each city and the row number of the business.

$ sql("SELECT name, city, review_count, rank() OVER (PARTITION BY city ORDER BY review_count DESC) as rank FROM business").show

+--------------------+--------------+------------+----+

| name| city|review_count|rank|

+--------------------+--------------+------------+----+

|A B-Z Auto Care, ...|110. Las Vegas| 15| 1|

| Wild Rice| Allentown| 16| 1|

| Popmart| Allentown| 3| 2|

| Co-Op Grill| Arlington| 7| 1|

| Desert Rose Bar| Arlington| 3| 2|

|Flavors of Louisiana| Avondale| 352| 1|

| Pita Kitchen| Avondale| 326| 2|

|TJ's Homestyle Re...| Avondale| 178| 3|

| Smashburger| Avondale| 158| 4|

| India Garden| Avondale| 151| 5|

|Don Pancho Mexica...| Avondale| 145| 6|

| Claim Jumper| Avondale| 138| 7|

|Thai Orchid Garde...| Avondale| 128| 8|

| Greek Gyro Express| Avondale| 119| 9|

|Colados Coffee & ...| Avondale| 118| 10|

| Thai Gourmet| Avondale| 117| 11|

| Fiesta Mexicana| Avondale| 108| 12|

|Kreative Kupcakes...| Avondale| 106| 13|

| NYPD Pizza| Avondale| 103| 14|

| Islands Restaurant| Avondale| 102| 15|

+--------------------+--------------+------------+----+

only showing top 20 rows

Check the number reviews for each business compared to the average number of reviews across all business in the city.

$ sql("SELECT name, city, review_count, AVG(review_count) OVER (PARTITION BY city) as city_review_avg FROM business").show

+--------------------+--------------+------------+------------------+

| name| city|review_count| city_review_avg|

+--------------------+--------------+------------+------------------+

|A B-Z Auto Care, ...|110. Las Vegas| 15| 15.0|

| Wild Rice| Allentown| 16| 9.5|

| Popmart| Allentown| 3| 9.5|

| Desert Rose Bar| Arlington| 3| 5.0|

| Co-Op Grill| Arlington| 7| 5.0|

|West Valley Pedia...| Avondale| 18|21.067357512953368|

|Dermatology for A...| Avondale| 3|21.067357512953368|

| Earnhardt Hyundai| Avondale| 33|21.067357512953368|

|Waterworks Water ...| Avondale| 5|21.067357512953368|

| Domino's Pizza| Avondale| 8|21.067357512953368|

|Universal Technic...| Avondale| 4|21.067357512953368|

|Crystal Springs A...| Avondale| 11|21.067357512953368|

| Wendy's| Avondale| 9|21.067357512953368|

| Fire Department| Avondale| 3|21.067357512953368|

| Gamestop| Avondale| 7|21.067357512953368|

|Phoenix Plasterin...| Avondale| 5|21.067357512953368|

| Nails of the World| Avondale| 14|21.067357512953368|

|St Thomas Aquinas...| Avondale| 3|21.067357512953368|

| Palermos Pizza| Avondale| 28|21.067357512953368|

| Dollar Tree| Avondale| 5|21.067357512953368|

+--------------------+--------------+------------+------------------+

Check how the number of reviews for each business contribute to the total number of reviews for all businesses in the city.

$ sql("SELECT name, city, review_count, SUM(review_count) OVER (PARTITION BY city) as city_review_total FROM business").show

+--------------------+--------------+------------+-----------------+

| name| city|review_count|city_review_total|

+--------------------+--------------+------------+-----------------+

|A B-Z Auto Care, ...|110. Las Vegas| 15| 15|

| Wild Rice| Allentown| 16| 19|

| Popmart| Allentown| 3| 19|

| Desert Rose Bar| Arlington| 3| 10|

| Co-Op Grill| Arlington| 7| 10|

|West Valley Pedia...| Avondale| 18| 8132|

|Dermatology for A...| Avondale| 3| 8132|

| Earnhardt Hyundai| Avondale| 33| 8132|

|Waterworks Water ...| Avondale| 5| 8132|

| Domino's Pizza| Avondale| 8| 8132|

|Universal Technic...| Avondale| 4| 8132|

|Crystal Springs A...| Avondale| 11| 8132|

| Wendy's| Avondale| 9| 8132|

| Fire Department| Avondale| 3| 8132|

| Gamestop| Avondale| 7| 8132|

|Phoenix Plasterin...| Avondale| 5| 8132|

| Nails of the World| Avondale| 14| 8132|

|St Thomas Aquinas...| Avondale| 3| 8132|

| Palermos Pizza| Avondale| 28| 8132|

| Dollar Tree| Avondale| 5| 8132|

+--------------------+--------------+------------+-----------------+

only showing top 20 rows

List Top 10 cities and their highest ranked businesses in terms of number of reviews.

$ sql("SELECT * FROM (SELECT name, city, review_count, rank() OVER (PARTITION BY city ORDER BY review_count DESC) as rank FROM business) as business_ranked WHERE rank = 1 ORDER BY review_count DESC").show

+--------------------+-----------------+------------+----+

| name| city|review_count|rank|

+--------------------+-----------------+------------+----+

| Mon Ami Gabi| Las Vegas| 5642| 1|

|Phoenix Sky Harbo...| Phoenix| 1786| 1|

| Studio B| Henderson| 1629| 1|

|Four Peaks Brewin...| Tempe| 1551| 1|

| Schwartz's| Montréal| 1224| 1|

|Charlotte Douglas...| Charlotte| 1205| 1|

|Citizen Public House| Scottsdale| 1171| 1|

| Joe's Farm Grill| Gilbert| 1165| 1|

| The Old Fashioned| Madison| 1011| 1|

| Texas de Brazil| Dallas| 990| 1|

| Meat & Potatoes| Pittsburgh| 980| 1|

|SanTan Brewing Co...| Chandler| 843| 1|

|Cornish Pasty Com...| Mesa| 768| 1|

|Black Dog Smoke &...| Urbana| 597| 1|

| Yard House| Glendale| 521| 1|

| Old School Pizzeria| North Las Vegas| 396| 1|

|Rock Springs Cafe...|Black Canyon City| 371| 1|

| El Encanto| Cave Creek| 361| 1|

|Flavors of Louisiana| Avondale| 352| 1|

|Queen Creek Olive...| Queen Creek| 351| 1|

+--------------------+-----------------+------------+----+

only showing top 20 rows

Compare the number of reviews for each business with the top and bottom review counts in the city.

$ sql("SELECT name, city, review_count, FIRST_VALUE(review_count) OVER (PARTITION BY city ORDER BY review_count DESC) as city_level_top, LAST_VALUE(review_count) OVER (PARTITION BY city ORDER BY review_count DESC) as city_level_bottom FROM business").show

+--------------------+--------------+------------+--------------+-----------------+

| name| city|review_count|city_level_top|city_level_bottom|

+--------------------+--------------+------------+--------------+-----------------+

|A B-Z Auto Care, ...|110. Las Vegas| 15| 15| 15|

| Wild Rice| Allentown| 16| 16| 16|

| Popmart| Allentown| 3| 16| 3|

| Co-Op Grill| Arlington| 7| 7| 7|

| Desert Rose Bar| Arlington| 3| 7| 3|

|Flavors of Louisiana| Avondale| 352| 352| 352|

| Pita Kitchen| Avondale| 326| 352| 326|

|TJ's Homestyle Re...| Avondale| 178| 352| 178|

| Smashburger| Avondale| 158| 352| 158|

| India Garden| Avondale| 151| 352| 151|

|Don Pancho Mexica...| Avondale| 145| 352| 145|

| Claim Jumper| Avondale| 138| 352| 138|

|Thai Orchid Garde...| Avondale| 128| 352| 128|

| Greek Gyro Express| Avondale| 119| 352| 119|

|Colados Coffee & ...| Avondale| 118| 352| 118|

| Thai Gourmet| Avondale| 117| 352| 117|

| Fiesta Mexicana| Avondale| 108| 352| 108|

|Kreative Kupcakes...| Avondale| 106| 352| 106|

| NYPD Pizza| Avondale| 103| 352| 103|

| Islands Restaurant| Avondale| 102| 352| 102|

+--------------------+--------------+------------+--------------+-----------------+

only showing top 20 rows

$ sql("SELECT name, city, review_count, FIRST_VALUE(review_count) OVER (PARTITION BY city ORDER BY review_count DESC) as city_level_top, FIRST_VALUE(review_count) OVER (PARTITION BY city ORDER BY review_count ASC) as city_level_bottom FROM business").show

+--------------------+--------------+------------+--------------+-----------------+

| name| city|review_count|city_level_top|city_level_bottom|

+--------------------+--------------+------------+--------------+-----------------+

|A B-Z Auto Care, ...|110. Las Vegas| 15| 15| 15|

| Popmart| Allentown| 3| 16| 3|

| Wild Rice| Allentown| 16| 16| 3|

| Desert Rose Bar| Arlington| 3| 7| 3|

| Co-Op Grill| Arlington| 7| 7| 3|

|Dermatology for A...| Avondale| 3| 352| 3|

| Fire Department| Avondale| 3| 352| 3|

|St Thomas Aquinas...| Avondale| 3| 352| 3|

| Officemax| Avondale| 3| 352| 3|

| Chase Bank| Avondale| 3| 352| 3|

| Subway| Avondale| 3| 352| 3|

| Zymages| Avondale| 3| 352| 3|

|A & C Party Supplies| Avondale| 3| 352| 3|

|Bamboo Palace Buffet| Avondale| 3| 352| 3|

| Ann Nails| Avondale| 3| 352| 3|

|Augusta's House &...| Avondale| 3| 352| 3|

| Bank of America| Avondale| 3| 352| 3|

| T-Mobile| Avondale| 3| 352| 3|

| Wells Fargo Bank| Avondale| 3| 352| 3|

|Fry's & Fry's Mar...| Avondale| 3| 352| 3|

+--------------------+--------------+------------+--------------+-----------------+

only showing top 20 rows

Compare the number of reviews with the number of reviews for the previous and following businesses.

$ sql("SELECT name, city, review_count, LAG(review_count, 1) OVER(PARTITION BY city ORDER BY review_count DESC) AS preceding_count, LEAD(review_count, 1) OVER(PARTITION BY city ORDER BY review_count DESC) AS leading_count FROM business").show

+--------------------+--------------+------------+---------------+-------------+

| name| city|review_count|preceding_count|leading_count|

+--------------------+--------------+------------+---------------+-------------+

|A B-Z Auto Care, ...|110. Las Vegas| 15| null| null|

| Wild Rice| Allentown| 16| null| 3|

| Popmart| Allentown| 3| 16| null|

| Co-Op Grill| Arlington| 7| null| 3|

| Desert Rose Bar| Arlington| 3| 7| null|

|Flavors of Louisiana| Avondale| 352| null| 326|

| Pita Kitchen| Avondale| 326| 352| 178|

|TJ's Homestyle Re...| Avondale| 178| 326| 158|

| Smashburger| Avondale| 158| 178| 151|

| India Garden| Avondale| 151| 158| 145|

|Don Pancho Mexica...| Avondale| 145| 151| 138|

| Claim Jumper| Avondale| 138| 145| 128|

|Thai Orchid Garde...| Avondale| 128| 138| 119|

| Greek Gyro Express| Avondale| 119| 128| 118|

|Colados Coffee & ...| Avondale| 118| 119| 117|

| Thai Gourmet| Avondale| 117| 118| 108|

| Fiesta Mexicana| Avondale| 108| 117| 106|

|Kreative Kupcakes...| Avondale| 106| 108| 103|

| NYPD Pizza| Avondale| 103| 106| 102|

| Islands Restaurant| Avondale| 102| 103| 99|

+--------------------+--------------+------------+---------------+-------------+

only showing top 20 rows

Find correlation between number of businesses and review counts with a city

$ sql("select count(*) number_of_business, sum(review_count) counts from business group by city").stat.corr("number_of_business","counts")

res59: Double = 0.9516694713907098