Window Functions in Spark SQL
Add a unique id (Long) to a each row in a DF.
import sqlContext.implicits._ import org.apache.spark.sql.Row import org.apache.spark.sql.types.{StructType, StructField, LongType} val df = sc.parallelize(Seq( ("a", -1.0), ("b", -2.0), ("c", -3.0))).toDF("foo", "bar") val schema = df.schema val rows = df.rdd.zipWithUniqueId.map{ case (r: Row, id: Long) => Row.fromSeq(id +: r.toSeq)} val dfWithPK = sqlContext.createDataFrame( rows, StructType(StructField("id", LongType, false) +: schema.fields))
Another easier option
df.withColumn("id", monotonicallyIncreasingId())
+---+----+-----------+
|foo| bar| id|
+---+----+-----------+
| a|-1.0| 8589934592|
| b|-2.0|17179869184|
| c|-3.0|25769803776|
+---+----+-----------+
Window Functions
Download yelp data http://www.yelp.com/dataset_challenge
$ import org.apache.spark.sql.functions._
$ val business = sqlContext.read.json("/Volumes/SONY/Data/yelp_dataset_challenge_academic_dataset/yelp_academic_dataset_business.json")
$ business.registerTempTable("business")
$ sql("SELECT name, city, review_count, row_number() OVER (PARTITION BY city) as city_review_total FROM business").show
+--------------------+--------------+------------+-----------------+
| name| city|review_count|city_review_total|
+--------------------+--------------+------------+-----------------+
|A B-Z Auto Care, ...|110. Las Vegas| 15| 1|
| Wild Rice| Allentown| 16| 1|
| Popmart| Allentown| 3| 2|
| Desert Rose Bar| Arlington| 3| 1|
| Co-Op Grill| Arlington| 7| 2|
|West Valley Pedia...| Avondale| 18| 1|
|Dermatology for A...| Avondale| 3| 2|
| Earnhardt Hyundai| Avondale| 33| 3|
|Waterworks Water ...| Avondale| 5| 4|
| Domino's Pizza| Avondale| 8| 5|
|Universal Technic...| Avondale| 4| 6|
|Crystal Springs A...| Avondale| 11| 7|
| Wendy's| Avondale| 9| 8|
| Fire Department| Avondale| 3| 9|
| Gamestop| Avondale| 7| 10|
|Phoenix Plasterin...| Avondale| 5| 11|
| Nails of the World| Avondale| 14| 12|
|St Thomas Aquinas...| Avondale| 3| 13|
| Palermos Pizza| Avondale| 28| 14|
| Dollar Tree| Avondale| 5| 15|
+--------------------+--------------+------------+-----------------+
only showing top 20 rows
Get the top Yelp businesses based on the number reviews in each city and the row number of the business.
$ sql("SELECT name, city, review_count, rank() OVER (PARTITION BY city ORDER BY review_count DESC) as rank FROM business").show
+--------------------+--------------+------------+----+
| name| city|review_count|rank|
+--------------------+--------------+------------+----+
|A B-Z Auto Care, ...|110. Las Vegas| 15| 1|
| Wild Rice| Allentown| 16| 1|
| Popmart| Allentown| 3| 2|
| Co-Op Grill| Arlington| 7| 1|
| Desert Rose Bar| Arlington| 3| 2|
|Flavors of Louisiana| Avondale| 352| 1|
| Pita Kitchen| Avondale| 326| 2|
|TJ's Homestyle Re...| Avondale| 178| 3|
| Smashburger| Avondale| 158| 4|
| India Garden| Avondale| 151| 5|
|Don Pancho Mexica...| Avondale| 145| 6|
| Claim Jumper| Avondale| 138| 7|
|Thai Orchid Garde...| Avondale| 128| 8|
| Greek Gyro Express| Avondale| 119| 9|
|Colados Coffee & ...| Avondale| 118| 10|
| Thai Gourmet| Avondale| 117| 11|
| Fiesta Mexicana| Avondale| 108| 12|
|Kreative Kupcakes...| Avondale| 106| 13|
| NYPD Pizza| Avondale| 103| 14|
| Islands Restaurant| Avondale| 102| 15|
+--------------------+--------------+------------+----+
only showing top 20 rows
Check the number reviews for each business compared to the average number of reviews across all business in the city.
$ sql("SELECT name, city, review_count, AVG(review_count) OVER (PARTITION BY city) as city_review_avg FROM business").show
+--------------------+--------------+------------+------------------+
| name| city|review_count| city_review_avg|
+--------------------+--------------+------------+------------------+
|A B-Z Auto Care, ...|110. Las Vegas| 15| 15.0|
| Wild Rice| Allentown| 16| 9.5|
| Popmart| Allentown| 3| 9.5|
| Desert Rose Bar| Arlington| 3| 5.0|
| Co-Op Grill| Arlington| 7| 5.0|
|West Valley Pedia...| Avondale| 18|21.067357512953368|
|Dermatology for A...| Avondale| 3|21.067357512953368|
| Earnhardt Hyundai| Avondale| 33|21.067357512953368|
|Waterworks Water ...| Avondale| 5|21.067357512953368|
| Domino's Pizza| Avondale| 8|21.067357512953368|
|Universal Technic...| Avondale| 4|21.067357512953368|
|Crystal Springs A...| Avondale| 11|21.067357512953368|
| Wendy's| Avondale| 9|21.067357512953368|
| Fire Department| Avondale| 3|21.067357512953368|
| Gamestop| Avondale| 7|21.067357512953368|
|Phoenix Plasterin...| Avondale| 5|21.067357512953368|
| Nails of the World| Avondale| 14|21.067357512953368|
|St Thomas Aquinas...| Avondale| 3|21.067357512953368|
| Palermos Pizza| Avondale| 28|21.067357512953368|
| Dollar Tree| Avondale| 5|21.067357512953368|
+--------------------+--------------+------------+------------------+
Check how the number of reviews for each business contribute to the total number of reviews for all businesses in the city.
$ sql("SELECT name, city, review_count, SUM(review_count) OVER (PARTITION BY city) as city_review_total FROM business").show
+--------------------+--------------+------------+-----------------+
| name| city|review_count|city_review_total|
+--------------------+--------------+------------+-----------------+
|A B-Z Auto Care, ...|110. Las Vegas| 15| 15|
| Wild Rice| Allentown| 16| 19|
| Popmart| Allentown| 3| 19|
| Desert Rose Bar| Arlington| 3| 10|
| Co-Op Grill| Arlington| 7| 10|
|West Valley Pedia...| Avondale| 18| 8132|
|Dermatology for A...| Avondale| 3| 8132|
| Earnhardt Hyundai| Avondale| 33| 8132|
|Waterworks Water ...| Avondale| 5| 8132|
| Domino's Pizza| Avondale| 8| 8132|
|Universal Technic...| Avondale| 4| 8132|
|Crystal Springs A...| Avondale| 11| 8132|
| Wendy's| Avondale| 9| 8132|
| Fire Department| Avondale| 3| 8132|
| Gamestop| Avondale| 7| 8132|
|Phoenix Plasterin...| Avondale| 5| 8132|
| Nails of the World| Avondale| 14| 8132|
|St Thomas Aquinas...| Avondale| 3| 8132|
| Palermos Pizza| Avondale| 28| 8132|
| Dollar Tree| Avondale| 5| 8132|
+--------------------+--------------+------------+-----------------+
only showing top 20 rows
List Top 10 cities and their highest ranked businesses in terms of number of reviews.
$ sql("SELECT * FROM (SELECT name, city, review_count, rank() OVER (PARTITION BY city ORDER BY review_count DESC) as rank FROM business) as business_ranked WHERE rank = 1 ORDER BY review_count DESC").show
+--------------------+-----------------+------------+----+
| name| city|review_count|rank|
+--------------------+-----------------+------------+----+
| Mon Ami Gabi| Las Vegas| 5642| 1|
|Phoenix Sky Harbo...| Phoenix| 1786| 1|
| Studio B| Henderson| 1629| 1|
|Four Peaks Brewin...| Tempe| 1551| 1|
| Schwartz's| Montréal| 1224| 1|
|Charlotte Douglas...| Charlotte| 1205| 1|
|Citizen Public House| Scottsdale| 1171| 1|
| Joe's Farm Grill| Gilbert| 1165| 1|
| The Old Fashioned| Madison| 1011| 1|
| Texas de Brazil| Dallas| 990| 1|
| Meat & Potatoes| Pittsburgh| 980| 1|
|SanTan Brewing Co...| Chandler| 843| 1|
|Cornish Pasty Com...| Mesa| 768| 1|
|Black Dog Smoke &...| Urbana| 597| 1|
| Yard House| Glendale| 521| 1|
| Old School Pizzeria| North Las Vegas| 396| 1|
|Rock Springs Cafe...|Black Canyon City| 371| 1|
| El Encanto| Cave Creek| 361| 1|
|Flavors of Louisiana| Avondale| 352| 1|
|Queen Creek Olive...| Queen Creek| 351| 1|
+--------------------+-----------------+------------+----+
only showing top 20 rows
Compare the number of reviews for each business with the top and bottom review counts in the city.
$ sql("SELECT name, city, review_count, FIRST_VALUE(review_count) OVER (PARTITION BY city ORDER BY review_count DESC) as city_level_top, LAST_VALUE(review_count) OVER (PARTITION BY city ORDER BY review_count DESC) as city_level_bottom FROM business").show
+--------------------+--------------+------------+--------------+-----------------+
| name| city|review_count|city_level_top|city_level_bottom|
+--------------------+--------------+------------+--------------+-----------------+
|A B-Z Auto Care, ...|110. Las Vegas| 15| 15| 15|
| Wild Rice| Allentown| 16| 16| 16|
| Popmart| Allentown| 3| 16| 3|
| Co-Op Grill| Arlington| 7| 7| 7|
| Desert Rose Bar| Arlington| 3| 7| 3|
|Flavors of Louisiana| Avondale| 352| 352| 352|
| Pita Kitchen| Avondale| 326| 352| 326|
|TJ's Homestyle Re...| Avondale| 178| 352| 178|
| Smashburger| Avondale| 158| 352| 158|
| India Garden| Avondale| 151| 352| 151|
|Don Pancho Mexica...| Avondale| 145| 352| 145|
| Claim Jumper| Avondale| 138| 352| 138|
|Thai Orchid Garde...| Avondale| 128| 352| 128|
| Greek Gyro Express| Avondale| 119| 352| 119|
|Colados Coffee & ...| Avondale| 118| 352| 118|
| Thai Gourmet| Avondale| 117| 352| 117|
| Fiesta Mexicana| Avondale| 108| 352| 108|
|Kreative Kupcakes...| Avondale| 106| 352| 106|
| NYPD Pizza| Avondale| 103| 352| 103|
| Islands Restaurant| Avondale| 102| 352| 102|
+--------------------+--------------+------------+--------------+-----------------+
only showing top 20 rows
$ sql("SELECT name, city, review_count, FIRST_VALUE(review_count) OVER (PARTITION BY city ORDER BY review_count DESC) as city_level_top, FIRST_VALUE(review_count) OVER (PARTITION BY city ORDER BY review_count ASC) as city_level_bottom FROM business").show
+--------------------+--------------+------------+--------------+-----------------+
| name| city|review_count|city_level_top|city_level_bottom|
+--------------------+--------------+------------+--------------+-----------------+
|A B-Z Auto Care, ...|110. Las Vegas| 15| 15| 15|
| Popmart| Allentown| 3| 16| 3|
| Wild Rice| Allentown| 16| 16| 3|
| Desert Rose Bar| Arlington| 3| 7| 3|
| Co-Op Grill| Arlington| 7| 7| 3|
|Dermatology for A...| Avondale| 3| 352| 3|
| Fire Department| Avondale| 3| 352| 3|
|St Thomas Aquinas...| Avondale| 3| 352| 3|
| Officemax| Avondale| 3| 352| 3|
| Chase Bank| Avondale| 3| 352| 3|
| Subway| Avondale| 3| 352| 3|
| Zymages| Avondale| 3| 352| 3|
|A & C Party Supplies| Avondale| 3| 352| 3|
|Bamboo Palace Buffet| Avondale| 3| 352| 3|
| Ann Nails| Avondale| 3| 352| 3|
|Augusta's House &...| Avondale| 3| 352| 3|
| Bank of America| Avondale| 3| 352| 3|
| T-Mobile| Avondale| 3| 352| 3|
| Wells Fargo Bank| Avondale| 3| 352| 3|
|Fry's & Fry's Mar...| Avondale| 3| 352| 3|
+--------------------+--------------+------------+--------------+-----------------+
only showing top 20 rows
Compare the number of reviews with the number of reviews for the previous and following businesses.
$ sql("SELECT name, city, review_count, LAG(review_count, 1) OVER(PARTITION BY city ORDER BY review_count DESC) AS preceding_count, LEAD(review_count, 1) OVER(PARTITION BY city ORDER BY review_count DESC) AS leading_count FROM business").show
+--------------------+--------------+------------+---------------+-------------+
| name| city|review_count|preceding_count|leading_count|
+--------------------+--------------+------------+---------------+-------------+
|A B-Z Auto Care, ...|110. Las Vegas| 15| null| null|
| Wild Rice| Allentown| 16| null| 3|
| Popmart| Allentown| 3| 16| null|
| Co-Op Grill| Arlington| 7| null| 3|
| Desert Rose Bar| Arlington| 3| 7| null|
|Flavors of Louisiana| Avondale| 352| null| 326|
| Pita Kitchen| Avondale| 326| 352| 178|
|TJ's Homestyle Re...| Avondale| 178| 326| 158|
| Smashburger| Avondale| 158| 178| 151|
| India Garden| Avondale| 151| 158| 145|
|Don Pancho Mexica...| Avondale| 145| 151| 138|
| Claim Jumper| Avondale| 138| 145| 128|
|Thai Orchid Garde...| Avondale| 128| 138| 119|
| Greek Gyro Express| Avondale| 119| 128| 118|
|Colados Coffee & ...| Avondale| 118| 119| 117|
| Thai Gourmet| Avondale| 117| 118| 108|
| Fiesta Mexicana| Avondale| 108| 117| 106|
|Kreative Kupcakes...| Avondale| 106| 108| 103|
| NYPD Pizza| Avondale| 103| 106| 102|
| Islands Restaurant| Avondale| 102| 103| 99|
+--------------------+--------------+------------+---------------+-------------+
only showing top 20 rows
Find correlation between number of businesses and review counts with a city
$ sql("select count(*) number_of_business, sum(review_count) counts from business group by city").stat.corr("number_of_business","counts")
res59: Double = 0.9516694713907098