Hive Table Using Regex Serde

Connect to hive using either hive or beeline

$ beeline -u jdbc:hive2://localhost:10000/default

0: jdbc:hive2://localhost:10000/default> set hive.server2.logging.operation.level=none; --turn off logging;

You can get sample access logs from this site http://ita.ee.lbl.gov/html/contrib/NASA-HTTP.html.

Load access log file to HDFS location /user/cloudera/weblogs. You can use hue to load the file.

Create an external table using hive

hive> CREATE EXTERNAL TABLE `weblogs`(
  `host` string COMMENT 'Host', 
  `identity` string COMMENT 'User Identity', 
  `user` string COMMENT 'User identifier', 
  `time` string COMMENT 'Date time of access', 
  `request` string COMMENT 'Http request', 
  `status` string COMMENT 'Http status', 
  `size` string COMMENT 'Http response size', 
  `referrer` string COMMENT 'Referrer url', 
  `useragent` string COMMENT 'Web client agent')
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.serde2.RegexSerDe' 
WITH SERDEPROPERTIES ( 
  'input.regex'='(\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3})? (\\S+) (\\S+) (\\[.+?\\]) \"(.*?)\" (\\d{3}) (\\S+) \"(.*?)\" \"(.*?)\"', 
  'output.format.string'='%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s') 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  '/user/cloudera/weblogs';

Query the table using HQL:

0: jdbc:hive2://localhost:10000/default> select host, request, size from weblogs limit 10;
+----------------+-------------------------------+--------+--+
|      host      |             time              |  size  |
+----------------+-------------------------------+--------+--+
| 66.249.67.3    | [20/Jul/2009:20:12:22 -0700]  | 5      |
| 66.249.67.3    | [20/Jul/2009:20:12:25 -0700]  | 8068   |
| 64.233.172.17  | [20/Jul/2009:20:12:26 -0700]  | 2969   |
| 74.125.74.193  | [20/Jul/2009:20:13:01 -0700]  | 2969   |
| 192.168.1.198  | [20/Jul/2009:20:13:18 -0700]  | 17935  |
| 192.168.1.198  | [20/Jul/2009:20:13:18 -0700]  | 1504   |
| 192.168.1.198  | [20/Jul/2009:20:13:19 -0700]  | 146    |
| 66.249.67.3    | [20/Jul/2009:20:13:21 -0700]  | 5      |
| 66.249.67.3    | [20/Jul/2009:20:13:24 -0700]  | 8196   |
| 66.249.67.3    | [20/Jul/2009:20:13:29 -0700]  | 6360   |
+----------------+-------------------------------+--------+--+

To test a query, find the average average size of the access logs

0: jdbc:hive2://localhost:10000/default> select avg(size) from weblogs;

Find histogram of size of request.

0: jdbc:hive2://localhost:10000/default> SELECT inline(histogram_numeric(CAST(size AS INT), 10)) FROM weblogs;
+---------------------+-----------+--+
|          x          |     y     |
+---------------------+-----------+--+
| 8260.706523990037   | 236619.0  |
| 298536.65322580637  | 124.0     |
| 633294.855172414    | 145.0     |
| 893082.5142857141   | 35.0      |
| 1057042.1818181819  | 11.0      |
| 1307906.2857142857  | 21.0      |
| 1569395.3103448276  | 29.0      |
| 6461779.999999999   | 8.0       |
| 7171896.0           | 7.0       |
| 8512949.333333334   | 9.0       |
+---------------------+-----------+--+

Hive UDF

Hive has a rich set of UDF for analytics. See the document below for the list.

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF

Information:

  • IP address of the user
  • Date and Time of access
  • HTTP method
  • Page of access
  • Page Request params
  • HTTP response code
  • HTTP response time
  • OS
  • Browser

What questions we can ask to the data

  • Most popular web pages
  • Which pages we can embed ads?
  • Which browser is the fastest
  • Which is the most popular browser
  • Which is the most popular OS
  • Which geography requests are coming from
  • How is a traffic at at a given time and prior value last week