Hive Table Using Regex Serde

Connect to hive using either hive or beeline

$ beeline -u jdbc:hive2://localhost:10000/default

0: jdbc:hive2://localhost:10000/default> set hive.server2.logging.operation.level=none; --turn off logging;

You can get sample access logs from this site http://ita.ee.lbl.gov/html/contrib/NASA-HTTP.html.

Load access log file to HDFS location /user/cloudera/weblogs. You can use hue to load the file.

Create an external table using hive

hive> CREATE EXTERNAL TABLE `weblogs`(

`host` string COMMENT 'Host',

`identity` string COMMENT 'User Identity',

`user` string COMMENT 'User identifier',

`time` string COMMENT 'Date time of access',

`request` string COMMENT 'Http request',

`status` string COMMENT 'Http status',

`size` string COMMENT 'Http response size',

`referrer` string COMMENT 'Referrer url',

`useragent` string COMMENT 'Web client agent')

ROW FORMAT SERDE

'org.apache.hadoop.hive.serde2.RegexSerDe'

WITH SERDEPROPERTIES (

'input.regex'='(\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3})? (\\S+) (\\S+) (\\[.+?\\]) \"(.*?)\" (\\d{3}) (\\S+) \"(.*?)\" \"(.*?)\"',

'output.format.string'='%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s')

STORED AS INPUTFORMAT

'org.apache.hadoop.mapred.TextInputFormat'

OUTPUTFORMAT

'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'

LOCATION

'/user/cloudera/weblogs';

Query the table using HQL:

0: jdbc:hive2://localhost:10000/default> select host, request, size from weblogs limit 10;

+----------------+-------------------------------+--------+--+

| host | time | size |

+----------------+-------------------------------+--------+--+

| 66.249.67.3 | [20/Jul/2009:20:12:22 -0700] | 5 |

| 66.249.67.3 | [20/Jul/2009:20:12:25 -0700] | 8068 |

| 64.233.172.17 | [20/Jul/2009:20:12:26 -0700] | 2969 |

| 74.125.74.193 | [20/Jul/2009:20:13:01 -0700] | 2969 |

| 192.168.1.198 | [20/Jul/2009:20:13:18 -0700] | 17935 |

| 192.168.1.198 | [20/Jul/2009:20:13:18 -0700] | 1504 |

| 192.168.1.198 | [20/Jul/2009:20:13:19 -0700] | 146 |

| 66.249.67.3 | [20/Jul/2009:20:13:21 -0700] | 5 |

| 66.249.67.3 | [20/Jul/2009:20:13:24 -0700] | 8196 |

| 66.249.67.3 | [20/Jul/2009:20:13:29 -0700] | 6360 |

+----------------+-------------------------------+--------+--+

To test a query, find the average average size of the access logs

0: jdbc:hive2://localhost:10000/default> select avg(size) from weblogs;

Find histogram of size of request.

0: jdbc:hive2://localhost:10000/default> SELECT inline(histogram_numeric(CAST(size AS INT), 10)) FROM weblogs;

+---------------------+-----------+--+

| x | y |

+---------------------+-----------+--+

| 8260.706523990037 | 236619.0 |

| 298536.65322580637 | 124.0 |

| 633294.855172414 | 145.0 |

| 893082.5142857141 | 35.0 |

| 1057042.1818181819 | 11.0 |

| 1307906.2857142857 | 21.0 |

| 1569395.3103448276 | 29.0 |

| 6461779.999999999 | 8.0 |

| 7171896.0 | 7.0 |

| 8512949.333333334 | 9.0 |

+---------------------+-----------+--+

Hive UDF

Hive has a rich set of UDF for analytics. See the document below for the list.

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF

Information:

IP address of the user
Date and Time of access
HTTP method
Page of access
Page Request params
HTTP response code
HTTP response time
OS
Browser

What questions we can ask to the data

Most popular web pages
Which pages we can embed ads?
Which browser is the fastest
Which is the most popular browser
Which is the most popular OS
Which geography requests are coming from
How is a traffic at at a given time and prior value last week