Hive Table Using Regex Serde

Connect to hive using either hive or beeline

$ beeline -u jdbc:hive2://localhost:10000/default

0: jdbc:hive2://localhost:10000/default> set hive.server2.logging.operation.level=none; --turn off logging;

You can get sample access logs from this site http://ita.ee.lbl.gov/html/contrib/NASA-HTTP.html.

Load access log file to HDFS location /user/cloudera/weblogs. You can use hue to load the file.

Create an external table using hive

hive> CREATE EXTERNAL TABLE `weblogs`(

`host` string COMMENT 'Host',

`identity` string COMMENT 'User Identity',

`user` string COMMENT 'User identifier',

`time` string COMMENT 'Date time of access',

`request` string COMMENT 'Http request',

`status` string COMMENT 'Http status',

`size` string COMMENT 'Http response size',

`referrer` string COMMENT 'Referrer url',

`useragent` string COMMENT 'Web client agent')

ROW FORMAT SERDE

'org.apache.hadoop.hive.serde2.RegexSerDe'

WITH SERDEPROPERTIES (

'input.regex'='(\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3})? (\\S+) (\\S+) (\\[.+?\\]) \"(.*?)\" (\\d{3}) (\\S+) \"(.*?)\" \"(.*?)\"',

'output.format.string'='%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s')

STORED AS INPUTFORMAT

'org.apache.hadoop.mapred.TextInputFormat'

OUTPUTFORMAT

'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'

LOCATION

'/user/cloudera/weblogs';

Query the table using HQL:

0: jdbc:hive2://localhost:10000/default> select host, request, size from weblogs limit 10;

+----------------+-------------------------------+--------+--+

| host | time | size |

+----------------+-------------------------------+--------+--+

| 66.249.67.3 | [20/Jul/2009:20:12:22 -0700] | 5 |

| 66.249.67.3 | [20/Jul/2009:20:12:25 -0700] | 8068 |

| 64.233.172.17 | [20/Jul/2009:20:12:26 -0700] | 2969 |

| 74.125.74.193 | [20/Jul/2009:20:13:01 -0700] | 2969 |

| 192.168.1.198 | [20/Jul/2009:20:13:18 -0700] | 17935 |

| 192.168.1.198 | [20/Jul/2009:20:13:18 -0700] | 1504 |

| 192.168.1.198 | [20/Jul/2009:20:13:19 -0700] | 146 |

| 66.249.67.3 | [20/Jul/2009:20:13:21 -0700] | 5 |

| 66.249.67.3 | [20/Jul/2009:20:13:24 -0700] | 8196 |

| 66.249.67.3 | [20/Jul/2009:20:13:29 -0700] | 6360 |

+----------------+-------------------------------+--------+--+

To test a query, find the average average size of the access logs

0: jdbc:hive2://localhost:10000/default> select avg(size) from weblogs;

Find histogram of size of request.

0: jdbc:hive2://localhost:10000/default> SELECT inline(histogram_numeric(CAST(size AS INT), 10)) FROM weblogs;

+---------------------+-----------+--+

| x | y |

+---------------------+-----------+--+

| 8260.706523990037 | 236619.0 |

| 298536.65322580637 | 124.0 |

| 633294.855172414 | 145.0 |

| 893082.5142857141 | 35.0 |

| 1057042.1818181819 | 11.0 |

| 1307906.2857142857 | 21.0 |

| 1569395.3103448276 | 29.0 |

| 6461779.999999999 | 8.0 |

| 7171896.0 | 7.0 |

| 8512949.333333334 | 9.0 |

+---------------------+-----------+--+

Hive UDF

Hive has a rich set of UDF for analytics. See the document below for the list.

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF

Information:

  • IP address of the user

  • Date and Time of access

  • HTTP method

  • Page of access

  • Page Request params

  • HTTP response code

  • HTTP response time

  • OS

  • Browser

What questions we can ask to the data

  • Most popular web pages

  • Which pages we can embed ads?

  • Which browser is the fastest

  • Which is the most popular browser

  • Which is the most popular OS

  • Which geography requests are coming from

  • How is a traffic at at a given time and prior value last week