Big Data Miner: 2014

Tuesday, March 11, 2014

Hive date format

Last time I had an issue with parsing timestamp string to Hive timestamp format.
Timestamp string: Tue Oct 01 18:30:05 PDT 2013

Very useful in this case were Java SimpleDateFormat rules. Below table describe methods of date transformations.

Sample hive query :

FROM_UNIXTIME(unix_timestamp(‘Tue Oct 01 18:30:05 PDT 2013’,"EEE MMMM dd HH:mm:ss z yyyy"))

Ouput for this query: 2013-10-01 18:30:05

Below table is taken from the Oracle site – Thanks Oracle!

The following examples show how date and time patterns are interpreted in the U.S. locale. The given date and time are 2001-07-04 12:08:56 local time in the U.S. Pacific Time time zone.

Date and Time Pattern	Result
"yyyy.MM.dd G 'at' HH:mm:ss z"	2001.07.04 AD at 12:08:56 PDT
"EEE, MMM d, ''yy"	Wed, Jul 4, '01
"h:mm a"	12:08 PM
"hh 'o''clock' a, zzzz"	12 o'clock PM, Pacific Daylight Time
"K:mm a, z"	0:08 PM, PDT
"yyyyy.MMMMM.dd GGG hh:mm aaa"	02001.July.04 AD 12:08 PM
"EEE, d MMM yyyy HH:mm:ss Z"	Wed, 4 Jul 2001 12:08:56 -0700
"yyMMddHHmmssZ"	010704120856-0700
"yyyy-MM-dd'T'HH:mm:ss.SSSZ"	2001-07-04T12:08:56.235-0700
"yyyy-MM-dd'T'HH:mm:ss.SSSXXX"	2001-07-04T12:08:56.235-07:00
"YYYY-'W'ww-u"	2001-W27-3

Thursday, February 6, 2014

Hive indexes

This is how to create, show and drop the index in Hive. This is just a quick reminder.

CREATE INDEX table05_index ON TABLE table05 (column6) AS 'COMPACT' STORED AS RCFILE;

SHOW INDEX ON table01;

DROP INDEX table01_index ON table01;

Index creation recipe:

CREATE INDEX index_name

ON TABLE base_table_name (col_name, ...)

AS 'index.handler.class.name'

[WITH DEFERRED REBUILD]

[IDXPROPERTIES (property_name=property_value, ...)]

[IN TABLE index_table_name]

[PARTITIONED BY (col_name, ...)]

[

[ ROW FORMAT ...] STORED AS ...

| STORED BY ...

]

[LOCATION hdfs_path]

[TBLPROPERTIES (...)]

[COMMENT "index comment"]

Source: https://cwiki.apache.org/confluence/display/Hive/IndexDev

LZO, Snappy

Below is a short overview of Snappy and LZO. I am still trying to configure both of them to work locally. I will try to compare usability command line tools and possibilities for data load later.

1. Additional software installation required

a. LZO (lzop command line tool) – yes

b. Snappy (ie. snzip command line tool) – yes

2. Hadoop configuration changes (core-site.xml, mapred-site.xml) required

a. LZO – yes (Recipe : http://blog.cloudera.com/blog/2009/11/hadoop-at-twitter-part-1-splittable-lzo-compression/ )

b. Snappy – yes (Recipe : http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/latest/CDH4-Installation-Guide/cdh4ig_topic_23_3.html)

3. Common input formats

Compression format	Tool	Algorithm	File extention	Splittable
gzip	gzip	DEFLATE	.gz	No
bzip2	bizp2	bzip2	.bz2	Yes
LZO	lzop	LZO	.lzo	Yes if indexed
Snappy	N/A	Snappy	.snappy	No

4. LZO/Snappy – overview

The LZO compression format is composed of many smaller (~256K) blocks of compressed data, allowing jobs to be split along block boundaries. Moreover, it was designed with speed in mind: it decompresses about twice as fast as gzip, meaning it’s fast enough to keep up with hard drive read speeds. It doesn’t compress quite as well as gzip — expect files that are on the order of 50% larger than their gzipped version. But that is still 20-50% of the size of the files without any compression at all, which means that IO-bound jobs complete the map phase about four times faster.

Snappy:
Snappy is a compression/decompression library. It does not aim for maximum compression, or compatibility with any other compression library; instead, it aims for very high speeds and reasonable compression. For instance, compared to the fastest mode of zlib, Snappy is an order of magnitude faster for most inputs, but the resulting compressed files are anywhere from 20% to 100% bigger. On a single core of a Core i7 processor in 64-bit mode, Snappy compresses at about 250 MB/sec or more and decompresses at about 500 MB/sec or more. Snappy is widely used inside Google, in everything from BigTable and MapReduce to our internal RPC systems.

5. Indexing LZO input file

a. hadoop jar /path/to/your/hadoop-lzo.jar com.hadoop.compression.lzo.LzoIndexer big_file.lzo

6. Sources

a. http://blog.cloudera.com/blog/2011/09/snappy-and-hadoop/

b. http://blog.erdemagaoglu.com/post/4605524309/lzo-vs-snappy-vs-lzf-vs-zlib-a-comparison-of

c. http://comphadoop.weebly.com/

d. http://www.lzop.org/lzop_man.php

e. http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/latest/CDH4-Installation-Guide/cdh4ig_topic_23_3.html

f. http://maheshwaranm.blogspot.com/2013/07/hadoop-lzo.html

g. http://blog.cloudera.com/blog/2009/11/hadoop-at-twitter-part-1-splittable-lzo-compression/