Big Data Miner

Tuesday, March 11, 2014

Hive date format

Last time I had an issue with parsing timestamp string to Hive timestamp format.
Timestamp string: Tue Oct 01 18:30:05 PDT 2013

Very useful in this case were Java SimpleDateFormat rules. Below table describe methods of date transformations.

Sample hive query :

FROM_UNIXTIME(unix_timestamp(‘Tue Oct 01 18:30:05 PDT 2013’,"EEE MMMM dd HH:mm:ss z yyyy"))

Ouput for this query: 2013-10-01 18:30:05

Below table is taken from the Oracle site – Thanks Oracle!

The following examples show how date and time patterns are interpreted in the U.S. locale. The given date and time are 2001-07-04 12:08:56 local time in the U.S. Pacific Time time zone.

Date and Time Pattern	Result
"yyyy.MM.dd G 'at' HH:mm:ss z"	2001.07.04 AD at 12:08:56 PDT
"EEE, MMM d, ''yy"	Wed, Jul 4, '01
"h:mm a"	12:08 PM
"hh 'o''clock' a, zzzz"	12 o'clock PM, Pacific Daylight Time
"K:mm a, z"	0:08 PM, PDT
"yyyyy.MMMMM.dd GGG hh:mm aaa"	02001.July.04 AD 12:08 PM
"EEE, d MMM yyyy HH:mm:ss Z"	Wed, 4 Jul 2001 12:08:56 -0700
"yyMMddHHmmssZ"	010704120856-0700
"yyyy-MM-dd'T'HH:mm:ss.SSSZ"	2001-07-04T12:08:56.235-0700
"yyyy-MM-dd'T'HH:mm:ss.SSSXXX"	2001-07-04T12:08:56.235-07:00
"YYYY-'W'ww-u"	2001-W27-3

Thursday, February 6, 2014

Hive indexes

This is how to create, show and drop the index in Hive. This is just a quick reminder.

CREATE INDEX table05_index ON TABLE table05 (column6) AS 'COMPACT' STORED AS RCFILE;

SHOW INDEX ON table01;

DROP INDEX table01_index ON table01;

Index creation recipe:

CREATE INDEX index_name

ON TABLE base_table_name (col_name, ...)

AS 'index.handler.class.name'

[WITH DEFERRED REBUILD]

[IDXPROPERTIES (property_name=property_value, ...)]

[IN TABLE index_table_name]

[PARTITIONED BY (col_name, ...)]

[

[ ROW FORMAT ...] STORED AS ...

| STORED BY ...

]

[LOCATION hdfs_path]

[TBLPROPERTIES (...)]

[COMMENT "index comment"]

Source: https://cwiki.apache.org/confluence/display/Hive/IndexDev

LZO, Snappy

Below is a short overview of Snappy and LZO. I am still trying to configure both of them to work locally. I will try to compare usability command line tools and possibilities for data load later.

1. Additional software installation required

a. LZO (lzop command line tool) – yes

b. Snappy (ie. snzip command line tool) – yes

2. Hadoop configuration changes (core-site.xml, mapred-site.xml) required

a. LZO – yes (Recipe : http://blog.cloudera.com/blog/2009/11/hadoop-at-twitter-part-1-splittable-lzo-compression/ )

b. Snappy – yes (Recipe : http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/latest/CDH4-Installation-Guide/cdh4ig_topic_23_3.html)

3. Common input formats

Compression format	Tool	Algorithm	File extention	Splittable
gzip	gzip	DEFLATE	.gz	No
bzip2	bizp2	bzip2	.bz2	Yes
LZO	lzop	LZO	.lzo	Yes if indexed
Snappy	N/A	Snappy	.snappy	No

4. LZO/Snappy – overview

The LZO compression format is composed of many smaller (~256K) blocks of compressed data, allowing jobs to be split along block boundaries. Moreover, it was designed with speed in mind: it decompresses about twice as fast as gzip, meaning it’s fast enough to keep up with hard drive read speeds. It doesn’t compress quite as well as gzip — expect files that are on the order of 50% larger than their gzipped version. But that is still 20-50% of the size of the files without any compression at all, which means that IO-bound jobs complete the map phase about four times faster.

Snappy:
Snappy is a compression/decompression library. It does not aim for maximum compression, or compatibility with any other compression library; instead, it aims for very high speeds and reasonable compression. For instance, compared to the fastest mode of zlib, Snappy is an order of magnitude faster for most inputs, but the resulting compressed files are anywhere from 20% to 100% bigger. On a single core of a Core i7 processor in 64-bit mode, Snappy compresses at about 250 MB/sec or more and decompresses at about 500 MB/sec or more. Snappy is widely used inside Google, in everything from BigTable and MapReduce to our internal RPC systems.

5. Indexing LZO input file

a. hadoop jar /path/to/your/hadoop-lzo.jar com.hadoop.compression.lzo.LzoIndexer big_file.lzo

6. Sources

a. http://blog.cloudera.com/blog/2011/09/snappy-and-hadoop/

b. http://blog.erdemagaoglu.com/post/4605524309/lzo-vs-snappy-vs-lzf-vs-zlib-a-comparison-of

c. http://comphadoop.weebly.com/

d. http://www.lzop.org/lzop_man.php

e. http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/latest/CDH4-Installation-Guide/cdh4ig_topic_23_3.html

f. http://maheshwaranm.blogspot.com/2013/07/hadoop-lzo.html

g. http://blog.cloudera.com/blog/2009/11/hadoop-at-twitter-part-1-splittable-lzo-compression/

Thursday, November 7, 2013

Hadoop compression

What is going on with compressed files that are kept on the Hadoop HDFS? Are they gonna be splitted on the chunks like the typical files stored on the cluster? That what I was trying to figure out.

I found some answers right here - SOURCE.
Below are just a few interesting parts of this document.

…Some compression formats are splittable, which can enhance performance when reading and processing large compressed files. Typically, a single large compressed file is stored in HDFS with many data blocks that are distributed across many data nodes. When the file has been compressed by using one of the splittable algorithms, the data blocks can be decompressed in parallel by multiple MapReduce tasks. However, if a file has been compressed by a non-splittable algorithm, Hadoop must pull all the blocks together and use a single MapReduce task to decompress them…

… When you submit a MapReduce job against compressed data in HDFS, Hadoop will determine whether the source file is compressed by checking the file name extension, and if the file name has an appropriate extension, Hadoop will decompress it automatically using the appropriate codec. Therefore, users do not need to explicitly specify a codec in the MapReduce job. However, if the file name extension does not follow naming conventions, Hadoop will not recognize the format and will not automatically decompress the file….

Format	Codec	Extension	Splittable	Hadoop	HDInsight
DEFLATE	org.apache.hadoop.io.compress.DefaultCodec	.deflate	N	Y	Y
Gzip	org.apache.hadoop.io.compress.GzipCodec	.gz	N	Y	Y
Bzip2	org.apache.hadoop.io.compress.BZip2Codec	.bz2	Y	Y	Y
LZO	com.hadoop.compression.lzo.LzopCodec	.lzo	N	Y	N
LZ4	org.apache.hadoop.io.compress.Lz4Codec	.Lz4	N	Y	N
Snappy	org.apache.hadoop.io.compress.SnappyCodec	.Snappy	N	Y	N

Tuesday, November 5, 2013

Oozie Client API

Have you ever tried to build Java app that will be responsible for running the Oozie flows? Seems to be not a big deal especially Oozie Client API is available. The concept that I was working on was connected with running flow in certain conditions. Conditions were depend on the application business logic.

I had to choose the Oozie class that support authentication mechanisms. AuthOozieClient supports Kerberos HTTP SPNEGO and simple authentication. Code below should be enough for starting Oozie job.

AuthOozieClient wc = new AuthOozieClient("http://vm1234:11000/oozie");
conf.setProperty(OozieClient.APP_PATH, "hdfs://vm1234.hostname.net:8020/user/hue/oozie/workspaces/_bdp_-oozie-16");
  
conf.setProperty("nameNode", "hdfs://vm1234.hotname.net:8020");
conf.setProperty("jobTracker", "vm1234.hotname.net:8021");
conf.setProperty("oozie.use.system.libpath", "true");
conf.setProperty("oozie.libpath", "hdfs://vm1234.hotname.net:8020/user/oozie/share/lib");
System.out.println("Workflow job submitted");

while(wc1.getJobInfo(jobId).getStatus() == WorkflowJob.Status.RUNNING) {
   Thread.sleep(10 * 1000);
   System.out.println("Workflow job running ...");
}
String jobId = wc.run(conf);
System.out.println("Workflow job finished");

To be sure that your flow has started properly you can use following command:
oozie jobs -oozie http://vm1234:11000/oozie -auth SIMPLE

When you need some more details about submitted job:
oozie job -oozie http://vm1234:11000/oozie -info 0000386-131010095736347-oozie-oozi-W

-info is a parameter defined by the Oozie job ID.

Hello World!

“Every second of every day, our senses bring in way too much data than we can possibly process in our brains.” – Peter Diamandis, Chairman/CEO, X-Prize Foundation

I hope this blog would be a good place for sharing knowledge about Big Data techniques.