Thursday, November 7, 2013

Hadoop compression

What is going on with compressed files that are kept on the Hadoop HDFS? Are they gonna be splitted on the chunks like the typical files stored on the cluster? That what I was trying to figure out.

I found some answers right here - SOURCE.
Below are just a few interesting parts of this document.

…Some compression formats are splittable, which can enhance performance when reading and processing large compressed files. Typically, a single large compressed file is stored in HDFS with many data blocks that are distributed across many data nodes. When the file has been compressed by using one of the splittable algorithms, the data blocks can be decompressed in parallel by multiple MapReduce tasks. However, if a file has been compressed by a non-splittable algorithm, Hadoop must pull all the blocks together and use a single MapReduce task to decompress them…

… When you submit a MapReduce job against compressed data in HDFS, Hadoop will determine whether the source file is compressed by checking the file name extension, and if the file name has an appropriate extension, Hadoop will decompress it automatically using the appropriate codec. Therefore, users do not need to explicitly specify a codec in the MapReduce job. However, if the file name extension does not follow naming conventions, Hadoop will not recognize the format and will not automatically decompress the file….


Format
Codec
Extension
Splittable
Hadoop
HDInsight
DEFLATE
org.apache.hadoop.io.compress.DefaultCodec
.deflate
N
Y
Y
Gzip
org.apache.hadoop.io.compress.GzipCodec
.gz
N
Y
Y
Bzip2
org.apache.hadoop.io.compress.BZip2Codec
.bz2
Y
Y
Y
LZO
com.hadoop.compression.lzo.LzopCodec
.lzo
N
Y
N
LZ4
org.apache.hadoop.io.compress.Lz4Codec
.Lz4
N
Y
N
Snappy
org.apache.hadoop.io.compress.SnappyCodec
.Snappy
N
Y
N

Tuesday, November 5, 2013

Oozie Client API

Have you ever tried to build Java app that will be responsible for running the Oozie flows? Seems to be not a big deal especially Oozie Client API is available. The concept that I was working on was connected with running flow in certain conditions. Conditions were depend on the application business logic.

I had to choose the Oozie class that support authentication mechanisms. AuthOozieClient supports Kerberos HTTP SPNEGO and simple authentication. Code below should be enough for starting Oozie job.


AuthOozieClient wc = new AuthOozieClient("http://vm1234:11000/oozie");
conf.setProperty(OozieClient.APP_PATH, "hdfs://vm1234.hostname.net:8020/user/hue/oozie/workspaces/_bdp_-oozie-16");
  
conf.setProperty("nameNode", "hdfs://vm1234.hotname.net:8020");
conf.setProperty("jobTracker", "vm1234.hotname.net:8021");
conf.setProperty("oozie.use.system.libpath", "true");
conf.setProperty("oozie.libpath", "hdfs://vm1234.hotname.net:8020/user/oozie/share/lib");
System.out.println("Workflow job submitted");

while(wc1.getJobInfo(jobId).getStatus() == WorkflowJob.Status.RUNNING) {
   Thread.sleep(10 * 1000);
   System.out.println("Workflow job running ...");
}
String jobId = wc.run(conf);
System.out.println("Workflow job finished");

To be sure that your flow has started properly you can use following command:
oozie jobs -oozie http://vm1234:11000/oozie -auth SIMPLE

When you need some more details about submitted job:
oozie job -oozie http://vm1234:11000/oozie -info 0000386-131010095736347-oozie-oozi-W

-info is a parameter defined by the Oozie job ID.

Hello World!



“Every second of every day, our senses bring in way too much data than we can possibly process in our brains.” – Peter Diamandis, Chairman/CEO, X-Prize Foundation

I hope this blog would be a good place for sharing knowledge about Big Data techniques.