Big Data Miner: LZO, Snappy

Below is a short overview of Snappy and LZO. I am still trying to configure both of them to work locally. I will try to compare usability command line tools and possibilities for data load later.

1. Additional software installation required

a. LZO (lzop command line tool) – yes

b. Snappy (ie. snzip command line tool) – yes

2. Hadoop configuration changes (core-site.xml, mapred-site.xml) required

a. LZO – yes (Recipe : http://blog.cloudera.com/blog/2009/11/hadoop-at-twitter-part-1-splittable-lzo-compression/ )

b. Snappy – yes (Recipe : http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/latest/CDH4-Installation-Guide/cdh4ig_topic_23_3.html)

3. Common input formats

Compression format	Tool	Algorithm	File extention	Splittable
gzip	gzip	DEFLATE	.gz	No
bzip2	bizp2	bzip2	.bz2	Yes
LZO	lzop	LZO	.lzo	Yes if indexed
Snappy	N/A	Snappy	.snappy	No

4. LZO/Snappy – overview

The LZO compression format is composed of many smaller (~256K) blocks of compressed data, allowing jobs to be split along block boundaries. Moreover, it was designed with speed in mind: it decompresses about twice as fast as gzip, meaning it’s fast enough to keep up with hard drive read speeds. It doesn’t compress quite as well as gzip — expect files that are on the order of 50% larger than their gzipped version. But that is still 20-50% of the size of the files without any compression at all, which means that IO-bound jobs complete the map phase about four times faster.

Snappy:
Snappy is a compression/decompression library. It does not aim for maximum compression, or compatibility with any other compression library; instead, it aims for very high speeds and reasonable compression. For instance, compared to the fastest mode of zlib, Snappy is an order of magnitude faster for most inputs, but the resulting compressed files are anywhere from 20% to 100% bigger. On a single core of a Core i7 processor in 64-bit mode, Snappy compresses at about 250 MB/sec or more and decompresses at about 500 MB/sec or more. Snappy is widely used inside Google, in everything from BigTable and MapReduce to our internal RPC systems.

5. Indexing LZO input file

a. hadoop jar /path/to/your/hadoop-lzo.jar com.hadoop.compression.lzo.LzoIndexer big_file.lzo

6. Sources

a. http://blog.cloudera.com/blog/2011/09/snappy-and-hadoop/

b. http://blog.erdemagaoglu.com/post/4605524309/lzo-vs-snappy-vs-lzf-vs-zlib-a-comparison-of

c. http://comphadoop.weebly.com/

d. http://www.lzop.org/lzop_man.php

e. http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/latest/CDH4-Installation-Guide/cdh4ig_topic_23_3.html

f. http://maheshwaranm.blogspot.com/2013/07/hadoop-lzo.html

g. http://blog.cloudera.com/blog/2009/11/hadoop-at-twitter-part-1-splittable-lzo-compression/