Below is a short overview of Snappy and LZO. I am still
trying to configure both of them to work locally. I will try to compare
usability command line tools and possibilities for data load later.
1. Additional
software installation required
a.
LZO (lzop command line tool) – yes
b.
Snappy (ie. snzip command line tool) – yes
2. Hadoop
configuration changes (core-site.xml, mapred-site.xml) required
a.
LZO – yes (Recipe : http://blog.cloudera.com/blog/2009/11/hadoop-at-twitter-part-1-splittable-lzo-compression/
)
b.
Snappy – yes (Recipe : http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/latest/CDH4-Installation-Guide/cdh4ig_topic_23_3.html)
3. Common
input formats
|
Compression format
|
Tool
|
Algorithm
|
File extention
|
Splittable
|
|
gzip
|
gzip
|
DEFLATE
|
.gz
|
No
|
|
bzip2
|
bizp2
|
bzip2
|
.bz2
|
Yes
|
|
LZO
|
lzop
|
LZO
|
.lzo
|
Yes if indexed
|
|
Snappy
|
N/A
|
Snappy
|
.snappy
|
No
|
4. LZO/Snappy
– overview
The LZO compression
format is composed of many smaller (~256K) blocks of compressed data, allowing
jobs to be split along block boundaries. Moreover, it was designed with
speed in mind: it decompresses about twice as fast as gzip, meaning it’s fast
enough to keep up with hard drive read speeds. It doesn’t compress quite
as well as gzip — expect files that are on the order of 50% larger than their
gzipped version. But that is still 20-50% of the size of the files
without any compression at all, which means that IO-bound jobs complete the map
phase about four times faster.
Snappy:
Snappy is a compression/decompression library. It does not aim for maximum compression, or compatibility with any other compression library; instead, it aims for very high speeds and reasonable compression. For instance, compared to the fastest mode of zlib, Snappy is an order of magnitude faster for most inputs, but the resulting compressed files are anywhere from 20% to 100% bigger. On a single core of a Core i7 processor in 64-bit mode, Snappy compresses at about 250 MB/sec or more and decompresses at about 500 MB/sec or more. Snappy is widely used inside Google, in everything from BigTable and MapReduce to our internal RPC systems.
Snappy:
Snappy is a compression/decompression library. It does not aim for maximum compression, or compatibility with any other compression library; instead, it aims for very high speeds and reasonable compression. For instance, compared to the fastest mode of zlib, Snappy is an order of magnitude faster for most inputs, but the resulting compressed files are anywhere from 20% to 100% bigger. On a single core of a Core i7 processor in 64-bit mode, Snappy compresses at about 250 MB/sec or more and decompresses at about 500 MB/sec or more. Snappy is widely used inside Google, in everything from BigTable and MapReduce to our internal RPC systems.
5. Indexing
LZO input file
a.
hadoop jar
/path/to/your/hadoop-lzo.jar com.hadoop.compression.lzo.LzoIndexer big_file.lzo
6. Sources
No comments:
Post a Comment