Big Data Miner: Hadoop compression

What is going on with compressed files that are kept on the Hadoop HDFS? Are they gonna be splitted on the chunks like the typical files stored on the cluster? That what I was trying to figure out.

I found some answers right here - SOURCE.
Below are just a few interesting parts of this document.

…Some compression formats are splittable, which can enhance performance when reading and processing large compressed files. Typically, a single large compressed file is stored in HDFS with many data blocks that are distributed across many data nodes. When the file has been compressed by using one of the splittable algorithms, the data blocks can be decompressed in parallel by multiple MapReduce tasks. However, if a file has been compressed by a non-splittable algorithm, Hadoop must pull all the blocks together and use a single MapReduce task to decompress them…

… When you submit a MapReduce job against compressed data in HDFS, Hadoop will determine whether the source file is compressed by checking the file name extension, and if the file name has an appropriate extension, Hadoop will decompress it automatically using the appropriate codec. Therefore, users do not need to explicitly specify a codec in the MapReduce job. However, if the file name extension does not follow naming conventions, Hadoop will not recognize the format and will not automatically decompress the file….

Format	Codec	Extension	Splittable	Hadoop	HDInsight
DEFLATE	org.apache.hadoop.io.compress.DefaultCodec	.deflate	N	Y	Y
Gzip	org.apache.hadoop.io.compress.GzipCodec	.gz	N	Y	Y
Bzip2	org.apache.hadoop.io.compress.BZip2Codec	.bz2	Y	Y	Y
LZO	com.hadoop.compression.lzo.LzopCodec	.lzo	N	Y	N
LZ4	org.apache.hadoop.io.compress.Lz4Codec	.Lz4	N	Y	N
Snappy	org.apache.hadoop.io.compress.SnappyCodec	.Snappy	N	Y	N

Big Data Miner

Thursday, November 7, 2013

Hadoop compression

No comments:

Post a Comment