I found some answers right here - SOURCE.
Below are just a few interesting parts of this document.
…Some compression formats are splittable, which can enhance performance when reading and processing large compressed files. Typically, a single large compressed file is stored in HDFS with many data blocks that are distributed across many data nodes. When the file has been compressed by using one of the splittable algorithms, the data blocks can be decompressed in parallel by multiple MapReduce tasks. However, if a file has been compressed by a non-splittable algorithm, Hadoop must pull all the blocks together and use a single MapReduce task to decompress them…
… When you submit a MapReduce job against compressed data in HDFS, Hadoop will determine whether the source file is compressed by checking the file name extension, and if the file name has an appropriate extension, Hadoop will decompress it automatically using the appropriate codec. Therefore, users do not need to explicitly specify a codec in the MapReduce job. However, if the file name extension does not follow naming conventions, Hadoop will not recognize the format and will not automatically decompress the file….
|
Format
|
Codec
|
Extension
|
Splittable
|
Hadoop
|
HDInsight
|
|
DEFLATE
|
org.apache.hadoop.io.compress.DefaultCodec
|
.deflate
|
N
|
Y
|
Y
|
|
Gzip
|
org.apache.hadoop.io.compress.GzipCodec
|
.gz
|
N
|
Y
|
Y
|
|
Bzip2
|
org.apache.hadoop.io.compress.BZip2Codec
|
.bz2
|
Y
|
Y
|
Y
|
|
LZO
|
com.hadoop.compression.lzo.LzopCodec
|
.lzo
|
N
|
Y
|
N
|
|
LZ4
|
org.apache.hadoop.io.compress.Lz4Codec
|
.Lz4
|
N
|
Y
|
N
|
|
Snappy
|
org.apache.hadoop.io.compress.SnappyCodec
|
.Snappy
|
N
|
Y
|
N
|
No comments:
Post a Comment