如何强制 spark/hadoop 忽略文件上的.gz扩展名并将其读取为未压缩的纯文本?

我的代码如下：

val lines: RDD[String] = sparkSession.sparkContext.textFile("s3://mybucket/file.gz")

URL 以.gz结尾，但这是旧代码的结果。该文件是纯文本，不涉及压缩。然而，火花坚持将其读取为GZIP文件，这显然失败了。如何让它忽略扩展名并简单地将文件读取为文本？

根据本文，我尝试在不包括GZIP编解码器的各个地方设置配置，例如：

sparkContext.getConf.set("spark.hadoop.io.compression.codecs", classOf[DefaultCodec].getCanonicalName)

这似乎没有任何效果。

由于文件位于 S3 上，因此我不能在不复制整个文件的情况下简单地重命名它们。

第一个解决方案：着色 GzipCodec

这个想法是通过在您自己的源代码中包含此 java 文件并替换以下行来阴影/着色包org.apache.hadoop.io.compress中定义的GzipCodec：

public String getDefaultExtension() {
return ".gz";
}

跟：

public String getDefaultExtension() {
return ".whatever";
}

在构建项目时，这将有效地使用GzipCodec的定义，而不是依赖项提供的定义(这是 GzipCodec 的阴影)。

这样，在解析文件时，textFile()将被迫应用默认编解码器，因为 gzip 的编解码器不再适合文件的命名。

此解决方案的不便之处在于您将无法在同一应用程序中处理真正的gzip文件。

第二种解决方案：将newAPIHadoopFile与自定义/修改的TextInputFormat一起使用

您可以将newAPIHadoopFile(而不是textFile)与自定义/修改的TextInputFormat一起使用，从而强制使用DefaultCodec(纯文本)。

我们将根据默认的 (TextInputFormat) 编写自己的行读取器。这个想法是删除TextInputFormat中发现它被命名为.gz的部分，从而在读取文件之前解压缩文件。

而不是打电话给sparkContext.textFile，

// plain text file with a .gz extension:
sparkContext.textFile("s3://mybucket/file.gz")

我们可以使用底层sparkContext.newAPIHadoopFile它允许我们指定如何读取输入：

import org.apache.hadoop.mapreduce.lib.input.FakeGzInputFormat
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.io.{LongWritable, Text}
sparkContext
.newAPIHadoopFile(
"s3://mybucket/file.gz",
classOf[FakeGzInputFormat], // This is our custom reader
classOf[LongWritable],
classOf[Text],
new Configuration(sparkContext.hadoopConfiguration)
)
.map { case (_, text) => text.toString }

通常的呼叫newAPIHadoopFile方式是使用TextInputFormat.这是包装文件读取方式以及根据文件扩展名选择压缩编解码器的部分。

让我们称它为FakeGzInputFormat，并按如下方式实现它作为TextInputFormat的扩展(这是一个Java文件，让我们把它放在包src/main/java/org/apache/hadoop/mapreduce/lib/input中)：

package org.apache.hadoop.mapreduce.lib.input;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.JobContext;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import com.google.common.base.Charsets;
public class FakeGzInputFormat extends TextInputFormat {
public RecordReader<LongWritable, Text> createRecordReader(
InputSplit split,
TaskAttemptContext context
) {
String delimiter =
context.getConfiguration().get("textinputformat.record.delimiter");
byte[] recordDelimiterBytes = null;
if (null != delimiter)
recordDelimiterBytes = delimiter.getBytes(Charsets.UTF_8);
// Here we use our custom `FakeGzLineRecordReader` instead of
// `LineRecordReader`:
return new FakeGzLineRecordReader(recordDelimiterBytes);
}
@Override
protected boolean isSplitable(JobContext context, Path file) {
return true; // plain text is splittable (as opposed to gzip)
}
}

实际上，我们必须更深入地进行更深入的访问，并将默认LineRecordReader(Java)替换为我们自己的(我们称之为FakeGzLineRecordReader)。

由于从LineRecordReader继承是相当困难的，我们可以复制LineRecordReader(在src/main/java/org/apache/hadoop/mapreduce/lib/input中)，并通过强制使用默认编解码器(纯文本)来稍微修改(和简化)initialize(InputSplit genericSplit, TaskAttemptContext context)方法：

(与原始LineRecordReader相比，唯一的更改已给出注释，解释正在发生的事情)

package org.apache.hadoop.mapreduce.lib.input;
import java.io.IOException;
import org.apache.hadoop.classification.InterfaceAudience;
import org.apache.hadoop.classification.InterfaceStability;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.fs.Seekable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
@InterfaceAudience.LimitedPrivate({"MapReduce", "Pig"})
@InterfaceStability.Evolving
public class FakeGzLineRecordReader extends RecordReader<LongWritable, Text> {
private static final Logger LOG =
LoggerFactory.getLogger(FakeGzLineRecordReader.class);
public static final String MAX_LINE_LENGTH =
"mapreduce.input.linerecordreader.line.maxlength";
private long start;
private long pos;
private long end;
private SplitLineReader in;
private FSDataInputStream fileIn;
private Seekable filePosition;
private int maxLineLength;
private LongWritable key;
private Text value;
private byte[] recordDelimiterBytes;
public FakeGzLineRecordReader(byte[] recordDelimiter) {
this.recordDelimiterBytes = recordDelimiter;
}
// This has been simplified a lot since we don't need to handle compression
// codecs.
public void initialize(
InputSplit genericSplit,
TaskAttemptContext context
) throws IOException {
FileSplit split = (FileSplit) genericSplit;
Configuration job = context.getConfiguration();
this.maxLineLength = job.getInt(MAX_LINE_LENGTH, Integer.MAX_VALUE);
start = split.getStart();
end = start + split.getLength();
final Path file = split.getPath();
final FileSystem fs = file.getFileSystem(job);
fileIn = fs.open(file);
fileIn.seek(start);
in = new UncompressedSplitLineReader(
fileIn, job, this.recordDelimiterBytes, split.getLength()
);
filePosition = fileIn;
if (start != 0) {
start += in.readLine(new Text(), 0, maxBytesToConsume(start));
}
this.pos = start;
}
// Simplified as input is not compressed:
private int maxBytesToConsume(long pos) {
return (int) Math.max(Math.min(Integer.MAX_VALUE, end - pos), maxLineLength);
}
// Simplified as input is not compressed:
private long getFilePosition() {
return pos;
}
private int skipUtfByteOrderMark() throws IOException {
int newMaxLineLength = (int) Math.min(3L + (long) maxLineLength,
Integer.MAX_VALUE);
int newSize = in.readLine(value, newMaxLineLength, maxBytesToConsume(pos));
pos += newSize;
int textLength = value.getLength();
byte[] textBytes = value.getBytes();
if ((textLength >= 3) && (textBytes[0] == (byte)0xEF) &&
(textBytes[1] == (byte)0xBB) && (textBytes[2] == (byte)0xBF)) {
LOG.info("Found UTF-8 BOM and skipped it");
textLength -= 3;
newSize -= 3;
if (textLength > 0) {
textBytes = value.copyBytes();
value.set(textBytes, 3, textLength);
} else {
value.clear();
}
}
return newSize;
}
public boolean nextKeyValue() throws IOException {
if (key == null) {
key = new LongWritable();
}
key.set(pos);
if (value == null) {
value = new Text();
}
int newSize = 0;
while (getFilePosition() <= end || in.needAdditionalRecordAfterSplit()) {
if (pos == 0) {
newSize = skipUtfByteOrderMark();
} else {
newSize = in.readLine(value, maxLineLength, maxBytesToConsume(pos));
pos += newSize;
}
if ((newSize == 0) || (newSize < maxLineLength)) {
break;
}
LOG.info("Skipped line of size " + newSize + " at pos " +
(pos - newSize));
}
if (newSize == 0) {
key = null;
value = null;
return false;
} else {
return true;
}
}
@Override
public LongWritable getCurrentKey() {
return key;
}
@Override
public Text getCurrentValue() {
return value;
}
public float getProgress() {
if (start == end) {
return 0.0f;
} else {
return Math.min(1.0f, (getFilePosition() - start) / (float)(end - start));
}
}
public synchronized void close() throws IOException {
try {
if (in != null) {
in.close();
}
} finally {}
}
}

相关内容

最新更新

热门标签：