了解从 S3 for Java 下载的压缩文件的 MIME 类型



客户端应该将压缩文件上传到 S3 文件夹中。然后下载和解压缩压缩文件,以对其包含的文件执行各种操作。最初我们告诉我们的客户将其文件压缩成ZIP文件,但这对我们的客户来说太难了。相反,它提交了一个带有 ZIP 扩展名的RAR文件......多么聪明。出于显而易见的原因,无法使用ZIP解压缩算法解压缩RAR文件。

因此,我正在寻找一种方法来找出 S3 下载文件的文件类型,因为我正在 Linux 操作系统上使用亚马逊的 SDK 进行 Java 项目。我将根据获得的文件类型负责如何解压缩文件。

我已经看过许多堆栈溢出问题,比如这个,但仅通过查看它们(及其评论)似乎都没有 100% 有效。

找出压缩文件类型的最佳方法是什么?

TL;博士;

以编程方式将文件上传到 Amazon S3 时,可以指定对象的Content-Type。如果指定 none,如 @Michael-bot 所澄清的那样,默认情况下分配的值将为binary/octet-stream。或者,如果决定通过 Amazon S3 的 GUI 上传文件,则文件会从其文件扩展名(可悲的是,不是其内容)获取其Content-Type。如果您可以信任上传文件的人来正确设置Content-Type,请继续查看ObjectMetadata,但如果不能(像我一样),则需要另一种解决方案。

因此,如果您正在寻找适用于最常见文件压缩类型的解决方案,Files.probeContentType,Apache Tika和SimpleMagic似乎是可接受的解决方案。

最后我选择了Files.probeContentType因为它不需要额外的库,并且在 Linux 机器上运行良好(只要文件没有错误的扩展名,有一个解决方法:删除文件扩展名并让它发挥它的魔力)。


测试设置

起初,人们会认为从亚马逊的S3下载文件时的响应对象包括文件类型。它确实包含此信息,但是当文件的扩展名与其内容不匹配时,就会出现问题。

import com.amazonaws.services.s3.model.S3Object;
final S3Object s3Object = ...;
final String contentType = s3Object.getObjectMetadata().getContentType();

即使文件的内容是 Rar 文件,此代码也会返回application/zip。所以这个解决方案对我不起作用。

出于这个原因,我花时间构建了一个示例项目,该项目使用可用的不同方法和库测试了各种场景。顺便说一下,我正在使用Java 8

测试的文件类型包括:

具有 Zip 扩展
  • 名和不带扩展名的 Zip 文件
  • 带有 Rar 扩展名、Zip 扩展名和不带扩展名的 Rar 文件
  • 具有 7z 扩展名、Zip 扩展名和不带扩展名的 7z 文件
  • 带有 Tar.xz 扩展名
  • 、Zip 扩展名和不带扩展名的 Tar.xz
  • 带有焦油.gz扩展名、Zip 扩展名和不带扩展名的 Tar.gz

请注意,此处介绍的实现仅用于测试目的。它们没有以任何方式被认可用于生产代码,因为它们不考虑文件锁定问题以及我的想象力无法考虑的其他事情。=)


哑剧类型文件类型映射

实现

import java.io.File;
import javax.activation.MimetypesFileTypeMap;
final File file = new File(basePath + "/" + fileName);
try {
return MimetypesFileTypeMap.getDefaultFileTypeMap().getContentType(file);
} catch (final Exception exception) {
return "<EXCEPTION: " + exception.getMessage() + ">";
}

结果

Rar with Rar extension is:       application/octet-stream
Rar with Zip extension is:       application/octet-stream
Zip with Zip extension is:       application/octet-stream
7z with 7z extension is:         application/octet-stream
7z with Zip extension is:        application/octet-stream
Tar.xz with Tar.xz extension is: application/octet-stream
Tar.xz with Zip extension is:    application/octet-stream
Tar.gz with Tar.gz extension is: application/octet-stream
Tar.gz with Zip extension is:    application/octet-stream
Rar without extension is:        application/octet-stream
Zip without extension is:        application/octet-stream
7z without extension is:         application/octet-stream
Tar.xz without extension is:     application/octet-stream
Tar.gz without extension is:     application/octet-stream

结论

当文件类型尚未被识别时,此方法返回的值为application/octet-stream。似乎所有方案都失败了,所以我们应该放弃这种方法。


URLConnection.guessContentTypeFromStream

实现

import java.io.File;
import java.io.FileInputStream;
import java.io.InputStream;
import java.io.BufferedInputStream;
import java.net.URLConnection;
final File file = new File(basePath + "/" + fileName);
try {
final FileInputStream fileInputStream = new FileInputStream(file);
final InputStream inputStream = new BufferedInputStream(fileInputStream);
return URLConnection.guessContentTypeFromStream(inputStream);
} catch (final Exception exception) {
return "<EXCEPTION: " + exception.getMessage() + ">";
}

结果

Rar with Rar extension is:       null
Rar with Zip extension is:       null
Zip with Zip extension is:       null
7z with 7z extension is:         null
7z with Zip extension is:        null
Tar.xz with Tar.xz extension is: null
Tar.xz with Zip extension is:    null
Tar.gz with Tar.gz extension is: null
Tar.gz with Zip extension is:    null
Rar without extension is:        null
Zip without extension is:        null
7z without extension is:         null
Tar.xz without extension is:     null
Tar.gz without extension is:     null

结论

同样,此方法会失败所有方案。似乎它的支持非常有限。


Files.probeContentType

实现

import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
try {
final Path path = Paths.get(basePath + "/" + fileName);
return Files.probeContentType(path);
} catch (final Exception exception) {
return "<EXCEPTION: " + exception.getMessage() + ">";
}

结果

Rar with Rar extension is:       application/vnd.rar
Rar with Zip extension is:       application/zip
Zip with Zip extension is:       application/zip
7z with 7z extension is:         application/x-7z-compressed
7z with Zip extension is:        application/zip
Tar.xz with Tar.xz extension is: application/x-xz-compressed-tar
Tar.xz with Zip extension is:    application/zip
Tar.gz with Tar.gz extension is: application/x-compressed-tar
Tar.gz with Zip extension is:    application/zip
Rar without extension is:        application/vnd.rar
Zip without extension is:        application/zip
7z without extension is:         application/x-7z-compressed
Tar.xz without extension is:     application/x-xz
Tar.gz without extension is:     application/gzip

结论

这种方法的效果出奇地好,但不要被愚弄,有一种情况是它总是失败。如果文件的扩展名错误(与内容不匹配的扩展名),它将报告文件类型为扩展名。它不应该经常发生,但如果一个人非常挑剔,则不使用此方法。

此外,有些人警告说,他的方法在Windows中效果不佳。

解决办法:如果设法从文件名中删除扩展名,这将为所有给定方案返回正确的值。


Apache Tika (tika-eval 1.18)

这个库似乎有很多风格(应用程序、服务器、eval 等),但网络上的许多人抱怨它有点"依赖性重"。

实现

import org.apache.tika.Tika;
try {
return new Tika().detect(new File(basePath + "/" + fileName));
} catch (final Exception exception) {
return "<EXCEPTION: " + exception.getMessage() + ">";
}

结果

Rar with Rar extension is:       application/x-rar-compressed
Rar with Zip extension is:       application/x-rar-compressed
Zip with Zip extension is:       application/zip
7z with 7z extension is:         application/x-7z-compressed
7z with Zip extension is:        application/x-7z-compressed
Tar.xz with Tar.xz extension is: application/x-xz
Tar.xz with Zip extension is:    application/x-xz
Tar.gz with Tar.gz extension is: application/gzip
Tar.gz with Zip extension is:    application/gzip
Rar without extension is:        application/x-rar-compressed
Zip without extension is:        application/zip
7z without extension is:         application/x-7z-compressed
Tar.xz without extension is:     application/x-xz
Tar.gz without extension is:     application/gzip

结论

所有文件都已正确识别,但由于它有其优点,因此也有其缺点。

优点:

  • 由Apache维护。
  • 不会被扩展所愚弄。

缺点:

  • 真的很重,特别是如果只想检查获取文件类型。Tika-eval Jar 的重量为 +40MB。

网址连接

实现

import java.net.URL;
import java.net.URLConnection;
try {
final URL url = new URL("file://" + basePath + "/" + fileName);
final URLConnection urlConnection = url.openConnection();
return urlConnection.getContentType();
} catch (final Exception exception) {
return "<EXCEPTION: " + exception.getMessage() + ">";
}

结果

Rar with Rar extension is:       content/unknown
Rar with Zip extension is:       application/zip
Zip with Zip extension is:       application/zip
7z with 7z extension is:         content/unknown
7z with Zip extension is:        application/zip
Tar.xz with Tar.xz extension is: content/unknown
Tar.xz with Zip extension is:    application/zip
Tar.gz with Tar.gz extension is: application/octet-stream
Tar.gz with Zip extension is:    application/zip
Rar without extension is:        content/unknown
Zip without extension is:        content/unknown
7z without extension is:         content/unknown
Tar.xz without extension is:     content/unknown
Tar.gz without extension is:     content/unknown

结论

它几乎不识别任何文件压缩格式,并通过扩展名而不是其内容来指导自己。


简单魔术 1.14

该项目似乎每年至少更新一次。

实现

import com.j256.simplemagic.ContentInfo;
import com.j256.simplemagic.ContentInfoUtil;
try {
final ContentInfoUtil util = new ContentInfoUtil();
final ContentInfo info = util.findMatch(basePath + "/" + fileName);
return info.getMimeType();
} catch (final Exception exception) {
return "<EXCEPTION: " + exception.getMessage() + ">";
}

结果

Rar with Rar extension is:       application/x-rar
Rar with Zip extension is:       application/x-rar
Zip with Zip extension is:       application/zip
7z with 7z extension is:         application/x-7z-compressed
7z with Zip extension is:        application/x-7z-compressed
Tar.xz with Tar.xz extension is: <EXCEPTION: null>
Tar.xz with Zip extension is:    <EXCEPTION: null>
Tar.gz with Tar.gz extension is: application/x-gzip
Tar.gz with Zip extension is:    application/x-gzip
Rar without extension is:        application/x-rar
Zip without extension is:        application/zip
7z without extension is:         application/x-7z-compressed
Tar.xz without extension is:     <EXCEPTION: null>
Tar.gz without extension is:     application/x-gzip

结论

它几乎适用于我们所有的场景,但似乎对于像 Tar.xz 这样最"晦涩"的压缩格式,它无法检测到它们(并在此过程中抛出异常)。


哑剧 2.1.3

该项目自 2010 年以来未被修改过,因此不要指望支持或更新。为了完成起见,这里只是列出来。

实现

import eu.medsea.mimeutil.MimeUtil2;
try {
final MimeUtil2 mimeUtil = new MimeUtil2();
mimeUtil.registerMimeDetector("eu.medsea.mimeutil.detector.MagicMimeMimeDetector");
return MimeUtil2.getMostSpecificMimeType(mimeUtil.getMimeTypes(basePath + "/" + fileName)).toString();
} catch (final Exception exception) {
return "<EXCEPTION: " + exception.getMessage() + ">";
}

结果

Rar with Rar extension is:       application/x-rar
Rar with Zip extension is:       application/x-rar
Zip with Zip extension is:       application/zip
7z with 7z extension is:         application/octet-stream
7z with Zip extension is:        application/octet-stream
Tar.xz with Tar.xz extension is: application/octet-stream
Tar.xz with Zip extension is:    application/octet-stream
Tar.gz with Tar.gz extension is: application/x-gzip
Tar.gz with Zip extension is:    application/x-gzip
Rar without extension is:        application/x-rar
Zip without extension is:        application/zip
7z without extension is:         application/octet-stream
Tar.xz without extension is:     application/octet-stream
Tar.gz without extension is:     application/x-gzip

结论

它识别了一些最流行的文件类型,但在 Tar.xz 和 7z 中失败。


文件 - 命令行

不是最漂亮的解决方案,但必须尝试:Ubuntu文件命令。

实现

import java.io.BufferedReader;
import java.io.InputStreamReader;
try {
final Process process = Runtime.getRuntime().exec("file --mime-type " + basePath + "/" + fileName);
final BufferedReader stdInput = new BufferedReader(new InputStreamReader(process.getInputStream()));
String text = "";
String s;
while ((s = stdInput.readLine()) != null) {
text += s;
}
return text.split(": ")[1];
} catch (final Exception exception) {
return "<EXCEPTION: " + exception.getMessage() + ">";
}

结果

Rar with Rar extension is:       application/x-rar
Rar with Zip extension is:       application/x-rar
Zip with Zip extension is:       application/zip
7z with 7z extension is:         application/x-7z-compressed
7z with Zip extension is:        application/x-7z-compressed
Tar.xz with Tar.xz extension is: application/x-xz
Tar.xz with Zip extension is:    application/x-xz
Tar.gz with Tar.gz extension is: application/gzip
Tar.gz with Zip extension is:    application/gzip
Rar without extension is:        application/x-rar
Zip without extension is:        application/zip
7z without extension is:         application/x-7z-compressed
Tar.xz without extension is:     application/x-xz
Tar.gz without extension is:     application/gzip

结论

它适用于所有场景,但同样,这依赖于运行代码的系统上存在的命令File

最新更新