file_get_contents on Word Doc

我正在使用以下代码尝试使用PHP在word文档中查找"术语"。当然，这不是打开像word文档这样的二进制文件的正确方法，但是"$fileContent"中格式错误的字符串对我来说已经足够了。但是，在搜索当前位于文档中的术语时，"stripos"功能无法按预期工作。

$fileContent = file_get_contents($filePath);
$posFileContent = stripos($fileContent,$term);
if ($posFileContent !== false) {
echo "Found!!";
$value += $FACTOR_SEC;
}

观察：对$fileContent进行var_dump显示了文档的正确内容，当然还有格式错误的问题，但该术语仍然存在。

更多信息：

var_dump($term)

弦(10) "创新">

var_dump($fileContent)

字符串(10240) " ࡱ ;�� 根进入 ��!�� FMicrosoft Word-Dokument MSWordDocWord.Document.8�9�q [Z��ZNormal1$*$3B*OJQJCJmH sH KHPJnHtH^JaJ_H9BA@ BAbsatz-StandardschriftartF FHeading x$OJQJCJPJ^JaJ.B.文本正文 x/List^J@"@Caption x x$CJ6^JaJ]& 2&Index$^Jd ddPG Times New Roman5 Symbol3 & ArialG Times New Roman5 SimSun5 MangalG Microsoft Yahei5 MangalB h " 5_ 5_' 0 0 Oh +' 0|8 @ LXd p 0@@@ { @ M 0 Caolan80 $d b 拉姆达发展关于我们 Lambda 开发创新的软件产品，引领我们的客户走上成功之路。我们专注于移动应用程序、网络工具和管理系统。我们的团队参与整个过程，从想法诞生的地方开始，到现在通过产品规范，直到其在适当的技术中实施。 &*：> j l CJ>*5aJ\OJQJ/：;B*ph""CJ@ 6>*5aJ\OJQJCJ$>*5aJ$\CJ8>5aJ8\ (<> $a$"/=！ n" n# n$ n3P(20 ՜.+， D ՜.+， \ 根 Entry F CompObj jOle 1表摘要信息( word文档 $DocumentSummaryInformation 8 t">

经过两天的挣扎，这就是答案：

Microsoft 单词编码在所有"真实字符"之间添加"\0"字符，因此基本上单词"hello"实际上是"h\0e\0l\0l\0o\0"。

在文档中搜索的方法是：

$fileContent = file_get_contents($filePath);
$termArray = str_split($term);
$newTerm = '';
foreach ($termArray as $charTerm) {
$newTerm = $newTerm.$charTerm;
$newTerm = $newTerm."";
}
if (stripos($fileContent,$newTerm) !== false) {
// Term found in doc
}

相关内容

最新更新

热门标签：