Bash/Perl:htm文件中字符串替换期间的Unicode处理问题

我有一个bash脚本，它使用Perl的替换运算符来替换指定目录中所有.htm文件中的字符串。

find $files_dir -name '*.htm' | while read line; do
    ReplaceString "$line"
done
function ReplaceString {
    perl -pi -e 's/string1/string2/g' "$1"
    rm -rf "$1.bak"
}

问题是有些文件包含Unicode字符（例如').当文件中存在任何Unicode字符时，该文件不会被处理，也不会发生字符串替换。当我从文件中删除Unicode时，字符串替换就起作用了。

我正在寻找一种方法，使我的程序"Unicode意识"，以便它可以处理任何文件，无论它是否包含Unicode。

我也尝试过使用sed而不是Perl：

sed -i 's/string1/string2/g' "$1"

这给了我同样的问题。

非工作文件示例（精简）：

<html>
<head><meta http-equiv=Content-Type content="text/html; charset=unicode"></head>
<style>
     <!-- 
     /* Font definitions (generated by MS Word) */
     @list l0:level3
     {mso-level-text:;}
      -->
</style>
<body>
     <p>string1</p>
</body>
</html>

正如ikegami和n.m所指出的，.htm文件（使用Microsoft Word生成）是用UTF-16le编码的。Perl替换操作无法理解这种编码。

我通过使用MS Word保存UTF-8编码的非工作文件来解决这个问题。

相关内容

最新更新

热门标签：