使用非UTF-8符号删除目录中的所有文件

我有一组数据，但是我只需要使用utf-8数据，因此我需要使用非utf-8符号删除所有数据。

当我尝试使用这些文件时，我会收到：

UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 3062: character maps to <undefined> and UnicodeDecodeError: 'utf8' codec can't decode byte 0xc1 in position 1576: invalid start byte

我的代码

class Corpus:
        def __init__(self,path_to_dir=None):
                self.path_to_dir = path_to_dir if path_to_dir else []

        def emails_as_string(self):
                for file_name in os.listdir(self.path_to_dir):
                        if not file_name.startswith("!"):
                                with io.open(self.add_slash(self.path_to_dir)+file_name,'r', encoding ='utf-8') as body:
                                        yield[file_name,body.read()]                        
        def add_slash(self, path):
                if path.endswith("/"): return path
                return path + "/"

i在这里 yield[file_name,body.read()]和 list_of_emails = mailsrch.findall(text) recive错误，但是当我与UTF-8一起工作时，一切都很棒。

我怀疑您要在bytes.decode上使用errors='ignore'参数。请参阅http://docs.python.org/3/howto/unicode.html#unicode-howto and http://docs.python.org/3/library/stdtypes.html#bytes.decode.decode.decode.decode.decode.decode。

编辑：

这是一个示例，显示了一个很好的方法：

for file_name in os.listdir(self.path_to_dir):
    if not file_name.startswith("!"):
        fullpath = os.path.join(self.path_to_dir, file_name)
        with open(fullpath, 'r', encoding ='utf-8', errors='ignore') as body:
            yield [file_name, body.read()]

使用os.path.join，您可以消除您的add_slash方法，并确保其跨平台工作。

相关内容

最新更新

热门标签：