将文件夹中的文章映射到列表中



我有一个包含几篇文章的文件夹,我想将每篇文章的文本映射到一个公共列表中,以便将该列表用于 tf-idf 转换。例如:

文件夹 = [文章 1、文章 2、文章 3]

进入列表

列表 = ['text_of_article1', 'text_of_article2', 'text_of_article3']

def multiple_file(arg):     #arg is path to the folder with multiple files
    '''Function opens multiple files in a folder and maps each of them to a list
    as a string'''
    import glob, sys, errno
    path = arg
    files = glob.glob(path)
    list = []               #list where file string would be appended
    for item in files:    
        try:
            with open(item) as f: # No need to specify 'r': this is the default.
                list.append(f.read())
        except IOError as exc:
            if exc.errno != errno.EISDIR: # Do not fail if a directory is found, just ignore it.
                raise # Propagate other kinds of IOError.
    return list

当我设置包含文章的文件夹的路径时,我得到一个空列表。但是,当我直接将其设置为一篇文章时,该文章将显示在列表中。我怎样才能将它们全部映射到我的列表中。:S

这是代码,不确定这是否是你想到的:

def multiple_files(arg):     #arg is path to the folder with multiple files
    '''Function opens multiple files in a folder and maps each of them to a list
    as a string'''
    import glob, sys, errno, os
    path = arg
    files = os.listdir(path)
    list = []               #list where file string would be appended
    for item in files:    
        try:
            with open(item) as f: # No need to specify 'r': this is the default.
                list.append(f.read())
        except IOError as exc:
            if exc.errno != errno.EISDIR: # Do not fail if a directory is found, just ignore it.
                raise # Propagate other kinds of IOError.
    return list

这是错误:

Traceback (most recent call last):
  File "<ipython-input-7-13e1457699ff>", line 1, in <module>
    x = multiple_files(path)
  File "<ipython-input-5-6a8fab5c295f>", line 10, in multiple_files
    with open(item) as f: # No need to specify 'r': this is the default.
IOError: [Errno 2] No such file or directory: 'u02.txt'

第 2 条实际上是新创建的列表中的第一个。

假设path == "/home/docs/guzdeh" .如果你只是说glob.glob(path)你只会得到[path],因为没有其他东西与模式相匹配。您希望glob.glob(path + "/*")获取该目录中的所有内容,或者glob.glob(path + "/*.txt")获取所有txt文件。

或者你可以使用import os; os.listdir(path),我认为这更有意义。

更新:

关于新代码,问题是os.listdir只返回相对于所列目录的路径。因此,您需要将两者结合起来,以便python知道您在谈论什么。加:

item = os.path.join(path, item)

在尝试open(item)之前.您可能还希望更好地命名变量。

相关内容

  • 没有找到相关文章

最新更新