我有一个包含几篇文章的文件夹,我想将每篇文章的文本映射到一个公共列表中,以便将该列表用于 tf-idf 转换。例如:
文件夹 = [文章 1、文章 2、文章 3]
进入列表
列表 = ['text_of_article1', 'text_of_article2', 'text_of_article3']
def multiple_file(arg): #arg is path to the folder with multiple files
'''Function opens multiple files in a folder and maps each of them to a list
as a string'''
import glob, sys, errno
path = arg
files = glob.glob(path)
list = [] #list where file string would be appended
for item in files:
try:
with open(item) as f: # No need to specify 'r': this is the default.
list.append(f.read())
except IOError as exc:
if exc.errno != errno.EISDIR: # Do not fail if a directory is found, just ignore it.
raise # Propagate other kinds of IOError.
return list
当我设置包含文章的文件夹的路径时,我得到一个空列表。但是,当我直接将其设置为一篇文章时,该文章将显示在列表中。我怎样才能将它们全部映射到我的列表中。:S
这是代码,不确定这是否是你想到的:
def multiple_files(arg): #arg is path to the folder with multiple files
'''Function opens multiple files in a folder and maps each of them to a list
as a string'''
import glob, sys, errno, os
path = arg
files = os.listdir(path)
list = [] #list where file string would be appended
for item in files:
try:
with open(item) as f: # No need to specify 'r': this is the default.
list.append(f.read())
except IOError as exc:
if exc.errno != errno.EISDIR: # Do not fail if a directory is found, just ignore it.
raise # Propagate other kinds of IOError.
return list
这是错误:
Traceback (most recent call last):
File "<ipython-input-7-13e1457699ff>", line 1, in <module>
x = multiple_files(path)
File "<ipython-input-5-6a8fab5c295f>", line 10, in multiple_files
with open(item) as f: # No need to specify 'r': this is the default.
IOError: [Errno 2] No such file or directory: 'u02.txt'
第 2 条实际上是新创建的列表中的第一个。
假设path == "/home/docs/guzdeh"
.如果你只是说glob.glob(path)
你只会得到[path]
,因为没有其他东西与模式相匹配。您希望glob.glob(path + "/*")
获取该目录中的所有内容,或者glob.glob(path + "/*.txt")
获取所有txt
文件。
或者你可以使用import os; os.listdir(path)
,我认为这更有意义。
更新:
关于新代码,问题是os.listdir
只返回相对于所列目录的路径。因此,您需要将两者结合起来,以便python知道您在谈论什么。加:
item = os.path.join(path, item)
在尝试open(item)
之前.您可能还希望更好地命名变量。