如何在一个文件夹内读取多个ann文件(从brat注释)到一个pandas数据框架?

我可以像下面这样读取一个ann文件到pandas数据框架中:

df = pd.read_csv('something/something.ann', sep='^([^s]*)s', engine='python', header=None).drop(0, axis=1)
df.head()

但是我不知道如何将多个ann文件读取到一个pandas数据框架中。我尝试使用concat，但结果不是我所期望的。

如何在一个pandas数据框架中读取多个ann文件?

听起来您需要使用glob从文件夹中拉入所有.ann文件并将它们添加到数据框架列表中。之后，您可能需要根据需要加入/合并/连接等。

我不知道你的确切要求，但下面的代码应该让你接近。目前，脚本假设在运行Python脚本的地方有一个名为files的子文件夹，并且希望在其中拉入所有.ann文件(它不会查看任何其他文件)。显然，检查和修改是必要的，因为每行都有注释。

import pandas as pd
import glob
path = r'./files' # use your path
all_files = glob.glob(path + "/*.ann")
# create empty list to hold dataframes from files found
dfs = []
# for each file in the path above ending .ann
for file in all_files:
#open the file
df = pd.read_csv(file, sep='^([^s]*)s', engine='python', header=None).drop(0, axis=1)
#add this new (temp during the looping) frame to the end of the list
dfs.append(df)
#at this point you have a list of frames with each list item as one .ann file.  Like [annFile1, annFile2, etc.] - just not those names.
#handle a list that is empty
if len(dfs) == 0:
print('No files found.')
#create a dummy frame
df = pd.DataFrame()
#or have only one item/frame and get it out
elif len(dfs) == 1:
df = dfs[0]
#or concatenate more than one frame together
else: #modify this join as required.
df = pd.concat(dfs, ignore_index=True)
df = df.reset_index(drop=True)
#check what you've got
print(df.head())

相关内容

最新更新

热门标签：