pdf mcq到pandas数据帧

有什么方法可以将这样的文本从pdf转换为pandas数据帧吗？text:

比较成本优势理论由-----a( 阿尔弗雷德·马歇尔大卫·里卡多c( Taussig d(Heberler
里卡多的比较成本理论是基于以下哪一个假设a( 共同市场b(同等成本c( 垄断d(自由贸易

预期df:

The theory of comparative cost advantage theory was Introduced by-----                  Alfred Marshall     David Ricardo     Taussig     Heberler
The Ricardo’s comparative cost theory is based on which of the following assumption     Common Market       Equal cost        Monopoly    Free Trade

逐行用换行符分隔
逐列按正则表达式拆分

rawtxt = """The theory of comparative cost advantage theory was Introduced by----- a) Alfred Marshall b) David Ricardo c) Taussig d) Heberler
The Ricardo’s comparative cost theory is based on which of the following assumption a) Common Market b) Equal cost c) Monopoly d) Free Trade"""
df = pd.DataFrame({"rawtxt":rawtxt.split("n")})
df.rawtxt.str.split(r"[a-z])").apply(pd.Series)

输出

>>lfred Marshall>

	0	1	2	4
0	比较成本优势理论是由----	David Ricardo		1	里卡多的比较成本理论基于以下哪一个假设

假设您能够从PDF中提取文本，每个句子/问题都用新行分隔，那么您可以使用Regex，如下所示：

import re
regex = r"(.+)(a).+).+(b).+).+(c).+).+(d).+)"
pdf_txt = """The theory of comparative cost advantage theory was Introduced by----- a) Alfred Marshall b) David Ricardo c) Taussig d) Heberlern 
The Ricardo’s comparative cost theory is based on which of the following assumption a) Common Market b) Equal cost c) Monopoly d) Free Traden"""
matches = re.finditer(regex, pdf_txt, re.MULTILINE)
data = {1 : [], 2 : [], 3 : [], 4 : [], 5 : []}
for match_num, match in enumerate(matches, start=1):
for group_num in range(0, len(match.groups())):
data[group_num + 1].append(match.group(group_num + 1))

df = pd.DataFrame(data)
df.columns = ['Question', 'A', "B", "C", "D"]
print(df.head())

输出

相关内容

最新更新

热门标签：