pdf mcq到pandas数据帧



有什么方法可以将这样的文本从pdf转换为pandas数据帧吗?text:

  1. 比较成本优势理论由-----a( 阿尔弗雷德·马歇尔大卫·里卡多c( Taussig d(Heberler
  2. 里卡多的比较成本理论是基于以下哪一个假设a( 共同市场b(同等成本c( 垄断d(自由贸易

预期df:

The theory of comparative cost advantage theory was Introduced by-----                  Alfred Marshall     David Ricardo     Taussig     Heberler
The Ricardo’s comparative cost theory is based on which of the following assumption     Common Market       Equal cost        Monopoly    Free Trade
  • 逐行用换行符分隔
  • 逐列按正则表达式拆分
rawtxt = """The theory of comparative cost advantage theory was Introduced by----- a) Alfred Marshall b) David Ricardo c) Taussig d) Heberler
The Ricardo’s comparative cost theory is based on which of the following assumption a) Common Market b) Equal cost c) Monopoly d) Free Trade"""
df = pd.DataFrame({"rawtxt":rawtxt.split("n")})
df.rawtxt.str.split(r"[a-z])").apply(pd.Series)

输出

>>lfred Marshall>
0124
0比较成本优势理论是由----David Ricardo1里卡多的比较成本理论基于以下哪一个假设

假设您能够从PDF中提取文本,每个句子/问题都用新行分隔,那么您可以使用Regex,如下所示:

import re
regex = r"(.+)(a).+).+(b).+).+(c).+).+(d).+)"
pdf_txt = """The theory of comparative cost advantage theory was Introduced by----- a) Alfred Marshall b) David Ricardo c) Taussig d) Heberlern 
The Ricardo’s comparative cost theory is based on which of the following assumption a) Common Market b) Equal cost c) Monopoly d) Free Traden"""
matches = re.finditer(regex, pdf_txt, re.MULTILINE)
data = {1 : [], 2 : [], 3 : [], 4 : [], 5 : []}
for match_num, match in enumerate(matches, start=1):
for group_num in range(0, len(match.groups())):
data[group_num + 1].append(match.group(group_num + 1))

df = pd.DataFrame(data)
df.columns = ['Question', 'A', "B", "C", "D"]
print(df.head())

最新更新