参考用于将PDF转换为文本帖子的Python模块,抓取pdf文件并提取数据。在抓取时,数据被分解为两个变量。如何合并这些数据并将其提取为字典?
例如
content = ['Sample Questions Set 1 ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', '01 Which function among the following can’t be accessed outside ', 'the class in java in same package? ', 'A. public void show()。 ', 'B. void show()。 ', 'C. protected show()。 ', 'D. static void show()。 ', '02 How many private member functions are allowed in a class ? ', 'A. Only 1 ', 'B. Only 7 ', 'C. Only 255 ', 'D. As many as required ', '03 Can main() function be made private? ', 'A. Yes, always。 ', 'B. Yes, if program doesn’t contain any classes。 ', 'C. No, because main function is user defined。 ', 'D. No, never。 ', '04 If private member functions are to be declared in C++ then_________。 ', 'A. private: ', 'B. private ', 'C. private(private member list) ', 'D. private :- <private members> ', '05 If a function in java is declared private then it _________。 ', 'A. Can’t access the standard output ', 'B. Can access the standard output。 ', 'C. Can’t access any output stream。 ', 'D. Can access only the output streams。 ']
输出:
questions = [{'Qid':01,'Qtext':'Which function among the following can’t be accessed outside the class in java in same package?','A.':'public void show()。','B.':' void show()。','C.':'protected show()。','D.':'static void show()'},{'Qid':02,....},{...},{...},{...}]
以下操作:
questions = []
for s in content:
s = s.lstrip()
if s:
if s[0].isdigit():
questions.append({'Qid': len(questions) + 1, 'Qtext': s.split(maxsplit=1)[1]})
elif s[0].isalpha() and s[1] == '.':
questions[-1][s[:2]] = s.split(maxsplit=1)[1]
elif questions:
questions[-1]['Qtext'] += s
questions
将变成:
[{'Qid': 1, 'Qtext': 'Which function among the following can’t be accessed outside the class in java in same package? ', 'A.': 'public void show()。 ', 'B.': 'void show()。 ', 'C.': 'protected show()。 ', 'D.': 'static void show()。 '}, {'Qid': 2, 'Qtext': 'How many private member functions are allowed in a class ? ', 'A.': 'Only 1 ', 'B.': 'Only 7 ', 'C.': 'Only 255 ', 'D.': 'As many as required '}, {'Qid': 3, 'Qtext': 'Can main() function be made private? ', 'A.': 'Yes, always。 ', 'B.': 'Yes, if program doesn’t contain any classes。 ', 'C.': 'No, because main function is user defined。 ', 'D.': 'No, never。 '}, {'Qid': 4, 'Qtext': 'If private member functions are to be declared in C++ then_________。 ', 'A.': 'private: ', 'B.': 'private ', 'C.': 'private(private member list) ', 'D.': 'private :- <private members> '}, {'Qid': 5, 'Qtext': 'If a function in java is declared private then it _________。 ', 'A.': 'Can’t access the standard output ', 'B.': 'Can access the standard output。 ', 'C.': 'Can’t access any output stream。 ', 'D.': 'Can access only the output streams。 '}]
这会将它们合并到问题列表中:-
import re
questions = []
loc = 0
for i in range(len(content)):
res = content[i]
prefix = res[0]
if(prefix.isalpha() and res[1]=='.'):
questions[loc][prefix + "."] = re.sub(r"[ABCD].s*", '', res)
if(prefix == "D"):loc += 1
elif(prefix.isdigit()):
questions.append({'Qid':loc+1, 'Qtext': re.sub(r"d+s+", '', res)})
elif(len(questions) != 0):
questions[loc]['Qtext'] += res #for this line which after a question cutted
结果:
[{'Qid': 1, 'Qtext': 'Which function among the following can’t be accessed outside the class in java in same package? ', 'A.': 'public void show()。 ', 'B.': 'void show()。 ', 'C.': 'protected show()。 ', 'D.': 'static void show()。 '}, {'Qid': 2, 'Qtext': 'How many private member functions are allowed in a class ? ', 'A.': 'Only 1 ', 'B.': 'Only 7 ', 'C.': 'Only 255 ', 'D.': 'As many as required '}, {'Qid': 3, 'Qtext': 'Can main() function be made private? ', 'A.': 'Yes, always。 ', 'B.': 'Yes, if program doesn’t contain any classes。 ', 'C.': 'No, because main function is user defined。 ', 'D.': 'No, never。 '}, {'Qid': 4, 'Qtext': 'If private member functions are to be declared in C++ then_________。 ', 'A.': 'private: ', 'B.': 'private ', 'C.': 'private(private member list) ', 'D.': 'private :- <private members> '}, {'Qid': 5, 'Qtext': 'If a function in java is declared private then it _________。 ', 'A.': 'Can’t access the standard output ', 'B.': 'Can access the standard output。 ', 'C.': 'Can’t access any output stream。 ', 'D.': 'Can access only the output streams。 '}]