我想使用Python中的NLP从文本中提取有关教育机构,学位,通过年份和成绩(CGPA/GPA/百分比)的信息。例如,如果我输入:
NBN Sinhgad工程学院,浦那2016 - 2020工程学士计算机科学CGPA: 8.78 2014 - 2016中级- pcm,经济学CBSE百分比:88.8 2003 - 2014预科,CBSE CGPA: 8.6经验
我想要输出:
[{
"Institute": "NBN Sinhgad School Of Engineering",
"Degree": "Bachelor of Engineering Computer Science",
"Grades": "8.78",
"Year of Passing": "2020"
}, {
"Institute": "Vidya Bharati Chinmaya Vidyalaya",
"Degree": "Intermediate-PCM,Economics",
"Grades": "88.8",
"Year of Passing": "2016"
}, {
"Institute": "Vidya Bharati Chinmaya Vidyalaya",
"Degree": "Matriculation,CBSE",
"Grades": "8.6",
"Year of Passing": "2014"
}]
可以在不训练任何自定义NER模型的情况下完成吗?是否有任何预先训练的NER可以做到这一点?
是的,可以在不训练任何自定义NER模型的情况下解析数据。你有构建自定义规则来解析数据。
在您的示例中,您可以通过regex和模式识别提取数据,例如institute总是在通过的年份之前或其他东西。如果它不是无序的,你必须通过关键字,如school, institute,college ans so on...
,这取决于你的情况。
import re
txt = '''NBN Sinhgad School Of Engineering,Pune 2016 - 2020 Bachelor of Engineering Computer Science CGPA: 8.78
Vidya Bharati Chinmaya Vidyalaya,Jamshedpur 2014 - 2016 Intermediate-PCM,Economics CBSE Percentage: 88.8
Vidya Bharati Chinmaya Vidyalaya,Jamshedpur 2003 - 2014 Matriculation,CBSE CGPA: 8.6 EXPERIENCE'''
# extract grades
grade_regex = r'(?:d{1,2}.d{1,2})'
grades = re.findall(grade_regex, txt)
# extract years
year_regex = r'(?:d{4}s?-s?d{4})'
years = re.findall(year_regex, txt)
# function to replace a value in string
def replacer(string, noise_list):
for v in noise_list:
string = string.replace(v, ":")
return string
# extract college
data = replacer(txt, years)
cleaned_text = re.sub("(?:w+s?:)", "**", data).split('n')
college = []
degree = []
for i in cleaned_text:
split_data = i.split("**")
college.append(split_data[0].replace(',', '').strip())
degree.append(split_data[1].strip())
parsed_output = []
for i in range(len(grades)):
parsed_data = {
"Institute": college[i],
"Degree": degree[i],
"Grades": grades[i],
"Year of Passing": years[i].split('-')[1]
}
parsed_output.append(parsed_data)
print(parsed_output)
>>>> [{'Institute': 'NBN Sinhgad School Of Engineering', 'Degree': 'Bachelor of Engineering Computer Science', 'Grades': '8.78', 'Year of Passing': ' 2020'}, {'Institute': 'Vidya Bharati Chinmaya Vidyalaya', 'Degree': 'Intermediate-PCM,Economics CBSE', 'Grades': '88.8', 'Year of Passing': ' 2016'}, {'Institute': 'Vidya Bharati Chinmaya Vidyalaya', 'Degree': 'Matriculation,CBSE', 'Grades': '8.6', 'Year of Passing': ' 2014'}]