如何通过NLTK Python标记文本



我有这样的文本:

Exception in org.baharan.dominant.dao.core.nonPlanAllocation.INonPlanAllocationRepository.getAllGrid() 
with cause = 'org.hibernate.exception.SQLGrammarException: could not extract ResultSet'
Caused by: java.sql.SQLSyntaxErrorException: ORA-00942: table or view does not exist

我在 Python 中用 word_tokenize 标记此文本,输出为:

Exception
org.baharan.dominant.dao.core.nonPlanAllocation.INonPlanAllocationRepository.getAllGrid
cause
'org.hibernate.exception.SQLGrammarException
could
extract
ResultSet'
Caused
java.sql.SQLSyntaxErrorException
ORA-00942
table
view
exist

但如您所见,第二行输出几个虚线排列在一起的单词。如何将它们作为一个词分开?!

我使用这个 Python 代码:

>>> f = open('001.txt')
>>> text = [w for w in word_tokenize(f.read()) if w not in stopwords]

事实上,我希望所有的单词都像这样分开:

Exception
org
baharan
dominant
dao
core
nonPlanAllocation
INonPlanAllocationRepository
getAllGrid
cause
'org
hibernate
exception
SQLGrammarException
could
extract
ResultSet'
Caused
java
sql
SQLSyntaxErrorException
ORA-00942
table
view
exist
f = "Exception in org.baharan.dominant.dao.core.nonPlanAllocation.INonPlanAllocationRepository.getAllGrid() 
with cause = 'org.hibernate.exception.SQLGrammarException: could not extract ResultSet' 
Caused by: java.sql.SQLSyntaxErrorException: ORA-00942: table or view does not exist'"
s = ''
f_list = f.replace('.', ' ').split(' ')
for item in f_list:
#print(item)
s = s + ' ' + ''.join(item)+'n'
print(s)

输出

Exception
in
org
baharan
dominant
dao
core
nonPlanAllocation
INonPlanAllocationRepository
getAllGrid()
with
cause
=
'org
hibernate
exception
SQLGrammarException:
could
not
extract
ResultSet'
Caused
by:
java
sql
SQLSyntaxErrorException:
ORA-00942:
table
or
view
does
not
exist'

我找到了一个简单的方法,使用 nltk.tokenize 的 RegexpTokenizer,如下所示:

>>> from nltk.tokenize import RegexpTokenizer
>>> tokenizer = RegexpTokenizer(r'w+')

考虑删除非索引字后的输出如下:

Exception
org
baharan
dominant
dao
core
nonPlanAllocation
INonPlanAllocationRepository
getAllGrid
cause
org
hibernate
exception
SQLGrammarException
could
extract
ResultSet
Caused
java
sql
SQLSyntaxErrorException
ORA-00942
table
view
exist

最新更新