问题：

我有成对的句子，它们之间缺少句点和一个大写字母。需要将它们彼此分开。我正在寻找一些帮助来选择好的功能来改进模型。

背景：

我正在使用pycrfsuite来执行序列分类并找到第一句话的结尾，如下所示：

从棕色语料库中，我每两个句子连接在一起并获得它们的pos标签。然后，我用'S'标记句子中的每个标记，如果空格跟在它后面，'P'句点是否跟在句子后面。然后我删除句子之间的句点，并降低以下标记。我得到这样的东西：

输入：

data = ['I love Harry Potter.', 'It is my favorite book.']

输出：

sent = [('I', 'PRP'), ('love', 'VBP'), ('Harry', 'NNP'), ('Potter', 'NNP'), ('it', 'PRP'), ('is', 'VBZ'), ('my', 'PRP$'), ('favorite', 'JJ'), ('book', 'NN')]
labels = ['S', 'S', 'S', 'P', 'S', 'S', 'S', 'S', 'S']

目前，我提取了这些常规特征：

def word2features2(sent, i):
word = sent[i][0]
postag = sent[i][1]
# Common features for all words
features = [
'bias',
'word.lower=' + word.lower(),
'word[-3:]=' + word[-3:],
'word[-2:]=' + word[-2:],
'word.isupper=%s' % word.isupper(),
'word.isdigit=%s' % word.isdigit(),
'postag=' + postag
]
# Features for words that are not
# at the beginning of a document
if i > 0:
word1 = sent[i-1][0]
postag1 = sent[i-1][1]
features.extend([
'-1:word.lower=' + word1.lower(),
'-1:word.isupper=%s' % word1.isupper(),
'-1:word.isdigit=%s' % word1.isdigit(),
'-1:postag=' + postag1
])
else:
# Indicate that it is the 'beginning of a sentence'
features.append('BOS')
# Features for words that are not
# at the end of a document
if i < len(sent)-1:
word1 = sent[i+1][0]
postag1 = sent[i+1][1]
features.extend([
'+1:word.lower=' + word1.lower(),
'+1:word.isupper=%s' % word1.isupper(),
'+1:word.isdigit=%s' % word1.isdigit(),
'+1:postag=' + postag1
])
else:
# Indicate that it is the 'end of a sentence'
features.append('EOS')

并使用以下参数训练 crf：

trainer = pycrfsuite.Trainer(verbose=True)
# Submit training data to the trainer
for xseq, yseq in zip(X_train, y_train):
trainer.append(xseq, yseq)
# Set the parameters of the model
trainer.set_params({
# coefficient for L1 penalty
'c1': 0.1,
# coefficient for L2 penalty
'c2': 0.01,
# maximum number of iterations
'max_iterations': 200,
# whether to include transitions that
# are possible, but not observed
'feature.possible_transitions': True
})
trainer.train('crf.model')

结果：

准确性报告显示：

precision    recall  f1-score   support
S       0.99      1.00      0.99    214627
P       0.81      0.57      0.67      5734
micro avg       0.99      0.99      0.99    220361
macro avg       0.90      0.79      0.83    220361
weighted avg       0.98      0.99      0.98    220361

为了改进模型，我可以通过哪些方式编辑word2features2()？(或任何其他部分)

这是今天完整代码的链接。

此外，我只是 nlp 的初学者，因此我将非常感谢任何整体反馈、相关或有用资源的链接以及相当简单的解释。非常感谢！

由于问题的性质，您的类非常不平衡，因此我建议使用加权损失，其中 P 标签的损失值高于 S 类的损失值。我认为问题可能是由于两个类的权重相等，分类器没有对这些 P 标签给予足够的关注，因为它们对损失的影响非常小。

您可以尝试的另一件事是超参数调优，请确保针对宏 f1 分数进行优化，因为无论支持实例的数量如何，它都会为两个类提供相等的权重。

哪些特征可以帮助对句尾进行分类?序列分类

问题：

背景：

结果：

相关内容

最新更新

热门标签：