自然语言语料库字符串到int



从语料库1、语料库2和语料库3中的每一个语料库中抽取一个句子样本,并显示平均长度(根据句子中的字符数测量(。

所以我有3个语料库,sample_raw_sents是一个定义的函数,用于返回随机句子:

tcr = corpus1()
rcr = corpus2()
mcr = corpus3()  
sample_size=50
for sentence in tcr.sample_raw_sents(sample_size):
print(len(sentence))
for sentence in rcr.sample_raw_sents(sample_size):
print(len(sentence))
for sentence in mcr.sample_raw_sents(sample_size):
print(len(sentence))  

所以使用这个代码,所有的长度都会被打印出来,但是我如何求和((这些长度呢?

使用zip,可以一次从每个语料库中提取一个句子。

tcr = corpus1()
rcr = corpus2()
mcr = corpus3()  
sample_size=50
zipped = zip(tcr.sample_raw_sents(sample_size),
rcr.sample_raw_sents(sample_size),
mcr.sample_raw_sents(sample_size))
for s1, s2, s3 in zipped:
summed = len(s1) + len(s2) + len(s3)
average = summed/3
print(summed, average)

您可以将sentences的所有长度存储在list中,然后将它们相加。

tcr = corpus1()
rcr = corpus2()
mcr = corpus3()  
sample_size=50
lengths = []
for sentence in tcr.sample_raw_sents(sample_size):
lengths.append(len(sentence))
for sentence in rcr.sample_raw_sents(sample_size):
lengths.append(len(sentence))
for sentence in mcr.sample_raw_sents(sample_size):
lengths.append(len(sentence))
print(sum(lengths) / len(lengths))
tcr = corpus1()
rcr = corpus2()
mcr = corpus3()  
sample_size=50
s = 0
for sentence in tcr.sample_raw_sents(sample_size):
s = s + len(sentence)
for sentence in rcr.sample_raw_sents(sample_size):
s = s + len(sentence)
for sentence in mcr.sample_raw_sents(sample_size):
s = s + len(sentence)
average = s/150
print('average: {}'.format(average))

最新更新