我很难从这个数据集构建网络
Node Sentence
Mary I am here to help. What would you like to talk about?
Mary What's up? I hope everything is going well in NY. I have always loved NY, the Big Apple!
John There is the football match, tonight. Let's go to the pub!
Christopher It is a great news! I am so happy for y'all
Catherine Do not do that! It is extremely dangerous
Matt I read that news. I was so happy and grateful it was not you.
Matt Yes, I didn't know it. It is such a surprising news! Congratulations!
Sarah Nothing to add...
Catherine Finally a beautiful sunny day!!!
Mary Jane I do not think it will rain. There is the sun. It is a hot day. Very hot!
名称应该是网络中的节点。对于每个节点,我应该创建一个包含句子中频繁单词(不包括停止词(的链接,以获得更有意义的关系。为了从我使用的句子中删除停止语,nlkt(效果不好,但应该可以(:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
stop_words = set(stopwords.words('english'))
df['Sentences'] = df['Sentences'].str.lower().str.split()
df['Sentences'].apply(lambda x: [item for item in x if item not in stop_words])
然后,对于单词的频率,我会首先创建一个包含所有术语及其相应频率的词汇表,然后我会回到句子中创建一对(word, freq)
,其中单词是"目标"节点,freq
应该是目标节点的大小。在这里,我的困难暴露出来,因为这个
word = df['Sentence'].tolist()
words = nltk.tokenize.word_tokenize(word)
word_dist = nltk.FreqDist(words)
result = pd.DataFrame(word_dist, columns=['Word', 'Frequency'])
没有显示单词及其频率(我正在创建一个新的数据帧来显示它们,而不是在我的原始数据帧中再添加两列这些信息;后者更可取(。对于构建网络,一旦得到Node、Target和Weight,我就会使用networkx。
word_dist
结果示例(未排序(:
Word Frequency
help 8
like 12
news 21
day 8
sunny 17
sun 23
football 12
pub 3
home 14
congratulations 3
nltk.FreqDist()
类返回一个collections.counter
对象基本上是一本字典。当Panda构建一个数据帧和第一个自变量是一个字典,每个键都被视为一列,每个值应为列值的列表。因此如下面的例子中那样,result
将是一个具有两列的空数据帧。
为了构造具有字典的数据帧,其中每个键都是一行,您可以简单地将字典分为键和值,例如在CCD_ 7的构建中。下一行设置索引,如果你愿意的话。
import pandas as pd
word_dict = {'help': '8',
'like': '12',
'news': '21',
'day': '8',
'sunny': '17',
'sun': '23',
'football': '12',
'pub': '3',
'home': '14',
'congratulations': '3'}
result = pd.DataFrame(word_dict, columns=('a', 'b'))
result2 = pd.DataFrame(word_dict.values(), index=word_dict.keys(), columns=('Frequency',))
result2.index.rename('Word', inplace=True)