建立一个以用户为节点,以用户的句子为目标的网络



我很难从这个数据集构建网络

Node                Sentence      
Mary              I am here to help. What would you like to talk about?
Mary              What's up? I hope everything is going well in NY. I have always loved NY, the Big Apple!
John              There is the football match, tonight. Let's go to the pub!
Christopher       It is a great news! I am so happy for y'all
Catherine         Do not do that! It is extremely dangerous
Matt              I read that news. I was so happy and grateful it was not you. 
Matt              Yes, I didn't know it. It is such a surprising news! Congratulations!
Sarah             Nothing to add...
Catherine         Finally a beautiful sunny day!!!
Mary Jane         I do not think it will rain. There is the sun. It is a hot day. Very hot!

名称应该是网络中的节点。对于每个节点,我应该创建一个包含句子中频繁单词(不包括停止词(的链接,以获得更有意义的关系。为了从我使用的句子中删除停止语,nlkt(效果不好,但应该可以(:

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

stop_words = set(stopwords.words('english'))
df['Sentences'] = df['Sentences'].str.lower().str.split()
df['Sentences'].apply(lambda x: [item for item in x if item not in stop_words])

然后,对于单词的频率,我会首先创建一个包含所有术语及其相应频率的词汇表,然后我会回到句子中创建一对(word, freq),其中单词是"目标"节点,freq应该是目标节点的大小。在这里,我的困难暴露出来,因为这个

word = df['Sentence'].tolist()
words = nltk.tokenize.word_tokenize(word)
word_dist = nltk.FreqDist(words)
result = pd.DataFrame(word_dist, columns=['Word', 'Frequency'])

没有显示单词及其频率(我正在创建一个新的数据帧来显示它们,而不是在我的原始数据帧中再添加两列这些信息;后者更可取(。对于构建网络,一旦得到Node、Target和Weight,我就会使用networkx。

word_dist结果示例(未排序(:

Word        Frequency
help           8
like          12
news          21
day           8
sunny         17
sun           23
football      12
pub           3
home          14
congratulations  3

nltk.FreqDist()类返回一个collections.counter对象基本上是一本字典。当Panda构建一个数据帧和第一个自变量是一个字典,每个键都被视为一列,每个值应为列值的列表。因此如下面的例子中那样,result将是一个具有两列的空数据帧。

为了构造具有字典的数据帧,其中每个键都是一行,您可以简单地将字典分为键和值,例如在CCD_ 7的构建中。下一行设置索引,如果你愿意的话。

import pandas as pd
word_dict = {'help': '8',
'like': '12',
'news': '21',
'day': '8',
'sunny': '17',
'sun': '23',
'football': '12',
'pub': '3',
'home': '14',
'congratulations': '3'}
result = pd.DataFrame(word_dict, columns=('a', 'b'))
result2 = pd.DataFrame(word_dict.values(), index=word_dict.keys(), columns=('Frequency',))
result2.index.rename('Word', inplace=True)

相关内容

最新更新