Tweepy:只按关键字过滤推文流



我有兴趣检索关于即将到来的尼日利亚大选的推特流。我想要所有来自尼日利亚的推文,其中只包含关于4位主要总统候选人的信息("atiku-abubakar-rabiu-kwankwaso-peter-obi-bola-tinubu-inec"(。

然而,从我检索到的信息来看,这些时刻都是推特,其中大多数与关键词(规则(无关,甚至与政治或选举无关。

我的代码:

import tweepy
import json
import sqlite3
import time
BEARER = "my-bearer-key"
try:
connection = sqlite3.connect('inec-2023-tweets.db')
cursor = connection.cursor()
print(f"Database connection successful! n")
except sqlite3.Error as error:
print(f'Error while connecting to sqlite {error}')
class MyListener(tweepy.StreamingClient):
def on_data(self, data):
new_data = str(data) #
data_obj = json.loads(data.decode('utf8'))
data_obj = json.dumps(data_obj, indent=2)
print('nTweet data received, sending to db...n')
u_timestamp = int(time.time())
query = "INSERT INTO raw_data(timestamp, payload) VALUES(?,?)"
data = (u_timestamp, data_obj)
try:
cursor.execute(query, data)
connection.commit()
print('nData saved.')
except sqlite3.Error as error:
print(f"Error while working with SQLite: {error}")
return True
def on_connect(self):
print('n Connected..!')
def on_error(self, status):
print(status)
return True
stream = MyListener(BEARER)
stream.add_rules(tweepy.StreamRule('place_country:NG has:geo', tag="atiku-abubakar-rabiu-kwankwaso-peter-obi-bola-tinubu-inec"))
stream.filter(tweet_fields=["geo","created_at","author_id","context_annotations"],
place_fields=["id","geo","name","country_code","place_type","full_name","country"],
expansions=["geo.place_id","referenced_tweets.id"]) 

检索到的推文示例:

1|1666603722|{
"data": {
"author_id": "3301724376",
"context_annotations": [
{
"domain": {
"id": "46",
"name": "Business Taxonomy",
"description": "Categories within Brand Verticals that narrow down the scope of Brands"
},
"entity": {
"id": "1557193940978135808",
"name": "Gaming Business",
"description": "Brands, companies, advertisers and every non-person handle with the profit intent related to offline and online games such as gaming consoles, tabletop games, video game publishers"
}
},
{
"domain": {
"id": "47",
"name": "Brand",
"description": "Brands and Companies"
},
"entity": {
"id": "1502374025170882561",
"name": "WhatsApp",
"description": "WhatsApp Messenger, or simply WhatsApp, is an internationally available American freeware, cross-platform centralized instant messaging and voice-over-IP service owned by Meta Platforms."
}
}
],
"created_at": "2022-10-24T09:28:36.000Z",
"edit_history_tweet_ids": [
"1584476943512059905"
],
"geo": {
"place_id": "13e62ac32ad46001"
},
"id": "1584476943512059905",
"text": "Good morning ud83eudd70nThis is a great week to shop for new sheetsud83dude4fud83cudffcu2764ufe0fnnBedsheets and pillowcases only n6/6 - NGN 6000n6/7 - NGN 6500n7/7 - NGN 7500n nKindly DM or WhatsApp 08062407473 to order nLocation is LagosnNationwide delivery guaranteed ud83dudcafn@_DammyB_  @yay_tunes @unclemidetush "
},
"includes": {
"places": [
{
"country": "Nigeria",
"country_code": "NG",
"full_name": "Lagos University Teaching Hospital",
"geo": {
"type": "Feature",
"bbox": [
3.354450897360182,
6.519118684124127,
3.354450897360182,
6.519118684124127
],
"properties": {}
},
"id": "13e62ac32ad46001",
"name": "Lagos University Teaching Hospital",
"place_type": "poi"
}
],
"tweets": [
{
"author_id": "3301724376",
"context_annotations": [
{
"domain": {
"id": "46",
"name": "Business Taxonomy",
"description": "Categories within Brand Verticals that narrow down the scope of Brands"
},
"entity": {
"id": "1557696940178935808",
"name": "Gaming Business",
"description": "Brands, companies, advertisers and every non-person handle with the profit intent related to offline and online games such as gaming consoles, tabletop games, video game publishers"
}
},
{
"domain": {
"id": "47",
"name": "Brand",
"description": "Brands and Companies"
},
"entity": {
"id": "1502374025170882561",
"name": "WhatsApp",
"description": "WhatsApp Messenger, or simply WhatsApp, is an internationally available American freeware, cross-platform centralized instant messaging and voice-over-IP service owned by Meta Platforms."
}
}
],
"created_at": "2022-10-24T09:28:36.000Z",
"edit_history_tweet_ids": [
"1584476943512059905"
],
"geo": {
"place_id": "13e62ac32ad46001"
},
"id": "1584476943512059905",
"text": "Good morning ud83eudd70nThis is a great week to shop for new sheetsud83dude4fud83cudffcu2764ufe0fnnBedsheets and pillowcases only n6/6 - NGN 6000n6/7 - NGN 6500n7/7 - NGN 7500n nKindly DM or WhatsApp 08062407473 to order nLocation is LagosnNationwide delivery guaranteed ud83dudcafn@_DammyB_  @yay_tunes @unclemidetush "
}
]
},
"matching_rules": [
{
"id": "1575129079472443401",
"tag": "atiku-abubakar-rabiu-kwankwaso-peter-obi-bola-tinubu-inec"
}
]
}

  • 如何过滤推文以仅包含那些关键字"atiku-abubakar-rabiu-kwankwaso-peter-obi-bola-tinubu-inec"(或标签(

您正在进行的API调用中tag的值不是搜索项。搜索项是规则中的第一个值。

这只是为了使您能够标记流过滤器规则。当一条推文与规则匹配时,它将被标记为该标签。所以,你可能想得到关于猫的推文,还有关于狗的推文和视频——你可以通过一个连接收到这两条推文,但要分别标记它们,你可以随心所欲地使用标签:

stream.add_rules(tweepy.StreamRule('cats has:images', tag="cat-images"))
stream.add_rules(tweepy.StreamRule('dogs has:videos', tag="dogs"))

使用此模式,当您的代码收到Tweet时,它可以根据标记检查它与哪个规则匹配。

深入到您的具体示例中,您可以在这里选择几个选项。我认为你想包括一个或多个政治家的名字,但只在尼日利亚境内寻找推文(重要的一点是:在推特上发布推文的人只对推文中的一小部分进行了地理标记,所以你可能会有一小部分推文标记为来自国内(。

我不完全确定这些名字中的哪一个,我缩短了查询时间,请原谅我缺乏知识,但你可以这样做:

stream.add_rules(tweepy.StreamRule('("atiku abubakar" OR "rabiu kwankwaso")  place_country:NG has:geo', tag="ng-election"))

这将尝试匹配

包含文本";atiku abubakar";或包含文本";rabiu kwankwaso";,但包括地理位置,来自尼日利亚境内

返回的推文将被标记为ng-electionmatching_rules.tag。你可以为每个政客添加规则(使用不同的命名标签(,也可以用一个标签将它们放在同一个规则中。

你可以做的另一件事是看看推特是否为这些个人或政党定义了上下文注释(比如说可能有一个标签之类的(。如果你发现一条推特肯定是关于一个人的,它可能有一个context_annotation来定义这个特定的个人。在您共享的示例中,您有一条关于domain46(商业(和entity1557696940178935808(游戏(的推文,因此您可以在context:46.1557696940178935808上匹配一条规则,该规则将挑选出同一类别的推文。将[domain].[entity]值替换为与您感兴趣的主题相匹配的值。

相关内容

  • 没有找到相关文章

最新更新