寻找从 python 中的 yelp 评论数据集构建矩阵的有效方法



目前,我正在寻找有效的方法来构建Python中推荐系统的评级矩阵。

矩阵应如下所示:

4|0|0|
5|2|0|
5|0|0|
4|0|0|
4|0|0|
4|0|0|
4|4|0|
2|0|0|
0|4|0|
0|3|0|
0|0|3|
0|0|5|
0|0|4|

具体来说,列是business_id的,行是user_id

|bus-1|bus-2|
user-1|stars|stars|
user-2|stars|stars|

目前我正在使用存储在MongoDB中的Yelp评论数据集:

_id: "----X0BIDP9tA49U3RvdSQ"
user_id: "gVmUR8rqUFdbSeZbsg6z_w"
business_id: "Ue6-WhXvI-_1xUIuapl0zQ"
stars: 4
useful: 1
funny: 0
cool: 0
text: "Red, white and bleu salad was super yum and a great addition to the me..."
date: "2014-02-17 16:48:49"

我的方法是从评论表中构建唯一business_iduser_id的列表,然后再次在评论表中查询这些值。

我在这里包含了我的代码,正如你所看到的,由于蛮力方法,仅仅构建小矩阵就花了很长时间,就像我之前包含的

矩阵一样。这是我的一些代码片段:

def makeBisnisArray(cityNameParam):
arrayBisnis = []
#Append business id filtered by cityNameParam to the bisnis array
bisnisInCity = colBisnis.find({"city": cityNameParam})
for bisnis in bisnisInCity:
#if the business id is not in array, then append it to the array
if(not(bisnis in arrayBisnis)):
arrayBisnis.append(bisnis["_id"])
return arrayBisnis
def makeUserArray(bisnisName):
global arrayUser
#find review filtered by bisnisName
hslReview = colReview.find({"business_id": bisnisName})
for review in hslReview:
#if the user id is not already in array, append it to the array
if(not(review['user_id'] in arrayUser)):
arrayUser.append(review['user_id'])

def writeRatingMatrix(arrayBisnis, arrayUser):
f = open("file.txt", "w")
for user in arrayUser:
for bisnis in arrayBisnis:
#find one instance from the database by business_id and user_id
x = colReview.find_one({"business_id": bisnis, "user_id": user})
#if there's none, then just write the rating as 0
if x is None :
f.write('0|')
#if found, write the star value
else:
f.write((str(x['stars'])+"|"))
print()
f.write('n')

def buildCityTable(cityName):
arrayBisnis = makeBisnisArray(cityName)
global arrayUser
for bisnis in arrayBisnis:
makeUserArray(bisnis)
writeRatingMatrix(arrayBisnis, arrayUser) 

arrayUser = []
cityNameVar = 'Pointe-Aux-Trembles'
buildCityTable(cityNameVar)

谁能建议更有效的方法来为我构建评级矩阵?

您可以采取几种常规方法来加快速度。

  1. 使用集合或字典分别建立一组独特的业务和用户;设置/字典查找比列表搜索快得多。
  2. 一次处理一个条目的 yelp 文件
  3. 使用numpy或熊猫之类的东西来构建你的矩阵

像这样的东西


users = {}
businesses = {}
ratings = {}
for entry in yelp_entries:
if entry['user_id'] not in users:
users[entry['user_id']] = len(users)
if entry['business_id'] not in businesses:
businesses[entry['business_id']] = len(businesses)
ratings.append((
users[[entry['user_id']],
businesses[entry['business_id']],
entry['stars']
))
matrix = numpy.tile(0, (len(users), len(businesses))
for r in ratings:
matrix[r[0]][r[1]] = r[2]

我修改了@sirlark的代码以满足我的需要,但由于某种原因,我不能在评级上使用附加并在评级中使用 r迭代它,所以我不得不像这样更改代码

users = {}
businesses = {}
ratings = {}
#Query the yelp_entries for all reviews matching business_id and store it in businesses first
for entry in yelp_entries:
if entry['business_id'] not in businesses:
businesses[entry['business_id']] = len(businesses)
if entry['user_id'] not in users:
users[entry['user_id']] = len(users)
ratings[len(ratings)]=(users[entry['user_id']],
businesses[entry['business_id']],
int(entry['stars']))
matrix = numpy.tile(0, (len(users), len(businesses))
for ind in range(0,len(ratings)):
matrix[ratings[ind][0]][ratings[ind][1]] = ratings[ind][2]

后来我发现除了使用瓷砖方法 我们也可以使用SciPy_coo矩阵,它比上面的方法略快,但我们需要稍微修改一下代码

from scipy.sparse import coo_matrix
users = {}
businesses = {}
ratings = {}
row = []
col = []
data = []
for entry in yelp_entries:
if entry['business_id'] not in businesses:
businesses[entry['business_id']] = len(businesses)
if entry['user_id'] not in users:
users[entry['user_id']] = len(users)
col.append(businesses[review['business_id']])
row.append(users[review['user_id']])
data.append(int(review['stars']))
matrix = coo_matrix((data, (row, col))).toarray()

注意:后来我发现我不能将 .append(( 或 .add(( 添加到评级变量的原因是因为

ratings = {}

计为 dict 数据类型,要声明 set 数据类型,您应该改用它:

ratings = set()

最新更新