我得到以下错误一旦我更新sklearn到一个较新的版本-我不知道为什么这是。
Traceback (most recent call last):
File "/Users/X/Courses/Project/SupportVectorMachine/main.py", line 95, in <module>
y, x = dmatrices(formula, data=finalDataFrame, return_type='matrix')
File "/Library/Python/2.7/site-packages/patsy/highlevel.py", line 297, in dmatrices
NA_action, return_type)
File "/Library/Python/2.7/site-packages/patsy/highlevel.py", line 156, in _do_highlevel_design
return_type=return_type)
File "/Library/Python/2.7/site-packages/patsy/build.py", line 947, in build_design_matrices
value, is_NA = evaluator.eval(data, NA_action)
File "/Library/Python/2.7/site-packages/patsy/build.py", line 85, in eval
return result, NA_action.is_numerical_NA(result)
File "/Library/Python/2.7/site-packages/patsy/missing.py", line 135, in is_numerical_NA
mask |= np.isnan(arr)
TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule 'safe'
这是与此对应的代码。我重新安装了从Numpy到scipy的所有东西。但什么都没用。
# Merging the two dataframes - user and the tweets
finalDataFrame = pandas.merge(twitterDataFrame.reset_index(),twitterUserDataFrame.reset_index(),on=['UserID'],how='inner')
finalDataFrame = finalDataFrame.drop_duplicates()
finalDataFrame['FrequencyOfTweets'] = numpy.all(numpy.isfinite(finalDataFrame['FrequencyOfTweets']))
# model formula, ~ means = and C() lets the classifier know its categorical data.
formula = 'Classifier ~ InReplyToStatusID + InReplyToUserID + RetweetCount + FavouriteCount + Hashtags + UserMentionID + URL + MediaURL + C(MediaType) + UserMentionID + C(PossiblySensitive) + C(Language) + TweetLength + Location + Description + UserAccountURL + Protected + FollowersCount + FriendsCount + ListedCount + UserAccountCreatedAt + FavouritesCount + GeoEnabled + StatusesCount + ProfileBackgroundImageURL + ProfileUseBackgroundImage + DefaultProfile + FrequencyOfTweets'
### create a regression friendly data frame y gives the classifiers, x gives the features and gives different columns for Categorical data depending on variables.
y, x = dmatrices(formula, data=finalDataFrame, return_type='matrix')
## select which features we would like to analyze
X = numpy.asarray(x)
我发现在调用np时有时会出现这个错误。对包含字符串或其他非浮点值的数组执行Isnan操作。试着发你的np。数组在传递给dmatrices之前使用arr.astype(float)。
此外,您的tweet的频率列被设置为全假或全真,因为np。
在大量查看代码等之后,问题是我传递的公式希望程序使用下面的所有功能。这里的UserAccountCreatedAt列的类型是datetime[ns]。我目前已经把这个公式,没有错误,但是,我想知道如何最好地将其转换为数字数据,以便实际通过它。这是因为分类数据是由C在一些列前面处理的,如下所示,datetime在patsy中被认为是数字。
formula = 'Classifier ~ UserAccountCreatedAt + InReplyToStatusID + InReplyToUserID + RetweetCount + FavouriteCount + Hashtags + UserMentionID + URL + MediaURL + C(MediaType) + UserMentionID + C(PossiblySensitive) + C(Language) + TweetLength + Location + Description + UserAccountURL + Protected + FollowersCount + FriendsCount + ListedCount + FavouritesCount + GeoEnabled + StatusesCount + ProfileBackgroundImageURL + ProfileUseBackgroundImage + DefaultProfile + FrequencyOfTweets'