Redshift Python UDF 独立运行，但在部分使用 count 或作为另一个查询的一部分时会引发错误

EDIT/UPDATE(BELOW(我存储并可以在AWS-Redshift中成功运行python UDF。UDF 采用纬度/纬度点，如果该点在另一个给定点的给定距离内，则返回boolean。

当我跑步时

SELECT dist_in_range(5000.0, latitude, longitude, 38.897957, -77.036560) as in_range 
from test_2;

它按预期返回一列 true 或 false。

当我跑步时

SELECT a.in_range from (SELECT dist_in_range(5000.0, latitude, longitude, 38.897957, -77.036560) as in_range 
from test_2) as a
where a.in_range = false;

要过滤 false，它会再次正常运行。

如果我在查询中添加一个count()函数，例如：

SELECT count(a.in_range) from (SELECT dist_in_range(5000.0, latitude, longitude, 38.897957, -77.036560) as in_range 
from test_2) as a
where a.in_range = false;

它返回错误：

[Amazon](500310) Invalid operation: TypeError: a float is required. Please look at svl_udf_log for more information Details: ----------------------------------------------- error: TypeError: a float is required. Please look at svl_udf_log for more information code: 10000 context: UDF query: 1766 location: udf_client.cpp:369 process: query1_995_1766 [pid=50711] -----------------------------------------------;

此错误似乎表明它是UDF和UDF输入的问题，但如上所示，UDF自行正常工作。我认为在结果上使用 count(( 只是一个 sql 查询，将返回的项目计数为 false。为什么在尝试计算 UDF 的结果时会出现错误？

更新/编辑：我开始相信这种在python 2.7中发生的某种类型的精度错误(Redshift文档版本声明它正在使用(。这是我正在运行的UDF(归功于 https://skipperkongen.dk/category/spatial/的代码;我只是做了补充(：

CREATE OR REPLACE FUNCTION dist_in_range (radius float,lat1 float, lon1 float, lat2 float, lon2 float)
RETURNS bool IMMUTABLE AS
$$
from math import radians, sin, cos, asin, sqrt, pi, atan2
import numpy as np
earth_radius_miles = 3956.0
def dist_in_range(radius, lat1, lon1, lat2, lon2):
"""checks if a point is within int number of miles of second set of points.
"""
lat1, lon1 = radians(lat1), radians(lon1)
lat2, lon2 = radians(lat2), radians(lon2)
dlat, dlon = float(lat2 - lat1), float(lon2 - lon1)
a = sin(dlat/2.0)**2 + cos(lat1) * cos(lat2) * sin(dlon/2.0)**2
great_circle_distance = 2 * asin(min(1,sqrt(a)))
if float(earth_radius_miles * great_circle_distance) < float(radius):
return True
else:
return False
return dist_in_range(radius, lat1, lon1, lat2, lon2)
$$ LANGUAGE plpythonu;

在我正在测试的数据集上，如果我运行此查询：

SELECT dist_in_range(40, latitude, longitude, 20.652975, -87.102572) as in_range from test_2
where in_range = true;

它返回的结果没有错误。如果我将半径变量降低到 40 以下，我开始收到"需要浮点数"错误，除非我设置 WHERE in_range = false，否则它会再次返回结果而没有错误。

我正在检查在 jupyter 笔记本中运行较小的半径，在某些情况下，在打印计算步骤时，我得到非常小的数字，例如1.0134428420666964e-13所以，我想知道这是python 2.7中的精度问题，我是否可以做些什么来调整？

最后，aws 错误引用的日志没有提供更多信息，因为它只是鹦鹉学舌"TypeError：需要浮点数"消息，并指向 UDF 中的第 11 行和第 21 行，但第 11 行是注释，第 21 行是else: return False行。

Redshift 现在支持空间数据的 GEOMETRY 数据类型，并具有 40+ 高性能本机函数。

https://docs.aws.amazon.com/redshift/latest/dg/geospatial-overview.html
https://docs.aws.amazon.com/redshift/latest/dg/geospatial-functions.html
https://docs.aws.amazon.com/redshift/latest/dg/spatial-limitations.html

我最初在 Redshift 中创建并加载了表，其中的纬度/纬度数据类型指定为 NUMERIC 且精度为 (9,6((我已经看到这推荐用于处理纬度/纬度类型(。我重新加载了表，但将数据类型更改为 FLOAT8，现在它工作正常。

我错误地假设小数点后有 6 位数字的数字会被视为浮点数，但事实并非如此。

相关内容

最新更新

热门标签：