好吧,这件事已经困扰了我一整天了。我有两个表(例如original_table和new_table)。新表是空的,我需要在给定以下条件下用original_table中的记录填充它:
- 行程持续时间必须至少30秒
- 仅包括从那里开始至少有100次行程的车站
- 仅包括至少有100次行程结束的车站
持续时间部分很容易,但我发现很难过滤其他两个条件。我试着像这样创建两个临时表:
CREATE TEMP TABLE start_stations AS(
SELECT ARRAY(SELECT DISTINCT start_station_id FROM `dataset.original_table`
WHERE duration_sec >= 30
GROUP BY start_station_id
HAVING COUNT(start_station_id)>=100
AND COUNT(end_station_id)>=100) as arr
);
CREATE TEMP TABLE end_stations AS(
SELECT ARRAY(SELECT DISTINCT end_station_id FROM `dataset.original_table`
WHERE duration_sec >= 30
GROUP BY end_station_id
HAVING COUNT(end_station_id)>=100
AND COUNT(start_station_id)>=100) as arr
);
然后尝试像这样插入new_table:
INSERT INTO `dataset.new_table`
SELECT a.* FROM `dataset.original_table` as a, start_stations as ss,
end_stations as es
WHERE a.start_station_id IN UNNEST(ss.arr)
AND a.end_station_id IN UNNEST(es.arr)
然而,这并没有给我提供正确的答案。我试图做一个临时的函数来清理数据,但我没有走远。(
下面是表格的示例:
trip_id|duration_sec|start_date|start_station_id| end_date|end_station_id|
--------------------------------------------------------------------------|
afad333| 231|2017-12-20| 210|2017-12-20| 355|
sffde56| 35|2017-12-12| 355|2017-12-12| 210|
af33445| 333|2018-10-27| 650|2018-10-27| 650|
dd1238d| 456|2017-09-15| 123|2017-09-15| 210|
dsa2223| 500|2017-09-15| 210|2017-09-15| 123|
...
如果你能帮助我,我将非常感激。提前感谢!方法应为
with major_stations as(
select start_station_id station_id
from trips
group by start_station_id
having count(*) > 100
union
select end_station_id station_id
from trips
group by end_station_id
having count(*) > 100
)
select *
from trips
where start_station_id in (select station_id from major_stations)
and trip_duration > 30
可能有一些简单的方法,但这是我想到的第一种方法。
所以我发现我的问题是什么。因为我必须过滤出100次旅行开始和结束的站点,所以我以前的方法是错误的。
我现在的答案是:
INSERT INTO dataset.new_table
WITH stations AS (
SELECT start_station_id, end_station_id FROM dataset.original_table
GROUP BY start_station_id, end_station_id
HAVING count(start_station_id)>=100
AND count(end_station_id)>=100
)
SELECT a.* FROM dataset.original_table AS a, stations as s
WHERE a.start_station_id = s.start_station_id
AND a.end_station_id = s.end_station_id
AND a.duration_sec >= 30
这样我只创建了一个WITH子句,它只过滤开始和端站,根据给定的标准。虽然看起来很简单,但显然我的大脑有时需要休息一下,从一个新的角度开始。