将表中的数据插入到一个新的表中



好吧,这件事已经困扰了我一整天了。我有两个表(例如original_table和new_table)。新表是空的,我需要在给定以下条件下用original_table中的记录填充它:

  • 行程持续时间必须至少30秒
  • 仅包括从那里开始至少有100次行程的车站
  • 仅包括至少有100次行程结束的车站

持续时间部分很容易,但我发现很难过滤其他两个条件。我试着像这样创建两个临时表:

CREATE TEMP TABLE start_stations AS(
SELECT ARRAY(SELECT DISTINCT start_station_id FROM `dataset.original_table`
WHERE duration_sec >= 30
GROUP BY start_station_id 
HAVING COUNT(start_station_id)>=100
AND COUNT(end_station_id)>=100) as arr
);
CREATE TEMP TABLE end_stations AS(
SELECT ARRAY(SELECT DISTINCT end_station_id FROM `dataset.original_table`
WHERE duration_sec >= 30
GROUP BY end_station_id 
HAVING COUNT(end_station_id)>=100
AND COUNT(start_station_id)>=100) as arr
);

然后尝试像这样插入new_table:

INSERT INTO `dataset.new_table` 
SELECT a.* FROM `dataset.original_table` as a, start_stations as ss,
end_stations as es
WHERE a.start_station_id IN UNNEST(ss.arr)
AND a.end_station_id IN UNNEST(es.arr)

然而,这并没有给我提供正确的答案。我试图做一个临时的函数来清理数据,但我没有走远。(

下面是表格的示例:

trip_id|duration_sec|start_date|start_station_id|  end_date|end_station_id|
--------------------------------------------------------------------------|
afad333|         231|2017-12-20|             210|2017-12-20|           355|
sffde56|          35|2017-12-12|             355|2017-12-12|           210|
af33445|         333|2018-10-27|             650|2018-10-27|           650|
dd1238d|         456|2017-09-15|             123|2017-09-15|           210|
dsa2223|         500|2017-09-15|             210|2017-09-15|           123|
  ...
如果你能帮助我,我将非常感激。提前感谢!

方法应为

with major_stations as(
select start_station_id station_id
from trips
group by start_station_id
having count(*) > 100
union 
select end_station_id station_id
from trips
group by end_station_id
having count(*) > 100
)
select *
from trips
where start_station_id in (select station_id from major_stations)
and trip_duration > 30

可能有一些简单的方法,但这是我想到的第一种方法。

所以我发现我的问题是什么。因为我必须过滤出100次旅行开始和结束的站点,所以我以前的方法是错误的。

我现在的答案是:

INSERT INTO dataset.new_table 
WITH stations AS (
SELECT start_station_id, end_station_id FROM dataset.original_table
GROUP BY start_station_id, end_station_id
HAVING count(start_station_id)>=100
AND count(end_station_id)>=100
)

SELECT a.* FROM dataset.original_table AS a, stations as s
WHERE a.start_station_id = s.start_station_id
AND a.end_station_id = s.end_station_id
AND a.duration_sec >= 30

这样我只创建了一个WITH子句,它只过滤开始端站,根据给定的标准。虽然看起来很简单,但显然我的大脑有时需要休息一下,从一个新的角度开始。

相关内容