PostgreSQL,R:将表的所有行相乘以创建面板数据(时间序列)



我有一个包含 320 万行的表buildings。我需要将此表扩展到 11 个不同的周期,以将其作为(平衡的(面板数据进行处理。这意味着每个天体都有11个不同的年份(从2000年到2010年(需要观测。句点应称为:

2000
2001
...
2009
2010

表定义

CREATE TABLE public.buildings
(
  gid integer NOT NULL DEFAULT nextval('buildings_gid_seq'::regclass),
  osm_id character varying(11),
  name character varying(48),
  type character varying(16),
  geom geometry(MultiPolygon,4326),
  centroid geometry(Point,4326),
  gembez character varying(50),
  gemname character varying(50),
  krsbez character varying(50),
  krsname character varying(50),
  pv boolean,
  gr smallint,
  capac double precision,
  instdate date,
  pvid integer,
  dist double precision,
  gemewz integer,
  n500 integer,
  ibase double precision,
  popden integer,
  instp smallint,
  b2000 double precision,
  b2001 double precision,
  b2002 double precision,
  b2003 double precision,
  b2004 double precision,
  b2005 double precision,
  b2006 double precision,
  b2007 double precision,
  b2008 double precision,
  b2009 double precision,
  b2010 double precision,
  ibase_id integer[],
  ibase_dist integer[],
  CONSTRAINT buildings_pkey PRIMARY KEY (gid)
)
WITH (
  OIDS=FALSE
);
ALTER TABLE public.buildings
  OWNER TO postgres;
CREATE INDEX build_centroid_gix
  ON public.buildings
  USING gist
  (st_transform(centroid, 31467));
CREATE INDEX buildings_geom_idx
  ON public.buildings
  USING gist
  (geom);

我想将数据用于 R 中的回归分析。

ibase_id是一组gidibase_dist是一个相关的数组,其距离为gid到obejct的距离。两个数组的长度始终相同。

数组中的gid属于buildings的记录,这些记录位于centroid周围500m的半径内,是obeject的中心,并且有pv=TRUE(这意味着distinstdateinstpcapacpvidNOT NULL(。

SELECT a.gid AS buildid, array_agg(b.gid) AS ibase_id, array_agg(round(ST_Distance(ST_Transform(a.centroid, 31467), ST_Transform(b.centroid, 31467))::integer)) AS ibase_dist
  FROM buildings a
  LEFT JOIN (SELECT * FROM buildings WHERE pv=TRUE) AS b ON ST_DWithin(ST_Transform(a.centroid, 31467), ST_Transform(b.centroid, 31467), 500.0)
      AND a.gid <> b.gid
  GROUP BY a.gid

例:

ibase_id: {3075528,409073,322311,226643,833798,322344,226609} ;

ibase_dist {290,293,398,494,411,381,384}

UPDATE buildings
SET ibase=SUM(1/s)
FROM unnest(SELECT ibasedist FROM buildings WHERE (SELECT instp 
       FROM buildings 
       WHERE gid IN unnest(ibase_id))<year) s

对于每个周期,仅应考虑阵列的条目,其年份在面板数据的观察周期之前。(上面的查询还不起作用,因为我需要先对数组进行排序(现在,这两个阵列保存了所有年份的信息。这就是为什么我认为应该将它们添加到每个时间段,以便在扩展到面板数据后,我计算每条记录的ibase(11x 3,2百万(。

我不需要所有列进行回归分析。如果它会显着提高乘法的性能,我们可以坚持使用行(基本上省略几何列(:

   gid integer NOT NULL DEFAULT nextval('buildings_gid_seq'::regclass),
      gembez character varying(50),
      gemname character varying(50),
      krsbez character varying(50),
      krsname character varying(50),
      pv boolean,
      gr smallint,
      capac double precision,
      dist double precision,
      gemewz integer,
      n500 integer,
      ibase double precision,
      popden integer,
      instp smallint,
      b2000 double precision,
      b2001 double precision,
      b2002 double precision,
      b2003 double precision,
      b2004 double precision,
      b2005 double precision,
      b2006 double precision,
      b2007 double precision,
      b2008 double precision,
      b2009 double precision,
      b2010 double precision,
      ibase_id integer[],
      ibase_dist integer[],
      CONSTRAINT buildings_pkey PRIMARY KEY (gid)
    )
    WITH (
      OIDS=FALSE

解决方案方法

我的基本想法是创建一个包含 11 个不同周期的第二个表periods并将该表与表buildings相乘。不知道如何实现这一点。不幸的是,我对 R 没有太多经验,并且还没有使用 R 的数据库接口。

使用 PostgreSQL 9.5beta2,由 Visual C++ build 1800、64 位和 R x64 3.2.1 编译

本质上,面板数据集是格式的数据,每条记录的重复年份作为时间列。您当前的结构是格式的。虽然 R 可以转换这个非常大的数据集,但 PostGreSQL 可以使用其引擎将所有年份堆叠在一个联合查询中,并将结果集传递给 R。请注意,某些数据类型(如几何对象和数组(可能无法正确转换为 R 数据类型,因此请将其删除或转换为字符串/数字类型。

下面是这样一个具有堆叠年份的 SQL UNION 查询。我不太确定您对ibase_idibase_dist或"乘法"方面是什么意思,但添加了Year列和相应的b列。让 R 脚本通过 RPostGreSQL 模块调用它。

import("RPostgreSQL")
# CREATE CONNECTION     
drv <- dbDriver("PostgreSQL")
con <- dbConnect(drv, dbname = "postgres",
                 host = "localhost", port = ####,
                 user = "username", password = "password")
strSQL <- "SELECT '2000' As year,  gid, gembez, gemname, krsbez,
                 krsname, pv, gr, capac, dist, gemewz, n500
                 popden, instp, b2000 As b, (1/ibase_dist) As ibase
           FROM public.buildings
           INNER JOIN
                (SELECT a.gid AS buildid, 
                        SUM(round(ST_Distance(
                                              ST_Transform(a.centroid, 31467),  
                                              ST_Transform(b.centroid, 31467)
                                  )::integer)) AS ibase_dist
               FROM buildings a
               LEFT JOIN buildings b 
                      ON ST_DWithin(ST_Transform(a.centroid, 31467), 
                                    ST_Transform(b.centroid, 31467), 500.0)
                    AND a.gid <> b.gid
               WHERE b.pv=True AND b.instp < a.instp
               GROUP BY a.gid) AS distSum
           ON public.buildings.gid = distSum.buildid
           WHERE public.buildings.instp = 2000
           UNION
           ...other SELECT statements for years 2001-2010..."              
# IMPORT QUERY RESULTSET INTO DATAFRAME
df <- dbGetQuery(con, strSQL)
# CLOSE CONNECTION
dbDisconnect(con)

但请确保您拥有操作大数据集所需的 RAM。您可能需要相应地分配内存。或者,您可以迭代地将每年的 SELECT 语句追加到不断增长的数据帧对象中,而不是一次加载所有数据帧对象。

# ...SAME CONNECTION SETUP AS ABOVE...
years = c('2000', '2001', '2002', '2003', '2004', '2005', 
          '2006', '2007', '2008', '2009', '2010')
# CREATES LIST OF YEAR DATA FRAME
dfList = lapply(years, 
                function(y) {
                # NOTICE CONCATENATION OF Y IN SELECT STATEMENT 
                strSQL <- paste0("SELECT '", y, "' As year,  gid, gembez, gemname, krsbez,
                                         krsname, pv, gr, capac, dist, gemewz, n500, 
                                         popden, instp, b", y, ", As b, (1/ibase_dist) As ibase, 
                                  FROM public.buildings
                                  INNER JOIN
                                    (SELECT a.gid AS buildid, 
                                          SUM(round(ST_Distance(
                                              ST_Transform(a.centroid, 31467),  
                                              ST_Transform(b.centroid, 31467)
                                          )::integer)) AS ibase_dist
                                     FROM buildings a
                                     LEFT JOIN buildings b 
                                     ON ST_DWithin(ST_Transform(a.centroid, 31467), 
                                                   ST_Transform(b.centroid, 31467), 500.0)
                                     AND a.gid <> b.gid
                                     WHERE b.pv=True AND b.instp < a.instp
                                     GROUP BY a.gid) AS distSum
                                  ON public.buildings.gid = distSum.buildid
                                  WHERE public.buildings.instp =", y)
                dbGetQuery(con, strSQL)                               
                })
# APPEND LIST OF DATA FRAMES INTO ONE LARGE DATA FRAME              
df <- do.call(rbind, dfList)
# REMOVE PREVIOUS LIST FOR MEMORY RESOURCES
rm(dfList)
# CLOSE CONNECTION
dbDisconnect(con)

我通过使用交叉连接和一个包含句点的临时表 t1 创建了 Paneldata 表。

CREATE TABLE public.t1
(
  period smallint
)
WITH (
  OIDS=FALSE
);

CREATE TABLE paneldata AS
(SELECT * 
FROM t1 CROSS JOIN 
    (SELECT gid, 
    gemname, 
    gembez, 
    krsname,
    krsbez,
    pv,
    gr,
    capac,
    dist,
    gemewz,
    n500,
    popden,
    instp
    FROM buildings) AS test
ORDER BY gid)

最新更新