R:NaN 出现在 as.numeric() 之后,同时在 sparklyr 连接中使用 dplyr sparklyr



我开始使用sparklyr来处理大尺寸的数据,所以我只需要使用管道线。

但是在处理数据框时,我遇到了麻烦,似乎

csj %>% head()

下面是我的数据的样子。 在此处输入图像描述

我想做的是,首先,我想制作一个新列,lenght_of_review,计算 reviewText 的字符数和另一个可以显示类别的新列。

所以我把代码放成这样:

csj<- csj %>% mutate(length_of_review = nchar(csj$reviewText,keppNA=TRUE),
category ="Clothes_shoes_jewelry") %>% 
select(c('_c0',reviewerID,asin,helpful,length_of_review,overall,unixReviewTime,category))
csj %>% head()
Error: Invalid number of args to SQL LENGTH. Expecting 1

类别部分有效length_of_review但部分不起作用。所以我又厌倦了as.numeric

csj<- csj %>% mutate(length_of_review = as.numeric(nchar(csj$reviewText)),
category ="Clothes_shoes_jewelry") %>% 
select(c('_c0',reviewerID,asin,helpful,length_of_review,overall,unixReviewTime,category))
csj %>% head()
#Source:   lazy query [?? x 8]
# Database: spark_connection
`_c0` reviewerID     asin  helpful length_of_review overall unixReviewTime category             
<int> <chr>          <chr> <chr>              <dbl> <chr>   <chr>          <chr>                
1     0 A1KLRMWW2FWPL4 31887 [0, 0]               NaN 5       1297468800     Clothes_shoes_jewelry
2     1 A2G5TCU2WDFZ65 31887 [0, 0]               NaN 5       1358553600     Clothes_shoes_jewelry
3     2 A1RLQXYNCMWRWN 31887 [0, 0]               NaN 5       1357257600     Clothes_shoes_jewelry
4     3 A8U3FAMSJVHS5  31887 [0, 0]               NaN 5       1398556800     Clothes_shoes_jewelry
5     4 A3GEOILWLK86XM 31887 [0, 0]               NaN 5       1394841600     Clothes_shoes_jewelry
6     5 A27UF1MSF3DB2  31887 [0, 0]               NaN 4       1396224000     Clothes_shoes_jewelry

它转向南:(.....

另外,西米亚尔,但另一个问题是关于有用的。 我想制作一个名为 help 的新列 = 有用的第一个 #/有用的第二个 #。我之前在这里问过这个网站,我得到了这个代码:

csj %>%
+   mutate(col1 = as.numeric(stringi::stri_extract_first_regex(csj$helpful, pattern = "[0-9]")),#extract first number
+          col2 = as.numeric(stringi::stri_extract_last_regex(csj$helpful, pattern = "[0-9]")),#extract second
+          col3 = ifelse(col2 == 0, 1, col2 ),#change 0s to 1
+          help = col1/col3) #divide row1 and 3

# Source:   lazy query [?? x 12]
# Database: spark_connection
`_c0` reviewerID     asin  helpful length_of_review overall unixReviewTime category     col1  col2  col3  help
<int> <chr>          <chr> <chr>              <dbl> <chr>   <chr>          <chr>       <dbl> <dbl> <dbl> <dbl>
1     0 A1KLRMWW2FWPL4 31887 [0, 0]               NaN 5       1297468800     Clothes_sh~   NaN   NaN   NaN   NaN
2     1 A2G5TCU2WDFZ65 31887 [0, 0]               NaN 5       1358553600     Clothes_sh~   NaN   NaN   NaN   NaN
3     2 A1RLQXYNCMWRWN 31887 [0, 0]               NaN 5       1357257600     Clothes_sh~   NaN   NaN   NaN   NaN
4     3 A8U3FAMSJVHS5  31887 [0, 0]               NaN 5       1398556800     Clothes_sh~   NaN   NaN   NaN   NaN
5     4 A3GEOILWLK86XM 31887 [0, 0]               NaN 5       1394841600     Clothes_sh~   NaN   NaN   NaN   NaN
6     5 A27UF1MSF3DB2  31887 [0, 0]               NaN 4       1396224000     Clothes_sh~   NaN   NaN   NaN   NaN
7     6 A16GFPNVF4Y816 31887 [0, 0]               NaN 5       1399075200     Clothes_sh~   NaN   NaN   NaN   NaN
8     7 A2M2APVYIB2U6K 31887 [0, 0]               NaN 5       1356220800     Clothes_sh~   NaN   NaN   NaN   NaN
9     8 A1NJ71X3YPQNQ9 31887 [0, 0]               NaN 4       1384041600     Clothes_sh~   NaN   NaN   NaN   NaN
10     9 A3EERSWHAI6SO  31887 [7, 8]               NaN 5       1349568000     Clothes_sh~   NaN   NaN   NaN   NaN
# ... with more rows

这两个问题似乎都应该有效,但没有奏效。我什至无法开始分析,因为我被困在这里很长时间:(

有没有人知道为什么?并对此有解决方案?如果我能解决这些问题,我会非常高兴。任何帮助将不胜感激!

这是str(csj(>>

> str(csj)
List of 2
$ src:List of 1
..$ con:List of 10
.. ..$ master       : chr "local[4]"
.. ..$ method       : chr "shell"
.. ..$ app_name     : chr "sparklyr"
.. ..$ config       :List of 4
.. .. ..$ spark.env.SPARK_LOCAL_IP.local    : chr "127.0.0.1"
.. .. ..$ sparklyr.csv.embedded             : chr "^1.*"
.. .. ..$ sparklyr.cores.local              : int 4
.. .. ..$ spark.sql.shuffle.partitions.local: int 4
.. .. ..- attr(*, "config")= chr "default"
.. .. ..- attr(*, "file")= chr "C:\Users\ms\Documents\R\win-library\3.5\sparklyr\conf\config-template.yml"
.. ..$ spark_home   : chr "C:\spark"
.. ..$ backend      : 'sockconn' int 4
.. .. ..- attr(*, "conn_id")=<externalptr> 
.. ..$ monitor      : 'sockconn' int 3
.. .. ..- attr(*, "conn_id")=<externalptr> 
.. ..$ output_file  : chr "C:\Users\ms\AppData\Local\Temp\RtmpygTIca\file371068ce6a02_spark.log"
.. ..$ spark_context:Classes 'spark_jobj', 'shell_jobj' <environment: 0x00000000daa77a50> 
.. ..$ java_context :Classes 'spark_jobj', 'shell_jobj' <environment: 0x00000000daa365b8> 
.. ..- attr(*, "class")= chr [1:3] "spark_connection" "spark_shell_connection" "DBIConnection"
..- attr(*, "class")= chr [1:3] "src_spark" "src_sql" "src"
$ ops:List of 4
..$ name: chr "select"
..$ x   :List of 4
.. ..$ name: chr "mutate"
.. ..$ x   :List of 2
.. .. ..$ x   : 'ident' chr "review_csj"
.. .. ..$ vars: chr [1:7] "_c0" "reviewerID" "asin" "helpful" ...
.. .. ..- attr(*, "class")= chr [1:3] "op_base_remote" "op_base" "op"
.. ..$ dots:List of 2
.. .. ..$ length_of_review: language ~as.numeric(nchar(NULL))
.. .. .. ..- attr(*, ".Environment")=<environment: 0x00000000dd5f2548> 
.. .. ..$ category        : language ~"Clothes_shoes_jewelry"
.. .. .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv> 
.. ..$ args: list()
.. ..- attr(*, "class")= chr [1:3] "op_mutate" "op_single" "op"
..$ dots:List of 1
.. ..$ : language ~c("_c0", reviewerID, asin, helpful, length_of_review, overall, unixReviewTime, category)
.. .. ..- attr(*, ".Environment")=<environment: 0x00000000dd6190f0> 
.. ..- attr(*, "class")= chr "quosures"
..$ args: list()
..- attr(*, "class")= chr [1:3] "op_select" "op_single" "op"
- attr(*, "class")= chr [1:4] "tbl_spark" "tbl_sql" "tbl_lazy" "tbl"
> 

这是我的session_info((

Session info ----------------------------------------------------------------------------------------------------
setting  value                       
version  R version 3.5.0 (2018-04-23)
system   x86_64, mingw32             
ui       RStudio (1.1.453)           
language (EN)                        
collate  English_United States.1252  
tz       Europe/Berlin               
date     2018-05-21                  
Packages --------------------------------------------------------------------------------------------------------
package      * version date       source        
assertthat     0.2.0   2017-04-11 CRAN (R 3.5.0)
backports      1.1.2   2017-12-13 CRAN (R 3.5.0)
base         * 3.5.0   2018-04-23 local         
base64enc      0.1-3   2015-07-28 CRAN (R 3.5.0)
bindr          0.1.1   2018-03-13 CRAN (R 3.5.0)
bindrcpp       0.2.2   2018-03-29 CRAN (R 3.5.0)
broom          0.4.4   2018-03-29 CRAN (R 3.5.0)
cli            1.0.0   2017-11-05 CRAN (R 3.5.0)
colorspace     1.3-2   2016-12-14 CRAN (R 3.5.0)
compiler       3.5.0   2018-04-23 local         
config         0.3     2018-03-27 CRAN (R 3.5.0)
crayon         1.3.4   2017-09-16 CRAN (R 3.5.0)
datasets     * 3.5.0   2018-04-23 local         
DBI            1.0.0   2018-05-02 CRAN (R 3.5.0)
dbplyr         1.2.1   2018-02-19 CRAN (R 3.5.0)
devtools     * 1.13.5  2018-02-18 CRAN (R 3.5.0)
digest       * 0.6.15  2018-01-28 CRAN (R 3.5.0)
dplyr        * 0.7.5   2018-05-19 CRAN (R 3.5.0)
foreign        0.8-70  2017-11-28 CRAN (R 3.5.0)
ggplot2      * 2.2.1   2016-12-30 CRAN (R 3.5.0)
glue           1.2.0   2017-10-29 CRAN (R 3.5.0)
graphics     * 3.5.0   2018-04-23 local         
grDevices    * 3.5.0   2018-04-23 local         
grid         * 3.5.0   2018-04-23 local         
gtable         0.2.0   2016-02-26 CRAN (R 3.5.0)
hms            0.4.2   2018-03-10 CRAN (R 3.5.0)
htmltools      0.3.6   2017-04-28 CRAN (R 3.5.0)
httpuv         1.4.3   2018-05-10 CRAN (R 3.5.0)
httr           1.3.1   2017-08-20 CRAN (R 3.5.0)
jsonlite       1.5     2017-06-01 CRAN (R 3.5.0)
later          0.7.2   2018-05-01 CRAN (R 3.5.0)
lattice        0.20-35 2017-03-25 CRAN (R 3.5.0)
lazyeval       0.2.1   2017-10-29 CRAN (R 3.5.0)
magrittr       1.5     2014-11-22 CRAN (R 3.5.0)
memoise        1.1.0   2017-04-21 CRAN (R 3.5.0)
methods      * 3.5.0   2018-04-23 local         
mime           0.5     2016-07-07 CRAN (R 3.5.0)
mnormt         1.5-5   2016-10-15 CRAN (R 3.5.0)
munsell        0.4.3   2016-02-13 CRAN (R 3.5.0)
nlme           3.1-137 2018-04-07 CRAN (R 3.5.0)
openssl        1.0.1   2018-03-03 CRAN (R 3.5.0)
parallel       3.5.0   2018-04-23 local         
pillar         1.2.2   2018-04-26 CRAN (R 3.5.0)
pkgconfig      2.0.1   2017-03-21 CRAN (R 3.5.0)
plyr           1.8.4   2016-06-08 CRAN (R 3.5.0)
promises       1.0.1   2018-04-13 CRAN (R 3.5.0)
psych          1.8.4   2018-05-06 CRAN (R 3.5.0)
purrr          0.2.4   2017-10-18 CRAN (R 3.5.0)
R6             2.2.2   2017-06-17 CRAN (R 3.5.0)
RColorBrewer * 1.1-2   2014-12-07 CRAN (R 3.5.0)
Rcpp           0.12.17 2018-05-18 CRAN (R 3.5.0)
readr        * 1.1.1   2017-05-16 CRAN (R 3.5.0)
reshape2       1.4.3   2017-12-11 CRAN (R 3.5.0)
rlang          0.2.0   2018-02-20 CRAN (R 3.5.0)
rprojroot      1.3-2   2018-01-03 CRAN (R 3.5.0)
rstudioapi     0.7     2017-09-07 CRAN (R 3.5.0)
scales       * 0.5.0   2017-08-24 CRAN (R 3.5.0)
shiny          1.1.0   2018-05-17 CRAN (R 3.5.0)
sparklyr     * 0.8.3   2018-05-12 CRAN (R 3.5.0)
stats        * 3.5.0   2018-04-23 local         
stringi      * 1.1.7   2018-03-12 CRAN (R 3.5.0)
stringr      * 1.3.1   2018-05-10 CRAN (R 3.5.0)
tibble         1.4.2   2018-01-22 CRAN (R 3.5.0)
tidyr        * 0.8.1   2018-05-18 CRAN (R 3.5.0)
tidyselect     0.2.4   2018-02-26 CRAN (R 3.5.0)
tools          3.5.0   2018-04-23 local         
utf8           1.1.3   2018-01-03 CRAN (R 3.5.0)
utils        * 3.5.0   2018-04-23 local         
withr          2.1.2   2018-03-15 CRAN (R 3.5.0)
xtable         1.8-2   2016-02-05 CRAN (R 3.5.0)
yaml           2.1.19  2018-05-01 CRAN (R 3.5.0)
> 

你犯了两个基本错误,前者实际上掩盖了后者。让我们一步一步地追踪它。

df <- copy_to(
sc, tibble(id=1, helpful="[0, 0]", reviewText="Here goes some text")
)
  • 第一个错误是尝试使用$访问列。tbl_spark(据我所知,其他基于数据库的对象,如tbl_sql(对象不提供$访问:

    > df$a
    NULL
    

    一般来说,您不应该使用dplyr引用这样的列。

  • sparklyr不支持直接执行纯 R 代码(spark_apply等函数除外(。然而,这些效率非常低(。

    因此,如果省略$查询实际上会失败:

    > df %>% mutate(length_of_review = nchar(reviewText, keppNA=TRUE))
    Error: Invalid number of args to SQL LENGTH. Expecting 1
    

    您可以省略 R 特定的参数:

    > df %>% mutate(length_of_review = nchar(reviewText))
    # Source:   lazy query [?? x 4]
    # Database: spark_connection
    id helpful reviewText          length_of_review
    <dbl> <chr>   <chr>                          <int>
    1     1 [0, 0]  Here goes some text               19
    

    这将转换为length函数的本机 Spark 调用(请注意,它不是base::length(:

    > df %>% mutate(length_of_review = length(reviewText))
    # Source:   lazy query [?? x 4]
    # Database: spark_connection
    id helpful reviewText          length_of_review
    <dbl> <chr>   <chr>                          <int>
    1     1 [0, 0]  Here goes some text               19
    

通常,应用于连接的操作将转换为SQL查询,因此您应该使用可在特定平台上访问的函数(或由dplyr别名(。有关可用函数的列表及其参数,您可以查看官方 Scala 文档。

关于您的问题:

  • 要检查长度,请使用length函数(见上文(。
  • 要使用正则表达式处理字符串regexp_extract请使用Java 正则表达式的函数:

    > df %>% mutate(col1 = as.numeric(regexp_extract(helpful, "^\\[([0-9]+), ([0-9]+)\\]$", 1)))
    # Source:   lazy query [?? x 4]
    # Database: spark_connection
    id helpful reviewText           col1
    <dbl> <chr>   <chr>               <dbl>
    1     1 [0, 0]  Here goes some text     0
    

    您也可以使用spark_apply,但如前所述,这不是一种有效的方法。

最新更新