如何通过Pyspark中同一数据帧中另一列的正则表达式值来过滤数据帧中的一列



我正在尝试筛选数据帧中与另一列中给定的正则表达式模式匹配的列

df = sqlContext.createDataFrame([('what is the movie that features Tom Cruise','actor_movies','(movie|film).*(feature)|(in|on).*(movie|film)'),
('what is the movie that features Tom Cruise','artist_song','(who|what).*(sing|sang|perform)'),
('who is the singer for hotel califonia?','artist_song','(who|what).*(sing|sang|perform)')],  
['query','question_type','regex_patt'])
+--------------------+-------------+----------------------------------------- -+
|               query                   |question_type  |regex_patt|
+--------------------+-------------+----------------------------------------------+
|what movie features Tom Cruise         | actor_movies  | (movie|film).*(feature)|(in|on).*(movie|film)
|what movie features Tom Cruise         | artist_song   | (who|what).*(sing|sang|perform)
|who is the singer for hotel califonia  | artist_song   | (who|what).*(sing|sang|perform) |
+--------------------+-------------+------------------------------------------------+

我想修剪数据帧,以便只保留查询与regex_pattern列值匹配的行
最终结果应该像这个

+--------------------+-------------+----------------------------------------- -+
|               query                   |question_type  |regex_patt|
+--------------------+-------------+----------------------------------------------+
|what movie features Tom Cruise         | actor_movies  | (movie|film).*(feature)|(in|on).*(movie|film)|
|who is the singer for hotel califonia  | artist_song   | (who|what).*(sing|sang|perform) 
+--------------------+-------------+------------------------------------------------+

我在想

df.filter(column('query').rlike('regex_patt'))

但是rlike只接受正则表达式字符串。

现在的问题是,如何根据"regex_patt"列的正则表达式值过滤"query"列?

你可以试试这个。该表达式允许您将列作为str和模式。

from pyspark.sql import functions as F
df.withColumn("query1", F.expr("""regexp_extract(query, regex_patt)""")).filter(F.col("query1")!='').drop("query1").show(truncate=False)
+------------------------------------------+-------------+---------------------------------------------+
|query                                     |question_type|regex_patt                                   |
+------------------------------------------+-------------+---------------------------------------------+
|what is the movie that features Tom Cruise|actor_movies |(movie|film).*(feature)|(in|on).*(movie|film)|
|who is the singer for hotel califonia?    |artist_song  |(who|what).*(sing|sang|perform)              |
+------------------------------------------+-------------+---------------------------------------------+

最新更新