我有问题。。。如何将for循环作为UDF函数?
我有问题。。。如何将for循环作为UDF函数?
import cld3
ind_err = []
cnt = 0
cnt_NOT = 0
for index, row in pandasDF.iterrows():
lan, probability, is_reliable, proportion = cld3.get_language(row["content"])
if (lan != 'en'):
cnt_NOT += 1
ind_err.append(index)
elif(lan == 'en' and probability < 0.85):
cnt += 1
ind_err.append(index)
pandasDF = pandasDF.drop(labels=ind_err, axis=0)
此函数在pandas数据帧的所有行上循环,并查看cld3,哪些是英文,哪些不是英文,以便清理。将索引保存在数组中,以便使用.drop(labels=ind_err,axis=0(将其删除。
这是我拥有的数据:
+--------------------+-----+
| content|score|
+--------------------+-----+
| what sapp| 1|
| right| 5|
|ciao mamma mi pia...| 1|
|bounjourn whatsa ...| 5|
|hola amigos te qu...| 5|
|excellent thank y...| 5|
| whatsapp| 1|
|so frustrating i ...| 1|
|unable to update ...| 1|
| whatsapp| 1|
+--------------------+-----+
这是我要删除的数据:
|ciao mamma mi pia...| 1|
|bounjourn whatsa ...| 5|
|hola amigos te qu...| 5|
这就是我想要的数据帧:
+--------------------+-----+
| content|score|
+--------------------+-----+
| what sapp| 1|
| right| 5|
|excellent thank y...| 5|
| whatsapp| 1|
|so frustrating i ...| 1|
|unable to update ...| 1|
| whatsapp| 1|
+--------------------+-----+
这个循环的问题是它非常慢,因为有1119778行。我知道PySpark的withColumn要快得多,但老实说,我不知道如何选择要删除的行并将其删除
如何将for循环转换为函数,并使语言检测速度更快
我的环境是Google Colab
非常感谢
您可能可以这样做:
from pyspark.sql import functions as F, types as T
# assuming df is your dataframe
@F.udf(T.BooleanType())
def is_english(content):
lan, probability, is_reliable, proportion = cld3.get_language(content)
if lan == "en" and probability >= 0.85:
return True
return False
df.where(is_english(F.col("content")))
事实上,我真的不明白你为什么要通过Spark。正确使用熊猫应该可以解决你的问题:
# I used you example so I only have partial text...
def is_engllish(content):
lan, probability, is_reliable, proportion = cld3.get_language(content)
if lan == "en" and probability >= 0.85:
return True
return False
df.loc[df["content"].apply(is_eng)]
content
8 unable to update ...
# That's the only line from your truncated example that matches your criterias