如何使用UDF函数和Pandas删除元素,而不是使用for循环Python



我有问题。。。如何将for循环作为UDF函数?
import cld3
ind_err = []
cnt = 0
cnt_NOT = 0
for index, row in pandasDF.iterrows():
lan, probability, is_reliable, proportion = cld3.get_language(row["content"])
if (lan != 'en'):
cnt_NOT += 1
ind_err.append(index)
elif(lan == 'en' and probability < 0.85):
cnt += 1
ind_err.append(index)
pandasDF = pandasDF.drop(labels=ind_err, axis=0)

此函数在pandas数据帧的所有行上循环,并查看cld3,哪些是英文,哪些不是英文,以便清理。将索引保存在数组中,以便使用.drop(labels=ind_err,axis=0(将其删除。
这是我拥有的数据:

+--------------------+-----+
|             content|score|
+--------------------+-----+
|           what sapp|    1|
|               right|    5|
|ciao mamma mi pia...|    1|
|bounjourn whatsa ...|    5|
|hola amigos te qu...|    5|
|excellent thank y...|    5|
|            whatsapp|    1|
|so frustrating i ...|    1|
|unable to update ...|    1|
|            whatsapp|    1|
+--------------------+-----+

这是我要删除的数据:

|ciao mamma mi pia...|    1|
|bounjourn whatsa ...|    5|
|hola amigos te qu...|    5|

这就是我想要的数据帧:

+--------------------+-----+
|             content|score|
+--------------------+-----+
|           what sapp|    1|
|               right|    5|
|excellent thank y...|    5|
|            whatsapp|    1|
|so frustrating i ...|    1|
|unable to update ...|    1|
|            whatsapp|    1|
+--------------------+-----+

这个循环的问题是它非常慢,因为有1119778行。我知道PySpark的withColumn要快得多,但老实说,我不知道如何选择要删除的行并将其删除
如何将for循环转换为函数,并使语言检测速度更快
我的环境是Google Colab
非常感谢

您可能可以这样做:

from pyspark.sql import functions as F, types as T
# assuming df is your dataframe

@F.udf(T.BooleanType())
def is_english(content):
lan, probability, is_reliable, proportion = cld3.get_language(content)
if lan == "en" and probability >= 0.85:
return True
return False

df.where(is_english(F.col("content")))

事实上,我真的不明白你为什么要通过Spark。正确使用熊猫应该可以解决你的问题:

# I used you example so I only have partial text...
def is_engllish(content):
lan, probability, is_reliable, proportion = cld3.get_language(content)
if lan == "en" and probability >= 0.85:
return True
return False
df.loc[df["content"].apply(is_eng)]
content
8  unable to update ...
# That's the only line from your truncated example that matches your criterias

最新更新