我想避免循环遍历PyArrow列的所有元素,并从unidecode
包中应用unidecode
函数,以便创建一个将转换为PyArrow列的未解码元素列表,所以我想知道PyArrow是否有一个函数以更有效的方式完成此操作,因为这需要大量时间对于长度大于100万的列。我正在寻找PyArrow的"计算"包,但我没有找到任何有用的东西。这就是我现在正在做的:
from unidecode import unidecode
import pyarrow
import pyarrow.compute as pc
pc_value_counts = pc.value_counts(tmp_column)
value_counts = dict()
for record in pc_value_counts:
record_py = record['values'].as_py()
if isinstance(record_py, str) and not record_py.isdigit():
unique_value = unidecode(record_py)
else:
unique_value = record_py
value_counts[unique_value] = value_counts.get(unique_value, 0) + record['counts'].as_py()
table = pyarrow.table([pyarrow.array(value_counts.keys()), pyarrow.array(value_counts.values())],
schema=pyarrow.schema([pyarrow.field('unique_values', pyarrow.string()),
pyarrow.field('value_counts', pyarrow.int32())]))
其中tmp_column
为PyArrow列
您能表达一下您想用这里的一个内核做什么吗:https://arrow.apache.org/docs/cpp/compute.html?#string-transforms ?