按字符串的前几个字符对pandas字符串列进行排序



我在数据帧中有一列,该列的uuid附加了一些其他文件信息:

ff8738hjgdj792__somevar1.txt
9jldh93k4043ik__some3var.txt

我想根据第一个uuid字段对数据帧进行排序(直到双下划线(,而忽略其他要排序的attached string

目前我做:

df.sort_values(by='df_column_name')

但这并没有产生所需的结果,因为pd考虑了整个字符串。

我该如何与熊猫一起实现这一目标?

Pandas 1.1.0+具有参数key。使用它作为常规pythonsort进行排序

样品df:

col1
0  ff8738hjgdj792__somevar1.txt
1  9jldh93k4043ik__some3var.txt
df['col1'].sort_values(key=lambda x: x.str.split('__').str[0])
Out[809]:
1    9jldh93k4043ik__some3var.txt
0    ff8738hjgdj792__somevar1.txt
Name: col1, dtype: object

df_final = df.sort_values(by='col1',key=lambda x: x.str.split('__').str[0])
Out[812]:
col1
1  9jldh93k4043ik__some3var.txt
0  ff8738hjgdj792__somevar1.txt

由于您已经在使用pandasql,我建议添加pandasql。它让你很容易完成你想要的。

import pandas as pd
import pandasql as ps
# Recreating the data you provided
df = pd.DataFrame(['ff8738hjgdj792__somevar1.txt', '9jldh93k4043ik__some3var.txt'], columns = ['something']) 
# Selecting and sorting by the the the length of the substring you're looking for
df_res = ps.sqldf("""
select something 
from df 
order by substr(something, 0, length('ff8738hjgdj792')) """, locals())

print(df_res)

返回

something
0  9jldh93k4043ik__some3var.txt
1  ff8738hjgdj792__somevar1.txt

最新更新