我在数据帧中有一列,该列的uuid附加了一些其他文件信息:
ff8738hjgdj792__somevar1.txt
9jldh93k4043ik__some3var.txt
我想根据第一个uuid字段对数据帧进行排序(直到双下划线(,而忽略其他要排序的attached string
?
目前我做:
df.sort_values(by='df_column_name')
但这并没有产生所需的结果,因为pd考虑了整个字符串。
我该如何与熊猫一起实现这一目标?
Pandas 1.1.0+具有参数key
。使用它作为常规pythonsort
进行排序
样品df
:
col1
0 ff8738hjgdj792__somevar1.txt
1 9jldh93k4043ik__some3var.txt
df['col1'].sort_values(key=lambda x: x.str.split('__').str[0])
Out[809]:
1 9jldh93k4043ik__some3var.txt
0 ff8738hjgdj792__somevar1.txt
Name: col1, dtype: object
或
df_final = df.sort_values(by='col1',key=lambda x: x.str.split('__').str[0])
Out[812]:
col1
1 9jldh93k4043ik__some3var.txt
0 ff8738hjgdj792__somevar1.txt
由于您已经在使用pandasql,我建议添加pandasql。它让你很容易完成你想要的。
import pandas as pd
import pandasql as ps
# Recreating the data you provided
df = pd.DataFrame(['ff8738hjgdj792__somevar1.txt', '9jldh93k4043ik__some3var.txt'], columns = ['something'])
# Selecting and sorting by the the the length of the substring you're looking for
df_res = ps.sqldf("""
select something
from df
order by substr(something, 0, length('ff8738hjgdj792')) """, locals())
print(df_res)
返回
something
0 9jldh93k4043ik__some3var.txt
1 ff8738hjgdj792__somevar1.txt