我正在尝试基于一列的拆分将列添加到DataFrame中。有了两行,一切都正常,空列的值为"None"。当我只有一行,并且DataFrame无法展开,并且我希望它也被分配值"None"时,就会出现问题。
工作示例:
>>> import pandas as pd
>>> df = pd.DataFrame({'auth':['dbname_user','dbname']})
>>> df
auth
0 dbname_user
1 dbname
>>> df[['db','login']] = df['auth'].str.split('_', n=1, expand=True)
>>> df
auth db login
0 dbname_user dbname user
1 dbname dbname None <--- as expected, 'None' value is assigned
问题示例:
>>> import pandas as pd
>>> df = pd.DataFrame({'auth':['dbname']})
>>> df
auth
0 dbname
>>> df[['db','login']] = df['auth'].str.split('_', n=1, expand=True)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/ubuntu/.local/lib/python3.8/site-packages/pandas/core/frame.py", line 3643, in __setitem__
self._setitem_array(key, value)
File "/home/ubuntu/.local/lib/python3.8/site-packages/pandas/core/frame.py", line 3685, in _setitem_array
check_key_length(self.columns, key, value)
File "/home/ubuntu/.local/lib/python3.8/site-packages/pandas/core/indexers/utils.py", line 428, in check_key_length
raise ValueError("Columns must be same length as key")
ValueError: Columns must be same length as key
我希望与工作示例中的情况相同,其中第二列的值为"None"。不幸的是,我无法使用列表理解来动态扩展列的数量。列数必须是固定的。
这应该可以做到:
import pandas as pd
import numpy as np
df_example = pd.DataFrame({'auth': ['dbname']})
df_example[["db", "login"]] = (
# This np.where adds a "_" to each string that don't have "_".
pd.Series(
np.where(
df_example["auth"].str.contains("_"),
df_example["auth"],
df_example["auth"] + "_"
)
)
# Normal `.str.split` like you were doing.
.str.split("_", n=1, expand=True)
# `.applymap` converts empty strings ("") into None.
.applymap(lambda value: None if value == "" else value)
)
print(df_example)
# Prints:
# auth db login
# 0 dbname dbname None
解释
您得到的错误(ValueError: Columns must be same length as key
(是因为在拆分列"auth"
之后,您最终只得到一个长度为1的值。expand=True
在这里对您没有帮助,因为split中的所有值的长度都为1。您的第一个示例是有效的,因为当panda拆分第一个值dbname_user
时,它的长度为2,因此剩余的值被扩展到相同的长度。换句话说,expand=True
使所有返回值具有与具有最大长度的值相同的长度:
# Series with first value containing no "_",
# second value containing one "_",
# and third value containing two "_".
example_2 = pd.Series(['dbname', 'dbname_user', 'dbname_user_2'])
# Applying `.str.split`, without setting parameter `n`
split = example_2.str.split("_", expand=True)
# Get the maximum length of the split (max_len = 3)
max_len = example_2.str.split("_").str.len().max()
print('Max Length:', max_len)
print('Number of columns:', split.shape[1])
print('max_len == split.shape[1]:', max_len == split.shape[1])
print('n', split)
# Prints:
# Max Length: 3
# Number of columns: 3
# max_len == split.shape[1]: True
#
# 0 1 2
# 0 dbname None None
# 1 dbname user None
# 2 dbname user 2
您可以尝试这段应该有效的代码:
def underscore_split(auth):
if "_" in auth:
return auth.split("_", 1)
else:
return [auth, None]
df[['db', 'login']] = [underscore_split(x) for x in df["auth"]]