ValueError:当只处理一行时,列的长度必须与键的长度相同



我正在尝试基于一列的拆分将列添加到DataFrame中。有了两行,一切都正常,空列的值为"None"。当我只有一行,并且DataFrame无法展开,并且我希望它也被分配值"None"时,就会出现问题。

工作示例:

>>> import pandas as pd
>>> df = pd.DataFrame({'auth':['dbname_user','dbname']})
>>> df
auth
0  dbname_user
1       dbname
>>> df[['db','login']] = df['auth'].str.split('_', n=1, expand=True)
>>> df
auth      db login
0  dbname_user  dbname  user
1       dbname  dbname  None   <--- as expected, 'None' value is assigned

问题示例:

>>> import pandas as pd
>>> df = pd.DataFrame({'auth':['dbname']})
>>> df
auth
0  dbname
>>> df[['db','login']] = df['auth'].str.split('_', n=1, expand=True)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/ubuntu/.local/lib/python3.8/site-packages/pandas/core/frame.py", line 3643, in __setitem__
self._setitem_array(key, value)
File "/home/ubuntu/.local/lib/python3.8/site-packages/pandas/core/frame.py", line 3685, in _setitem_array
check_key_length(self.columns, key, value)
File "/home/ubuntu/.local/lib/python3.8/site-packages/pandas/core/indexers/utils.py", line 428, in check_key_length
raise ValueError("Columns must be same length as key")
ValueError: Columns must be same length as key

我希望与工作示例中的情况相同,其中第二列的值为"None"。不幸的是,我无法使用列表理解来动态扩展列的数量。列数必须是固定的。

这应该可以做到:

import pandas as pd
import numpy as np

df_example = pd.DataFrame({'auth': ['dbname']})
df_example[["db", "login"]] = (
# This np.where adds a "_" to each string that don't have "_".
pd.Series(
np.where(
df_example["auth"].str.contains("_"),
df_example["auth"],
df_example["auth"] + "_"
)
)
# Normal `.str.split` like you were doing.
.str.split("_", n=1, expand=True)
# `.applymap` converts empty strings ("") into None.
.applymap(lambda value: None if value == "" else value)
)
print(df_example)
# Prints:
#      auth      db login
# 0  dbname  dbname  None

解释

您得到的错误(ValueError: Columns must be same length as key(是因为在拆分列"auth"之后,您最终只得到一个长度为1的值。expand=True在这里对您没有帮助,因为split中的所有值的长度都为1。您的第一个示例是有效的,因为当panda拆分第一个值dbname_user时,它的长度为2,因此剩余的值被扩展到相同的长度。换句话说,expand=True使所有返回值具有与具有最大长度的值相同的长度:


# Series with first value containing no "_",
# second value containing one "_",
# and third value containing two "_".
example_2 = pd.Series(['dbname', 'dbname_user', 'dbname_user_2'])
# Applying `.str.split`, without setting parameter `n`
split = example_2.str.split("_", expand=True)
# Get the maximum length of the split (max_len = 3)
max_len = example_2.str.split("_").str.len().max()
print('Max Length:', max_len)
print('Number of columns:', split.shape[1])
print('max_len == split.shape[1]:', max_len == split.shape[1])
print('n', split)
# Prints:
# Max Length: 3
# Number of columns: 3
# max_len == split.shape[1]: True
# 
#         0     1     2
# 0  dbname  None  None
# 1  dbname  user  None
# 2  dbname  user     2

您可以尝试这段应该有效的代码:

def underscore_split(auth):
if "_" in auth:
return auth.split("_", 1)
else:
return [auth, None]

df[['db', 'login']] = [underscore_split(x) for x in df["auth"]]

最新更新