我有一个DataFrame,看起来像这样…
Variable
0 Religion - Buddhism
1 Source: Clickerz
2 Religion - Islam
3 Source: SRZ FREE
4 Ethnicity - Mixed - White & Black African
我想操纵variable
列创建一个像这样new column
…
Variable New Column
0 Religion - Buddhism Buddhism
1 Source: Clickerz Clickerz
2 Religion - Islam Islam
3 Source: SRZ FREE SRZ FREE
4 Ethnicity - Mixed - White & Black African Mixed - White and Black African
所以我最终可以有一个像这样的DataFrame…
Variable New Column
0 Religion Buddhism
1 Source Clickerz
2 Religion Islam
3 Source SRZ FREE
4 Ethnicity Mixed - White and Black African
我想遍历Variable
列并操作数据以创建New Column
。我计划使用多个if
语句来查找特定的单词,例如'Ethnicity'
或'Religion'
,然后应用操作。
例如…
For row in df['Variable']:
if 'Religion' in row:
df['New Column'] = ...
elif 'Ethnicity' in row:
df['New Column'] = ...
elif: 'Source' in row:
df['New Column'] = ...
else:
df['New Column'] = 'Not Applicable'
即使type(row)
返回'str'
意味着它是类字符串,这段代码一直返回新列作为所有"不适用"的意思是它不检测任何字符串在数据帧中的任何行,即使我可以看到它们在那里。
我相信有一个简单的方法来做到这一点…请帮助!
我也试过以下方法…
For row in df['Variable']:
if row.find('Religion') != -1:
df['New Column'] = ...
elif row.find('Ethnicity') != -1:
df['New Column'] = ...
elif: row.find('Source') != -1:
df['New Column'] = ...
else:
df['New Column'] = 'Not Applicable'
我继续得到新列的所有条目都是'Not Applicable'。还是没有在现有列中找到字符串。
是数据类型的问题还是什么?
您可以使用嵌套的for
循环:
# For each row in the dataframe
for row in df['column_variable']:
# Set boolean to indicate if a substring was found
substr_found = False
# For each substring
for sub_str in ["substring1", "substring2"]:
# If the substring is in the row
if sub_str in row:
# Execute code...
df['new_column'] = ...
# Substring was found!
substr_found = True
# If substring was not found
if not substr_found:
# Set invalid code...
df['new column'] = 'Not Applicable'
已更新以匹配您的数据框架!
import pandas as pd
你Dataframelst = []
for i in ['Religion - Buddhism','Source: Clickerz','Religion - Islam','Source: SRZ FREE','Ethnicity - Mixed - White & Black African']:
item = [i]
lst.append(item)
df = pd.DataFrame.from_records(lst)
df.columns = ['variable']
print(df)
variable
0 Religion - Buddhism
1 Source: Clickerz
2 Religion - Islam
3 Source: SRZ FREE
4 Ethnicity - Mixed - White & Black African
与.loc
结合使用For循环和部分字符串匹配来设置新值
for x,y in df['variable'].iteritems():
if 'religion' in y.lower():
z = y.split('-')
df.loc[x, 'variable'] = z[0].strip()
df.loc[x, 'value'] = ''.join(z[1:]).strip()
if 'source' in y.lower():
z = y.split(':')
df.loc[x, 'variable'] = z[0].strip()
df.loc[x, 'value'] = ''.join(z[1:]).strip()
if 'ethnicity' in y.lower():
z = y.split('-')
df.loc[x, 'variable'] = z[0].strip()
df.loc[x, 'value'] = ''.join(z[1:]).strip()
print(df)
variable value
0 Religion Buddhism
1 Source Clickerz
2 Religion Islam
3 Source SRZ FREE
4 Ethnicity Mixed White & Black African
在操作DataFrame
时,应尽可能避免遍历行。这篇文章解释了哪些是更有效的替代方案。
你基本上是在尝试根据一些固定的映射来翻译字符串。很自然,我想到了dict
:
substring_map = {
"at": "pseudo-cat",
"dog": "true dog",
"bre": "something else",
"na": "not applicable"
}
这个映射可以从文件中读取,例如,在处理大量子字符串的场景中,一个JSON文件。
子字符串匹配逻辑现在可以从映射定义中解耦:
def translate_substring(x):
for substring, new_string in substring_map.items():
if substring in x:
return new_string
return "not applicable"
使用apply
和'mapping'函数来生成您的目标列:
df = pd.DataFrame({"name":
["cat", "dogg", "breeze", "bred", "hat", "misty"]})
df["new_column"] = df["name"].apply(translate_substring)
# df:
# name new_column
# 0 cat pseudo-cat
# 1 dogg true dog
# 2 breeze something else
# 3 bred something else
# 4 hat pseudo-cat
# 5 misty not applicable
此代码应用于pd.concat([df] * 10000)
(60,000行),在Colab笔记本中运行时间为42ms。相比之下,使用iterrows
在3.67秒内完成——加速了87倍。
您可以创建一个空列表,在那里添加新值,并创建新列作为最后一步:
all_data = []
for row in df["column_variable"]:
if "substring1" in row:
all_data.append("Found 1")
elif "substring2" in row:
all_data.append("Found 2")
elif "substring3" in row:
all_data.append("Found 3")
else:
all_data.append("Not Applicable")
df["new column"] = all_data
print(df)
打印:
column_variable new column
0 this is substring1 Found 1
1 this is substring2 Found 2
2 this is substring1 Found 1
3 this is substring3 Found 3
也许我能想到的最短的方法:
#Dummy DataFrame
df = pd.DataFrame([[1,"substr1"],[3,"bla"],[5,"bla"]],columns=["abc","col_to_check"])
substrings = ["substr1","substr2", "substr3"]
content = df["col_to_check"].unique().tolist() # Unique content of column
for subs in substrings: # Go through all your substrings
if subs in content: # Check if substring is in column
df[subs] = 0 # Fill your new column with whatever you want
我创建了一个函数'string_splitter'并将其应用于lambda函数,这解决了问题。
我创建了以下函数,根据单元格中包含的不同子字符串以不同的方式拆分字符串。
def string_splitter(cell):
word_list1 = ['Age', 'Disability', 'Religion', 'Gender']
word_list2 = ['Number shortlisted', 'Number Hired', 'Number Interviewed']
if any([word in cell for word in word_list1]):
result = cell.split("-")[1]
result = result.strip()
elif 'Source' in cell:
result = cell.split(":")[1]
result = result.strip()
elif 'Ethnicity' in cell:
result_list = cell.split("-")[1:3]
result = "-".join(result_list)
result = result.strip()
elif any([word in cell for word in word_list2]):
result = cell.split(" ")[1]
result = result.strip()
elif 'Number of Applicants' in cell:
result = cell
return result
然后在使用lambda操作时调用string_splitter
。这个函数应用于每个单元单独的代码遍历每行dataframe指定列。如下所示:
df['Answer'] = df['Visual Type'].apply(lambda x: string_splitter(x))
string_splitter
允许我创建New column
。
然后,我创建了另一个函数column_formatter
,以便在创建New Column
后操作Variable
列。第二个函数如下所示:
def column_formatter(cell):
word_list1 = ['Age', 'Gender', 'Ethnicity', 'Religion']
word_list2 = ['Number of Applicants', 'Number Hired', 'Number shortlisted', 'Number Interviewed']
if any([word in cell for word in word_list1]):
result = cell.split("-")[0]
result = result.strip()
elif 'Source' in cell:
result = cell.split(":")[0]
result = result.strip()
elif 'Disability' in cell:
result = cell.split(" ")[0]
result = result.strip()
elif any([word in cell for word in word_list2]):
result = 'Number of Applicants'
else:
result = 'Something wrong here'
return result
然后按如下方式调用函数:
df['Visual Type'] = df['Visual Type'].apply(lambda x: column_formatter(x))