Python -当使用FOR循环迭代pandas DataFrame时,使用IF语句在字符串中查找子字符串



我有一个DataFrame,看起来像这样…

Variable
0                         Religion - Buddhism
1                            Source: Clickerz
2                            Religion - Islam
3                            Source: SRZ FREE
4   Ethnicity - Mixed - White & Black African

我想操纵variable列创建一个像这样new column

Variable           New Column
0                         Religion - Buddhism           Buddhism
1                            Source: Clickerz           Clickerz
2                            Religion - Islam            Islam
3                            Source: SRZ FREE            SRZ FREE
4   Ethnicity - Mixed - White & Black African         Mixed - White and Black African

所以我最终可以有一个像这样的DataFrame…

Variable                      New Column
0                       Religion                        Buddhism
1                         Source                        Clickerz
2                       Religion                           Islam
3                         Source                        SRZ FREE
4                      Ethnicity         Mixed - White and Black African

我想遍历Variable列并操作数据以创建New Column。我计划使用多个if语句来查找特定的单词,例如'Ethnicity''Religion',然后应用操作。

例如…

For row in df['Variable']:
if 'Religion' in row:
df['New Column'] = ...

elif 'Ethnicity' in row:
df['New Column'] = ...
elif: 'Source' in row:
df['New Column'] = ...
else:
df['New Column'] = 'Not Applicable'

即使type(row)返回'str'意味着它是类字符串,这段代码一直返回新列作为所有"不适用"的意思是它不检测任何字符串在数据帧中的任何行,即使我可以看到它们在那里。

我相信有一个简单的方法来做到这一点…请帮助!

我也试过以下方法…

For row in df['Variable']:
if row.find('Religion') != -1:
df['New Column'] = ...
elif row.find('Ethnicity') != -1:
df['New Column'] = ...
elif: row.find('Source') != -1:
df['New Column'] = ...
else:
df['New Column'] = 'Not Applicable'

我继续得到新列的所有条目都是'Not Applicable'。还是没有在现有列中找到字符串。

是数据类型的问题还是什么?

您可以使用嵌套的for循环:

# For each row in the dataframe
for row in df['column_variable']:
# Set boolean to indicate if a substring was found
substr_found = False
# For each substring
for sub_str in ["substring1", "substring2"]:
# If the substring is in the row
if sub_str in row:
# Execute code...
df['new_column'] = ...
# Substring was found!
substr_found = True
# If substring was not found
if not substr_found:
# Set invalid code...
df['new column'] = 'Not Applicable'

已更新以匹配您的数据框架!

import pandas as pd
你Dataframe

lst = []
for i in ['Religion - Buddhism','Source: Clickerz','Religion - Islam','Source: SRZ FREE','Ethnicity - Mixed - White & Black African']:
item = [i]
lst.append(item)
df = pd.DataFrame.from_records(lst)
df.columns = ['variable']
print(df)
variable
0                        Religion - Buddhism
1                           Source: Clickerz
2                           Religion - Islam
3                           Source: SRZ FREE
4  Ethnicity - Mixed - White & Black African

.loc结合使用For循环和部分字符串匹配来设置新值

for x,y in df['variable'].iteritems():
if 'religion' in y.lower():
z = y.split('-')
df.loc[x, 'variable'] = z[0].strip()
df.loc[x, 'value'] = ''.join(z[1:]).strip()
if 'source' in y.lower():
z = y.split(':')
df.loc[x, 'variable'] = z[0].strip()
df.loc[x, 'value'] = ''.join(z[1:]).strip()
if 'ethnicity' in y.lower():
z = y.split('-')
df.loc[x, 'variable'] = z[0].strip()
df.loc[x, 'value'] = ''.join(z[1:]).strip()
print(df)
variable                         value
0   Religion                      Buddhism
1     Source                      Clickerz
2   Religion                         Islam
3     Source                      SRZ FREE
4  Ethnicity  Mixed  White & Black African

在操作DataFrame时,应尽可能避免遍历行。这篇文章解释了哪些是更有效的替代方案。

你基本上是在尝试根据一些固定的映射来翻译字符串。很自然,我想到了dict:

substring_map = {
"at": "pseudo-cat",
"dog": "true dog",
"bre": "something else",    
"na": "not applicable"
}

这个映射可以从文件中读取,例如,在处理大量子字符串的场景中,一个JSON文件。

子字符串匹配逻辑现在可以从映射定义中解耦:

def translate_substring(x):
for substring, new_string in substring_map.items():
if substring in x:
return new_string
return "not applicable"

使用apply和'mapping'函数来生成您的目标列:

df = pd.DataFrame({"name":
["cat", "dogg", "breeze", "bred", "hat", "misty"]})
df["new_column"] = df["name"].apply(translate_substring)
# df:
#      name      new_column
# 0     cat      pseudo-cat
# 1    dogg        true dog
# 2  breeze  something else
# 3    bred  something else
# 4     hat      pseudo-cat
# 5   misty  not applicable

此代码应用于pd.concat([df] * 10000)(60,000行),在Colab笔记本中运行时间为42ms。相比之下,使用iterrows在3.67秒内完成——加速了87倍。

您可以创建一个空列表,在那里添加新值,并创建新列作为最后一步:

all_data = []
for row in df["column_variable"]:
if "substring1" in row:
all_data.append("Found 1")
elif "substring2" in row:
all_data.append("Found 2")
elif "substring3" in row:
all_data.append("Found 3")
else:
all_data.append("Not Applicable")
df["new column"] = all_data
print(df)

打印:

column_variable new column
0  this is substring1    Found 1
1  this is substring2    Found 2
2  this is substring1    Found 1
3  this is substring3    Found 3

也许我能想到的最短的方法:

#Dummy DataFrame
df = pd.DataFrame([[1,"substr1"],[3,"bla"],[5,"bla"]],columns=["abc","col_to_check"])
substrings = ["substr1","substr2", "substr3"]
content = df["col_to_check"].unique().tolist() # Unique content of column
for subs in substrings: # Go through all your substrings
if subs in content: # Check if substring is in column
df[subs] = 0 # Fill your new column with whatever you want

我创建了一个函数'string_splitter'并将其应用于lambda函数,这解决了问题。

我创建了以下函数,根据单元格中包含的不同子字符串以不同的方式拆分字符串。

def string_splitter(cell):
word_list1 = ['Age', 'Disability', 'Religion', 'Gender']
word_list2 = ['Number shortlisted', 'Number Hired', 'Number Interviewed']
if any([word in cell for word in word_list1]):

result = cell.split("-")[1]
result = result.strip()

elif 'Source' in cell:

result = cell.split(":")[1]
result = result.strip()

elif 'Ethnicity' in cell:

result_list = cell.split("-")[1:3]
result = "-".join(result_list)
result = result.strip()
elif any([word in cell for word in word_list2]):

result = cell.split(" ")[1]
result = result.strip()
elif 'Number of Applicants' in cell:

result = cell

return result

然后在使用lambda操作时调用string_splitter。这个函数应用于每个单元单独的代码遍历每行dataframe指定列。如下所示:

df['Answer'] = df['Visual Type'].apply(lambda x: string_splitter(x))

string_splitter允许我创建New column

然后,我创建了另一个函数column_formatter,以便在创建New Column后操作Variable列。第二个函数如下所示:

def column_formatter(cell):
word_list1 = ['Age', 'Gender', 'Ethnicity', 'Religion']
word_list2 = ['Number of Applicants', 'Number Hired', 'Number shortlisted', 'Number Interviewed']
if any([word in cell for word in word_list1]):

result = cell.split("-")[0]
result = result.strip()
elif 'Source' in cell:

result = cell.split(":")[0]
result = result.strip()
elif 'Disability' in cell:

result = cell.split(" ")[0]
result = result.strip()
elif any([word in cell for word in word_list2]):

result = 'Number of Applicants'

else:

result = 'Something wrong here'

return result

然后按如下方式调用函数:

df['Visual Type'] = df['Visual Type'].apply(lambda x: column_formatter(x))

相关内容

  • 没有找到相关文章

最新更新