如何将范围字符串(bin)转换为数值，然后可以与Seaborn可视化一起使用

所以，我在Jupyter Notebooks中使用Python 3.7。我目前正在研究从.CSV file导入的Pandas形式的一些调查数据。我想通过一些Seaborn可视化来进一步探索，然而，数值数据是以年龄箱的形式收集的，使用字符串值。

有没有办法将这些列(Age和Approximate Household Income(转换为数值，然后与Seaborn一起使用？我尝试过搜索，但我的措辞似乎只是返回为具有数值的列创建年龄箱的方法。我真的在寻找如何将字符串值转换为数字年龄bin值。

另外，有人对我如何改进我的搜索方法有什么建议吗。对于这样的事情，寻找解决方案的理想措辞是什么？

以下是数据帧中的一个示例，使用df.head(5).to_dict()，为匿名目的更改了值。

'Age': {0: '45-54', 1: '35-44', 2: '45-54', 3: '45-54', 4: '55-64'},
'Ethnicity': {0: 'White', 1: 'White', 2: 'White', 3: 'White', 4: 'White'},
'Approximate Household Income': {0: '$175,000 - $199,999',
1: '$75,000 - $99,999',
2: '$25,000 - $49,999',
3: '$50,000 - $74,999',
4: nan},
'Highest Level of Education Completed': {0: 'Four Year College Degree',
1: 'Four Year College Degree',
2: 'Jr College/Associates Degree',
3: 'Jr College/Associates Degree',
4: 'Four Year College Degree'},
'2020 Candidate Choice': {0: 'Joe Biden',
1: 'Joe Biden',
2: 'Donald Trump',
3: 'Joe Biden',
4: 'Donald Trump'},
'2016 Candidate Choice': {0: 'Hillary Clinton',
1: 'Third Party',
2: 'Donald Trump',
3: 'Hillary Clinton',
4: 'Third Party'},
'Party Registration 2020': {0: 'Independent',
1: 'No Party',
2: 'No Party',
3: 'Independent',
4: 'Independent'},
'Registered State for Voting': {0: 'Colorado',
1: 'Virginia',
2: 'California',
3: 'North Carolina',
4: 'Oregon'}

您可以使用一些pandasSeries.str方法。

较小的示例数据集：

import pandas as pd
import numpy as np
df = pd.DataFrame(
{
"Age": {0: "45-54", 1: "35-44", 2: "45-54", 3: "45-54", 4: "55-64"},
"Ethnicity": {0: "White", 1: "White", 2: "White", 3: "White", 4: "White"},
"Approximate Household Income": {
0: "$175,000 - $199,999",
1: "$75,000 - $99,999",
2: "$25,000 - $49,999",
3: "$50,000 - $74,999",
4: np.nan,
},
}
)
#      Age Ethnicity Approximate Household Income
# 0  45-54     White          $175,000 - $199,999
# 1  35-44     White            $75,000 - $99,999
# 2  45-54     White            $25,000 - $49,999
# 3  45-54     White            $50,000 - $74,999
# 4  55-64     White                          NaN

我们可以遍历列列表，然后链式应用这些方法来解析pandas.DataFrame:中的所有范围

我们将按顺序使用的方法：

Series.str.replace-不使用任何内容替换逗号
Series.str.extract-从序列中提取数字，regex在此处解释
Series.astype-将提取的数字转换为floats
DataFrame.rename-重命名新列
DataFrame.join-将提取的数字重新添加到原始DataFrame

for col in ["Age", "Approximate Household Income"]:
df = df.join(
df[col]
.str.replace(",", "", regex=False)
.str.extract(pat=r"^[$]*(d+)[-s$]*(d+)$")
.astype("float")
.rename({0: f"{col}_lower", 1: f"{col}_upper"}, axis="columns")
)
#      Age Ethnicity Approximate Household Income  Age_lower  Age_upper  
# 0  45-54     White          $175,000 - $199,999       45.0       54.0   
# 1  35-44     White            $75,000 - $99,999       35.0       44.0   
# 2  45-54     White            $25,000 - $49,999       45.0       54.0   
# 3  45-54     White            $50,000 - $74,999       45.0       54.0   
# 4  55-64     White                          NaN       55.0       64.0   
# 
#    Approximate Household Income_lower  Approximate Household Income_upper  
# 0                            175000.0                            199999.0  
# 1                             75000.0                             99999.0  
# 2                             25000.0                             49999.0  
# 3                             50000.0                             74999.0  
# 4                                 NaN                                 NaN

在这种情况下，我建议根据字符串的格式为每种类型的类别设置"手动"转换。例如，对于年龄仓：

age = {0: '45-54', 1: '35-44', 2: '45-54', 3: '45-54', 4: '55-64'}
age_bins = {key: [int(age[key].split('-')[0]), int(age[key].split('-')[1])] for key in age}

{0: [45, 54], 1: [35, 44], 2: [45, 54], 3: [45, 54], 4: [55, 64]}

相关内容

最新更新

热门标签：