任务:我正试图从字典列表中创建一个pandas数据帧.问题:这会为每个字典项创建一个数据帧

我正试图从三个列表中创建一个数据帧，这三个列表是我使用webscraped数据生成的。然而，当我尝试将这些列表转换为字典，然后使用它们来构建panda数据帧时，它会为每个字典项(行(输出一个数据帧，而不是一个数据框架，该数据框架将所有这些项作为行包含在数据框架中。

我相信问题出在for循环中，我用它来对数据进行网络抓取。我知道在这个问题上也有人问过类似的问题，包括这里为每一行创建Pandas DataFrame，以及这里将多个列表放入数据框架，但我尝试了这些解决方案，但没有任何乐趣。我相信webscrape循环添加了一个细微之处，使这件事变得更加棘手。

下面是我的代码和输出的逐步演练，作为参考，我将panda作为pd和bs4导入。

# Step 1 create a webscraper which takes three sets of data (price, bedrooms and bathrooms) from a website and populate into three separate lists
for container in containers:
try:
price_container=container.find("a",{"class":"listing-price text-price"})
price_strip=price_container.text.strip()
price_list=[]
price_list.append(price_strip)
except TypeError:
continue
try:
bedroom_container = container.find("span",{"class":"icon num-beds"})
bedroom_strip=(bedroom_container["title"])
bedroom_list=[]
bedroom_list.append(bedroom_strip)

except TypeError:
continue
try:
bathroom_container=container.find("span", {"class":"icon num-baths"})
bathroom_strip=(bathroom_container["title"])
bathroom_list=[]
bathroom_list.append(bathroom_strip)

except TypeError:
continue
# Step 2 create a dictionary 
data = {'price':price_list, 'bedrooms':bedroom_list, 'bathrooms':bathrooms_list}

# Step 3 turn it into a pandas dataframe and print the output
d=pd.DataFrame(data)
print(d)

这为我提供了每个字典的数据帧，如下所示。

price               bedrooms          bathrooms                                   
0  £200,000            3                 2
[1 rows x 3 columns]

price               bedrooms          bathrooms                                   
0  £400,000            5                 3
[1 rows x 3 columns]

prices              bedrooms          bathrooms                                   
0  £900,000            6                 4
[1 rows x 3 columns]
and so on.....

我尝试了字典理解和列表理解，为每个字典项提供一个数据帧，而不是一个数据框架：

data = [({'price':price, 'bedrooms':bedrooms, 'bathrooms':bathrooms}) for item in container]
df = pd.DataFrame(data)
print(df)

而且，不管我怎么做列表表达式，这会产生一个更奇怪的输出。它为字典中的每个项目提供了一个数据帧，其中同一行信息重复了多次

price               bedrooms          bathrooms                                  
0  £200,000            3                 2
0  £200,000            3                 2
0  £200,000            3                 2
[3 rows x 3 columns]

price               bedrooms          bathrooms                                   
0  £400,000            5                 3
0  £400,000            5                 3
0  £400,000            5                 3
[3 rows x 3 columns]

price               bedrooms          bathrooms                                   
0  £900,000            6                 4
0  £900,000            6                 4
0  £900,000            6                 4
[1 rows x 3 columns]
and so on...

我如何解决这个问题，并将我的所有数据放入一个panda数据帧中？

首先，您应该在for循环之前执行price_list=[]、bedroom_list=[]和bathroom_list=[]，否则它们最多只有1个元素长，因为它们每次都会重新设置为[]，然后添加单个元素。其次，如果您希望拥有单个数据帧，则应在for循环之外创建它，即dedentdata = {'price':price_list, 'bedrooms':bedroom_list, 'bathrooms':bathrooms_list}以及以下行。最后，在丢失数据的情况下，您应该表示它——如果除了第一个continue之外还有其他CCD_8将被执行，则您的price_list、bedroom_list、bathroom_list将具有不同的长度。我建议使用price_list.append(None)替换第一个continue，第二个使用bedroom_list.append(None)，第三个使用CCD15，这样您就可以在数据帧中清楚地指示数据丢失的位置。

您在这里测试的代码部分很好——列表字典总是会返回一个数据帧。所以这部分：

pd.DataFrame(data)

不可能是问题的原因。相反，事实是它埋在一个循环中，所以运行了三次。同样的道理也适用于你的列表，这些列表被一次又一次地定义。

把这些零件从循环中取出，你应该没事

您必须合并的三个列表

df = pd.DataFrame(data["price"] + data["bedrooms"] + data["bathrooms"] )

如果你想要更通用的东西：

list_ = [item for i in data for item in data[i]]
df = pd.DataFrame(list_)

相关内容

最新更新

热门标签：