熊猫内部合并产生新的重复行



我有一个最佳女主角数据框,如下所示。

Year    Ceremony    Award       Winner  Name            Film        Ceremony_Date
1927/1928   1   Best Actress    NaN Louise Dresser  A Ship Comes In 1929-05-16
1927/1928   1   Best Actress    1.0 Janet Gaynor    7th Heaven      1929-05-16
1927/1928   1   Best Actress    NaN Gloria Swanson  Sadie Thompson  1929-05-16
1928/1929   2   Best Actress    NaN Ruth Chatterton Madame X        1930-04-03
1928/1929   2   Best Actress    NaN Betty Compson   The Barker      1930-04-03

我在姓名列上与以下最佳女演员的出生日期数据帧进行了内部联接(合并(,因为我希望上面的数据帧也包含出生日期信息。

Name           DOB
0   Janet Gaynor    1906-10-06
1   Louise Dresser  1878-10-17
2   Gloria Swanson  1899-03-27
3   Mary Pickford   1892-04-08
4   Ruth Chatterton 1892-12-24
5   Betty Compson   1897-03-19 

编辑 -->ba_dob = pd.merge(ba, df_birthdays, how='inner', on='Name')

结果是一个包含重复行的数据帧。例如(见下文(,梅丽尔·斯特里普(Meryl Streep(曾获得过一次电影提名,而该记录(加入后(神秘地重复了无数次。我认为内部联接只是将出生日期与名称相关联,只要两个数据帧之间的名称列上存在匹配项,而不是重新复制整个记录。我尝试了将最佳女演员表作为左表的左联接,并得到了类似的重复记录。对正在发生的事情的任何见解将不胜感激。

Year    Ceremony    Award   Winner  Name    Film       Ceremony_Date    DOB
1102    1981    54  Best Actress    NaN Meryl Streep    The French Lieutenant's Woman   1982-03-29  1949-06-22
1103    1981    54  Best Actress    NaN Meryl Streep    The French Lieutenant's Woman   1982-03-29  1949-06-22
1104    1981    54  Best Actress    NaN Meryl Streep    The French Lieutenant's Woman   1982-03-29  1949-06-22
1105    1981    54  Best Actress    NaN Meryl Streep    The French Lieutenant's Woman   1982-03-29  1949-06-22
1106    1981    54  Best Actress    NaN Meryl Streep    The French Lieutenant's Woman   1982-03-29  1949-06-22
1107    1981    54  Best Actress    NaN Meryl Streep    The French Lieutenant's Woman   1982-03-29  1949-06-22
1108    1981    54  Best Actress    NaN Meryl Streep    The French Lieutenant's Woman   1982-03-29  1949-06-22

编辑以下是上述字典形式的数据帧头(根据请求(:

最佳女主角

{'Award': {2: 'Best Actress',
3: 'Best Actress',
4: 'Best Actress',
40: 'Best Actress',
41: 'Best Actress'},
'Ceremony': {2: 1, 3: 1, 4: 1, 40: 2, 41: 2},
'Ceremony_Date': {2: Timestamp('1929-05-16 00:00:00'),
3: Timestamp('1929-05-16 00:00:00'),
4: Timestamp('1929-05-16 00:00:00'),
40: Timestamp('1930-04-03 00:00:00'),
41: Timestamp('1930-04-03 00:00:00')},
'Film': {2: 'A Ship Comes In',
3: '7th Heaven',
4: 'Sadie Thompson',
40: 'Madame X',
41: 'The Barker'},
'Name': {2: 'Louise Dresser',
3: 'Janet Gaynor',
4: 'Gloria Swanson',
40: 'Ruth Chatterton',
41: 'Betty Compson'},
'Winner': {2: nan, 3: 1.0, 4: nan, 40: nan, 41: nan},
'Year': {2: '1927/1928',
3: '1927/1928',
4: '1927/1928',
40: '1928/1929',
41: '1928/1929'}}

出生日期

{'DOB': {0: Timestamp('1906-10-06 00:00:00'),
1: Timestamp('1878-10-17 00:00:00'),
2: Timestamp('1899-03-27 00:00:00'),
3: Timestamp('1892-04-08 00:00:00'),
4: Timestamp('1892-12-24 00:00:00')},
'Name': {0: 'Janet Gaynor',
1: 'Louise Dresser',
2: 'Gloria Swanson',
3: 'Mary Pickford',
4: 'Ruth Chatterton'}}

合并(内部联接(数据帧

{'Award': {0: 'Best Actress',
1: 'Best Actress',
2: 'Best Actress',
3: 'Best Actress',
4: 'Best Actress'},
'Ceremony': {0: 1, 1: 1, 2: 1, 3: 10, 4: 10},
'Ceremony_Date': {0: Timestamp('1929-05-16 00:00:00'),
1: Timestamp('1929-05-16 00:00:00'),
2: Timestamp('1929-05-16 00:00:00'),
3: Timestamp('1938-03-10 00:00:00'),
4: Timestamp('1938-03-10 00:00:00')},
'DOB': {0: Timestamp('1878-10-17 00:00:00'),
1: Timestamp('1906-10-06 00:00:00'),
2: Timestamp('1906-10-06 00:00:00'),
3: Timestamp('1906-10-06 00:00:00'),
4: Timestamp('1906-10-06 00:00:00')},
'Film': {0: 'A Ship Comes In',
1: '7th Heaven',
2: '7th Heaven',
3: 'A Star Is Born',
4: 'A Star Is Born'},
'Name': {0: 'Louise Dresser',
1: 'Janet Gaynor',
2: 'Janet Gaynor',
3: 'Janet Gaynor',
4: 'Janet Gaynor'},
'Winner': {0: nan, 1: 1.0, 2: 1.0, 3: nan, 4: nan},
'Year': {0: '1927/1928',
1: '1927/1928',
2: '1927/1928',
3: '1937',
4: '1937'}}

编辑

梅丽尔·斯特里普(Meryl Streep(全部来自最佳女主角数据帧

{'Award': {5957: 'Best Actress',
6061: 'Best Actress',
6172: 'Best Actress',
6389: 'Best Actress',
6606: 'Best Actress',
6708: 'Best Actress',
6922: 'Best Actress',
7483: 'Best Actress',
7835: 'Best Actress',
7950: 'Best Actress',
8748: 'Best Actress',
8983: 'Best Actress',
9098: 'Best Actress',
9347: 'Best Actress',
9599: 'Best Actress'},
'Ceremony': {5957: 54,
6061: 55,
6172: 56,
6389: 58,
6606: 60,
6708: 61,
6922: 63,
7483: 68,
7835: 71,
7950: 72,
8748: 79,
8983: 81,
9098: 82,
9347: 84,
9599: 86},
'Ceremony_Date': {5957: Timestamp('1982-03-29 00:00:00'),
6061: Timestamp('1983-04-11 00:00:00'),
6172: Timestamp('1984-04-09 00:00:00'),
6389: Timestamp('1986-03-24 00:00:00'),
6606: Timestamp('1988-04-11 00:00:00'),
6708: Timestamp('1989-03-29 00:00:00'),
6922: Timestamp('1991-03-25 00:00:00'),
7483: Timestamp('1996-03-25 00:00:00'),
7835: Timestamp('1999-03-21 00:00:00'),
7950: Timestamp('2000-03-26 00:00:00'),
8748: Timestamp('2007-02-25 00:00:00'),
8983: Timestamp('2009-02-22 00:00:00'),
9098: Timestamp('2010-03-07 00:00:00'),
9347: Timestamp('2012-02-26 00:00:00'),
9599: Timestamp('2014-03-02 00:00:00')},
'Film': {5957: "The French Lieutenant's Woman",
6061: "Sophie's Choice",
6172: 'Silkwood',
6389: 'Out of Africa',
6606: 'Ironweed',
6708: 'A Cry in the Dark',
6922: 'Postcards from the Edge',
7483: 'The Bridges of Madison County',
7835: 'One True Thing',
7950: 'Music of the Heart',
8748: 'The Devil Wears Prada',
8983: 'Doubt',
9098: 'Julie & Julia',
9347: 'The Iron Lady',
9599: 'August: Osage County'},
'Name': {5957: 'Meryl Streep',
6061: 'Meryl Streep',
6172: 'Meryl Streep',
6389: 'Meryl Streep',
6606: 'Meryl Streep',
6708: 'Meryl Streep',
6922: 'Meryl Streep',
7483: 'Meryl Streep',
7835: 'Meryl Streep',
7950: 'Meryl Streep',
8748: 'Meryl Streep',
8983: 'Meryl Streep',
9098: 'Meryl Streep',
9347: 'Meryl Streep',
9599: 'Meryl Streep'},
'Winner': {5957: nan,
6061: 1.0,
6172: nan,
6389: nan,
6606: nan,
6708: nan,
6922: nan,
7483: nan,
7835: nan,
7950: nan,
8748: nan,
8983: nan,
9098: nan,
9347: 1.0,
9599: nan},
'Year': {5957: '1981',
6061: '1982',
6172: '1983',
6389: '1985',
6606: '1987',
6708: '1988',
6922: '1990',
7483: '1995',
7835: '1998',
7950: '1999',
8748: '2006',
8983: '2008',
9098: '2009',
9347: '2011',
9599: '2013'}}

为了回答我明显问到的一个非常小的问题,请允许我查明问题所在并确定解决方法。

如上所述,最佳女演员数据帧ba很好。事实上,所有数据帧都井井有条。最初的问题是关于内部联接是如何执行的,以及可能出现什么问题(即,提示合并过程创建重复记录(。

梅丽尔·斯特里普(Meryl Streep(将是我们在这里的指南,如上所述。在数据集中,她获得了16项最佳女主角提名(对于任何保持分数的人来说,数据不包括她最近的提名(。当在ba和出生日期、DOB、数据帧之间执行内部连接时,她被提名的每部电影都重复了 16 次,这不是我想要的结果(见上面的错误结果(。碰巧的是,她的名字和出生日期在DOB数据帧中出现了 16 次。这与我编写的抓取代码一致,根本不是意外的结果或错误。

当我在两帧之间进行内部连接时,我(错误地(认为熊猫会看到她的提名,比如说,"朱莉和朱莉娅",匹配她的生日一次,然后就完成了。显然,内部连接意味着如果两个表中的连接列上都有匹配项,则每一行都将尽可能多地匹配。因此,对于每部电影,合并表都有16条记录(她的16项最佳女主角提名中一条,这等于她的生日出现在成为数据框的提名生日网络抓取中的次数(。我不确定这是否正确,但它描述了我在我面前看到的东西。我欢迎对此作出澄清。

解决方法只是从DOB数据帧中删除重复的名称并重新合并。下面是代码和输出,使用 Meryl 作为我们的示例。

ba_dob_revised = df_birthdays.drop_duplicates('Name')
ba_dob = pd.merge(ba, ba_dob_revised, on='Name')
ba_dob[ba_dob.Name=="Meryl Streep"] 
{'Award': {282: 'Best Actress',
283: 'Best Actress',
284: 'Best Actress',
285: 'Best Actress',
286: 'Best Actress',
287: 'Best Actress',
288: 'Best Actress',
289: 'Best Actress',
290: 'Best Actress',
291: 'Best Actress',
292: 'Best Actress',
293: 'Best Actress',
294: 'Best Actress',
295: 'Best Actress',
296: 'Best Actress'},
'Ceremony': {282: 54,
283: 55,
284: 56,
285: 58,
286: 60,
287: 61,
288: 63,
289: 68,
290: 71,
291: 72,
292: 79,
293: 81,
294: 82,
295: 84,
296: 86},
'Ceremony_Date': {282: Timestamp('1982-03-29 00:00:00'),
283: Timestamp('1983-04-11 00:00:00'),
284: Timestamp('1984-04-09 00:00:00'),
285: Timestamp('1986-03-24 00:00:00'),
286: Timestamp('1988-04-11 00:00:00'),
287: Timestamp('1989-03-29 00:00:00'),
288: Timestamp('1991-03-25 00:00:00'),
289: Timestamp('1996-03-25 00:00:00'),
290: Timestamp('1999-03-21 00:00:00'),
291: Timestamp('2000-03-26 00:00:00'),
292: Timestamp('2007-02-25 00:00:00'),
293: Timestamp('2009-02-22 00:00:00'),
294: Timestamp('2010-03-07 00:00:00'),
295: Timestamp('2012-02-26 00:00:00'),
296: Timestamp('2014-03-02 00:00:00')},
'DOB': {282: Timestamp('1949-06-22 00:00:00'),
283: Timestamp('1949-06-22 00:00:00'),
284: Timestamp('1949-06-22 00:00:00'),
285: Timestamp('1949-06-22 00:00:00'),
286: Timestamp('1949-06-22 00:00:00'),
287: Timestamp('1949-06-22 00:00:00'),
288: Timestamp('1949-06-22 00:00:00'),
289: Timestamp('1949-06-22 00:00:00'),
290: Timestamp('1949-06-22 00:00:00'),
291: Timestamp('1949-06-22 00:00:00'),
292: Timestamp('1949-06-22 00:00:00'),
293: Timestamp('1949-06-22 00:00:00'),
294: Timestamp('1949-06-22 00:00:00'),
295: Timestamp('1949-06-22 00:00:00'),
296: Timestamp('1949-06-22 00:00:00')},
'Film': {282: "The French Lieutenant's Woman",
283: "Sophie's Choice",
284: 'Silkwood',
285: 'Out of Africa',
286: 'Ironweed',
287: 'A Cry in the Dark',
288: 'Postcards from the Edge',
289: 'The Bridges of Madison County',
290: 'One True Thing',
291: 'Music of the Heart',
292: 'The Devil Wears Prada',
293: 'Doubt',
294: 'Julie & Julia',
295: 'The Iron Lady',
296: 'August: Osage County'},
'Name': {282: 'Meryl Streep',
283: 'Meryl Streep',
284: 'Meryl Streep',
285: 'Meryl Streep',
286: 'Meryl Streep',
287: 'Meryl Streep',
288: 'Meryl Streep',
289: 'Meryl Streep',
290: 'Meryl Streep',
291: 'Meryl Streep',
292: 'Meryl Streep',
293: 'Meryl Streep',
294: 'Meryl Streep',
295: 'Meryl Streep',
296: 'Meryl Streep'},
'Winner': {282: nan,
283: 1.0,
284: nan,
285: nan,
286: nan,
287: nan,
288: nan,
289: nan,
290: nan,
291: nan,
292: nan,
293: nan,
294: nan,
295: 1.0,
296: nan},
'Year': {282: '1981',
283: '1982',
284: '1983',
285: '1985',
286: '1987',
287: '1988',
288: '1990',
289: '1995',
290: '1998',
291: '1999',
292: '2006',
293: '2008',
294: '2009',
295: '2011',
296: '2013'}}

要点:虽然内部连接是合适的(切换连接类型肯定不能解决问题(,但我没有完全考虑机器/Pandas 如何考虑内部连接。最终,识别错误结果中的模式并在其中一个数据帧中找到类似的模式,这两个数据帧都已经过错误检查,事实证明这是最有帮助的。

当我运行时,我没有重复项:

df = pd.merge(df1, df2, how='inner', on='Name')

打印方式:

print(json.dumps(json.loads(df.to_json()),indent=4))

我得到:

{
"Winner": {
"0": null,
"1": 1.0,
"3": null,
"2": null
},
"Ceremony": {
"0": 1,
"1": 1,
"3": 2,
"2": 1
},
"Year": {
"0": "1927/1928",
"1": "1927/1928",
"3": "1928/1929",
"2": "1927/1928"
},
"Film": {
"0": "A Ship Comes In",
"1": "7th Heaven",
"3": "Madame X",
"2": "Sadie Thompson"
},
"Name": {
"0": "Louise Dresser",
"1": "Janet Gaynor",
"3": "Ruth Chatterton",
"2": "Gloria Swanson"
},
"Award": {
"0": "Best Actress",
"1": "Best Actress",
"3": "Best Actress",
"2": "Best Actress"
},
"DOB": {
"0": -2878243200000,
"1": -1995667200000,
"3": -2430518400000,
"2": -2233180800000
},
"Ceremony_Date": {
"0": -1282176000000,
"1": -1282176000000,
"3": -1254355200000,
"2": -1282176000000
}
}

<table border="1" class="dataframe">  <thead>    <tr style="text-align: right;">      <th></th>      <th>Award</th>      <th>Ceremony</th>      <th>Ceremony_Date</th>      <th>Film</th>      <th>Name</th>      <th>Winner</th>      <th>Year</th>      <th>DOB</th>    </tr>  </thead>  <tbody>    <tr>      <th>0</th>      <td>Best Actress</td>      <td>1</td>      <td>1929-05-16</td>      <td>A Ship Comes In</td>      <td>Louise Dresser</td>      <td>NaN</td>      <td>1927/1928</td>      <td>1878-10-17</td>    </tr>    <tr>      <th>1</th>      <td>Best Actress</td>      <td>1</td>      <td>1929-05-16</td>      <td>7th Heaven</td>      <td>Janet Gaynor</td>      <td>1.0</td>      <td>1927/1928</td>      <td>1906-10-06</td>    </tr>    <tr>      <th>2</th>      <td>Best Actress</td>      <td>1</td>      <td>1929-05-16</td>      <td>Sadie Thompson</td>      <td>Gloria Swanson</td>      <td>NaN</td>      <td>1927/1928</td>      <td>1899-03-27</td>    </tr>    <tr>      <th>3</th>      <td>Best Actress</td>      <td>2</td>      <td>1930-04-03</td>      <td>Madame X</td>      <td>Ruth Chatterton</td>      <td>NaN</td>      <td>1928/1929</td>      <td>1892-12-24</td>    </tr>  </tbody></table>

df = pd.merge(df1, df2, how='outer', on='Name')

<table border="1" class="dataframe">  <thead>    <tr style="text-align: right;">      <th></th>      <th>Award</th>      <th>Ceremony</th>      <th>Ceremony_Date</th>      <th>Film</th>      <th>Name</th>      <th>Winner</th>      <th>Year</th>      <th>DOB</th>    </tr>  </thead>  <tbody>    <tr>      <th>0</th>      <td>Best Actress</td>      <td>1.0</td>      <td>1929-05-16</td>      <td>A Ship Comes In</td>      <td>Louise Dresser</td>      <td>NaN</td>      <td>1927/1928</td>      <td>1878-10-17</td>    </tr>    <tr>      <th>1</th>      <td>Best Actress</td>      <td>1.0</td>      <td>1929-05-16</td>      <td>7th Heaven</td>      <td>Janet Gaynor</td>      <td>1.0</td>      <td>1927/1928</td>      <td>1906-10-06</td>    </tr>    <tr>      <th>2</th>      <td>Best Actress</td>      <td>1.0</td>      <td>1929-05-16</td>      <td>Sadie Thompson</td>      <td>Gloria Swanson</td>      <td>NaN</td>      <td>1927/1928</td>      <td>1899-03-27</td>    </tr>    <tr>      <th>3</th>      <td>Best Actress</td>      <td>2.0</td>      <td>1930-04-03</td>      <td>Madame X</td>      <td>Ruth Chatterton</td>      <td>NaN</td>      <td>1928/1929</td>      <td>1892-12-24</td>    </tr>    <tr>      <th>4</th>      <td>Best Actress</td>      <td>2.0</td>      <td>1930-04-03</td>      <td>The Barker</td>      <td>Betty Compson</td>      <td>NaN</td>      <td>1928/1929</td>      <td>NaT</td>    </tr>    <tr>      <th>5</th>      <td>NaN</td>      <td>NaN</td>      <td>NaT</td>      <td>NaN</td>      <td>Mary Pickford</td>      <td>NaN</td>      <td>NaN</td>      <td>1892-04-08</td>    </tr>  </tbody></table>

使用drop_duplicates()方法。例如:

merged_df = pd.merge(left_df, right_df, on='common_key', how='inner').drop_duplicates()

最新更新