我有如下数据:
idx A B C D
0 0.0 0.0 0.0 apple
1 0.5 0.5 0.6 car
2 0.7 0.7 0.2 vegetables
3 0.8 0.9 0.4 fruits
4 0.9 1.0 0.8 metal
idx E
0 0.000006
idx A B C D
0 1.0 1.1 0.1 computer
1 0.8 1.6 1.0 books
2 0.9 1.9 1.1 textile
idx E
0 1.000009
idx A B C D
0 0.7 2.5 2 mouse
1 0.6 2.9 3 animals
2 0.5 3.0 2 birds
3 0.9 3.3 4 flower
4 1.0 3.4 5 garden
5 1.0 3.8 1 desk
6 0.85 3.9 8 tea
7 0.2 4.2 9 bread
8 0.1 4.9 3 paper
9 0.7 7.6 6 butter
idx E
0 0.9
我想改变哪里有idx E删除重复的标题,重复上面的最后一行,使一个点代替列D的值,并将E替换为列与它的值(重复到整个对应)。我想用python有条件地更改它,如下所示:
idx A B C D E
0 0.0 0.0 0.0 apple 0.000006
1 0.5 0.5 0.6 car 0.000006
2 0.7 0.7 0.2 vegetables 0.000006
3 0.8 0.9 0.4 fruits 0.000006
4 0.9 1.0 0.8 metal 0.000006
5 0.9 1.0 0.0 . 0.000006
6 1.0 1.1 0.1 computer 1.000009
7 0.8 1.6 1.0 books 1.000009
8 0.9 1.9 1.1 textile 1.000009
9 0.9 1.9 . 1.000009
10 0.7 2.5 2 mouse 0.9
11 0.6 2.9 3 animals 0.9
12 0.5 3.0 2 birds 0.9
13 0.9 3.3 4 flower 0.9
14 1.0 3.4 5 garden 0.9
15 1.0 3.8 1 desk 0.9
16 0.85 3.9 8 tea 0.9
17 0.2 4.2 9 bread 0.9
18 0.1 4.9 3 paper 0.9
19 0.7 7.6 6 butter 0.9
20 0.7 7.6 0.0 . 0.9
是否有可能进行条件循环?使用这样的数据框架?
首先删除列A
中Series.isin
中~
中boolean indexing
的反向掩码中A
和E
的行,创建默认索引
df = df[~df['A'].isin(['A','E'])].reset_index(drop=True)
然后按掩码设置测试Nr
的列-通过Series.where
设置NaNs为D
,并反向填充缺失值,然后在A, B
中设置DataFrame.mask
的缺失值并向前填充缺失值,最后在C
列中设置.
:
m = df['A'].shift().eq('E')
m1 = df['A'].eq('E')
df['E'] = df['A'].where(m).bfill()
df[['A','B', 'C']] = df[['A','B', 'C']].mask(m | m1).ffill()
df.loc[m, 'D'] = '.'
df.loc[m, 'C'] = 0
df = df[~m1].reset_index(drop=True)
print (df)
A B C D E
0 0.0 0.0 0.0 apple 0.000006
1 0.5 0.5 0.6 car 0.000006
2 0.7 0.7 0.2 vegetables 0.000006
3 0.8 0.9 0.4 fruits 0.000006
4 0.9 1.0 0.8 metal 0.000006
5 0.9 1.0 0 . 0.000006
6 1.0 1.1 0.1 computer 1.000009
7 0.8 1.6 1.0 books 1.000009
8 0.9 1.9 1.1 textile 1.000009
9 0.9 1.9 0 . 1.000009
10 0.7 2.5 2 mouse 0.9
11 0.6 2.9 3 animals 0.9
12 0.5 3.0 2 birds 0.9
13 0.9 3.3 4 flower 0.9
14 1.0 3.4 5 garden 0.9
15 1.0 3.8 1 desk 0.9
16 0.85 3.9 8 tea 0.9
17 0.2 4.2 9 bread 0.9
18 0.1 4.9 3 paper 0.9
19 0.7 7.6 6 butter 0.9
20 0.7 7.6 0 . 0.9
我不会在这里使用pandas,但我会恢复到好的旧csv模块,我认为它更通用,可以处理不是真正的csv格式的文件:
delimiter=' ' # put here the actual delimiter
with open(input_csv) as infile, open(output_csv, newline='') as outfile
rd = csv.reader(infile, delimiter=delimiter)
wr = csv.writer(outfile, delimiter=delimiter)
wr.writerow(['idx', 'A', 'B', 'C', 'D', 'E']) # write a header
nb = 0 # record number
for row in rd:
if flag:
e = row[-1] # use last value
for r in pool: # copy it for the whole block
r.append(e)
wr.writerows(pool) # write the block
flag = False
elif row[0] == 'idx':
if row[1] == 'E':
pool.append(pool[-1][:])
pool[-1][-1] = '.'
pool[-1][0] = nb
nb += 1
flag = True
else:
pool = []
else:
row[0] = nb
pool.append(row)
nb += 1
根据您的输入,它给出:
idx A B C D E
0 0.0 0.0 0.0 apple 0.000006
1 0.5 0.5 0.6 car 0.000006
2 0.7 0.7 0.2 vegetables 0.000006
3 0.8 0.9 0.4 fruits 0.000006
4 0.9 1.0 0.8 metal 0.000006
5 0.9 1.0 0.8 . 0.000006
6 1.0 1.1 0.1 computer 1.000009
7 0.8 1.6 1.0 books 1.000009
8 0.9 1.9 1.1 textile 1.000009
9 0.9 1.9 1.1 . 1.000009
10 0.7 2.5 2 mouse 0.9
11 0.6 2.9 3 animals 0.9
12 0.5 3.0 2 birds 0.9
13 0.9 3.3 4 flower 0.9
14 1.0 3.4 5 garden 0.9
15 1.0 3.8 1 desk 0.9
16 0.85 3.9 8 tea 0.9
17 0.2 4.2 9 bread 0.9
18 0.1 4.9 3 paper 0.9
19 0.7 7.6 6 butter 0.9
20 0.7 7.6 6 . 0.9