删除数组中重复的第一个值.Python, Numpy, Pandas, Arrays



所以我确实有这个NumPy数组结果(final),我想减少它,我的意思是,如果值重复,那么我想删除第一个值并保持第二个,第三个重复的值,等等…

import hmac
import hashlib
import time
from argparse import _MutuallyExclusiveGroup
from tkinter import *
import pandas as pd
import base64
import matplotlib.pyplot as plt
import numpy as np

key="800070FF00FF08012"
key=bytes(key,'utf-8')
collision=[]
for x in range(1,1000001):
msg=bytes(f'{x}','utf-8')
digest = hmac.new(key, msg,"sha256").digest()
code = base64.b64encode(digest).decode('utf-8')
code=code[:6]
key=key.replace(key,digest)
collision.append(code)
df=pd.DataFrame(collision)
df=df[df.duplicated(keep=False)]
df_index=df.index.to_numpy()
df=df.values.flatten()
final=np.stack((df_index,df),axis=1)
Results of the variable "final":
I HAVE:
[[14093 'JRp1kX']
[43985 'KGlW7X']
[59212 'pU97Tr']
[90668 'ecTjTB']
[140615 'JRp1kX']
[218480 '25gtjT']
[344174 'dtXg6E']
[380467 'DdHQ3M']
[395699 'vnFw/c']
[503504 'dtXg6E']
[531073 'KGlW7X']
[633091 'ecTjTB']
[671091 'vnFw/c']
[672111 '25gtjT']
[785568 'pU97Tr']
[991540 'DdHQ3M']
[991548 'JRp1kX']]

And I WANT TO HAVE:
[[140615 'JRp1kX']
[503504 'dtXg6E']
[531073 'KGlW7X']
[633091 'ecTjTB']
[671091 'vnFw/c']
[672111 '25gtjT']
[785568 'pU97Tr']
[991540 'DdHQ3M']
[991548 'JRp1kX']]

删除数组中重复出现的第一个值。有人能帮我写点代码吗?

用更简单的术语来说就是,如果你有这个列表[1,2,3,4,5,1,3,5,5]我想有[2,4,1,3,5,5]

df = pd.DataFrame([1, 2, 3, 4, 5, 1, 3, 5, 5])
# keep the unique rows
unique_mask = ~df.duplicated(keep=False)
# keep the repeated rows (skipping the first for each non-unique)
repeated_mask = df.duplicated()
df.loc[unique_mask | repeated_mask]
0
1  2
3  4
5  1
6  3
7  5
8  5

final是一个numpy数组,因此您可以在第二列上使用np.unique来获得第一次出现的索引和出现次数,以避免删除单个值

_, idx, counts = np.unique(final[:, 1], return_index=True, return_counts=True)
idx = idx[counts > 1]
final = np.delete(final, idx, axis=0)

这将适用于ndarray,对于您的第二个1d数组示例使用

_, idx, counts = np.unique(final, return_index=True, return_counts=True)

也许你可以创建for循环

to_remove = list()
for i in range(len(your_list)):
if your_list[i] in your_list[i:]:
to_remove.append(i)
removed_count = 0
for i in to_remove:
del your_list[i - removed_count]
removed_count += 1

你不能在第一个循环中立即del,因为i将迭代下一个数字,这将导致每次删除一个数字时跳过数字。

[i - removed_count],因为每次删除较低的索引,较高的索引会立即减少1。

我认为可以用一种更有效的方式来写,但这应该可以工作,也许做一点改变。

生成df后,添加以下行:

df=pd.DataFrame(collision)
# ... your code ends here
removed_already=[]
for idx in df[df.duplicated(keep=False)].index:
if df.loc[idx][0] not in removed_already:
removed_already.append(df.loc[idx][0])
df.drop(index=idx, inplace=True)
# your code continues
df_index=df.index.to_numpy()
df=df.values.flatten()
final=np.stack((df_index,df),axis=1)

相关内容

  • 没有找到相关文章

最新更新