我正在使用sklearn.datasets.fetch_20newsgroups()数据集。在这里,有一些文档属于多个新闻组。我想将这些文件视为每个属于一个新闻组的两个不同实体。为此,我将文档ID和组名称带入了数据框架。
import sklearn
from sklearn import datasets
data = datasets.fetch_20newsgroups()
filepaths = data.filenames.astype(str)
keys = []
for path in filepaths:
keys.append(os.path.split(path)[1])
groups = pd.DataFrame(keys, columns = ['Document_ID'])
groups['Group'] = data.target
groups.head()
>> Document_ID Group
0 102994 7
1 51861 4
2 51879 4
3 38242 1
4 60880 14
print (len(groups))
>>11314
print (len(groups['Document_ID'].drop_duplicates()))
>>9840
print (len(groups['Group'].drop_duplicates()))
>>20
对于每个document_id,如果分配了一个以上的组号,我想更改其值。例如,
groups[groups['Document_ID']=='76139']
>> Document_ID Group
5392 76139 6
5680 76139 17
我希望它成为:
>> Document_ID Group
5392 76139 6
5680 12345 17
这里,12345是一个随机的新ID,尚未在keys
列表中。
我该怎么做?
您可以在第一个使用duplicated
Methdod之后找到所有包含重复Document_ID
的行。然后创建一个新ID的列表,比最大ID多。使用loc
索引操作员用新IDS覆盖重复键。
groups['Document_ID'] = groups['Document_ID'].astype(int)
dupes = groups.Document_ID.duplicated(keep='first')
max_id = groups.Document_ID.max() + 1
new_id = range(max_id, max_id + dupes.sum())
groups.loc[dupes, 'Document_ID'] = new_id
测试用例
groups.loc[[5392,5680]]
Document_ID Group
5392 76139 6
5680 179489 17
确保不保留重复。
groups.Document_ID.duplicated(keep='first').any()
False
有点hacky,但是为什么不!
data = {"Document_ID": [102994,51861,51879,38242,60880,76139,76139],
"Group": [7,1,3,4,4,6,17],
}
groups = pd.DataFrame(data)
groupDict ={}
tempLst=[]
#Create a list of unique ID's
DocList = groups['Document_ID'].unique()
DocList.tolist()
#Build a dictionary and push all group ids to the correct doc id
DocDict = {}
for x in DocList:
DocDict[x] = []
for index, row in groups.iterrows():
DocDict[row['Document_ID']].append(row['Group'])
#For all doc Id's with multip entries create a new id with the group id as a decimal point.
groups['DupID'] = groups['Document_ID'].apply(lambda x: len(DocDict[x]))
groups["Document_ID"] = np.where(groups['DupID'] > 1, groups["Document_ID"] + groups["Group"]/10,groups["Document_ID"])
希望有帮助...