Python:随机样本中的随机种子问题



我有一个数据集,第一列是文本,第二列称为作者,第三列称为标题。所以我想根据标题将我的数据集分成3个子样本。请注意,有许多不同的文本具有相同的标题。

# Find the unique titles
random.seed(42)
mylist = list(set(list(dt_chunks['title'])))
print(len(mylist))
# Random sample of titles and match all of these titles with the respectively texts
random.seed(42)
trainlist = random.sample(mylist, k = int(len(mylist)*0.7))
pattern = '|'.join(trainlist)
train_idx = dt_chunks['title'].str.contains(pattern)
train_df = dt_chunks[train_idx]
# New list which is contains the other elements that the previous list doesn't contain
random.seed(42)
extralist = list(set(mylist)^set(trainlist))
# same logic
random.seed(42)
validlist = random.sample(extralist, k = int(len(extralist)*0.5))
pattern = '|'.join(validlist)
valid_idx = dt_chunks['title'].str.contains(pattern)
valid_df = dt_chunks[valid_idx]
# same logic
random.seed(42)
testlist = list(set(validlist)^set(extralist))
pattern = '|'.join(testlist)
test_idx = dt_chunks['title'].str.contains(pattern)
test_df = dt_chunks[test_idx]

这里的问题是,我使用随机种子,但如果我重新启动谷歌协作,输出是不一样的。如果你能帮助我,我将不胜感激。

可能是因为dt_chunks['title']不是每次都一样。如果是这种情况,那么len(mylist)也会发生变化,并且random.sample(mylist, k = int(len(mylist)*0.7))将导致在不同的运行中内部调用采样函数的次数不同。

最新更新