检查变量的值是否属于集合引导程序

我有一个整数数组说

theIndex = [ 1 2 6 7 17 2]

我有一个数据帧，其中一列数据集[：id]包含整数，比如

dataset = DataFrame(id=[ 1, 1, 2, 2, 3, 3, 3, 4, 4, 4])

我想选择数据集中属于索引的所有观测值。如果它们在索引中出现两次(或更多(，我想选择它们两次(或更多(

目前，我正在以愚蠢的方式做这件事。

theIndex = [ 1 2 6 7 17 2]
dataset = DataFrame(id=[ 1, 1, 2, 2, 3, 3, 3, 4, 4, 4])
dataset2 = DataFrame(id=Int64[])
for ii1=1:size(theIndex,2)
for ii2=1:size(dataset[:id],1)
any(i->i.==dataset[ii2,:id],theIndex[ii1]) ? 
push!(dataset2,dataset[ii2,:id]) : nothing
end
end

还有更优雅的解决方案吗？

本质上，这个问题想要计算theIndex和dataset之间的SQL JOIN。遗憾的是，此功能并未由数据帧在内部完全实现。因此，以下是用于此目的的 JOIN 的快速(且高效(模拟：

using DataStructures
sort!(dataset, cols=:id]
j = 1
newvec = Vector{Int}() 
for (val,cnt) in SortedDict(countmap(theIndex))
while j<=nrow(dataset)
dataset[j,:id] > val && break
dataset[j,:id] == val && append!(newvec,fill(j,cnt))
j += 1
end
end
dataset2 = dataset[newvec,:]

DataStructures 包用于 SortedDict。此实现应该比其他多循环方法更有效。

根据我之前的评论，您正在寻找findin函数。

julia> Ind = findin( dataset[:id], theIndex); # return indices of elements in
# dataset[:id] that occur in
# theIndex
julia> dataset[:id][Ind]
4-element DataArrays.DataArray{Int64,1}:
1
1
2
2

(或者，如果您希望将结果以子数据帧/视图的形式返回到数据集中，则可以执行SubDataFrame(dataset, Ind)等操作(

编辑：根据评论，为了确保考虑到theIndex中的重复，需要单独附加每个元素的示例：

Ind = []; for i in theIndex; append!(Ind, findin(dataset[:id], i)); end

然后，可以使用Ind创建数组或子数据帧，如上所述。

编辑2：

julia> @time dataset2 = DataFrame(id=Int64[])
for ii1=1:size(theIndex,2)
for ii2=1:size(dataset[:id],1)
any(i->i.==dataset[ii2,:id],theIndex[ii1]) && 
push!(dataset2,dataset[ii2,:id])
end
end
0.000016 seconds (24 allocations: 1.594 KiB)
julia> @time Ind = []; for i in theIndex; append!(Ind, findin(dataset[:id], i)); end
0.000002 seconds (5 allocations: 240 bytes)

_{(通常关于全球范围内基准测试的警告性咆哮(}

相关内容

最新更新

热门标签：