将JPG和XML文件的数据集拆分为火车和测试集



我有一个用于对象检测算法的数据集,该算法包含图片(.jpg(和对应的.xml文件包含边界框。

我想编写一个将数据集随机分配到火车和测试集中的脚本,这意味着我必须确保将JPG分配给其对应的XML到同一目录。

我应该如何编辑以下代码以实现此目标?

另外,这是"最好的"这样做的方法,或者最好在XML到CSV转换之后或生成CSV转换后分配数据集?

import shutil, os, glob, random
# List all files in a directory using os.listdir
basepath = '/home/bis/hans/bis/workspace/images/Synced_dataset'
filenames = []
for entry in os.listdir(basepath):
    if os.path.isfile(os.path.join(basepath, entry)):
        #print(entry)
        filenames.append(entry)
filenames.sort()  # make sure that the filenames have a fixed order before shuffling
random.seed(230)
random.shuffle(filenames) # shuffles the ordering of filenames (deterministic given the chosen seed)
split = int(0.8 * len(filenames))
train_filenames = filenames[:split]
test_filenames = filenames[split:]

我最好的选择是正确顺序创建两个文件列表( filenames for CC_1 for CC_1和 xmlnames(,以及一个索引 indices=[i for i in range(len(filenames))]列表。

然后,您可以将指数列表列出:

random.seed(230)
random.shuffle(indices)

最后,您为jpgxml文件创建火车和测试集:

split = int(0.8 * len(filenames))
file_train = [filenames[idx] for idx in indices[:split]]
file_test = [filenames[idx] for idx in indices[split:]]
xml_train = [xmlnames[idx] for idx in indices[:split]]
xml_test = [xmlnames[idx] for idx in indices[split:]]
import shutil, os, glob, random
# List all files in a directory using os.listdir
basepath = 'images/'
labelpath='label/'
filenames = []
xmlnames = []
for entry in os.listdir(basepath):
    if os.path.isfile(os.path.join(basepath, entry)):
        print(entry)
        filenames.append(entry)
        
        
for entry in os.listdir(labelpath):
    if os.path.isfile(os.path.join(labelpath, entry)):
        print(entry)
        xmlnames.append(entry)
indices=[i for i in range(len(filenames))]        
filenames.sort()
xmlnames.sort() # make sure that the filenames have a fixed order before shuffling
random.seed(230)
random.shuffle(indices) # shuffles the ordering of filenames (deterministic given the chosen seed)
split = int(0.8 * len(filenames))
file_train = [filenames[idx] for idx in indices[:split]]
file_test = [filenames[idx] for idx in indices[split:]]
xml_train = [xmlnames[idx] for idx in indices[:split]]
xml_test = [xmlnames[idx] for idx in indices[split:]]
print(file_test)
print(xml_test)

因此,我遵循上述建议(约瑟夫(添加索引,然后当我们进行测试和训练变量时,完全相同的图像和标签会在变量中添加,希望这会有所帮助

最新更新