在python文件中读取和写入unicode/非ascii字符时遇到问题



我有一个目录结构,其中包含许多非ascii字符的目录,大部分是梵语。我正在脚本中为这些目录/文件编制索引,但不知道如何最好地处理这些实例。这是我的过程:

  • 对所有文件进行哈希,递归地将每个文件的路径、文件名和哈希写入.tsv文件
  • 浏览这个文件,根据是否存在重复的散列对每一行进行排序。会生成一个形式为{'path': columns[0], 'filename': columns[1], 'status': True}的字典,其中状态决定以后是否对文件执行操作
  • 通过这个字典,将重复项从原始位置移到偏移根路径中(例如./duplicates而不是./(
  • 在每次移动的文件中写入要运行的命令,如有必要,该命令将反转移动(仅mv a b(;这并不重要,但我想我会把它包括在内

以下是一些样本数据和我迄今为止所写的内容:

示例生成的tsv(路径/名称/哈希(:

./Personal Research/Ramnad 9"14"10  DSC_0004.JPG    850cd9dcb0075febd4c0dcd549dd7860        
./Personal Research/Ramnad 9"14"10  DSC_0010.JPG    9db2219fc4c9423016fb9e295452f1ad        
./Personal Research/Ramnad 9"14"10  DSC_0006.JPG    ef7d13b88bbaabc029390bcef1319bb1            

"实际上是unicode:

块:专用区域
Unicode:U+F019
UTF-8:0xEF 0x80 0x99
JavaScript:0xF019

代码:将以上内容写入文件(fulltsv(:

for root, dirs, files in os.walk(SRC_DIR, topdown=True):
files[:] = [f for f in files if any(ext in f for ext in EXT_LIST) if not f.startswith('.')]
for file in files:
with open(os.path.join(root,file),'r') as f:
with open(SAVE_DIR+re.sub(r'W+', '', os.path.basename(root).lower())+'.tsv', 'a') as fout:
writer = csv.writer(fout, delimiter='t', quotechar='"', quoting=csv.QUOTE_MINIMAL)
checksums = []
with open(os.path.join(root, file), 'rb') as _file:
checksums.append([root, file, hashlib.md5(_file.read()).hexdigest()])
writer.writerows(checksums)

正在读取该文件:

#       generate list of all tsv
for (dir, subs, files) in os.walk(ROOT):
#   remove the new-root from the search
subs = [s for s in subs if NROOT not in s]
for f in files:
fpath = os.path.join(dir,f)
if ".tsv" in fpath:
TSVLIST.append(fpath)
#       open/append all TSV content to a single new TSV
with open(FULL,'w') as wfd:
for f in TSVLIST:
with open(f,'r') as fd:
wfd.write(fd.read())
lines = sum(1 for line in f)
#   add all entries to a dictionary keyed to their hash
entrydict = {}
ec = 0
with open(FULL, 'r') as fulltsv:
for line in fulltsv:
columns = line.strip().split('t')
if not columns[2].startswith('.'):
if columns[2] not in entrydict.keys():
entrydict[str(columns[2])] = []
entrydict[str(columns[2])].append({'path': columns[0], 'filename': columns[1], 'status': True})
if len(entrydict[str(columns[2])]) > 1:
ec += 1
ed = {k:v for k,v in entrydict.items() if len(v)>=2}

移动副本:

for e in f:
if len(f)-mvcnt > 1:
if e['status'] is True:
p = e['path']    #   path
n = e['filename']   #   name
n0,n0ext = os.path.splitext(n)
n1 = n
#   directory structure for new file
FROOT = p.replace(p.split('/')[0],NROOT,1)
n1 = n
rebk = 'mv {0}/{1} {2}/{3}'.format(FROOT,n,p,n)
shutil.move('{0}/{1}'.format(p,n),'{0}/{1}'.format(FROOT,n))
dupelist.write('{0} #{1}n'.format(rebk,str(h)))
mvcnt += 1

我收到的错误:

Traceback (most recent call last):
File "/usr/lib/python3.6/shutil.py", line 550, in move
os.rename(src, real_dst)
FileNotFoundError: [Errno 2] No such file or directory: '"./Personal Research/Ramnad 9""14""10"/DSC_0003.NEF' -> './duplicateRoot/Personal Research/Ramnad 9""14""10"/DSC_0003.NEF'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "dCompare.py", line 164, in <module>
shutil.move('{0}/{1}'.format(p,n),'{0}/{1}'.format(FROOT,n))
File "/usr/lib/python3.6/shutil.py", line 564, in move
copy_function(src, real_dst)
File "/usr/lib/python3.6/shutil.py", line 263, in copy2
copyfile(src, dst, follow_symlinks=follow_symlinks)
File "/usr/lib/python3.6/shutil.py", line 120, in copyfile
with open(src, 'rb') as fsrc:
FileNotFoundError: [Errno 2] No such file or directory: '"./Personal Research/Ramnad 9""14""10"/DSC_0003.NEF'

显然,这与我如何处理unicode字符有关,但我以前从未处理过,也不确定在哪一点/如何处理文件名。在linux的windows子系统下运行ubuntu 10,python 3。

当我阅读堆栈跟踪时,我看到的一个问题是,给定OP的示例TSV:,Unicode字符是错误的(它们不存在(

FileNotFoundError: [Errno 2] No such file or directory: '"./Personal Research/Ramnad 9""14""10"/DSC_0003.NEF' -> './duplicateRoot/Personal Research/Ramnad 9""14""10"/DSC_0003.NEF'

在源路径和目标路径中有一些引号转义,我认为不应该存在,额外的和双引号,看起来像是路径被分解并再次连接(或其他什么(:

'"./Personal Research/Ramnad 9""14""10"/DSC_0003.NEF'

我试图重新创建OP的错误,但无法。但是,当我处理下面的示例时,我最初得到了一个FileNotFoundError(因为我缺少目标文件夹,因此我的示例中缺少os.makedirs()(,但路径被正确编码:

FileNotFoundError: [Errno 2] No such file or directory: 'foo/Personal Research/Ramnad 9uf01914uf01910/DSC_0006.JPG'

我所能提供的只是猜测TSV文件或entrydict中的编码混乱。OP,您是否在解释器中检查了该文件或dict,并验证了您在预期的路径中看到了uf019?也许可以通过以下方式来确保这些代码点存在:

>>> print(path.encode('unicode_escape'))
b'./Personal Research/Ramnad 9\uf01914\uf01910'
>>> # or, look for 61465
>>> [ord(char) for char in path]
[46, 47, 80, 101, 114, 115, 111, 110, 97, 108, 32, 82, 101, 
115, 101, 97, 114, 99, 104, 47, 82, 97, 109, 110, 97, 100, 
32, 57, 61465, 49, 52, 61465, 49, 48]

这是我的尝试,可能会有所帮助。。。

我创建了一个示例TSV文件和相应的目录结构:

>>> p='./Personal Research/Ramnad 9uf01914uf01910'
>>> os.makedirs(p)
>>> checksums=[[p, 'DSC_0006.JPG', 'hash']]
>>> with open('full.tsv', 'a') as fout:
writer = csv.writer(fout, delimiter='t', quotechar='"', quoting=csv.QUOTE_MINIMAL)
writer.writerows(checksums)

并触摸了外壳中的文件:

$ touch Personal Research/Ramnad 91410/DSC_0006.JPG

检查full.tsv以确保其正确写入:

$cat full.tsv
./Personal Research/Ramnad 91410  DSC_0006.JPG    hash

空块是基于OP所包含的"的Unicode描述的正确utf-8编码的码点。

运行hexdump -C full.tsv以确保utf-8编码(查找2组ef 80 99(:

00000010  72 63 68 2f 52 61 6d 6e  61 64 20 39 ef 80 99 31  |rch/Ramnad 9...1|
00000020  34 ef 80 99 31 30 09 44  53 43 5f 30 30 30 36 2e  |4...10.DSC_0006.|

然后我运行

>>> entrydict = {}
>>> ec = 0
>>> with open('full.tsv', 'r') as fulltsv:
for line in fulltsv:
columns = line.strip().split('t')
if not columns[2].startswith('.'):
if columns[2] not in entrydict.keys():
entrydict[str(columns[2])] = []
entrydict[str(columns[2])].append({'path': columns[0], 'filename': columns[1], 'status': True})
if len(entrydict[str(columns[2])]) > 1:
ec += 1
>>> entrydict
{'hash': [{'path': './Personal Research/Ramnad 9uf01914uf01910', 'filename': 'DSC_0006.JPG', 'status': True}]}`

最后:

>>> e = entrydict['hash'][0]
>>> e
{'path': './Personal Research/Ramnad 9uf01914uf01910', 'filename': 'DSC_0006.JPG', 'status': True}
>>> NROOT='foo'
>>> if e['status'] is True:
p = e['path']    #   path
n = e['filename']   #   name
n0,n0ext = os.path.splitext(n)
n1 = n
#   directory structure for new file
FROOT = p.replace(p.split('/')[0],NROOT,1)

rebk = 'mv {0}/{1} {2}/{3}'.format(FROOT,n,p,n)
print(rebk)
src='{0}/{1}'.format(p,n)
dst='{0}/{1}'.format(FROOT,n)
os.makedirs(FROOT)
shutil.move(src,dst)

它奏效了。真倒霉

最新更新