我正在动态地用MongoDB的数据填充一个嵌套的字典。我不是很精通使用字典,所以请原谅我。我已经检查了一遍又一遍,尝试了不同的方法,但我仍然得到同样的错误结果。
我试图输入字典的数据不是在一个元组中,正如我在我检查过的问题中看到的那样,而是在MongoDB的集合中。
这是我的集合字段的样子:
new_crawl_130422_data.insert_one(
{
"database_url": proj_database_url,
"database_project_id": proj_database_id,
"projectname": proj_database_name,
"version": version,
"boost": boost,
"content": content,
"digest": digest,
"title": title,
"timestamp": timestamp,
"url": website,
"language": language
}
对于一个特定的project_id,这里的语言字段可以是各种语言。所以本质上,每个project_id都有许多记录,其中一些是用不同的语言编写的。我要做的是创建一个嵌套的字典,以project_id为名称,键是不同的语言。所以应该是这样的:
{Project_id1: {'it': "text here in Italian if it exists in the collection" ,'en': "text here in English if it exits", 'de': "text here in German if it exists"}
{Project_id2: {'en': "text here in English if it exists in the collection" ,'fr': "text here in French if it exits", 'de': "text here in German if it exists"}
等。
因此,当它遍历记录时,它应该选择一种语言并将其作为键,并选择'content'作为值。另一个方面是,如果字典中已经存在该语言键,它应该将具有匹配语言的文本附加到该值。我不知道这对字典来说是不是太多了?
到目前为止,我已经尝试了以下微弱的尝试,并得到了相同的结果,这只是最后一个记录和语言读取(它是覆盖,而不是追加),而且,它没有连接文本。
project_details = {}
for row in results:
idProject = row[0]
documents = mongo_db.new_collection_Eus.find(
{"database_project_id": idProject},
no_cursor_timeout=True).batch_size(100)
for doc in documents:
project_details[doc['database_project_id']] = {}
[project_details[doc['database_project_id']][doc['language']]] = [doc['content']]
for k,v in project_details[doc['database_project_id']].items():
if k in [project_details[doc['database_project_id']]]:
k[v] = project_details[doc['database_project_id']][doc['language']].append([doc['content']])
else:
[project_details[doc['database_project_id']][doc['language']]] = [doc['content']]
也试过这个:
for row in results:
idProject = row[0]
documents = mongo_db.new_collection_Eus.find(
{"database_project_id": idProject},
no_cursor_timeout=True).batch_size(100)
for doc in documents:
project_details[doc['database_project_id']] = {}
if doc['language'] not in project_details[doc['database_project_id']].keys():
project_details[doc['database_project_id']][doc['language']] = doc['content']
else:
project_details[doc['database_project_id']][doc['language']] = project_details[doc['database_project_id']][doc['language']] + ' ' + doc['content']
它们都给出了相同的结果,只有一种语言,即使记录中有许多语言,并且文本不是按字典中的每种语言串联起来的。
我已经看过这些问题了
- 添加输入到现有字典而不覆盖
- 更新字典值而不覆盖
- 添加多个键值更新
- 如何在不覆盖的情况下向字典添加新值。
任何帮助都将非常感激,因为我在这个问题上卡住了。
我认为这是defaultdict
的好工作:
# simple setup for example
test = [
(12, 'it', 'Buongiorno'),
(12, 'it', 'tutti'),
(12, 'fr', 'Salut'),
(12, 'fr', 'tout le monde'),
(12, 'en', 'Hello'),
(12, 'en', 'world'),
(13, 'en', 'and now'),
(13, 'en', 'for something completely different'),
]
# shuffle into a nested default dict: d[proj_id][lang]: list
from collections import defaultdict
d = defaultdict(lambda: defaultdict(list))
for proj_id, lang, text in test:
d[proj_id][lang].append(text)
>>> d
defaultdict(<function __main__.<lambda>()>,
{12: defaultdict(list,
{'it': ['Buongiorno', 'tutti'],
'fr': ['Salut', 'tout le monde'],
'en': ['Hello', 'world']}),
13: defaultdict(list,
{'en': ['and now',
'for something completely different']})})
>>> list(d[12])
['it', 'fr', 'en']
>>> d[12]['fr']
['Salut', 'tout le monde']
附录:变成简单的dict
,并加入多部分内容
将上面的d
转换为简单的dict
,同时将任何多部分内容连接成一个字符串(以sep
作为分隔符):
sep = ' '
d2 = {
proj_id: {
lang: sep.join(parts) for lang, parts in proj.items()
} for proj_id, proj in d.items()
}
>>> d2
{12: {'it': 'Buongiorno tutti',
'fr': 'Salut tout le monde',
'en': 'Hello world'},
13: {'en': 'and now for something completely different'}}
```