将pandas df的原始json列转换为更多列



在我的pandas数据框架中,我有一个遵循简单模式的列:

{'author_position': 'first',
'author': {'id': 'https://openalex.org/A3003121718',
'display_name': 'Chaolin Huang',
'orcid': None},
'institutions': [{'id': None,
'display_name': 'Jin Yin-tan Hospital, Wuhan, China',
'ror': None,
'country_code': None,
'type': None}],
'raw_affiliation_string': 'Jin Yin-tan Hospital, Wuhan, China'},

,并对某篇论文的每个作者重复后者。例如,我的数据库中的第一篇论文有几个作者,其作者栏如下所示:

df['authorships'][0]
### Output:
[{'author_position': 'first',
'author': {'id': 'https://openalex.org/A3003121718',
'display_name': 'Chaolin Huang',
'orcid': None},
'institutions': [{'id': None,
'display_name': 'Jin Yin-tan Hospital, Wuhan, China',
'ror': None,
'country_code': None,
'type': None}],
'raw_affiliation_string': 'Jin Yin-tan Hospital, Wuhan, China'},
{'author_position': 'middle',
'author': {'id': 'https://openalex.org/A3006261277',
'display_name': 'Yeming Wang',
'orcid': None},
'institutions': [{'id': 'https://openalex.org/I2801051648',
'display_name': 'China-Japan Friendship Hospital',
'ror': 'https://ror.org/037cjxp13',
'country_code': 'CN',
'type': 'healthcare'}],
'raw_affiliation_string': 'Department of Pulmonary and Critical Care Medicine, Center of Respiratory Medicine, National Clinical Research Center for Respiratory Diseases, China-Japan Friendship Hospital, Beijing, China.'},
{'author_position': 'middle',
'author': {'id': 'https://openalex.org/A2620960243',
'display_name': 'Xingwang Li',
'orcid': None},
'institutions': [{'id': 'https://openalex.org/I4210150338',
'display_name': 'Beijing Ditan Hospital',
'ror': 'https://ror.org/05kkkes98',
'country_code': 'CN',
'type': 'healthcare'}],
'raw_affiliation_string': 'Clinical and Research Center of Infectious Diseases Beijing Ditan Hospital Capital Medical University Beijing China'},
{'author_position': 'middle',
'author': {'id': 'https://openalex.org/A2103212470',
'display_name': 'Lili Ren',
'orcid': 'https://orcid.org/0000-0002-6645-8183'},
'institutions': [{'id': None,
'display_name': 'NHC Key Laboratory of Systems Biology of Pathogens and Christophe Mérieux Laboratory, Institute of Pathogen Biology, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing, China.',
'ror': None,
'country_code': None,
'type': None}],
'raw_affiliation_string': 'NHC Key Laboratory of Systems Biology of Pathogens and Christophe Mérieux Laboratory, Institute of Pathogen Biology, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing, China.'},
{'author_position': 'middle',
'author': {'id': 'https://openalex.org/A2582133136',
'display_name': 'Jianping Zhao',
'orcid': None},
'institutions': [{'id': 'https://openalex.org/I79431787',
'display_name': 'Tongji Medical College',
'ror': None,
'country_code': 'CN',
'type': None}],
'raw_affiliation_string': 'Tongji Hospital, Tongji medical college, Huazhong university of Science and Technology, Wuhan, China'},
{'author_position': 'middle',
'author': {'id': 'https://openalex.org/A2550526349',
'display_name': 'Yi Hu',
'orcid': None},
'institutions': [{'id': 'https://openalex.org/I47720641',
'display_name': 'Huazhong University of Science and Technology',
'ror': 'https://ror.org/00p991c53',
'country_code': 'CN',
'type': 'education'}],
'raw_affiliation_string': 'Department of Pulmonary and Critical Care Medicine, The Central Hospital of Wuhan, Tongji Medical College, Huazhong University of Science and Technology , Wuhan, China'},
{'author_position': 'middle',
'author': {'id': 'https://openalex.org/A3197971936',
'display_name': 'Li Zhang',
'orcid': 'https://orcid.org/0000-0002-7615-4976'},
'institutions': [{'id': None,
'display_name': 'Jin Yin-tan Hospital, Wuhan, China',
'ror': None,
'country_code': None,
'type': None}],
'raw_affiliation_string': 'Jin Yin-tan Hospital, Wuhan, China'},
{'author_position': 'middle',
'author': {'id': 'https://openalex.org/A2911488157',
'display_name': 'Guohui Fan',
'orcid': None},
'institutions': [{'id': 'https://openalex.org/I2801051648',
'display_name': 'China-Japan Friendship Hospital',
'ror': 'https://ror.org/037cjxp13',
'country_code': 'CN',
'type': 'healthcare'}],
'raw_affiliation_string': 'Department of Pulmonary and Critical Care Medicine, Center of Respiratory Medicine, National Clinical Research Center for Respiratory Diseases, China-Japan Friendship Hospital, Beijing, China.'},
{'author_position': 'middle',
'author': {'id': 'https://openalex.org/A3001214061',
'display_name': 'Jiuyang Xu',
'orcid': 'https://orcid.org/0000-0002-1906-5918'},
'institutions': [{'id': 'https://openalex.org/I99065089',
'display_name': 'Tsinghua University',
'ror': 'https://ror.org/03cve4549',
'country_code': 'CN',
'type': 'education'}],
'raw_affiliation_string': 'Tsinghua University,School of Medicine,Beijing,China'},
{'author_position': 'middle',
'author': {'id': 'https://openalex.org/A3006530843',
'display_name': 'Xiaoying Gu',
'orcid': None},
'institutions': [{'id': 'https://openalex.org/I2801051648',
'display_name': 'China-Japan Friendship Hospital',
'ror': 'https://ror.org/037cjxp13',
'country_code': 'CN',
'type': 'healthcare'}],
'raw_affiliation_string': 'Department of Pulmonary and Critical Care Medicine, Center of Respiratory Medicine, National Clinical Research Center for Respiratory Diseases, China-Japan Friendship Hospital, Beijing, China.'},
{'author_position': 'middle',
'author': {'id': 'https://openalex.org/A3205428521',
'display_name': 'Zhenshun Cheng',
'orcid': None},
'institutions': [{'id': 'https://openalex.org/I4210120234',
'display_name': 'Zhongnan Hospital of Wuhan University',
'ror': 'https://ror.org/01v5mqw79',
'country_code': 'CN',
'type': 'healthcare'}],
'raw_affiliation_string': 'Department of Respiratory Medicine, Zhongnan Hospital of Wuhan University, Wuhan, China'},
{'author_position': 'middle',
'author': {'id': 'https://openalex.org/A2498193827',
'display_name': 'Ting Yu',
'orcid': None},
'institutions': [{'id': None,
'display_name': 'Jin Yin-tan Hospital, Wuhan, China',
'ror': None,
'country_code': None,
'type': None}],
'raw_affiliation_string': 'Jin Yin-tan Hospital, Wuhan, China'}]

现在,我的目标实际上是只获取上面包含的一些信息,即唯一的作者和机构记录的名称,并创建两个列,其中包含作者姓名和机构名称的列表。在上面,具体来说,结果应该是建设两列"作者"。和";institutions"看起来像这样(关于第一篇论文):

df['authors][0]
['Chaolin Huang','Yeming Wang','Xingwang Li','Lili Ren','Jianping Zhao','Yi Hu','Li Zhang','Guohui Fan','Jiuyang Xu','Xiaoying Gu','Zhenshun Cheng','Ting Yu']
df['institutions'][0]
['Jin Yin-tan Hospital, Wuhan, China','China-Japan Friendship Hospital','Beijing Ditan Hospital','Tsinghua University','Zhongnan Hospital of Wuhan University','Jin Yin-tan Hospital, Wuhan, China']

请注意双数(例如:"中日友好医院")不在名单中重复出现。

谢谢

您可以检查下面的代码是否工作!

df = pd.DataFrame()
df['authors'] = pd.json_normalize(j)['author.display_name']
df['institutions'] = pd.json_normalize(j, record_path=['institutions'])['display_name']
df

相关内容

  • 没有找到相关文章

最新更新