删除 json 文件中不必要的句子



我正在尝试删除以下 json 文件中包含 [CLS] 和 [SEP] 的行。有没有办法在 python 中做到这一点?如何删除给定文本的这些行?

"Tirukkollampudur Vilvavaneswarar -. Temple  - Shivastalam.txt": {
"context": "    may be reproduced or used in any form without permission.          This Shivastalam is located 5 km south of Kodavasal          and Koradacheri on the Tiruvarur Thanjavur railroad. Koovilamputhur the original name          became Kollampudur. Koovilam stands for Vilvam, hence Vilvavanam. This shrine is regarded          as a Muktistalam. This shrine is regarded as the 113rd in the series of            Tevara Stalams in the Chola Region south of the river Kaveri. Legends  The Vilva trees are said to represent          splashes of the celestial nectar Amritam, and this stalam is considered on par with          Banares. Sundarar is believed to have floated across the river to this temple in a          boatmanless raft in a river in spate singing a Patikam . This event is celebrated in a          festival in the monsoon month of Libra. The Avimukteswarar temple nearby is also          associated with this legend as is the Shivastalam at Kodavasal.          Shiva is said to have blessed Durvasa Muni with a vision of the Cosmic Dance here.            Legend also has it that Arjuna worshipped Shiva at this shrine.  The Temple  There are several inscriptions here, and the Cholas have          made immense contributions here.to this temple which was built during the time of          Kulottunga Chola I. This temple occupies an area of over 2 acres, and its second prakaram          has a 5 tiered rajagopuram.  The Vinayakar in this temple is also of great  Festivals  Six worship services are offered each day. Kartikai Deepam,          Arudra Darisanam, Sivaratri, Skanda Sashti are some of the festivals celebrated here. ",
"answers": [
[
"5 km south of kodavasal and koradacheri on the tiruvarur thanjavur railroad"
],
[
"5 km south of kodavasal and koradacheri on the tiruvarur thanjavur railroad"
],
[
" "
],
[
" "
],
[
"during the time of kulottunga chola i"
],
[
"[CLS] what are the darshan hours ? [SEP] may be reproduced or used in any form without permission . this shivastalam is located 5 km south of kodavasal and koradacheri on the tiruvarur thanjavur railroad . koovilamputhur the original name became kollampudur . koovilam stands for vilvam , hence vilvavanam . this shrine is regarded as a muktistalam . this shrine is regarded as the 113rd in the series of tevara stalams in the chola region south of the river kaveri . legends the vilva trees are said to represent splashes of the celestial nectar amritam , and this stalam is considered on par with banares . sundarar is believed to have floated across the river to this temple in a boatmanless raft in a river in spate singing a patikam . this event is celebrated in a festival in the monsoon month of libra . the avimukteswarar temple nearby is also associated with this legend as is the shivastalam at kodavasal . shiva is said to have blessed durvasa muni with a vision of the cosmic dance here . legend also has it that arjuna worshipped shiva at this shrine . the temple there are several inscriptions here , and the cholas have made immense contributions here . to this temple which was built during the time of kulottunga chola i . this temple occupies an area of over 2 acres , and its second prakaram has a 5 tiered rajagopuram . the vinayakar in this temple is also of great festivals six worship services are offered each day . kartikai deepam , arudra darisanam"
],
[
"[CLS] what is the average darshan duration ? [SEP]"
],
[
" "
],
[
" "
],
[
" "
],
[
" "
],
[
" "
],
[
" "
]
]

},

您可以尝试以下方法。由于您获得了子列表的列表,我们可以执行以下操作。

import json

def remove_from_sublists(the_list, to_be_removed):
for each_item in list(the_list):
if isinstance(each_item, list):
remove_from_sublists(each_item, to_be_removed)
elif to_be_removed in each_item :
the_list.remove(each_item)
return the_list

dic = {}
with open('WebTempleCorpus.json') as json_file:
data = json.load(json_file)
for (i, v) in data.items():
sub_dict = v
if(v.get("answers")):
sub_dict["answers"] = remove_from_sublists(v["answers"], "CLS")
sub_dict["answers"] = remove_from_sublists(v["answers"], "SEP")
dic[i] = sub_dict
with open('result.json', 'w') as fp:
json.dump(dic, fp)

最新更新