如何将文本数据读取到部分采用键值格式的pandas数据帧中



文件test:的文件内容

costCenter: LL63238012
mail: shiva.gowni@LLp.com
LLpResponsible: cn=LLf58420,ou=Personal,ou=People,ou=LLDI,o=LLP
LLpHomeDirectory: nisMapName=auto.home,ou=RDC_AMEC_LL-CDC01,ou=Locations,ou=LLDI,o=LLP#0#Quality=economy,NisMap=llc1002:/proj/llc1002_ziz1/q,Quota=10621
LLpHomeDirectory: nisMapName=auto.home,ou=RDC_AMEC_LL-CDC01,ou=Locations,ou=LLDI,o=LLP#0#Quality=scratKG,NisMap=llc1002:/proj/llc1002_ziz1_scratKG/q,Quota=12000,Id=scratKG
fullName: Tulip project ziz1
costCenter: MX61FRK604
mail: ali.pina@LLp.com
LLpResponsible: cn=LLa11826,ou=Personal,ou=People,ou=LLDI,o=LLP
LLpHomeDirectory: nisMapName=auto.home,ou=SAT_AMEC_MX-GDL01,ou=Locations,ou=LLDI,o=LLP#0#Quality=reference,NisMap=llc0156:/proj/llc0156_zmx28home_3/q,Quota=100,Id=3
LLpHomeDirectory: nisMapName=auto.home,ou=SAT_AMEC_MX-GDL01,ou=Locations,ou=LLDI,o=LLP#0#Quality=reference,NisMap=llc0156:/proj/llc0156_zmx28home/q,Quota=300
LLpHomeDirectory: nisMapName=auto.home,ou=SAT_AMEC_MX-GDL01,ou=Locations,ou=LLDI,o=LLP#0#Quality=reference,NisMap=llc0156:/proj/llc0156_zmx28home_2/q,Quota=100,Id=2
fullName: xFSL to LLDI migration

costCenter: RU61FPD561
mail: udi.landen@LLp.com
LLpResponsible: cn=LLa09278,ou=Personal,ou=People,ou=LLDI,o=LLP
LLpHomeDirectory: nisMapName=auto.home,ou=RDC_AMEC_LL-CDC01,ou=Locations,ou=LLDI,o=LLP#0#Quota=1,Quality=EconomyHP,NisMap=llc2002:/proj/llc2002_zru12/q
LLpHomeDirectory: nisMapName=auto.home,ou=RDC_AMEC_LL-CDC01,ou=Locations,ou=LLDI,o=LLP#0#Quota=2800,Quality=EconomyHP,NisMap=llc1002:/proj/llc1002_zru12_analog/q,Id=analog
LLpHomeDirectory: nisMapName=auto.home,ou=RDC_AMEC_LL-CDC01,ou=Locations,ou=LLDI,o=LLP#0#Quota=1100,Quality=EconomyHP,NisMap=llc1002:/proj/llc1002_zru12_home/q,Id=home
LLpHomeDirectory: nisMapName=auto.home,ou=RDC_AMEC_LL-CDC01,ou=Locations,ou=LLDI,o=LLP#0#Quality=EconomyHP,NisMap=llc1002:/proj/llc1002_zru12_libddk/q,Quota=2162,Id=libddk
LLpHomeDirectory: nisMapName=auto.home,ou=RDC_AMEC_LL-CDC01,ou=Locations,ou=LLDI,o=LLP#0#Quality=EconomyHP,NisMap=llc1002:/proj/llc1002_zru12_proj/q,Quota=1102,Id=proj
fullName: zru12
costCenter: KG63010285
mail: adam.smith@LLp.com
LLpHomeDirectory: nisMapName=auto.home,ou=RDC_EMEA_NL-CDC01,ou=Locations,ou=LLDI,o=LLP#0#Quality=BLLinessCriticalHP,Quota=60,NisMap=llc4008:/proj/llc4008_zuriKG/q
fullName: Container to store ZuriKG vault

costCenter: KG63010285
mail: adam.smith@LLp.com
LLpHomeDirectory: nisMapName=auto.home,ou=RDC_EMEA_NL-CDC01,ou=Locations,ou=LLDI,o=LLP#0#Quality=EconomyHP,Quota=1,NisMap=llc3008:/proj/llc3008_zuriKG_rme/q
LLpHomeDirectory: nisMapName=auto.home,ou=SAT_EMEA_NL-RME01,ou=Locations,ou=LLDI,o=LLP#0#Quality=EconomyHP,Quota=30,NisMap=llc4014:/proj/llc4014_zuriKG_rme/q
LLpHomeDirectory: nisMapName=auto.home,ou=SAT_EMEA_NL-RME01,ou=Locations,ou=LLDI,o=LLP#0#Quality=ScratKGHP,Quota=400,NisMap=llc4014:/proj/llc4014_zuriKG_rme_scratKG/q,Id=scratKG
fullName: Project to restore  project data on the RME work on HP-UX

我正在使用的代码:

def generate(data):
for record in data.split("nn"):              # Split records based on two newlines (unix)
result = {}
for line in record.split("n"):            # Split properties based on single newlines (unix)
if line:                               # Skip empty lines happening for extra or trailing newlines
key, *value = line.split(": ")     # Tolerant to lines with more than a single ´: ´ (*values)
value = ": ".join(value)           # Recover original value if more than a single (`: `)
if key in result:
result[key] += ";" + value
else:
result[key] = value
if result:                                 # Don't yield empty results
yield result

frame = pd.DataFrame(generate("test"))
print(frame)
#frame["responsible"] = frame["LLpResponsible"].str.extract("cn=([w]*)")
frame["location"] = frame["LLpHomeDirectory"].str.extract("ou=([w_-]*)")
frame["directory"] = frame["LLpHomeDirectory"].str.findall("NisMap=w+:([w_-/]*)")
df1 = frame[['costCenter',  'mail', 'responsible', 'location', 'directory']]
#df2 = df.explode("directory")[["costCenter", "responsible", "directory", "mail", "location"]]
print(df1)

错误:

当我运行这个程序时,我会低于。。。。

$ python ldapDataParse1
test
0
Traceback (most recent call last):
File "/home/lib64/python3.6/site-packages/pandas/core/indexes/base.py", line 2898, in get_loc
return self._engine.get_loc(casted_key)
File "pandas/_libs/index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 101, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 1675, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 1683, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'LLpHomeDirectory'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "ldapDataParse1", line 30, in <module>
frame["location"] = frame["LLpHomeDirectory"].str.extract("ou=([w_-]*)")
File "/home/lib64/python3.6/site-packages/pandas/core/frame.py", line 2906, in __getitem__
indexer = self.columns.get_loc(key)
File "/home/lib64/python3.6/site-packages/pandas/core/indexes/base.py", line 2900, in get_loc
raise KeyError(key) from err
KeyError: 'LLpHomeDirectory'

您只需要将这两个函数复制粘贴在一起,原始的autor甚至为您提供了代码的注释。您需要上下文管理器(带有open(…(…(来打开文件,然后在for record in handler.read().split("nn")之后,您只需使用答案的第二个版本来处理额外的换行符。

def generate(file):
with open(file) as handler:
for record in handler.read().split("nn"):
result = {}
for line in record.split("n"):
if line:
key, *value = line.split(": ")
value = ": ".join(value)
if key in result:
result[key] += ";" + value
else:
result[key] = value
if result:
yield result
frame = pd.DataFrame(generate("test.txt"))
frame["responsible"] = frame["LLpResponsible"].str.extract("cn=([w]*)")
frame["location"] = frame["LLpHomeDirectory"].str.extract("ou=([w_-]*)")
frame["directory"] = frame["LLpHomeDirectory"].str.findall("NisMap=w+:([w_-/]*)")
print(frame[["costCenter", "responsible", "directory"]])

输出:

costCenter responsible                                          directory
0  MX61FRK604    LLa11826  [/proj/llc0156_zmx28home_3/q, /proj/llc0156_zm...
1  RU61FPD561    LLa09278  [/proj/llc2002_zru12/q, /proj/llc1002_zru12_an...
2  KG63010285         NaN                           [/proj/llc4008_zuriKG/q]
3  KG63010285         NaN  [/proj/llc3008_zuriKG_rme/q, /proj/llc4014_zur...

最新更新