如何按不同列中包含NaN值的字符串对列中的字符串进行分组



背景
我从互联网上的一个来源获取了一个表(请参阅Mordred Molecular Descriptors)机器学习项目。
我用来获取该表的代码如下所示:

import requests
import pandas as pd
from bs4 import BeautifulSoup
# Fetch the HTML content of the webpage
url = "https://mordred-descriptor.github.io/documentation/master/descriptors.html"
html = requests.get(url).content
# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
# Find the table element in the HTML
table = soup.find('table')
# Convert the table into a Pandas dataframe
df = pd.read_html(str(table))[0]
# Print the resulting dataframe
df.drop(['#', 'constructor', 'dim', 'description'], axis=1)

在Python3中运行以上代码后,我生成了这个数据帧。

现在我想把";名称";列中的相应模块;模块";列

问题是提取的表已经是透视表;模块";列由NaN值填充。理想情况下,我想生成一个字典,其中包含键、模块和值,以及分组名称的列表。

示例:
dict_df = {'ABCIndex': ['ABC','ABCGG'], 'AcidBase': ['nAcid', 'nBase'], ..., 'ZagrebIndex': ['Zagreb1', 'Zagreb2', 'mZagreb1', 'mZagreb2']}

我曾尝试在Pandas中使用.groupby()按模块将名称分组,但NaN值被省略了,字典值只剩下一个名称列表;模块不是NaN值的行的名称。

感谢您的时间和帮助。

IIUC,像这样?使用ffill,然后使用groupby,使用agglist

df.groupby(df['module'].ffill())['name'].agg(list)

输出:

module
ABCIndex                                                           [ABC, ABCGG]
AcidBase                                                         [nAcid, nBase]
AdjacencyMatrix               [SpAbs_A, SpMax_A, SpDiam_A, SpAD_A, SpMAD_A, ...
Aromatic                                                 [nAromAtom, nAromBond]
AtomCount                     [nAtom, nHeavyAtom, nSpiro, nBridgehead, nHete...
Autocorrelation               [ATS0dv, ATS1dv, ATS2dv, ATS3dv, ATS4dv, ATS5d...
BCUT                          [BCUTc-1h, BCUTc-1l, BCUTdv-1h, BCUTdv-1l, BCU...
BalabanJ                                                             [BalabanJ]
BaryszMatrix                  [SpAbs_DzZ, SpMax_DzZ, SpDiam_DzZ, SpAD_DzZ, S...
BertzCT                                                               [BertzCT]
BondCount                     [nBonds, nBondsO, nBondsS, nBondsD, nBondsT, n...
CPSA                          [PNSA1, PNSA2, PNSA3, PNSA4, PNSA5, PPSA1, PPS...
CarbonTypes                   [C1SP1, C2SP1, C1SP2, C2SP2, C3SP2, C1SP3, C2S...
Chi                           [Xch-3d, Xch-4d, Xch-5d, Xch-6d, Xch-7d, Xch-3...
Constitutional                [SZ, Sm, Sv, Sse, Spe, Sare, Sp, Si, MZ, Mm, M...
DetourMatrix                  [SpAbs_Dt, SpMax_Dt, SpDiam_Dt, SpAD_Dt, SpMAD...
DistanceMatrix                [SpAbs_D, SpMax_D, SpDiam_D, SpAD_D, SpMAD_D, ...
EState                        [NsLi, NssBe, NssssBe, NssBH, NsssB, NssssB, N...
EccentricConnectivityIndex                                            [ECIndex]
ExtendedTopochemicalAtom      [ETA_alpha, AETA_alpha, ETA_shape_p, ETA_shape...
FragmentComplexity                                                    [fragCpx]
Framework                                                                 [fMF]
GeometricalIndex              [GeomDiameter, GeomRadius, GeomShapeIndex, Geo...
GravitationalIndex                                 [GRAV, GRAVH, GRAVp, GRAVHp]
HydrogenBond                                                   [nHBAcc, nHBDon]
InformationContent            [IC0, IC1, IC2, IC3, IC4, IC5, TIC0, TIC1, TIC...
KappaShapeIndex                                           [Kier1, Kier2, Kier3]
Lipinski                                                [Lipinski, GhoseFilter]
LogS                                                             [FilterItLogS]
McGowanVolume                                                        [VMcGowan]
MoRSE                         [Mor01, Mor02, Mor03, Mor04, Mor05, Mor06, Mor...
MoeType                       [LabuteASA, PEOE_VSA1, PEOE_VSA2, PEOE_VSA3, P...
MolecularDistanceEdge         [MDEC-11, MDEC-12, MDEC-13, MDEC-14, MDEC-22, ...
MolecularId                   [MID, AMID, MID_h, AMID_h, MID_C, AMID_C, MID_...
MomentOfInertia                                        [MOMI-X, MOMI-Y, MOMI-Z]
PBF                                                                       [PBF]
PathCount                     [MPC2, MPC3, MPC4, MPC5, MPC6, MPC7, MPC8, MPC...
Polarizability                                                     [apol, bpol]
RingCount                     [nRing, n3Ring, n4Ring, n5Ring, n6Ring, n7Ring...
RotatableBond                                                  [nRot, RotRatio]
SLogP                                                              [SLogP, SMR]
TopoPSA                                                  [TopoPSA(NO), TopoPSA]
TopologicalCharge             [GGI1, GGI2, GGI3, GGI4, GGI5, GGI6, GGI7, GGI...
TopologicalIndex              [Diameter, Radius, TopoShapeIndex, PetitjeanIn...
VdwVolumeABC                                                             [Vabc]
VertexAdjacencyInformation                                            [VAdjMat]
WalkCount                     [MWC01, MWC02, MWC03, MWC04, MWC05, MWC06, MWC...
Weight                                                                [MW, AMW]
WienerIndex                                                       [WPath, WPol]
ZagrebIndex                              [Zagreb1, Zagreb2, mZagreb1, mZagreb2]
Name: name, dtype: object

最新更新