我正试图从FDA网页中提取一些信息。我正在使用这个代码:
import pandas as pd
#Get CEDI html tables
CEDI_inv_url = "https://www.accessdata.fda.gov/scripts/sda/sdNavigation.cfm?sd=edisrev&displayAll=true"
CEDI_HTML_tables = pd.read_html(CEDI_inv_url)
# STEP 2: Extract information from HTML Tables (i.e., scrapping of information)
CEDI_table_data = CEDI_HTML_tables[0]
CEDI_df = pd.DataFrame (CEDI_table_data, columns = ['MAINTERM','CAS NO','CUM DC (ppb)','CEDI','REGNUM'])
CEDI_df['CAS NO'].to_string()
CEDI_df['CAS NO'] = CEDI_df['CAS NO'].str.extract(r'([0-9]+[u2011|-][0-9]{2}[u2011|-][0-9](?![0-9]))')
CEDI_df.head()
我得到一个只能使用。str访问器与字符串值!错误。我已经尝试了许多方法将数据帧转换为字符串。什么好主意吗?
出现此错误是因为您正在访问的列不是字符串。使用.astype(str)应该修复:
import pandas as pd
CEDI_inv_url = "https://www.accessdata.fda.gov/scripts/sda/sdNavigation.cfm?sd=edisrev&displayAll=true"
CEDI_HTML_tables = pd.read_html(CEDI_inv_url)
CEDI_df = pd.DataFrame(CEDI_HTML_tables[0], columns = ['MAINTERM','CAS/ID NO','CUM DC (ppb)','CEDI (mg/kg bw/d)','REGNUM'])
CEDI_df['CAS/ID NO'] = CEDI_df['CAS/ID NO'].astype(str).str.extract(r'([0-9]+[u2011|-][0-9]{2}[u2011|-][0-9](?![0-9]))')
print(CEDI_df.head())
输出:MAINTERM CAS/ID NO CUM DC (ppb) CEDI REGNUM
0 (1,1,4,4- TETRAMETHYLTETRAMETHYLENE)BIS(TERT-B... NaN 0.2 NaN 177.2600 177.1520
1 (2,4,4-TRIMETHYLPENT-2-YL)-N-PHENYL-1-NAPHTHYL... NaN 50.0 NaN NaN
2 (2- (METHACRYLOYLOXY)ETHYL)TRIMETHYLAMMONIUM M... NaN 0.4 NaN 178.3520 176.170
3 (2-ALKENYL(C15-21))SUCCINIC ANHYDRIDE NaN 5.0 NaN 176.170
4 (N-OCTYL)TIN S,S'S" TRIS(ISOOCTYLMERCAPTOACETATE) NaN 7.7 NaN 178.2650