试图处理ValueError时的问题:没有找到包含pandas的表



我想从生活在链接序列中的所有表中读取并创建一个数据框架。比如我有:

list_links = ['url1.com', 'url2.com', 'url3.com',...,'urln.com']

:

for url in lis:
    try:
        df = pd.read_html(url,index_col=None, header=0)
        lis.append(df)
        frame = pd.concat(url, ignore_index=True)
    except:
        pass

然而,我不能得到数据帧,什么也没有发生:

In: frame
Out:
In: print(frame)
Out: 

在单个表中添加每个链接中的所有表的正确方法是什么?请注意,有些链接没有表…因此,我尝试了pass。我也试了这个:

import multiprocessing
def process_url(url):
    df_url = pd.read_html(url)
    df = pd.concat(df_url, ignore_index=True) 
    return df_url
pool = multiprocessing.Pool(processes=4)
pool.map(process_url, lis)

:

ValueError                                Traceback (most recent call last)
<ipython-input-3-46e04cfd0bfe> in <module>()
      7 
      8 pool = multiprocessing.Pool(processes=4)
----> 9 pool.map(process_url, lis)
/usr/local/Cellar/python3/3.5.2_1/Frameworks/Python.framework/Versions/3.5/lib/python3.5/multiprocessing/pool.py in map(self, func, iterable, chunksize)
    258         in a list that is returned.
    259         '''
--> 260         return self._map_async(func, iterable, mapstar, chunksize).get()
    261 
    262     def starmap(self, func, iterable, chunksize=None):
/usr/local/Cellar/python3/3.5.2_1/Frameworks/Python.framework/Versions/3.5/lib/python3.5/multiprocessing/pool.py in get(self, timeout)
    606             return self._value
    607         else:
--> 608             raise self._value
    609 
    610     def _set(self, i, obj):
ValueError: No tables found

我也试过这个:

import multiprocessing
def process_url(url):
    df_url = pd.read_html(url)
    df = pd.concat(df_url, ignore_index=True) 
    return df_url
pool = multiprocessing.Pool(processes=4)
try:
    dfs_ = pool.map(process_url, lis)
except: 
    pass

什么也没发生

您实际上没有连接数据框架。如果你这样做呢:

df_list = []
for url in list_links:
    try:
        df = pd.read_html(url, index_col=None, header=0)
        df_list.append(df)
    except:
        pass
df = pd.concat(df_list)

最新更新