用BeautifulSoup从google chrome中提取书签和文件夹层次结构



我有一个收藏的书签在谷歌chrome浏览器与链接,在链接之间的子文件夹和在一些子文件夹甚至更多的子文件夹。
现在,我想将url和其他信息作为纯文本提取出来,以便进一步处理。
为此,我将我所有的书签从google-chrome书签管理器导出到一个名为bookmarks_8_2_21.html的html文件

我将在下文中使用该文件的一个示例部分:

<!DOCTYPE NETSCAPE-Bookmark-file-1>
<!-- This is an automatically generated file.
It will be read and overwritten.
DO NOT EDIT! -->
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8">
<TITLE>Bookmarks</TITLE>
<H1>Bookmarks</H1>
<DL><p>
<DT><H3 ADD_DATE="1606927410" LAST_MODIFIED="1620226362" PERSONAL_TOOLBAR_FOLDER="true">Bookmarks bar</H3>
<DL><p>
<DT><A HREF="javascript:location.href='org-protocol://capture?template=l&url='+encodeURIComponent(location.href)+'&title='+encodeURIComponent(document.title)+'&body='+encodeURIComponent(window.getSelection())" ADD_DATE="1607739285">org-capture-bookmark</A>
<DT><A HREF="https://www.google.de/" ADD_DATE="1554935207" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAACIklEQVQ4jYWSS0iUURTHf/fe8RvHooE2VlT2FNqUGWmNEYUR9lhEEVJhUIsoXOQuap1Rq6KHNQt3LaPAIOxhlNTChUwLMU3NR1CklUzg6xvPd1ro2KhTHjjcA/e8/uf/hzmmqsUiEheRLhHxp/2TiDxQ1aK5+ZmFeSJSrwuYiMRVNZKuMxnFz51zu9T3GX/6iPGmRqS/F5WAUMEawuUVRI5UYjwPEWl2zlUYY8YMgIjUW2vPBkPfSV6uYbKvJ+uW3rZSojfuABAEQdw5d96oajHQqr7P8IUqpL8X43lEjp3EK4mBtfgt75l4+4po7U3cytWZPbcyjUlTidv642ipDu7foX7bh2zgs92jDhHpUlWdbNmuEw15OvqweqE7ZjboCAEFADrSjs1LkRM7NAt3+bWRebfYudFx9XguwFqbwePs9z/mT/6NLdAHMBpex28W0/C1Y1Zy05VFM75nUwiAZVGT/v5sgdcA3UurOPUrxvXOFhJD7fOmdn4LeNc5NbpkfWimv5mWZ8KXFKdfXqInOYBnc6gsPEjZ8mKssbQOtvEkMczYl0oK8z3un4lgppbYkhZS3Fp7bnD0Jxeba+lODmTFviFcxq29NeRHDUEQ1DnnqtNSjohIo3Nutx+keNz9gmf9zfQkB0ChYMkK9q2KcaLwMJFQGFV9Y4w5YIwZzyBBI2lRLcD9PVXN/SdFqlokInUi0iEiE9P+UUTuqurmufl/AKTzsFGmvUNUAAAAAElFTkSuQmCC"></A>
<DT><H3 ADD_DATE="1606928255" LAST_MODIFIED="1606928255">Folder_1</H3>
<DL><p>
<DT><A HREF="https://stackoverflow.com/" ADD_DATE="1605695883" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAABXklEQVQ4jbWQsUsCYRjGn/fuSu/Sk3ALmlzNtoagKRqSaHMKGkKhEOV0KWispSXPQaglAnNobOgfaCyIcgicmxO9zFPv/N5WwTs5gt7x+5739/2eDwgw/bK67HcnBQG4Ag3L0LJ/BoBFDuDzTiGUCAywDC3bNbRtANCrwxaBziRZanAGcjADwR8AX1uGesEZyFGzXwO43VsKn07GaJa5lY/GMefUAYooEvaELDnCEW9M2I1V7GdPg04hlLAM7dYqqut67ftLNwdpMB5dgRfXdVMgHIFpx9egfbwYk0eDA2LKAWJMkK6cUOhOGdkpZmoQiy29OmwFq1AKb5CgQyakAXqQJKpELn/eJzPK1JKhPhHjk4EmMzUVmU/coVLkeXff672pk155YXUsxikCJQFeYVCSgCiAV920N311b+r37FslH413S+qaV86rggfIBbG38RRAN+2ZHzsTMKvGv80vvziHGAusG84AAAAASUVORK5CYII=">Stack Overflow - Where Developers Learn, Share, &amp; Build Careers</A>
<DT><A HREF="https://stackexchange.com/" ADD_DATE="1605695914" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAABIUlEQVQ4ja2SvUoDQRSFvztZDSIKWwbxB2zs7eyCWGljYWUlCKKm0kfwEQRR8A0kaGEfW+2VgD+JGxMlELOsWxhJdizMrmtM1hVyYGDm3Pm4c5grtJXLaaM0UV8HTKJVH7fM43RamgCG75Yn7bOkUot/wADkU7V5YAVA+eaAyEIc+LXRpPbRitUolsTf7F64Ouqi9dbi0fGC89WqKRCK8B+4rwoirB3d/YpQqDa4r753BUv7s9ERouC+6jvC3vmTSqixQsXmoWxHQhrcYUOnbk623SCCiCTjwO2uQw7JQYCEb47OLOWLFXsaeAGev5Z2QEYIjTxwKyI7Vnbj8keEXppaPgj/zmbxdOswXI81SICjIdMJ0/G0nvKUN2dlM9fdap8MMGR5HOUBZgAAAABJRU5ErkJggg==">Hot Questions - Stack Exchange</A>
<DT><H3 ADD_DATE="1606928255" LAST_MODIFIED="1606928255">Subfolder</H3>
<DL><p>
<DT><A HREF="https://meta.stackexchange.com/" ADD_DATE="1605695986" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAAAm0lEQVQ4je2SsQ3CMBBF30XMglKyBB7DEiNATSpnADYAeQtgEcIyR2EiHONgU9DxSv/736fTFyKMv+1BHEW0u9i2B2imZnZlM4C45zyL+DFNz/HaUhzQN+nAJ3NOfwv4ln/ALwLGgsyR6lGRtBsLYvxQrLOg28kGoSDalZcOn51tewhBFRg/KIAim6tdnmKt+og5czXr4301pz0AqgIzDZOACvcAAAAASUVORK5CYII=">Meta Stack Exchange</A>
<DT><H3 ADD_DATE="1606928255" LAST_MODIFIED="1606928255">Another Subfolder</H3>
<DL><p>
<DT><A HREF="https://en.wikipedia.org/wiki/Main_Page" ADD_DATE="1605696025" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAABO0lEQVQ4jaWTMaoCMRCG/wnvDtELmHaxdAmIXcheZA9hYeMNxNZqsc81lu0X+2VLTzBj8V5C8uQ9UAcG5k+YP5kvhPATzCx4IZRSBAD0TnNuQu82J5NPmgFADcMAay2UUjifzwAA733S8zzDWgtrLeZ5xvV6xXK5hPcet9vte/5pmoSIJIQgURtj5HQ6CTOLc06maRJmFmaWuq5TjVg454qNrutEay0hBDkej8V6NC4M+r4XANL3fdo0xogxJul4UK4TxPV6Decc9vt9ArTb7XC/35MehgFVVZUUc7cQghCRjOOYTtNaS9d1wszStm3BgpnlKzfz3mO1WuFyuWCz2aBpGlhrcTgcsN1uAQCLxeLvG0RIRJRmjS9U13XB5wlinlrrgnTbtk/w/jWIDPL8PXvMzz9TzuLVZgB4AExRsO8ga8hoAAAAAElFTkSuQmCC">Wikipedia, the free encyclopedia</A>
</DL><p>
</DL><p>
<DT><A HREF="https://www.wikipedia.org/" ADD_DATE="1605696017" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAABO0lEQVQ4jaWTMaoCMRCG/wnvDtELmHaxdAmIXcheZA9hYeMNxNZqsc81lu0X+2VLTzBj8V5C8uQ9UAcG5k+YP5kvhPATzCx4IZRSBAD0TnNuQu82J5NPmgFADcMAay2UUjifzwAA733S8zzDWgtrLeZ5xvV6xXK5hPcet9vte/5pmoSIJIQgURtj5HQ6CTOLc06maRJmFmaWuq5TjVg454qNrutEay0hBDkej8V6NC4M+r4XANL3fdo0xogxJul4UK4TxPV6Decc9vt9ArTb7XC/35MehgFVVZUUc7cQghCRjOOYTtNaS9d1wszStm3BgpnlKzfz3mO1WuFyuWCz2aBpGlhrcTgcsN1uAQCLxeLvG0RIRJRmjS9U13XB5wlinlrrgnTbtk/w/jWIDPL8PXvMzz9TzuLVZgB4AExRsO8ga8hoAAAAAElFTkSuQmCC">Wikipedia</A>
<DT><A HREF="https://en.wikipedia.org/wiki/Beautiful_Soup_(HTML_parser)" ADD_DATE="1605696102" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAABO0lEQVQ4jaWTMaoCMRCG/wnvDtELmHaxdAmIXcheZA9hYeMNxNZqsc81lu0X+2VLTzBj8V5C8uQ9UAcG5k+YP5kvhPATzCx4IZRSBAD0TnNuQu82J5NPmgFADcMAay2UUjifzwAA733S8zzDWgtrLeZ5xvV6xXK5hPcet9vte/5pmoSIJIQgURtj5HQ6CTOLc06maRJmFmaWuq5TjVg454qNrutEay0hBDkej8V6NC4M+r4XANL3fdo0xogxJul4UK4TxPV6Decc9vt9ArTb7XC/35MehgFVVZUUc7cQghCRjOOYTtNaS9d1wszStm3BgpnlKzfz3mO1WuFyuWCz2aBpGlhrcTgcsN1uAQCLxeLvG0RIRJRmjS9U13XB5wlinlrrgnTbtk/w/jWIDPL8PXvMzz9TzuLVZgB4AExRsO8ga8hoAAAAAElFTkSuQmCC">Beautiful Soup (HTML parser) - Wikipedia</A>
</DL><p>
<DT><H3 ADD_DATE="1606928255" LAST_MODIFIED="1606928255">Folder_2</H3>
<DL><p>
<DT><A HREF="https://www.reddit.com/" ADD_DATE="1605696212" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAACdUlEQVQ4jWXTTYjVdRTG8c/5/e+945hNMhLEgFRUA1lgNYHojJjpDXduQkSqjRiKJNSu2rRJV0VQi14IKohCW8UEIRJZ05uUGESrBBOsJocayRznvvxPizup0dmcxeF5zoHzfAOSCBKybRP2SpO4SUjMSjOKV+Ooz67VhKXKdUZc75CwWzGk/neAQBXUuSi9oevp+NRFiCRMGDHqiIa2rkSNgqu3Xc5aUQwLfR9bsCO+8FcjyLzBIQ1tPR2hIVWD7ZEeforMsPLGyh331g7u6rr4xzbDnseBkltNqezW0xfRIIqqpDrSytHQGg7Tb6XZM6mzWNDQUwuP5xbrI9veU+xU61tUBHpCB+vWs3qcd9+mhYbUxDK1UKm905AmpdQTbh2n2wnj97F5F7fcRd1j7WaOv89Pp0JzKP36c2hKTMmHymJulHlgQ51/X8i8MJdZ9/N/1e9lzp3LnD+fuW+izo0y26VTXPljDvrIKl5+gn330+/R67B3gjefYdUYzSFKdUVWZD1reaRvv0rffJT6fe6epP0YVYNGi8nt3LxmYHjyWDp1Ii2Xsj5bpM+VCEX6+kNZVTy4i3s2Mf878+fZsJ32o7JqcGJ6EK8SIcxUz91mVp2PaEZx+gcREcYn0ty5cPp7fjvD8HVpZFQceSEcfjG1hDo7evYH5BavGLJfR8dlDXeuLR7YkcZuH8T4l9Ph+Af8eLLW0tfS1PVSHPPkIMqTVhh2WNM2/UgLWesITQNWulJLGo6iytA1rdjpqEtXYVpjhTEHhT2qsoygXiKqlAFVvXoBr/nTs/GdS5Y4+y/OW01hD6awesn/rDCj7/X4xJfXav4BhnocQyGrEocAAAAASUVORK5CYII=">reddit: the front page of the internet</A>
<DT><A HREF="https://www.youtube.com/" ADD_DATE="1574152707" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAABx0lEQVQ4jZ2TQWtTQRDHfzO7yUsNKSG0heJJ0YKnCvVSkHrV7+BBeu7Vk9+lH8CLN6EXk4Lo1V48lGClFBELGmmSmr73djy8fS8vll4c+LO7s7OzM///LhQmBmrg+uDtZrgIBQQAKyf/YQbiBexFt9t9niSP7ji3umrW7og0UfWL0ZaOzGYjmBxn2c/BdPpJxuNz+vDwFxxnqmaqZs4V8L5AuS6haqmqXcDpO3jCEN4YmEFmkMcxrSCSGqSh8GcGIY52Am91Ce5HJ5EYBRwbG45222HmUHVS+DX2DhASuKu3oAeolGSqCiDs7gqHh8L2thCCxOQSbxFAO9BToClzNQSJokwmsLUFgwHs78P6eklnlQho6c0axUKbTVhZgUYjHpeqCwE8kMWFxYNF9uVlGA5hbw8ODuqJzajqnPENPtfYDxU2N4OtrRVzkVDfC0V8/h2G/g98AR7ESkJF8tFRcYdzkOf15iRW6y/hlD48voCz+BbmELFrvhrG8OMjPBOAV3D7qXM791R73Var00qSJRoNTyifB5Dn+eVsNv19dTX+mmWj93n+4SWciM1LKkmqy7RoImFBqPqPfH39K7t/4A18f74nAH8Bjm35s3ZkOjEAAAAASUVORK5CYII=">YouTube</A>
<DT><H3 ADD_DATE="1606928255" LAST_MODIFIED="1606928255">stuff</H3>
<DL><p>
<DT><A HREF="https://www.pgadmin.org/" ADD_DATE="1566393697" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAAC4ElEQVQ4jU2TS2hdZRSFv7XPf26eNbV52jRp0tYYiVVDfUBBlEpHTuxIOhEzMHUi6EjswIEOBQV1YsFWyMCJCI6sGhBRJ1JDrVpMGnuJSXprTFqpjXmc8//bwb0R93zvvdb+9tL4qfeetBDOxFgMmeMOBiAhAHdHEu44gCC5kIXKQiz8VJA4CwwGKbowASbh7sSUGs2OEA44mHtyT3FY5ueCuw9sb27Gje3SKsEoo1PGqKY8o6UpR4AkortM8vrATMQYHfpDEcs40LPbHhrZq9mltdS5q8UO7u30mSvXdKlawx2KoqS1uclvb24Jl+fBMJOFzFJY3yjsscNDevXkExRlVB6yhlU4/cF5vvtlgVeeeZzDw336dfFPr9Zu6shIP1NfzvDZ97MKkrwsk2JKrG9u89qH0/TtaefFp4+ycnOd9186wehgD9dv/M3YUK+OH7kbgG9+rnoRnYAnudcPNz3zG1PTMwx072a47052tTUxOtjD5YU/mHjzY279s8WZl09wdGw/G1uFJAhIyOqHuv9AH2P7exkd6Karow1PTh0lxOjEmADIzGhQJngDG8DIvi59/fYkmRkA585foFq7wdhQL5++8Szg9Hd14O6Y1fEGAWVMxJT49qcFv7K8qkP9nVycrzG3tMrzb33C5FOP8PA9+7hUvc7iyl88eu8gRRkbChwPmSkz4/bGFqfPfk7XHW2UMSLBsfGDzF9b46OvfqS7o43XnzvukjS7tOZ5CIQQTEtrt/hhbtnnllfp6Wj31uZcABtbBSePPcD4oX7+V5r6YoYLc0tqb8nR+AvvFEWZzJMbUqoEM5CDK7nT1lzhvqFeDty1x4sy6vLvK35xvkYlD5jhenDy3WQiIdlOeMC1s66Mic3tkpgSIK+EjNbmHJyUIAtm2SKZBr0sI40k7igAkYeMpjyA5Dg4TkzJLcszeVo20ITQVRcSpP+MSvXsAcnrPxBTwpMnSUhW9eQT/wJc4GRalsmdmQAAAABJRU5ErkJggg==">pgAdmin - PostgreSQL Tools</A>
</DL><p>
<DT><A HREF="https://www.gnu.org/software/emacs/" ADD_DATE="1605696341" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAADN0lEQVQ4jW2TXWibZRiG7/f7S76k+WKS/iTNTLRzSSSr60TmWEVQ8GdSJjpxIIh4oB4peqQFj/TAE0EEEakgKII60ENxYwpSmTQd27Ru0OiSNj9Nk7Rp/r6kyfe+7+PB1qFj99HzwH0/XAfPzXAbzU29HfONed/SXeaklJLtDvq5Tm3j4zPrC5Vbvey/y8np+dTI6PjnoVj0kP/OsE/RVEBI8N0hGmul1na5cmGrWnj57LVPi3sZdW949vC7J8IHEqfjR6bTZtByMcZAgjAaNJFMhWBafrdrZHSKhuK5qJleztZ/L9w8cCo1nxhPHjgduS8RBWOAIJCQGAuZCAbcuP9IBI8cvxuXzm/A4wlZw17v0ZAS+zbfvGhrAGCGxxfCBxNREBAJe3Bsdh98PgM+y4DHq2OwK1DMteDYnMAlYlPpeLu9vQDgae2J+JsRKxqeYSAQJ/h9LqTSoyjkm/j6sz/Q2LRBnOg6FQFCQhGEgH/swccnXvSqDySPvxObST/GcB29XrGxsd7GwZlx6LqCarFLQ5uDhAQEAYIghYRLNTyVbrGkqIa+X1FUgBOIS4BLbFVsLJ5Zo2jMwuvvHWNPnkowt65CcgnJJcAJpjLCDOae0SAJ4BIkCJbPwNzzSexPhSC4ZFcubGLp7Brl/txCr+vcIJDQuAJFMIAAzenvrvH+AKqi4amTCdxzbwjd1gBffJChnWoP8kYIgqBwQBUqNK6g37cxFPaK0rfrnzQKlTYcovXVHSICpCRE4xaCIRM6ADgE1WHQuArdUaFxFbVmoe60Ol8yAHjh6PuLyemjsyQlJiZHcPihKLsrFYThVgECVE2B129g8+82vp+/TLIjsFj66ccfsh/OaQBgd2qvlXJXz0X3JcOVfBuVf5rkcWuIJYKIxCwEJ7ysWx1Q9lwV6BBWaplStZF99eYnrtYz9UlvssSH/GHLG/AamopAyINmvofC8g7yv22jcrEFpyHxVz1TXu/mXvl145vL/+tCtrZ0JeCJ/2K3dg6xIQsqXZeOvgKdq2AOodzKd1e2lzLl7uozPxe+Wr5tG/c0G3spfYfb/4apuMNEQg7EsNwY1D46X/zu2q3efwF9w4d36At8owAAAABJRU5ErkJggg==">GNU Emacs - GNU Project</A>
<DT><H3 ADD_DATE="1606928255" LAST_MODIFIED="1606928255">other stuff</H3>
<DL><p>
<DT><H3 ADD_DATE="1606928255" LAST_MODIFIED="1606928255">emacs</H3>
<DL><p>
<DT><A HREF="https://www.gnu.org/software/emacs/download.html" ADD_DATE="1605696357" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAADN0lEQVQ4jW2TXWibZRiG7/f7S76k+WKS/iTNTLRzSSSr60TmWEVQ8GdSJjpxIIh4oB4peqQFj/TAE0EEEakgKII60ENxYwpSmTQd27Ru0OiSNj9Nk7Rp/r6kyfe+7+PB1qFj99HzwH0/XAfPzXAbzU29HfONed/SXeaklJLtDvq5Tm3j4zPrC5Vbvey/y8np+dTI6PjnoVj0kP/OsE/RVEBI8N0hGmul1na5cmGrWnj57LVPi3sZdW949vC7J8IHEqfjR6bTZtByMcZAgjAaNJFMhWBafrdrZHSKhuK5qJleztZ/L9w8cCo1nxhPHjgduS8RBWOAIJCQGAuZCAbcuP9IBI8cvxuXzm/A4wlZw17v0ZAS+zbfvGhrAGCGxxfCBxNREBAJe3Bsdh98PgM+y4DHq2OwK1DMteDYnMAlYlPpeLu9vQDgae2J+JsRKxqeYSAQJ/h9LqTSoyjkm/j6sz/Q2LRBnOg6FQFCQhGEgH/swccnXvSqDySPvxObST/GcB29XrGxsd7GwZlx6LqCarFLQ5uDhAQEAYIghYRLNTyVbrGkqIa+X1FUgBOIS4BLbFVsLJ5Zo2jMwuvvHWNPnkowt65CcgnJJcAJpjLCDOae0SAJ4BIkCJbPwNzzSexPhSC4ZFcubGLp7Brl/txCr+vcIJDQuAJFMIAAzenvrvH+AKqi4amTCdxzbwjd1gBffJChnWoP8kYIgqBwQBUqNK6g37cxFPaK0rfrnzQKlTYcovXVHSICpCRE4xaCIRM6ADgE1WHQuArdUaFxFbVmoe60Ol8yAHjh6PuLyemjsyQlJiZHcPihKLsrFYThVgECVE2B129g8+82vp+/TLIjsFj66ccfsh/OaQBgd2qvlXJXz0X3JcOVfBuVf5rkcWuIJYKIxCwEJ7ysWx1Q9lwV6BBWaplStZF99eYnrtYz9UlvssSH/GHLG/AamopAyINmvofC8g7yv22jcrEFpyHxVz1TXu/mXvl145vL/+tCtrZ0JeCJ/2K3dg6xIQsqXZeOvgKdq2AOodzKd1e2lzLl7uozPxe+Wr5tG/c0G3spfYfb/4apuMNEQg7EsNwY1D46X/zu2q3efwF9w4d36At8owAAAABJRU5ErkJggg==">GNU Emacs download - GNU Project</A>
<DT><A HREF="https://www.gnu.org/software/emacs/documentation.html" ADD_DATE="1605696393" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAADN0lEQVQ4jW2TXWibZRiG7/f7S76k+WKS/iTNTLRzSSSr60TmWEVQ8GdSJjpxIIh4oB4peqQFj/TAE0EEEakgKII60ENxYwpSmTQd27Ru0OiSNj9Nk7Rp/r6kyfe+7+PB1qFj99HzwH0/XAfPzXAbzU29HfONed/SXeaklJLtDvq5Tm3j4zPrC5Vbvey/y8np+dTI6PjnoVj0kP/OsE/RVEBI8N0hGmul1na5cmGrWnj57LVPi3sZdW949vC7J8IHEqfjR6bTZtByMcZAgjAaNJFMhWBafrdrZHSKhuK5qJleztZ/L9w8cCo1nxhPHjgduS8RBWOAIJCQGAuZCAbcuP9IBI8cvxuXzm/A4wlZw17v0ZAS+zbfvGhrAGCGxxfCBxNREBAJe3Bsdh98PgM+y4DHq2OwK1DMteDYnMAlYlPpeLu9vQDgae2J+JsRKxqeYSAQJ/h9LqTSoyjkm/j6sz/Q2LRBnOg6FQFCQhGEgH/swccnXvSqDySPvxObST/GcB29XrGxsd7GwZlx6LqCarFLQ5uDhAQEAYIghYRLNTyVbrGkqIa+X1FUgBOIS4BLbFVsLJ5Zo2jMwuvvHWNPnkowt65CcgnJJcAJpjLCDOae0SAJ4BIkCJbPwNzzSexPhSC4ZFcubGLp7Brl/txCr+vcIJDQuAJFMIAAzenvrvH+AKqi4amTCdxzbwjd1gBffJChnWoP8kYIgqBwQBUqNK6g37cxFPaK0rfrnzQKlTYcovXVHSICpCRE4xaCIRM6ADgE1WHQuArdUaFxFbVmoe60Ol8yAHjh6PuLyemjsyQlJiZHcPihKLsrFYThVgECVE2B129g8+82vp+/TLIjsFj66ccfsh/OaQBgd2qvlXJXz0X3JcOVfBuVf5rkcWuIJYKIxCwEJ7ysWx1Q9lwV6BBWaplStZF99eYnrtYz9UlvssSH/GHLG/AamopAyINmvofC8g7yv22jcrEFpyHxVz1TXu/mXvl145vL/+tCtrZ0JeCJ/2K3dg6xIQsqXZeOvgKdq2AOodzKd1e2lzLl7uozPxe+Wr5tG/c0G3spfYfb/4apuMNEQg7EsNwY1D46X/zu2q3efwF9w4d36At8owAAAABJRU5ErkJggg==">GNU Emacs documentation - GNU Project</A>
</DL><p>
</DL><p>
<DT><A HREF="https://orgmode.org/" ADD_DATE="1605696413" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAAClElEQVQ4jZ1TzU8TcRSc/f36Ddt2aWkLKbi0FCSUCCSiRUTExCBGYyJKotGLHv03jJ41Xjx48aQXPRA1xujBGJRoJCIYAhQo1NKvbZfdbbct2/WkIQIhOqdJ3ps5vJlHsT9qAOgAqrsN6V6qjqGhcQeloeELp29bOJe0sbI6/y8Gnu5TA89bONvNoxWh3dQaOOkd6DMvTn6JApC2LzLbOHsu5D7R6HZ3NVD9hnJxPFh2e0BKKlS1jHq+CbGZH6vxleV7qiDFqtDLm8mNZQYAejxs5Erk0OP2poYgCjLKiTXMRYah9EUAWYYsiLCxNtjsLKpaFVJOhMPNYeHz1xgJ2s2t1473vuoPtwWh69jMZaGAAZn6iFJsDQylcHpcMFktIJTAaDGBUAoxm0d6PfnCcL63/c7hUIs9KxehCBlUJAnR5hAyfh7leALFLQ2s046ipCATi7/TCSNpqtaQTSZL0+9f3zX467j+LR1Q8wK2ZBESMUI62AlHOAxhaRkr07PfNtP5h7KYi868/fDy72sbNB1qQUijJArQ1BIS/laAUmiyAr1aRUGS30w+m3iwV9wklU7PavkMNFlG1N0IJRAC4/XBQBgYTCawdY4RH8/zexlQO0NFr4u9muI8SAfaYA53wcAAhBAYjEYQSus9LU3XWadzOb4QndthsCjKC2WrNWs6Mzpay/NgCANCKXRdh8lqgdlmRbVSsbJu7pLFap//ubj0fUcTo8nclNXnzXsDB0ac9S4YTEYwhKAgSqioKkpFFZYaG2rr7GOlMp4K6+uZ3ZqIjsGewc4jA5eNFqNDSmXnYrG1T0o8keK8PgfX6DrmavbfknJifuL+o26GYXZ9rv3ARsbOPmmL9Jz6H/Ef8Dzv/M1/AdxXB/z0rsGnAAAAAElFTkSuQmCC">Org mode for Emacs</A>
</DL><p>
</DL><p>

我想从这个文件中提取以下信息:

  1. <
  2. 添加日期/gh>
  3. 文件夹层次结构/URL路径

我得到了与BeautifulSoup工作的前三个要求,但我似乎无法得到第四个工作。因此,我将试着进一步解释这一点。

让我们假设以下文件夹层次结构:

Bookmarks 
_Bookmarks bar
_Folder_1
_Subfolder
_Another Subfolder
_Folder_2
_stuff
_other stuff
_emacs

理想情况下,我希望在"另一个子文件夹"中的URL有以下示例输出:

https://en.wikipedia.org/wiki/Main_Page
Wikipedia, the free encyclopedia
1605696025
Bookmarks bar/Folder_1/Subfolder/Another Subfolder

但是这个输出已经非常有用了:

https://en.wikipedia.org/wiki/Main_Page
Wikipedia, the free encyclopedia
1605696025
Another Subfolder

我的代码到目前为止是:

from bs4 import BeautifulSoup
def read_in_file(filename): 
f = open(filename, 'r') 
soup = BeautifulSoup(f.read(), 'html.parser')
f.close()
return soup
soup = read_in_file('bookmarks_8_2_21.html')
for line in soup.find_all('a'):
print(line.get('href'))     # 1) URL:         works 
print(line.get_text())      # 2) Description: works 
print(line.get('add_date')) # 3) Add Date:    works
dir = soup.find('h3') # 4) Folder hierarch/ path: not working
print(dir.contents)   # only prints ['Bookmarks bar']
print()

条目到目前为止的输出:

https://en.wikipedia.org/wiki/Main_Page
Wikipedia, the free encyclopedia
1605696025
['Bookmarks bar']

我也尝试了兄弟代码,并发现了如何打印文件夹层次结构,但我无法让它与其他代码一起工作:

代码片段:

for dir in soup.find_all('h3', recursive=True):
print(dir.text)

输出:

Bookmarks bar
Folder_1
Subfolder
Another Subfolder
Folder_2
stuff
other stuff
emacs

谢谢你的帮助和建议!

问题可能与您的书签文件如何导入或BS如何读取该文件有关。更具体地说,它如何读取Description Term<DT>元素。这是因为这些标记在导出文件中没有关闭。因此,它不知道标签应该在哪里关闭,从而在一些随机的地方关闭它。

所以我把上的标签和上的标签同一行关闭在那之后,你应该很容易提取数据。

from bs4 import BeautifulSoup
soup = BeautifulSoup()
with open('bookmarks.html') as f:
soup = BeautifulSoup(f.read(), 'lxml')
dt = soup.find_all('dt')
folder_name =''
for i in dt:
n = i.find_next()
if n.name == 'h3':
folder_name = n.text
continue
else:
print(f'url = {n.get("href")}')
print(f'website name = {n.text}')
print(f'add date = {n.get("add_date")}')
print(f'folder name = {folder_name}')
print()

我希望这一小部分的o/p会有所帮助:

url = https://stackoverflow.com/
website name = Stack Overflow - Where Developers Learn, Share, & Build Careers
add date = 1605695883
folder name = Folder_1
url = https://stackexchange.com/
website name = Hot Questions - Stack Exchange
add date = 1605695914
folder name = Folder_1
url = https://meta.stackexchange.com/
website name = Meta Stack Exchange
add date = 1605695986
folder name = Subfolder
url = https://en.wikipedia.org/wiki/Main_Page
website name = Wikipedia, the free encyclopedia
add date = 1605696025
folder name = Another Subfolder
url = https://www.wikipedia.org/
website name = Wikipedia
add date = 1605696017
folder name = Another Subfolder

这里我假设文件夹名称下的任何链接都属于该文件夹,但这可能会因为我在下面添加的原因而改变。

如果你想要更准确的结果,那么你应该考虑关闭p标签,因为它们也被打开,可以在任何地方填写。

要做到这一点的方法是通过找到dl标签并在它们内部分别遍历,以找出哪个dt标签精确地属于哪个文件夹或dl元素。

这是一个非常特殊的问题,因为并不是所有人都以相同的方式保存书签。此外,您还必须注意,html根据文件夹的组织而变化。例如:如果链接先出现或者子文件夹先出现,相应的html文件也会改变。

我正在努力解决这个问题,我想获得完整的文件夹集,即父文件夹和子文件夹。

我写了一个简单的函数,通过传递一个link元素来查找父目录

import bs4
def find_parent_dir(l):
if l is None:
return None

if l.h3 and l.name == "dt":
current_folder = l.h3.getText()
parents = find_parent_dir(l.find_parent("dl"))
if parents is None:
return [current_folder]
else:
return parents + [current_folder]

return find_parent_dir(l.parent)
with open("bookmarks_8_29_21.html") as fh:
html_obj = bs4.BeautifulSoup(fh.read(), 'html.parser')
links = [link for link in html_obj.find_all("a") ]

folders_path = find_parent_dir(link[0])
print(folders_path)

最新更新