小贝子编程

如何获取html文档中字符的坐标?

本文关键字：字符坐标文档 html 何获取获取 python python-3.x web-scraping beautifulsoup python-tesseract
更新时间 : 2023-09-17
英文 : How to get coordinates of characters in html document?

<span class = 'ocrx_word' id = 'word_1_45' title = 'bbox 369 429 301 123;x_wconf 96'>refrence</span>

如何使用Python仅从上面的代码中提取369 429 301 123值？

解决此问题的最简单方法最有可能是用分号拆分文本以获取在此之前的所有内容。然后，您可以再次拆分它并仅保留数字部分。

from bs4 import BeautifulSoup
tag = "<span class = 'ocrx_word' id = 'word_1_45' title = 'bbox 369 429 301 123;x_wconf 96'>refrence</span>"
soup = BeautifulSoup(tag, 'html.parser')
s = soup.findAll('span')
for span in s:
print([x  for x in span.attrs['title'].split(';')[0].split() if x.isdigit()])

from bs4 import BeautifulSoup
import re
data = """<span class = 'ocrx_word' id = 'word_1_45' title = 'bbox 369 429 301 123;x_wconf 96'>refrence</span>
"""
soup = BeautifulSoup(data, 'html.parser')
new = soup.find("span", {'class': 'ocrx_word'}).get("title")
print(re.findall(r"(?<=bbox )(?:d+ ){3}d+", new))

如何获取html文档中字符的坐标?

相关内容

最新更新

热门标签：