(Python)是否有一种方法从整个字符串提取子字符串/数字?



我有一个元组列表,其中包含唯一的公用事业数据,包括消耗量(立方英尺)、加仑水和估计价格。有13个元组,一个代表一年中的每个月,另一个代表年底的总消费。我的目标是提取这三条信息,将它们存储到数据框中,并最终将它们导出到Excel工作表中。

这是我将元组列表按字符串排序后的样子。(我将它们迭代并排序为字符串的原因是因为它们最初是Soup(BeautifulSoup)格式,很难组织成列表。)

一个元组是这样的:

['<area alt="" coords="151,115,181,382" onmouseout="DisplayTooltip(\'\');" onmouseover="DisplayTooltip(\'Consumption = 49,094.00 CF (367,223.12 Gallons)  &lt;br /&gt; Approximate Charge = $5,073.42\');" shape="rect"/>']'

下面是tuple的整个列表。唯一的例外是,最后(第13)元组列出了"总消费",而不仅仅是"消费">

['['<area alt="" coords="113,88,143,382" onmouseout="DisplayTooltip(\'\');" onmouseover="DisplayTooltip(\'**Consumption = 54,070.00 CF (404,443.60 Gallons)**  &lt;br /&gt; **Approximate Charge = $5,587.65**\');" shape="rect"/>']', '['<area alt="" coords="151,115,181,382" onmouseout="DisplayTooltip(\'\');" onmouseover="DisplayTooltip(\'Consumption = 49,094.00 CF (367,223.12 Gallons)  &lt;br /&gt; Approximate Charge = $5,073.42\');" shape="rect"/>']', '['<area alt="" coords="188,99,218,382" onmouseout="DisplayTooltip(\'\');" onmouseover="DisplayTooltip(\'Consumption = 51,921.00 CF (388,369.08 Gallons)  &lt;br /&gt; Approximate Charge = $5,365.57\');" shape="rect"/>']', '['<area alt="" coords="226,125,256,382" onmouseout="DisplayTooltip(\'\');" onmouseover="DisplayTooltip(\'Consumption = 47,122.00 CF (352,472.56 Gallons)  &lt;br /&gt; Approximate Charge = $4,869.63\');" shape="rect"/>']', '['<area alt="" coords="263,101,294,382" onmouseout="DisplayTooltip(\'\');" onmouseover="DisplayTooltip(\'Consumption = 51,687.00 CF (386,618.76 Gallons)  &lt;br /&gt; Approximate Charge = $5,341.39\');" shape="rect"/>']', '['<area alt="" coords="301,139,331,382" onmouseout="DisplayTooltip(\'\');" onmouseover="DisplayTooltip(\'Consumption = 44,643.00 CF (333,929.64 Gallons)  &lt;br /&gt; Approximate Charge = $4,613.45\');" shape="rect"/>']', '['<area alt="" coords="339,176,369,382" onmouseout="DisplayTooltip(\'\');" onmouseover="DisplayTooltip(\'Consumption = 37,770.00 CF (282,519.60 Gallons)  &lt;br /&gt; Approximate Charge = $4,010.80\');" shape="rect"/>']', '['<area alt="" coords="376,382,407,383" onmouseout="DisplayTooltip(\'\');" onmouseover="DisplayTooltip(\'Consumption = 0.00 CF (0.00 Gallons)  &lt;br /&gt; Approximate Charge = $0.00\');" shape="rect"/>']', '['<area alt="" coords="414,382,444,383" onmouseout="DisplayTooltip(\'\');" onmouseover="DisplayTooltip(\'Consumption = 0.00 CF (0.00 Gallons)  &lt;br /&gt; Approximate Charge = $0.00\');" shape="rect"/>']', '['<area alt="" coords="452,382,482,383" onmouseout="DisplayTooltip(\'\');" onmouseover="DisplayTooltip(\'Consumption = 0.00 CF (0.00 Gallons)  &lt;br /&gt; Approximate Charge = $0.00\');" shape="rect"/>']', '['<area alt="" coords="489,382,519,383" onmouseout="DisplayTooltip(\'\');" onmouseover="DisplayTooltip(\'Consumption = 0.00 CF (0.00 Gallons)  &lt;br /&gt; Approximate Charge = $0.00\');" shape="rect"/>']', '['<area alt="" coords="527,382,557,383" onmouseout="DisplayTooltip(\'\');" onmouseover="DisplayTooltip(\'Consumption = 0.00 CF (0.00 Gallons)  &lt;br /&gt; Approximate Charge = $0.00\');" shape="rect"/>']', '['<area alt="" coords="653,68,733,382" onmouseout="DisplayTooltip(\'\');" onmouseover="DisplayTooltip(\'Total Consumption = 336,307 CF (2,515,576 Gallons) &lt;br /&gt; Approximate Charge = $34,861.91\');" shape="rect"/>']']

我写了这个Regex表达式来提取加仑:

gallons = re.search('CF((.*)Gallons)', test_line)
print(gallons)

输出如下:

<re.Match object; span=(128, 150), match='CF (404,443.60 Gallons'>

这并没有真正使它变得更容易,因为现在我必须找到一种方法来提取'404,443,.60'

如果有人可以推荐一种从元组列表中提取这三个数据块的方法(假设我很可能必须在元组列表上创建某种形式的迭代)并将它们存储到数据框架中,这将非常有帮助。最终目标是将这些数字存储到数据框中,并最终导出到Excel工作表中。

这可能是你想要的:

gallons = re.search(r'(?<=CFs()[d,.]*(?= Gallons)', test_line)

您可以获得匹配的组:

import re
re_gallons = re.compile(r'CF ((.*)Gallons)')
print(re_gallons.search(test_line).group(1))

您可以使用捕获组,并通过匹配圆括号使模式更具体一些,并捕获第1组中圆括号后面带有可选小数部分的数字。

bCFs((d+(?:.d+)*(?:,d+(?:.d+)*)*)sGallons)
  • bCFs防止空匹配的字边界,匹配CF和空白字符
  • (匹配(
  • (Capture组1
    • d+(?:.d+)*匹配1+位数字和可选的小数部分
    • (?:,d+(?:.d+)*)*可选重复匹配,和1+数字,可选小数部分
  • )关闭组1
  • sGallons)匹配空白字符和Gallons)

Regex demo | Python demo

例如

import re

pattern=r"bCFs((d+(?:.d+)*(?:,d+(?:.d+)*)*)sGallons)"

strings = [r'Consumption = 49,094.00 CF (367,223.12 Gallons)']

for s in strings:
m = re.search(pattern, s)
if m:
gallons = m.group(1)
print(gallons)

输出
367,223.12

相关内容

  • 没有找到相关文章

最新更新