我有一个元组列表,其中包含唯一的公用事业数据,包括消耗量(立方英尺)、加仑水和估计价格。有13个元组,一个代表一年中的每个月,另一个代表年底的总消费。我的目标是提取这三条信息,将它们存储到数据框中,并最终将它们导出到Excel工作表中。
这是我将元组列表按字符串排序后的样子。(我将它们迭代并排序为字符串的原因是因为它们最初是Soup(BeautifulSoup)格式,很难组织成列表。)
一个元组是这样的:
['<area alt="" coords="151,115,181,382" onmouseout="DisplayTooltip(\'\');" onmouseover="DisplayTooltip(\'Consumption = 49,094.00 CF (367,223.12 Gallons) <br /> Approximate Charge = $5,073.42\');" shape="rect"/>']'
下面是tuple的整个列表。唯一的例外是,最后(第13)元组列出了"总消费",而不仅仅是"消费">
['['<area alt="" coords="113,88,143,382" onmouseout="DisplayTooltip(\'\');" onmouseover="DisplayTooltip(\'**Consumption = 54,070.00 CF (404,443.60 Gallons)** <br /> **Approximate Charge = $5,587.65**\');" shape="rect"/>']', '['<area alt="" coords="151,115,181,382" onmouseout="DisplayTooltip(\'\');" onmouseover="DisplayTooltip(\'Consumption = 49,094.00 CF (367,223.12 Gallons) <br /> Approximate Charge = $5,073.42\');" shape="rect"/>']', '['<area alt="" coords="188,99,218,382" onmouseout="DisplayTooltip(\'\');" onmouseover="DisplayTooltip(\'Consumption = 51,921.00 CF (388,369.08 Gallons) <br /> Approximate Charge = $5,365.57\');" shape="rect"/>']', '['<area alt="" coords="226,125,256,382" onmouseout="DisplayTooltip(\'\');" onmouseover="DisplayTooltip(\'Consumption = 47,122.00 CF (352,472.56 Gallons) <br /> Approximate Charge = $4,869.63\');" shape="rect"/>']', '['<area alt="" coords="263,101,294,382" onmouseout="DisplayTooltip(\'\');" onmouseover="DisplayTooltip(\'Consumption = 51,687.00 CF (386,618.76 Gallons) <br /> Approximate Charge = $5,341.39\');" shape="rect"/>']', '['<area alt="" coords="301,139,331,382" onmouseout="DisplayTooltip(\'\');" onmouseover="DisplayTooltip(\'Consumption = 44,643.00 CF (333,929.64 Gallons) <br /> Approximate Charge = $4,613.45\');" shape="rect"/>']', '['<area alt="" coords="339,176,369,382" onmouseout="DisplayTooltip(\'\');" onmouseover="DisplayTooltip(\'Consumption = 37,770.00 CF (282,519.60 Gallons) <br /> Approximate Charge = $4,010.80\');" shape="rect"/>']', '['<area alt="" coords="376,382,407,383" onmouseout="DisplayTooltip(\'\');" onmouseover="DisplayTooltip(\'Consumption = 0.00 CF (0.00 Gallons) <br /> Approximate Charge = $0.00\');" shape="rect"/>']', '['<area alt="" coords="414,382,444,383" onmouseout="DisplayTooltip(\'\');" onmouseover="DisplayTooltip(\'Consumption = 0.00 CF (0.00 Gallons) <br /> Approximate Charge = $0.00\');" shape="rect"/>']', '['<area alt="" coords="452,382,482,383" onmouseout="DisplayTooltip(\'\');" onmouseover="DisplayTooltip(\'Consumption = 0.00 CF (0.00 Gallons) <br /> Approximate Charge = $0.00\');" shape="rect"/>']', '['<area alt="" coords="489,382,519,383" onmouseout="DisplayTooltip(\'\');" onmouseover="DisplayTooltip(\'Consumption = 0.00 CF (0.00 Gallons) <br /> Approximate Charge = $0.00\');" shape="rect"/>']', '['<area alt="" coords="527,382,557,383" onmouseout="DisplayTooltip(\'\');" onmouseover="DisplayTooltip(\'Consumption = 0.00 CF (0.00 Gallons) <br /> Approximate Charge = $0.00\');" shape="rect"/>']', '['<area alt="" coords="653,68,733,382" onmouseout="DisplayTooltip(\'\');" onmouseover="DisplayTooltip(\'Total Consumption = 336,307 CF (2,515,576 Gallons) <br /> Approximate Charge = $34,861.91\');" shape="rect"/>']']
我写了这个Regex表达式来提取加仑:
gallons = re.search('CF((.*)Gallons)', test_line)
print(gallons)
输出如下:
<re.Match object; span=(128, 150), match='CF (404,443.60 Gallons'>
这并没有真正使它变得更容易,因为现在我必须找到一种方法来提取'404,443,.60'
如果有人可以推荐一种从元组列表中提取这三个数据块的方法(假设我很可能必须在元组列表上创建某种形式的迭代)并将它们存储到数据框架中,这将非常有帮助。最终目标是将这些数字存储到数据框中,并最终导出到Excel工作表中。
这可能是你想要的:
gallons = re.search(r'(?<=CFs()[d,.]*(?= Gallons)', test_line)
您可以获得匹配的组:
import re
re_gallons = re.compile(r'CF ((.*)Gallons)')
print(re_gallons.search(test_line).group(1))
您可以使用捕获组,并通过匹配圆括号使模式更具体一些,并捕获第1组中圆括号后面带有可选小数部分的数字。
bCFs((d+(?:.d+)*(?:,d+(?:.d+)*)*)sGallons)
bCFs
防止空匹配的字边界,匹配CF
和空白字符(
匹配(
(
Capture组1d+(?:.d+)*
匹配1+位数字和可选的小数部分(?:,d+(?:.d+)*)*
可选重复匹配,
和1+数字,可选小数部分
)
关闭组1sGallons)
匹配空白字符和Gallons)
Regex demo | Python demo
例如
import re
pattern=r"bCFs((d+(?:.d+)*(?:,d+(?:.d+)*)*)sGallons)"
strings = [r'Consumption = 49,094.00 CF (367,223.12 Gallons)']
for s in strings:
m = re.search(pattern, s)
if m:
gallons = m.group(1)
print(gallons)
输出367,223.12