我知道这又是一个很菜鸟的问题,但是我已经在互联网上磕磕绊绊了几天,无法解决我的问题。我已经从discogs下载了数据转储,这是一个大约35 GB的xml文件。到目前为止,我将不得不使用 SAX 解析器,因为我显然无法将此文件加载到我的 RAM 中,而那只牛在 ruby 中获得了最好的运行时,但我根本不明白如何使用这个解析器,即使使用小 IO 对象或其他仅用于测试的东西,它仍然是一件神奇的事情,把我不明白的东西扔回给我。这是 xml 的样子:
<releases>
<release id="1" status="Accepted"><images><image height="600" type="primary" uri="" uri150="" width="600"/><image height="600" type="secondary" uri="" uri150="" width="600"/><image height="600" type="secondary" uri="" uri150="" width="600"/><image height="600" type="secondary" uri="" uri150="" width="600"/></images><artists><artist><id>1</id><name>The Persuader</name><anv></anv><join></join><role></role><tracks></tracks></artist></artists><title>Stockholm</title><labels><label catno="SK032" id="5" name="Svek"/></labels><extraartists><artist><id>239</id><name>Jesper Dahlbäck</name><anv></anv><join></join><role>Music By [All Tracks By]</role><tracks></tracks></artist></extraartists><formats><format name="Vinyl" qty="2" text=""><descriptions><description>12"</description><description>33 ⅓ RPM</description></descriptions></format></formats><genres><genre>Electronic</genre></genres><styles><style>Deep House</style></styles><country>Sweden</country><released>1999-03-00</released><notes>The song titles are the names of six of Stockholm's 82 districts.
Title on label: - Stockholm -
Recorded at the Globe Studio, Stockholm
FAX: +46 8 679 64 53
</notes><data_quality>Needs Vote</data_quality><tracklist><track><position>A</position><title>Östermalm</title><duration>4:45</duration></track><track><position>B1</position><title>Vasastaden</title><duration>6:11</duration></track><track><position>B2</position><title>Kungsholmen</title><duration>2:49</duration></track><track><position>C1</position><title>Södermalm</title><duration>5:38</duration></track><track><position>C2</position><title>Norrmalm</title><duration>4:52</duration></track><track><position>D</position><title>Gamla Stan</title><duration>5:16</duration></track></tracklist><identifiers><identifier description="A-Side Runout" type="Matrix / Runout" value="MPO SK 032 A1"/><identifier description="B-Side Runout" type="Matrix / Runout" value="MPO SK 032 B1"/><identifier description="C-Side Runout" type="Matrix / Runout" value="MPO SK 032 C1"/><identifier description="D-Side Runout" type="Matrix / Runout" value="MPO SK 032 D1"/><identifier description="Only On A-Side Runout" type="Matrix / Runout" value="G PHRUPMASTERGENERAL T27 LONDON"/></identifiers><videos><video duration="326" embed="true" src="https://www.youtube.com/watch?v=afMHNll9EVM"><title>The Persuader - Gamla Stan</title><description>The Persuader - Gamla Stan</description></video><video duration="301" embed="true" src="https://www.youtube.com/watch?v=EBBHR3EMN50"><title>The Persuader - Norrmalm</title><description>The Persuader - Norrmalm</description></video><video duration="341" embed="true" src="https://www.youtube.com/watch?v=WDZqiENap_U"><title>The Persuader - Södermalm</title><description>The Persuader - Södermalm</description></video><video duration="176" embed="true" src="https://www.youtube.com/watch?v=XExCZfMCXdo"><title>The Persuader - Kungsholmen</title><description>The Persuader - Kungsholmen</description></video><video duration="376" embed="true" src="https://www.youtube.com/watch?v=Cawyll0pOI4"><title>The Persuader - Vasastaden</title><description>The Persuader - Vasastaden</description></video><video duration="296" embed="true" src="https://www.youtube.com/watch?v=MpmbntGDyNE"><title>The Persuader - Östermalm</title><description>The Persuader - Östermalm</description></video></videos><companies><company><id>271046</id><name>The Globe Studios</name><catno></catno><entity_type>23</entity_type><entity_type_name>Recorded At</entity_type_name><resource_url>https://api.discogs.com/labels/271046</resource_url></company><company><id>56025</id><name>MPO</name><catno></catno><entity_type>17</entity_type><entity_type_name>Pressed By</entity_type_name><resource_url>https://api.discogs.com/labels/56025</resource_url></company></companies></release>
<release id="2" status="Accepted"><images><image height="394" type="primary" uri="" uri150="" width="400"/><image height="600" type="secondary" uri="" uri150="" width="600"/><image height="600" type="secondary" uri="" uri150="" width="600"/></images><artists><artist><id>2</id><name>Mr. James Barth & A.D.</name><anv></anv><join></join><role></role><tracks></tracks></artist></artists><title>Knockin' Boots Vol 2 Of 2</title><labels><label catno="SK 026" id="5" name="Svek"/><label catno="SK026" id="5" name="Svek"/></labels><extraartists><artist><id>26</id><name>Alexi Delano</name><anv></anv><join></join><role>Producer, Recorded By</role><tracks></tracks></artist><artist><id>27</id><name>Cari Lekebusch</name><anv></anv><join></join><role>Producer, Recorded By</role><tracks></tracks></artist><artist><id>26</id><name>Alexi Delano</name><anv>A. Delano</anv><join></join><role>Written-By</role><tracks></tracks></artist><artist><id>27</id><name>Cari Lekebusch</name><anv>C. Lekebusch</anv><join></join><role>Written-By</role><tracks></tracks></artist></extraartists><formats><format name="Vinyl" qty="1" text=""><descriptions><description>12"</description><description>33 ⅓ RPM</description></descriptions></format></formats><genres><genre>Electronic</genre></genres><styles><style>Broken Beat</style><style>Techno</style><style>Tech House</style></styles><country>Sweden</country><released>1998-06-00</released><notes>All joints recorded in NYC (Dec.97).</notes><data_quality>Correct</data_quality><master_id is_main_release="true">713738</master_id><tracklist><track><position>A1</position><title>A Sea Apart</title><duration>5:08</duration></track><track><position>A2</position><title>Dutchmaster</title><duration>4:21</duration></track><track><position>B1</position><title>Inner City Lullaby</title><duration>4:22</duration></track><track><position>B2</position><title>Yeah Kid!</title><duration>4:46</duration></track></tracklist><identifiers><identifier description="Side A Runout Etching" type="Matrix / Runout" value="MPO SK026-A -J.T.S.-"/><identifier description="Side B Runout Etching" type="Matrix / Runout" value="MPO SK026-B -J.T.S.-"/></identifiers><videos><video duration="268" embed="true" src="https://www.youtube.com/watch?v=LgLchSRehhc"><title>Mr. James Barth & A.D. - Dutchmaster</title><description>Mr. James Barth & A.D. - Dutchmaster</description></video><video duration="297" embed="true" src="https://www.youtube.com/watch?v=x_Os7b-iWKs"><title>Mr. James Barth & A.D. - Yeah Kid!</title><description>Mr. James Barth & A.D. - Yeah Kid!</description></video><video duration="314" embed="true" src="https://www.youtube.com/watch?v=MIgQNVhYILA"><title>Mr. James Barth & A.D. - A Sea Apart</title><description>Mr. James Barth & A.D. - A Sea Apart</description></video><video duration="267" embed="true" src="https://www.youtube.com/watch?v=iaqHaULlqqg"><title>Mr. James Barth & A.D. - Inner City Lullaby</title><description>Mr. James Barth & A.D. - Inner City Lullaby</description></video></videos><companies><company><id>266169</id><name>JTS Studios</name><catno></catno><entity_type>29</entity_type><entity_type_name>Mastered At</entity_type_name><resource_url>https://api.discogs.com/labels/266169</resource_url></company><company><id>56025</id><name>MPO</name><catno></catno><entity_type>17</entity_type><entity_type_name>Pressed By</entity_type_name><resource_url>https://api.discogs.com/labels/56025</resource_url></company></companies></release>
<release id="3" status="Accepted"><images><image height="595" type="primary" uri="" uri150="" width="600"/><image height="472" type="secondary" uri="" uri150="" width="600"/><image height="600" type="secondary" uri="" uri150="" width="599"/><image height="470" type="secondary" uri="" uri150="" width="600"/></images><artists><artist><id>3</id><name>Josh Wink</name><anv></anv><join></join><role></role><tracks></tracks></artist></artists><title>Profound Sounds Vol. 1</title><labels><label catno="CK 63628" id="6" name="Ruffhouse Records"/></labels><extraartists><artist><id>3</id><name>Josh Wink</name><anv></anv><join></join><role>DJ Mix</role><tracks></tracks></artist></extraartists><formats><format name="CD" qty="1" text=""><descriptions><description>Compilation</description><description>Mixed</description></descriptions></format></formats><genres><genre>Electronic</genre></genres><styles><style>Techno</style><style>Tech House</style></styles><country>US</country><released>1999-07-13</released><notes>1: Track title is given as "D2" (which is the side of record on the vinyl version of i220-010 release). This was also released on CD where this track is listed on 8th position. On both version no titles are given (only writing/producing credits). Both versions of i220-010 can be seen on the master release page [m27265]. Additionally this track contains female vocals that aren't present on original i220-010 release.
4: Credited as J. Dahlbäck.
5: Track title wrongly given as "Vol. 1".
6: Credited as Gez Varley presents Tony Montana.
12: Track exclusive to Profound Sounds Vol. 1.</notes><data_quality>Correct</data_quality><master_id is_main_release="false">66526</master_id><tracklist><track><position>1</position><title>Untitled 8</title><duration>7:00</duration><artists><artist><id>5</id><name>Heiko Laux</name><anv></anv><join>&</join><role></role><tracks></tracks></artist><artist><id>4</id><name>Johannes Heil</name><anv></anv><join></join><role></role><tracks></tracks></artist></artists></track><track><position>2</position><title>Anjua (Sneaky 3)</title><duration>5:28</duration><artists><artist><id>15525</id><name>Karl Axel Bissler</name><anv>K.A.B.</anv><join></join><role></role><tracks></tracks></artist></artists></track><track><position>3</position><title>When The Funk Hits The Fan (Mood II Swing When The Dub Hits The Fan)</title><duration>5:25</duration><artists><artist><id>7</id><name>Sylk 130</name><anv></anv><join></join><role></role><tracks></tracks></artist></artists><extraartists><artist><id>8</id><name>Mood II Swing</name><anv></anv><join></join><role>Remix</role><tracks></tracks></artist></extraartists></track><track><position>4</position><title>What's The Time, Mr. Templar</title><duration>4:27</duration><artists><artist><id>1</id><name>The Persuader</name><anv></anv><join></join><role></role><tracks></tracks></artist></artists></track><track><position>5</position><title>Vol. 2</title><duration>5:36</duration><artists><artist><id>267132</id><name>Care Company (2)</name><anv></anv><join></join><role></role><tracks></tracks></artist></artists></track><track><position>6</position><title>Political Prisoner</title><duration>3:37</duration><artists><artist><id>6981</id><name>Gez Varley</name><anv></anv><join></join><role></role><tracks></tracks></artist></artists></track><track><position>7</position><title>Pop Kulture</title><duration>5:03</duration><artists><artist><id>11</id><name>DJ Dozia</name><anv></anv><join></join><role></role><tracks></tracks></artist></artists></track><track><position>8</position><title>K-Mart Shopping (Hi-Fi Mix)</title><duration>5:42</duration><artists><artist><id>10702</id><name>Nerio's Dubwork</name><anv></anv><join>Meets</join><role></role><tracks></tracks></artist><artist><id>233190</id><name>Kathy Lee</name><anv></anv><join></join><role></role><tracks></tracks></artist></artists><extraartists><artist><id>23</id><name>Alex Hi-Fi</name><anv></anv><join></join><role>Remix</role><tracks></tracks></artist></extraartists></track><track><position>9</position><title>Lovelee Dae (Eight Miles High Mix)</title><duration>5:47</duration><artists><artist><id>13</id><name>Blaze</name><anv></anv><join></join><role></role><tracks></tracks></artist></artists><extraartists><artist><id>14</id><name>Eight Miles High</name><anv></anv><join></join><role>Remix</role><tracks></tracks></artist></extraartists></track><track><position>10</position><title>Sweat</title><duration>6:06</duration><artists><artist><id>67226</id><name>Stacey Pullen</name><anv></anv><join>Presents</join><role></role><tracks></tracks></artist><artist><id>7554</id><name>Black Odyssey</name><anv></anv><join></join><role></role><tracks></tracks></artist></artists><extraartists><artist><id>67226</id><name>Stacey Pullen</name><anv></anv><join></join><role>Presenter</role><tracks></tracks></artist></extraartists></track><track><position>11</position><title>Silver</title><duration>3:16</duration><artists><artist><id>3906</id><name>Christian Smith & John Selway</name><anv></anv><join></join><role></role><tracks></tracks></artist></artists></track><track><position>12</position><title>Untitled</title><duration>2:46</duration><artists><artist><id>3</id><name>Josh Wink</name><anv></anv><join></join><role></role><tracks></tracks></artist></artists></track><track><position>13</position><title>Boom Box</title><duration>3:41</duration><artists><artist><id>19</id><name>Sound Associates</name><anv></anv><join></join><role></role><tracks></tracks></artist></artists></track><track><position>14</position><title>Track 2</title><duration>3:39</duration><artists><artist><id>20</id><name>Percy X</name><anv></anv><join></join><role></role><tracks></tracks></artist></artists></track></tracklist><identifiers><identifier type="Barcode" value="074646362822"/></identifiers>
只是将其作为片段插入,是最简单的方法,对不起。我现在想做的是寻找特殊的发行ID,检查他们是否有条形码,如果有的话,请取回那个。谁能指出我正确的方向? 提前问候和感谢, RTUZ第二名
SAX 是"evented" XML 解析。handler
具有以下要求的方法:
- 输入一个元素(出现打开元素,即
<child>
) - 退出元素(关闭元素发生,即
</child>
) - 找到的属性
- 找到元素文本/正文
处理程序需要跟踪它当前在 XML 中的位置以及它感兴趣的值。因此,当遇到它感兴趣的元素时,它可以决定该怎么做。
您的示例 XML 有点大,所以我制作了自己的小示例:
xml = <<EOS
<root>
<child id="1">
<barcode value="1111">
</child>
<child id="2">
</child>
<child id="1">
<barcode value="2222">
</child>
<child id="4">
<barcode value="3333">
</child>
</root>
EOS
我试图找到具有odd
ID和even
条形码值的child
元素。 对于这个简单的例子,我跟踪堆栈上的所有标签和属性,在退出元素(@stack.pop
)时丢弃状态。根据XML文档的深度和标记/属性的数量,这可能是"昂贵的"。
require "ox"
require "stringio"
class Handler < ::Ox::Sax
def initialize
@stack = []
end
def start_element(element_name)
@stack << [element_name, {}]
end
def end_element(element_name)
parent_name, parent_attributes = @stack[-2]
if parent_name == :child && parent_attributes[:id].to_i.odd?
name, attributes = @stack[-1]
if name == :barcode && attributes[:value].to_i.even?
puts "Here is one record that seems interesting: Child: #{parent_attributes[:id]}, Barcode: #{attributes[:value]}"
end
end
@stack.pop
end
def attr(attribute_name, attribute_value)
_name, attributes = @stack.last
attributes[attribute_name] = attribute_value
end
end
handler = Handler.new
Ox.sax_parse(handler, StringIO.new(xml))
这将打印
这里有一条看起来很有趣的记录: 儿童: 1, 条形码: 2222