为什么运行管道不发送所有数据？

下面是我正在解析的一些xml：

<?xml version="1.0" encoding="utf-8"?>
<data>
<row ows_Document='Weekly Report 10.21.2020'
ows_Category='Weekly Report'/>
<row ows_Document='Daily Update 10.20.2020'
ows_Category='Daily Update'/>
<row ows_Document='Weekly Report 10.14.2020'
ows_Category='Weekly Report'/>
<row ows_Document='Weekly Report 10.07.2020'
ows_Category='Weekly Report'/>
<row ows_Document='Spanish: Reporte Semanal 07.10.2020' 
ows_Category='Weekly Report'/>
</data>

我一直在想如何让管道解析器拒绝记录，除非ows_Category是Weekly Report，而ows_Document不包含Spanish。起初，我在解析后使用了一个伪值(在下面的parseDoc'中(来过滤它们，但后来我意识到我应该能够使用Maybe(在下面完全相同的parseDoc中(和join来折叠我的Maybe层，tag'事件解析器使用的层基于名称或属性匹配而失败。它进行编译，但行为怪异，显然甚至没有尝试将某些元素发送到解析器！这怎么可能呢？

{-# LANGUAGE OverloadedStrings #-}
import           Conduit
import           Control.Monad
import qualified Data.ByteString.Lazy.Char8 as L8
import           Data.Foldable
import           Data.String
import qualified Data.Text                  as T
import           Data.XML.Types
import           Text.XML.Stream.Parse
newtype Doc = Doc
{ name :: String
} deriving (Show)
main :: IO ()
main = do
r <- L8.readFile "oha.xml"
let doc = Doc . T.unpack
check (x,y) a b = if y == "Weekly Report" && not (T.isInfixOf "Spanish" x) then a else b
t :: (MonadThrow m, MonadIO m) => ((T.Text, T.Text) -> ConduitT Event o m c)
-> ConduitT Event o m (Maybe c)
t f = tag' "row" ((,) <$> requireAttr "ows_Document" <*> requireAttr "ows_Category") $ x -> do
liftIO $ print x
f x
parseDoc, parseDoc' :: (MonadThrow m, MonadIO m) => ConduitT Event o m (Maybe Doc)
parseDoc  = (join <$>) . t $ z@(x,_) -> return $       check z (Just $ doc x)  Nothing -- this version doesn't get sent all of the data! why!?!?
parseDoc' =              t $ z@(x,_) -> return $ doc $ check z             x $ T.pack bad -- dummy value
parseDocs :: (MonadThrow m, MonadIO m) => ConduitT Event o m (Maybe Doc)
-> ConduitT Event o m [Doc]
parseDocs = f tagNoAttr "data" . many'
f g n = force (n <> " required") . g (fromString n)
go p = runConduit $ parseLBS def r .| parseDocs p
bad = "no good"
traverse_ print =<<                              go parseDoc
putStrLn ""
traverse_ print =<< filter ((/= bad) . name) <$> go parseDoc'

output——注意parseDoc甚至没有发送一条记录(从10.14开始应该成功的一条(，而parseDoc'的行为与预期一致：

("Weekly Report 10.21.2020","Weekly Report")
("Daily Update 10.20.2020","Daily Update")
("Weekly Report 10.07.2020","Weekly Report")
("Spanish: Reporte Semanal 07.10.2020","Weekly Report")
Doc {name = "Weekly Report 10.21.2020"}
Doc {name = "Weekly Report 10.07.2020"}
("Weekly Report 10.21.2020","Weekly Report")
("Daily Update 10.20.2020","Daily Update")
("Weekly Report 10.14.2020","Weekly Report")
("Weekly Report 10.07.2020","Weekly Report")
("Spanish: Reporte Semanal 07.10.2020","Weekly Report")
Doc {name = "Weekly Report 10.21.2020"}
Doc {name = "Weekly Report 10.14.2020"}
Doc {name = "Weekly Report 10.07.2020"}

当我试图通过删除与ows_Category有关的所有内容来进一步简化时，突然间parseDoc工作得很好，确立了这个想法的合理性？当我删除了与ows_Document有关的所有内容时，问题仍然存在。

我怀疑我应该用requireAttrRaw来做这件事，但我没能理解它，也找不到文档/示例。

这与Applicative有关吗？现在我想一想，它应该不会因为检查值而失败，对吧？

更新

我从该库的前一版本的作者那里找到了这个答案，其中包括类似情况下有趣的force "fail msg" $ return Nothing，但它放弃了所有解析，而不仅仅是当前解析失败。

这个注释建议我需要抛出一个异常，在源代码中，他们使用类似lift $ throwM $ XmlException "failed check" $ Just event的东西，但与force ... return Nothing一样，这会杀死所有解析，而不仅仅是当前的解析器。我也不知道如何拿到event。

这里有一个合并的pull请求，声称已经解决了这个问题，但它没有讨论如何使用它，只是说它是"；琐碎的"：(

回答

明确答案：

parseAttributes :: AttrParser (T.Text, T.Text)
parseAttributes = do
d <- requireAttr "ows_Document"
c <- requireAttr "ows_Category"
ignoreAttrs
guard $ not (T.isInfixOf "Spanish" d) && c == "Weekly Report"
return d
parseDoc :: (MonadThrow m, MonadIO m) => ConduitT Event o m (Maybe Doc)
parseDoc = tag' "row" parseAttributes $ return . doc

或者，因为在这种情况下，属性值可以独立检查：

parseAttributes = requireAttrRaw' "ows_Document" (not . T.isInfixOf "Spanish")
<* requireAttrRaw' "ows_Category" ("Weekly Report" ==)
<* ignoreAttrs
where requireAttrRaw' n f = requireAttrRaw ("required attr value failed condition: " <> n) $ (n',as) ->
asum $ ((ContentText a) -> guard (n' == fromString n && f a) *> pure a) <$> as

但后者留下了关于CCD_ 22:的这些问题

如果我们负责验证Name，难道不需要知道命名空间吗
为什么requireAttrRaw向我们发送[Content]而不是两个Maybe Content，ContentText和ContentEntity各一个
我们应该如何处理CCD_ 29"；对于直通解析">

tl；dr在tag' "row" parseAttributes parseContent中，check函数属于parseAttributes，而不属于parseContent。

为什么它没有按预期运行

xml管道(特别(是围绕以下不变量设计的：

当解析器类型为ConduitT Event o m (Maybe a)时，Maybe层对Event是否已被消耗进行编码
当且仅当parseName和parseAttributes都成功时，tag' parseName parseAttributes parseContent消耗Events
当且仅当parseName和parseAttributes都成功时，tag' parseName parseAttributes parseContent运行parseContent

在parseDoc:中

在parseContent部分调用check函数；在这个阶段，根据不变量2，tag'已经承诺消耗Events
将2个CCD_ 50层的堆叠体CCD_
- check函数的输出，用于编码当前<row/>元素是否相关
- "；标准"；tag'签名的Maybe层，根据不变量1对Events是否已被消耗进行编码

这本质上打破了不变量1：当check返回Nothing时，parseDoc返回Nothing，尽管消耗了整个<row/>元素的Events。这导致xml管道的所有组合子的未定义行为，特别是many'(下面分析(

为什么它会这样

many'组合子依赖于不变量1来完成它的工作。定义为many' consumer = manyIgnore consumer ignoreAnyTreeContent，即：

尝试consumer
如果consumer返回Nothing，则使用ignoreAnyTreeContent跳过元素或内容，假设它尚未被consumer消耗，然后递归返回步骤(1(

在您的情况下，consumer为Daily Update 10.20.2020项返回Nothing，即使完整的<row/>元素已被消耗。因此，运行ignoreAnyTreeContent是为了跳过特定的<row/>，但实际上最终会跳过下一个(Weekly Report 10.14.2020(。

如何实现预期行为

将check逻辑移到parseAttributes部分，使Event的消耗与check是否通过相耦合。

为什么它没有按预期运行

为什么它会这样

如何实现预期行为

相关内容

最新更新

热门标签：