在不列出每个块的情况下解析可迭代项

假设我想实现Python可迭代的拆分，而不列出每个块，类似于itertools.groupby，它的块是惰性的。但我想在一个比钥匙平等更复杂的条件下做这件事。所以更像是一个解析器。

例如，假设我想在可迭代的整数中使用奇数作为分隔符。像more_itertools.split_at(lambda x: x % 2 == 1, xs)。(但是more_itertools.split_at列出了每个块。)

在解析器组合子语言中，这可能被称为sepBy1(odd, many(even))。Haskell中有Parsec、pipes-parse和pipes-group库来解决这类问题。例如，从Pipes编写一个类似itertools.groupby的groupsBy'版本就足够了，也很有趣。组(请参见此处)。

itertools.groupby可能会有一些聪明的柔术，也许会应用itertools.pairwise，然后应用itertools.groupby，然后再回到单一元素。

我想我可以自己把它作为一个生成器来编写，但用Python(下面)编写itertools.groupby已经相当麻烦了。也不容易概括。

似乎应该有更普遍的方法，比如为任何类型的流编写解析器和组合子的相对无痛的方法。

# From https://docs.python.org/3/library/itertools.html#itertools.groupby
# groupby() is roughly equivalent to:
class groupby:
# [k for k, g in groupby('AAAABBBCCDAABBB')] --> A B C D A B
# [list(g) for k, g in groupby('AAAABBBCCD')] --> AAAA BBB CC D
def __init__(self, iterable, key=None):
if key is None:
key = lambda x: x
self.keyfunc = key
self.it = iter(iterable)
self.tgtkey = self.currkey = self.currvalue = object()
def __iter__(self):
return self
def __next__(self):
self.id = object()
while self.currkey == self.tgtkey:
self.currvalue = next(self.it)    # Exit on StopIteration
self.currkey = self.keyfunc(self.currvalue)
self.tgtkey = self.currkey
return (self.currkey, self._grouper(self.tgtkey, self.id))
def _grouper(self, tgtkey, id):
while self.id is id and self.currkey == tgtkey:
yield self.currvalue
try:
self.currvalue = next(self.it)
except StopIteration:
return
self.currkey = self.keyfunc(self.currvalue)

这里有几个简单的迭代器拆分器，我在无聊的时候写的。我不认为它们特别深刻，但也许它们会在某种程度上有所帮助。

我没有花太多时间思考有用的接口、优化或实现多个交互子功能。如果需要的话，所有这些东西都可以添加。

这些基本上是以itertools.groupby为模型的，其接口可能会被认为有点奇怪。这是Python实际上不是一种函数式编程语言的结果。Python的生成器(以及实现迭代器协议的其他对象)是有状态的，并且没有保存和恢复生成器状态的功能。因此，函数确实返回了一个迭代器，迭代器依次生成迭代器。迭代器从原始迭代器中生成值。但是返回的迭代器共享底层的可迭代性，即传递给原始调用的可迭代值，这意味着当您推进外部迭代器时，当前内部迭代器中任何未使用的值都会被丢弃，而不会发出通知。

有一些(相当昂贵的)方法可以避免丢弃这些值，但由于从一开始就排除了最明显的方法——列表化，尽管准确记录行为很尴尬，但我还是选择了groupby接口。可以用itertools.tee包装内部迭代器，以使原始迭代器独立，但这需要类似于(或可能略高于)列表化的代价。它仍然要求在启动下一个子迭代器之前完全生成每个子迭代器，但不要求在开始使用值之前完全生成子迭代程序。

为了简单起见(据我说：-)，我将这些函数实现为生成器，而不是对象，就像itertools和more_itertools一样。外部发生器产生每个连续的亚畸胎，然后在产生下一个亚畸胎之前收集并丢弃其中的任何剩余值[注1]。我想，大多数时候，在外循环尝试刷新子畸胎瘤之前，子畸胎瘤会完全耗尽，所以额外的调用会有点浪费，但它比您引用的itertools.groupby代码更简单。

仍然有必要从子迭代器返回原始迭代器已用尽的事实，因为这不是你可以询问迭代器的事情。我使用nonlocal声明在外部生成器和内部生成器之间共享状态。在某些方面，像itertools.groupby那样维护对象中的状态可能更灵活，甚至可能被认为更Python，但nonlocal对我有用

我实现了more_itertools.split_at(没有maxsplits和keep_separator选项)，我认为它相当于Pipes.Groups.groupBy'，重命名为split_between，表示如果两个连续元素满足某些条件，它会在它们之间进行拆分。

请注意，split_between总是在通过运行第一个子迭代器请求第一个值之前，强制提供迭代器中的第一个值。其余的值是延迟生成的。我尝试了几种方法来推迟第一个对象，但最终我还是选择了这个设计，因为它要简单得多。其结果是，即使提供的参数为空，不执行初始力的split_at也总是返回至少一个子谓词，而split_between则不返回。对于一些真正的问题，我必须同时尝试这两种方法，以决定我更喜欢哪个界面；如果你有偏好，一定要表达出来(但不能保证会有变化)。

from collections import deque
def split_at(iterable, pred=lambda x:x is None):
'''Produces an iterator which returns successive sub-iterations of 
`iterable`, delimited by values for which `pred` returns
truthiness. The default predicate returns True only for the
value None.
The sub-iterations share the underlying iterable, so they are not 
independent of each other. Advancing the outer iterator will discard
the rest of the current sub-iteration.
The delimiting values are discarded.
'''
done = False
iterable = iter(iterable)
def subiter():
nonlocal done
for value in iterable:
if pred(value): return
yield value
done = True
while not done:
yield (g := subiter())
deque(g, maxlen=0)
def split_between(iterable, pred=lambda before,after:before + 1 != after):
'''Produces an iterator which returns successive sub-iterations of 
`iterable`, delimited at points where calling `pred` on two
consecutive values produces truthiness. The default predicate
returns True when the two values are not consecutive, making it
possible to split a sequence of integers into contiguous ranges.
The sub-iterations share the underlying iterable, so they are not 
independent of each other. Advancing the outer iterator will discard
the rest of the current sub-iteration.
'''
iterable = iter(iterable)
try:
before = next(iterable)
except StopIteration:
return
done = False
def subiter():
nonlocal done, before
for after in iterable:
yield before
prev, before = before, after
if pred(prev, before):
return
yield before
done = True
while not done:
yield (g := subiter())
deque(g, maxlen=0)

备注

我相信，

collections.deque(g, maxlen=0)是目前丢弃迭代器剩余值的最有效方法，尽管它看起来有点神秘。感谢more_itertools为我指出了该解决方案，以及计算生成器生成的对象数的相关表达式：
```
cache[0][0] if (cache := deque(enumerate(it, 1), maxlen=1)) else 0
```
尽管我并不是想把上述怪事归咎于more_itertools。(他们是用if语句来做的，而不是海象。)

备注

相关内容

最新更新

热门标签：