在迭代器上使用函数进行 Spark 并行化

我有一个迭代器，它对 WARC 文档序列进行操作，并为每个文档生成修改后的令牌列表：

class MyCorpus(object):
def __init__(self, warc_file_instance):
self.warc_file = warc_file_instance
def clean_text(self, html):
soup = BeautifulSoup(html) # create a new bs4 object from the html data loaded
for script in soup(["script", "style"]): # remove all javascript and stylesheet code
script.extract()
# get text
text = soup.get_text()
# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
# drop blank lines
text = 'n'.join(chunk for chunk in chunks if chunk)
return text
def __iter__(self):
for r in self.warc_file:
try:
w_trec_id = r['WARC-TREC-ID']
print w_trec_id
except KeyError:
pass
try:
text = self.clean_text(re.compile('Content-Length: d+').split(r.payload)[1])
alnum_text = re.sub('[^A-Za-z0-9 ]+', ' ', text)
yield list(set(alnum_text.encode('utf-8').lower().split()))
except:
print 'An error occurred'

现在我应用 apache spark paraellize 来进一步应用所需的地图函数：

warc_file = warc.open('/Users/akshanshgupta/Workspace/00.warc')
documents = MyCorpus(warc_file) 
x = sc.parallelize(documents, 20)
data_flat_map = x.flatMap(lambda xs: [(x, 1) for x in xs])
sorted_map = data_flat_map.sortByKey()
counts = sorted_map.reduceByKey(add)
print(counts.max(lambda x: x[1]))

我有以下疑问：

这是实现这一目标的最佳方法还是有更简单的方法？
当我并行化迭代器时，实际处理是否并行进行？还是顺序的吗？
如果我有多个文件怎么办？我怎样才能把它扩展到一个非常大的语料库，比如TB？

更多来自 Scala 上下文，但是：

我有一个疑问是在reduceByKey之前做sortByKey。
如果使用map，foreachPartition，Dataframe Writer等或通过sc和Sparksession读取，则处理是并行的，Spark范式通常适用于非顺序依赖算法。 mapPartitions 和其他通常用于提高性能的 API。该函数应该是我认为或与map结合使用或在map闭包中使用map分区的一部分。请注意可序列化的问题，请参阅：
- Spark集群中RDD映射函数内的调用函数和
- https://engineering.sharethrough.com/blog/2013/09/13/top-3-troubleshooting-tips-to-keep-you-sparking/。
计算机资源允许更多的扩展，具有更好的性能和吞吐量。

相关内容

最新更新

热门标签：