Python:打印/获取每个段落的第一句话



这是我的代码,但它打印了整个段落。如何只打印第一句话,直到第一个点?

from bs4 import BeautifulSoup
import urllib.request,time
article = 'https://www.theguardian.com/science/2012/
oct/03/philosophy-artificial-intelligence'
req = urllib.request.Request(article, headers={'User-agent': 'Mozilla/5.0'})
html = urllib.request.urlopen(req).read()
soup = BeautifulSoup(html,'lxml')
def print_intro():
    if len(soup.find_all('p')[0].get_text()) > 100:
        print(soup.find_all('p')[0].get_text())

此代码打印:

说明人脑具有在某些方面的能力 尊重,远远优于所有其他已知物体 宇宙是没有争议的。大脑是唯一的物体 能够理解宇宙甚至在那里,或者为什么在那里 是无限多个质数,或者苹果掉下来是因为 时空曲率,或者服从自己与生俱来的本能可以 在道德上是错误的,或者它本身存在。它也不是独一无二的 能力仅限于这种大脑问题。冰冷的物理事实 是唯一一种可以推动自己进入的物体 空间和返回而不受伤害,或预测和防止流星撞击 本身,或将物体冷却到绝对以上的十亿分之一度 零,或探测到银河系距离上的同类。

但我只希望它打印:

说明人脑具有在某些方面的能力 尊重,远远优于所有其他已知物体 宇宙是没有争议的。

感谢您的帮助

拆分该点上的文本;对于单个拆分,使用 str.partition() 比使用限制的str.split()更快:

text = soup.find_all('p')[0].get_text()
if len(text) > 100:
    text = text.partition('.')[0] + '.'
print(text)

如果只需要处理第一个<p>元素,请改用soup.find()

text = soup.find('p').get_text()
if len(text) > 100:
    text = text.partition('.')[0] + '.'
print(text)

但是,对于给定的 URL,示例文本作为第二段找到:

>>> soup.find_all('p')[1]
<p><span class="drop-cap"><span class="drop-cap__inner">T</span></span>o state that the human brain has capabilities that are, in some respects, far superior to those of all other known objects in the cosmos would be uncontroversial. The brain is the only kind of object capable of understanding that the cosmos is even there, or why there are infinitely many prime numbers, or that apples fall because of the curvature of space-time, or that obeying its own inborn instincts can be morally wrong, or that it itself exists. Nor are its unique abilities confined to such cerebral matters. The cold, physical fact is that it is the only kind of object that can propel itself into space and back without harm, or predict and prevent a meteor strike on itself, or cool objects to a billionth of a degree above absolute zero, or detect others of its kind across galactic distances.</p>
>>> text = soup.find_all('p')[1].get_text()
>>> text.partition('.')[0] + '.'
'To state that the human brain has capabilities that are, in some respects, far superior to those of all other known objects in the cosmos would be uncontroversial.'
def print_intro():
    if len(soup.find_all('p')[0].get_text()) > 100:
        paragraph = soup.find_all('p')[0].get_text()
        phrase_list = paragraph.split('.')
        print(phrase_list[0])

第一个period split段落。参数1物种MAXSPLIT,并节省您的时间,避免不必要的额外分裂。

def print_intro():
    if len(soup.find_all('p')[0].get_text()) > 100:
        my_paragraph = soup.find_all('p')[0].get_text()
        my_list = my_paragraph.split('.', 1)
        print(my_list[0])
您可以使用

find('.'),它返回您要查找的内容第一次出现的索引。

因此,如果段落存储在名为 paragraph 的变量中

sentence_index = paragraph.find('.')
# add the '.'
sentence += 1
print(paragraph[0: sentence_index])

显然,这里缺少控制部分,例如检查变量paragraph中包含的字符串是否具有"."等,无论如何,如果找不到您要查找的子字符串,则 find(( 返回 -1。

最新更新