Python:将字符串中的"dumb quotation marks"替换为"卷曲的"



我有一个这样的字符串:

"可是那位绅士,"达西看着他,"似乎觉得这个国家根本不算什么。

我想要这个输出:

"可是那位绅士,"达西看着他,"似乎觉得这个国家根本不算什么。

同样,哑单引号应转换为其卷曲等价项。如果您有兴趣,请在此处阅读排版规则。

我的猜测是这之前已经解决了,但我找不到库或脚本来做到这一点。SmartyPants(Perl)是所有库之母,并且有一个python端口。但它的输出是 HTML 实体:“But that gentleman,”我只想要一个带有卷曲引号的普通字符串。有什么想法吗?

更新:

我按照Padraig Cunningham的建议解决了它:

  1. 使用智能裤进行排版更正
  2. 使用 HTMLParser().unescape 将 HTML 实体转换回 Unicode

如果您的输入文本包含您不希望转换的 HTML 实体,这种方法可能会有问题,但就我而言,没关系。

更新结束

输入可以信任吗?

到目前为止,输入只能受信任。该字符串可以包含一个非闭合双引号:"But be that gentleman, looking at Dary 。它还可能包含一个非闭合单引号:'But be that gentleman, looking at Dary 。最后,它可以包含一个单引号,该引号旨在作为撇号:Don't go there.

我已经实现了一个 alogrithm 试图正确关闭这些丢失的报价,所以这不是问题的一部分。为了完整起见,以下是关闭丢失引号的代码:

quotationMarkDictionary = [{
    'start': '"',
    'end': '"',
    },{
    'start': '“',
    'end': '”',
    },{
    'start': ''',
    'end': ''',
    },{
    'start': '‘',
    'end': '’'
    },{
    'start': '(',
    'end': ')'
    },{
    'start': '{',
    'end': '}'
    },{
    'start': '[',
    'end': ']'
    }]
'''If assumedSentence has quotation marks (single, double, …) and the 
number of opening quotation marks is larger than the number of closing    
quotation marks, append a closing quotation mark at the end of the 
sentence. Likewise, add opening quotation marks to the beginning of the 
sentence if there are more closing marks than opening marks.'''
for quotationMark in quotationMarkDictionary:
  numberOpenings = assumedSentence['sentence'].count(quotationMark['start'])
  numberClosings = assumedSentence['sentence'].count(quotationMark['end'])
  # Are the opening and closing marks the same? ('Wrong' marks.) Then just make sure there is an even number of them
  if quotationMark['start'] is quotationMark['end'] and numberOpenings % 2 is not 0:
    # If sentence starts with this quotation mark, put the new one at the end
    if assumedSentence['sentence'].startswith(quotationMark['start']):
      assumedSentence['sentence'] += quotationMark['end']
    else:
      assumedSentence['sentence'] = quotationMark['end'] + assumedSentence['sentence']
  elif numberOpenings > numberClosings:
    assumedSentence['sentence'] += quotationMark['end']
  elif numberOpenings < numberClosings:
     assumedSentence['sentence'] = quotationMark['start'] + assumedSentence['sentence']
您可以使用

HTMLParser来取消转义从smartypants返回的html实体:

In [32]: from HTMLParser import HTMLParser
In [33]: s = "&#x201C;But that gentleman,&#x201D;"
In [34]: print HTMLParser().unescape(s)
“But that gentleman,”
In [35]: HTMLParser().unescape(s)
Out[35]: u'u201cBut that gentleman,u201d'

要 avoin 编码错误,您应该在打开文件时使用 io.open 并指定encoding="the_encoding"或将字符串解码为 unicode:

 In [11]: s
Out[11]: '&#x201C;But that gentleman,&#x201D;xe2'
In [12]: print  HTMLParser().unescape(s.decode("latin-1"))
“But that gentleman,”â

对于最简单的用例,不需要正则表达式:

quote_chars_counts = {
    '"': 0,
    "'": 0,
    "`": 0
}

def to_smart_quotes(s):
    output = []
    for c in s:
        if c in quote_chars_counts.keys():
            replacement = (quote_chars_counts[c] % 2 == 0) and '“' or '”'
            quote_chars_counts[c] = quote_chars_counts[c] + 1
            new_ch = replacement
        else:
            new_ch = c
        output.append(new_ch)
    return ''.join(output)

如果需要,修改以从替换映射中提取替换而不是使用文本是很简单的。

由于最初提出这个问题,Python smartypants 获得了一个选项,可以直接输出 Unicode 中的替换字符:

u = 256

输出 Unicode 字符而不是数字字符引用,例如,从 &#8220; 到左双引号 () (U+201C)。

浏览文档,看起来你被困在 smarty裤子的顶部.replace

smartypants(r'"smarty" "pants"').replace('&#x201C;', '“').replace('&#x201D;', '”')

不过,如果您为魔术字符串添加别名,可能会读得更好:

html_open_quote = '&#x201C;'
html_close_quote = '&#x201D;'
smart_open_quote = '“'
smart_close_quote = '”'
smartypants(r'"smarty" "pants"') 
    .replace(html_open_quote, smart_open_quote)  
    .replace(html_close_quote, smart_close_quote)

假设输入良好,可以使用正则表达式完成此操作:

# coding=utf8
import re
sample = ''Sample Text' - "But that gentleman," looking at Darcy, "seemed to think the 'country' was nothing at all." 'Don't convert here.''
print re.sub(r"(s|^)'(.*?)'(s|$)", r"1‘2’3", re.sub(r""(.*?)"", r"“1”", sample))

输出:

‘Sample Text’ - “But that gentleman,” looking at Darcy, “seemed to think the ‘country’ was nothing at all.” ‘Don't convert here.’

我在这里通过假设单引号位于一行的开头/结尾或周围有空格来分隔单引号。

最新更新