如何使用python在文本中添加标点符号



我正在玩IBM Watson Speech To Text Service API。对于那些不知道这个服务被用来转录音频的人。您将音频文件上传到该服务,它将返回文本。到目前为止,该服务运行良好,但问题是返回的文本不包含标点符号。我试着用nltk来解决这个问题,但是没有结果。

一些nltk代码我已经尝试过了。

#string is the text
string = """Hey guys today I'm gonna show you how to make bulletproof coffee so free guys and never heard a blooper coffee it's been around since like two thousand two been around for awhile but i think it's been a lot more popular now probably within the last maybe year two years so for me I recently just started doing bulletproof coffee say about a couple so basically regular coffee you know you get your regular coffee and then you put cream and you put sugar mix it up to get this nice little creamy sweet taste so bulletproof coffee the differences as they're putting cream and sugar your you play coffee you put butter grass fed butter and coconut oil so the difference is instead of using sugar you're using and healthy fats and the point of that is it's really good for if you're asked about the cato jenneke dieter basically low carb diet last you won't do is load up your coffee with how much sugar and roads are low carb diet and also a lot of people actually use bulletproof coffee as a way to replace their practice now i'm not saying to do that because albion is as of right now I still you might break I drink my coffee and at all times a wreck this but again though I do eat a lot so home a lot more than a lot of normal people so keep this in mind okay so this is how the rescue workers the book per copy you want to go ahead and put two cups of coffee so it's a here is we have one cup put it into our blender the trick here is the blending park you don't blend it doesn't come out right I use it not blended and then that coconut oil and the butter cut just floats on top when you got this oily coffee taste not the not the best now two cups of coffee you put two teaspoons of butter you wanna go and grass fed butter so let's say this is one teaspoon here yeah I just gonna estimated this will be two teaspoons every go two teaspoons of butter and how it I guess and crazy I remember doing this as a man sir put butter in there but it actually doesn't and it's really really good then you put about one to two teaspoons of cooking oil so for me i'm going to go with so here's about probably about one tablespoon and here's to get there you go throw on the cat and   so not only is a good to be honest i think on a local wow it works out really well sugar is my car but when it comes to drinking the i can't really put any pre sugar sugar so what i've basically been doing a lot and you know black take the paper personally i can enjoy the cream and sugar so i thought this to be a really good replacement it makes is reading the taste and i think it's leann when i used even though i'm not sugar at all so it tastes good good for you and it's a low carb no take a look got a little foam on the top uh man fiasco we smell this now the butter coconut so check this out you get this nice little cream uh this is a taste test so smooth so creamy and i never put any sugar and a lot of people come in and play like go you know so we can love or something the v. f. or something the person for me albeit like this is perfect you know it's creamy and it's not bitter like black coffee plus it fits nicely with a low carb diet so a lot people drink this morning they replace directors because this does have a lot of calories so here's something to look out for them he noticed two teaspoons of butter want to tease bozo coconut oil a lot of calories just give an example one tablespoon of coconut oil has but a hundred and thirty calories which means two of them it's like two hundred sixty calories the butter is likely to borrow from that so you're looking at sort of ballpark a four hundred calories plus so that's what a lot of people drink this as replacement for their breakfast and with the good fats they give them the energy that they need to go throughout the day and also because that you watched another video talk about the difference we fat and carbs are sugar that exhort slower and so the energy lasts longer it doesn't get any sort of super fast and gets burnt off versus when you let's say drink sugary coffee you know we put sugar and cream in your coffee that sugar gets exhort super fast so you get this spike in energy up in the morning and then it starts crashing down but i noticed drinking this you don't gain topic crashes because it's back they said sugar so exhort slowly and kinda sustains so this is how i make my coffee i love it i suggest give it a try guys especially if you're one of coffee drinkers replacing your cream in your sugar with some bulletproof coffee you try some coconut oil and grass fed butter so i tried our guys see i like more workouts mortician how to work out the ads the most is that you want sixpack shortcuts dot com piece"""
import nltk
n = nltk.tokenize.punkt.PunktSentenceTokenizer()
n.sentences_from_text(string)
有什么办法可以解决我的问题吗?如果你是我,你会怎么做?

现在是2022年了,huggingface有一些易于安装的机器学习解决方案,可以在成绩单中添加标点符号:

https://huggingface.co/felflare/bert-restore-punctuation

的例子:

from rpunct import RestorePuncts
# The default language is 'english'
rpunct = RestorePuncts(use_cuda=False)
rpunct.punctuate("""in 2018 cornell researchers built a high-powered detector that in combination with an algorithm-driven process called ptychography set a world record
by tripling the resolution of a state-of-the-art electron microscope as successful as it was that approach had a weakness it only worked with ultrathin samples that were
a few atoms thick anything thicker would cause the electrons to scatter in ways that could not be disentangled now a team again led by david muller the samuel b eckert
professor of engineering has bested its own record by a factor of two with an electron microscope pixel array detector empad that incorporates even more sophisticated
3d reconstruction algorithms the resolution is so fine-tuned the only blurring that remains is the thermal jiggling of the atoms themselves""")

该模型经过56万个yelp句子的训练,准确率达到90%。

安装注意

截至2022年9月,rpunct包有一个没有gpu就无法运行的错误,但是在我的示例中有一个补丁分叉支持use_cuda=False kwarg(较慢但适用于所有处理器)。要安装这个分支,执行以下命令:

pip3 install git+https://github.com/ernie-mlg/rpunct.git

选项2(更好)

这是另一个具有类似精度的拥抱脸模型,并且第一次安装正确

https://huggingface.co/oliverguhr/fullstop-punctuation-multilang-large

pip install deepmultilingualpunctuation

用法:

>>> from deepmultilingualpunctuation import PunctuationModel
>>> model = PunctuationModel()
Downloading config.json: 100%|█████████████████████████████████████████████████████████████████| 892/892 [00:00<00:00, 335kB/s]
Downloading pytorch_model.bin: 100%|██████████████████████████████████████████████████████| 2.08G/2.08G [04:54<00:00, 7.60MB/s]
Downloading tokenizer_config.json: 100%|███████████████████████████████████████████████████████| 406/406 [00:00<00:00, 216kB/s]
Downloading sentencepiece.bpe.model: 100%|████████████████████████████████████████████████| 4.83M/4.83M [00:00<00:00, 8.08MB/s]
Downloading special_tokens_map.json: 100%|█████████████████████████████████████████████████████| 239/239 [00:00<00:00, 158kB/s]
/opt/anaconda3/envs/punct/lib/python3.9/site-packages/transformers/pipelines/token_classification.py:135: UserWarning: `grouped_entities` is deprecated and will be removed in version v5.0.0, defaulted to `aggregation_strategy="none"` instead.
  warnings.warn(
>>> text = "My name is Clara and I live in Berkeley California Ist das eine Frage Frau Müller"
>>> result = model.restore_punctuation(text)
>>> print(result)
My name is Clara and I live in Berkeley, California. Ist das eine Frage, Frau Müller?

我在上面一个新的conda环境中验证了这实际上是开箱即用的。第二个模型也自动检测并支持4种语言。

除非你有一些严肃的机器学习技能,否则你将不得不编写自己的"规则"来处理这个问题。对于这个用例来说,规则必须更容易开始。

句子必须遵守特定的"语法规则"。这些规则非常基本,因此可以在小学教授。问题是大多数人在写作或说话时不遵守语法规则。首先,您需要关注符合规则的文本。基于这个前提,下面是我要做的

  1. 在字符串上运行NLTK POS标记器(一旦我们知道词性,我们就可以编写规则)
  2. 识别词性违反句子语法规则的标记(句子不能以介词结尾)
  3. 根据语法规则识别句子结尾的标记(名词是一个很好的开始)
  4. 查找大量标点符号语法正确的文本语料库并删除标点符号。
  5. 在语料库中添加标点符号,试试你的"新规则"。调整你的规则。冲洗并重复

您可以利用punctuation 2 Python库来训练您自己的模型来检测标点符号,或者使用预训练的模型。

最新更新