我有一些文本,比如"google+", "facebook+", ".facebook", "H&M", "google 😃"...
在这里,我想创建一个关键字过滤,我想从字符串中删除无意义的特殊字符。
的例子:
"google+" => "google+" => meaningful because there is a service like "Google+"
"H&M" => "H&M" => meaningful because there is a service like "H&M"
"facebook+" => "facebook" => "+" in this text is meaningless then remove it.
".facebook" => "facebook" => "." in this text is meaningless then remove it.
"google 😃" => "google" => Emoji in this text is meaningless then remove it.
有什么建议吗?
这在算法中很难做到,因为在没有任何参考的情况下,有意义和无意义被认为是主观的。
你最好的办法是在某个地方找到一个或多个包含你所引用的标签类型的数据集。然后您可以检查数据集是否包含完整字符串。如果数据集不包含特殊字符,只需开始剥离特殊字符并再次检查。
你可以使用的一个数据集是:https://www.kaggle.com/stackoverflow/stacklite
您还可以对标签进行假设,并始终删除末尾的特殊字符。
words = ["google+", "H&M", "facebook+", ".facebook", "google 😃"]
keywords = [word.strip('$?^_.+!@#:() ') for word in words]
Output: ['google', 'H&M', 'facebook', 'facebook', 'google']
可能你可以结合这两个解决方案也得到google+。