为什么Google自然语言会返回分析字符串的不正确的开头



我正在使用google-cloud/language api进行#Ansotate呼叫,并分析了我从各种在线资源中获取的评论的实体和观点。

首先,我要分析的字符串包括评论,所以我重新格式化:

youtubez22htrtb1ymtdlka404t1aokg2kirffb53u3pya0,i just bot a Nostromo... ( ._.)
youtubez22oet0bruejcdf0gacdp431wxg3vb2zxoiov1da,Good Job Baby! MSI Propeller Blade Technology!
youtubez22ri11akra4tfku3acdp432h1qyzap3yy4ziifc,"exactly, i have to deal with that damned brick, and the power supply can't be upgraded because of it, because as far as power supply goes, i have never seen an external one on newegg that has more power then the x51's"
youtubez23ttpsyolztc1ep004t1aokg5zuyqxfqykgyjqs,"I like how people are liking your comment about liking the fact that Sky DID put Deadlox's channel in the description instead of Ryan's. Nice Alienware thing logo thing, btw"
youtubez12zjp5rupbcttvmy220ghf4ctqnerqwa04,"You know, If you actually made this. People would actually buy it."

,因此不包括任何注释ID:

I just bot a Nostromo... ( ._.)
Good Job Baby! MSI Propeller Blade Technology!n"exactly, i have to deal with that damned brick, and the power supply can't be upgraded because of it, because as far as power supply goes, i have never seen an external one on newegg that has more power then the x51's"
"I like how people are liking your comment about liking the fact that Sky DID put Deadlox's channel in the description instead of Ryan's.   Nice Alienware thing logo thing, btw"
"You know, If you actually made this. People would actually buy it."

发送了请求Google云/语言的请求,以#NANTITE文本。我收到一个响应,其中包括各种基因的情感和幅度。每个字符串也都会给出一个 beginOffset值,该值与原始字符串中的字符串索引有关(请求中的字符串)。

{ content: 'i just bot a Nostromo... ( ._.)nGood Job Baby!',
  beginOffset: 0 }
{ content: 'MSI Propeller Blade Technology!n"exactly, i have to deal with that damned brick, and the power supply can't be upgraded because of it, because as far as power supply goes, i have never seen an external one on newegg that has more power then the x51's"n"I like how people are liking your comment about liking the fact that Sky DID put Deadlox's channel in the description instead of Ryan's.',
  beginOffset: 50 }
{ content: 'Nice Alienware thing logo thing, btw"n"You know, If you actually made this.',
  beginOffset: 462 }

我的目的是在原始字符串中找到原始评论,这应该足够简单。像(originalString[beginOffset]) .....

此值不正确!

我假设它们不包含某些角色,但是我尝试了多种言论,似乎没有任何效果。有人知道可能导致问题吗???

我知道这是一个古老的问题,但即使在今天,这个问题似乎仍然存在。我最近遇到了同样的问题,并通过将Google的偏移解释为"字节偏移"而不是所选编码中的字符串偏移来解决。效果很好。我希望它对某人有帮助。

以下是一些C#代码,但是任何人都应该能够解释并以自己喜欢的语言进行重新编码。如果我们假设text实际上是正在分析的情感文本,则以下代码转换,Google的偏移成正确的偏移。

int TransformOffset(string text, int offset)
{
   return Encoding.UTF8.GetString(
             Encoding.UTF8.GetBytes(text),
             0,
             offset)
          .Length;
}

这与编码有关。播放其中一种编码,或者简单地使用其GitHub存储库中提供的一种示例方法:

https://github.com/googlecloodplatform/python-docs-samples/blob/master/language/api/api/analyze.py.py

关键代码块:


def get_native_encoding_type():
    """Returns the encoding type that matches Python's native strings."""
    if sys.maxunicode == 65535:
        return 'UTF16'
    else:
        return 'UTF32'

这对我有用。它弄乱了诸如 '(即 u2019)之类的字符。

您应该在请求上设置encodingtype。

使用Java客户端库并使用UTF-8编码文本的示例:

Document doc = Document.newBuilder().setContent(dreamText).setType(Type.PLAIN_TEXT).build();
        
AnalyzeEntitiesRequest request = AnalyzeEntitiesRequest.newBuilder().setEncodingType(EncodingType.UTF8).setDocument(doc).build();

相关内容

  • 没有找到相关文章

最新更新