为什么Google自然语言会返回分析字符串的不正确的开头

我正在使用google-cloud/language api进行#Ansotate呼叫，并分析了我从各种在线资源中获取的评论的实体和观点。

首先，我要分析的字符串包括评论，所以我重新格式化：

youtubez22htrtb1ymtdlka404t1aokg2kirffb53u3pya0,i just bot a Nostromo... ( ._.)
youtubez22oet0bruejcdf0gacdp431wxg3vb2zxoiov1da,Good Job Baby! MSI Propeller Blade Technology!
youtubez22ri11akra4tfku3acdp432h1qyzap3yy4ziifc,"exactly, i have to deal with that damned brick, and the power supply can&#39;t be upgraded because of it, because as far as power supply goes, i have never seen an external one on newegg that has more power then the x51&#39;s"
youtubez23ttpsyolztc1ep004t1aokg5zuyqxfqykgyjqs,"I like how people are liking your comment about liking the fact that Sky DID put Deadlox&#39;s channel in the description instead of Ryan&#39;s. Nice Alienware thing logo thing, btw"
youtubez12zjp5rupbcttvmy220ghf4ctqnerqwa04,"You know, If you actually made this. People would actually buy it."

，因此不包括任何注释ID：

I just bot a Nostromo... ( ._.)
Good Job Baby! MSI Propeller Blade Technology!n"exactly, i have to deal with that damned brick, and the power supply can&#39;t be upgraded because of it, because as far as power supply goes, i have never seen an external one on newegg that has more power then the x51&#39;s"
"I like how people are liking your comment about liking the fact that Sky DID put Deadlox&#39;s channel in the description instead of Ryan&#39;s.   Nice Alienware thing logo thing, btw"
"You know, If you actually made this. People would actually buy it."

发送了请求Google云/语言的请求，以#NANTITE文本。我收到一个响应，其中包括各种基因的情感和幅度。每个字符串也都会给出一个 beginOffset值，该值与原始字符串中的字符串索引有关（请求中的字符串）。

{ content: 'i just bot a Nostromo... ( ._.)nGood Job Baby!',
  beginOffset: 0 }
{ content: 'MSI Propeller Blade Technology!n"exactly, i have to deal with that damned brick, and the power supply can&#39;t be upgraded because of it, because as far as power supply goes, i have never seen an external one on newegg that has more power then the x51&#39;s"n"I like how people are liking your comment about liking the fact that Sky DID put Deadlox&#39;s channel in the description instead of Ryan&#39;s.',
  beginOffset: 50 }
{ content: 'Nice Alienware thing logo thing, btw"n"You know, If you actually made this.',
  beginOffset: 462 }

我的目的是在原始字符串中找到原始评论，这应该足够简单。像(originalString[beginOffset]) .....

此值不正确！

我假设它们不包含某些角色，但是我尝试了多种言论，似乎没有任何效果。有人知道可能导致问题吗？？？

我知道这是一个古老的问题，但即使在今天，这个问题似乎仍然存在。我最近遇到了同样的问题，并通过将Google的偏移解释为"字节偏移"而不是所选编码中的字符串偏移来解决。效果很好。我希望它对某人有帮助。

以下是一些C＃代码，但是任何人都应该能够解释并以自己喜欢的语言进行重新编码。如果我们假设text实际上是正在分析的情感文本，则以下代码转换，Google的偏移成正确的偏移。

int TransformOffset(string text, int offset)
{
   return Encoding.UTF8.GetString(
             Encoding.UTF8.GetBytes(text),
             0,
             offset)
          .Length;
}

这与编码有关。播放其中一种编码，或者简单地使用其GitHub存储库中提供的一种示例方法：

https://github.com/googlecloodplatform/python-docs-samples/blob/master/language/api/api/analyze.py.py

关键代码块：


def get_native_encoding_type():
    """Returns the encoding type that matches Python's native strings."""
    if sys.maxunicode == 65535:
        return 'UTF16'
    else:
        return 'UTF32'

这对我有用。它弄乱了诸如 '（即 u2019）之类的字符。

您应该在请求上设置encodingtype。

使用Java客户端库并使用UTF-8编码文本的示例：

Document doc = Document.newBuilder().setContent(dreamText).setType(Type.PLAIN_TEXT).build();
        
AnalyzeEntitiesRequest request = AnalyzeEntitiesRequest.newBuilder().setEncodingType(EncodingType.UTF8).setDocument(doc).build();

相关内容

最新更新

热门标签：