从带有文本列的Spark DataFrame创建TF_IDF向量

我有一个Spark DataFrame，该框架具有三个列。['id'，'title'，'desc']。这里的"标题"one_answers" desc"都是文本。ID是文档的ID。示例几行看起来如下：

[Row(id=-33753621, title=u'Royal Bank of Scotland is testing a robot that could solve your banking problems (RBS)', desc=u"If you hate dealing with bank tellers or customer service representatives, then the Royal Bank of Scotland might have a solution for you.If this program is successful, it could be a big step forward on the road to automated customer service through the use of AI, notes Laurie Beaver, research associate for BI Intelligence, Business Insider's premium research service.It's noteworthy that Luvo does not operate via a third-party app such as Facebook Messenger, WeChat, or Kik, all of which are currently trying to create bots that would assist in customer service within their respective platforms.Luvo would be available through the web and through smartphones. It would also use machine learning to learn from its mistakes, which should ultimately help with its response accuracy.Down the road, Luvo would become a supplement to the human staff. It can currently answer 20 set questions but as that number grows, it would allow the human employees to more complicated issues. If a problem is beyond Luvo's comprehension, then it would refer the customer to a bank employee; however,xa0a user could choose to speak with a human instead of Luvo anyway.AI such as Luvo, if successful, could help businesses become more efficient and increase their productivity, while simultaneously improving customer service capacity, which would consequentlyxa0save money that would otherwise go toward manpower.And this trend is already starting. Google, Microsoft, and IBM are investing significantly into AI research. Furthermore, the global AI market is estimated to grow from approximately $420 million in 2014 to $5.05 billion in 2020, according to a forecast by Research and Markets.xa0The move toward AI would be just one more way in which the digital age is disrupting retail banking. Customers, particularly millennials, are increasingly moving toward digital banking, and as a result, they're walking into their banks' traditional brick-and-mortar branches less often than ever before."),
 Row(id=-761323061, title=u'Teen sexting is prompting an overhaul in child pornography laws', desc=u"Rampant teen sexting has left politicians and law enforcement authorities around the country struggling to find some kind of legal middle ground between prosecuting students for child porn and letting them off the hook.Most states consider sexually explicit images of minors to be child pornography, meaning even teenagers who share nude selfies among themselves can, in theory at least, be hit with felony charges that can carry heavy prison sentences and require lifetime registration as a sex offender.Many authorities consider that overkill, however, and at least 20 states have adopted sexting laws with less-serious penalties, mostly within the past five years. Eleven states have made sexting between teens a misdemeanor; in some of those places, prosecutors can require youngsters to take courses on the dangers of social media instead of charging them with a crime.Hawaii passed a 2012 law saying youths can escape conviction if they take steps to delete explicit photos. Arkansas adopted a 2013 law sentencing first-time youth sexters to eight hours of community service. New Mexico last month removed criminal penalties altogether in such cases.At least 12 other states are considering sexting laws this year, many to create new a category of crime that would apply to young people.But one such proposal in Colorado has revealed deep divisions about how to treat the phenomenon. Though prosecutors and researchers agree that felony sex crimes shouldn't apply to a pair of 16-year-olds sending each other selfies, they disagree about whether sexting should be a crime at all.Colorado's bill was prompted by a scandal last year at a Canon City high school where more than 100 students were found with explicit images of other teens. The news sent shockwaves through the city of 16,000. Dozens of students were suspended, and the football team forfeited the final game of the season.Fremont County prosecutors ultimately decided against filing any criminal charges, saying Colorado law doesn't properly distinguish between adult sexual predators and misbehaving teenagers.In a similar case last year out Fayetteville, North Carolina, two dating teens who exchanged nude selfies at age 16 were charged as adults with a felony u2014 sexual exploitation of a minor. After an uproar, the cha"),

我想将此" desc"列（文档的实际文本）转换为Spark中的TF-IDF向量。

这就是我为此所做的。

def tfIdf(df):
    """ This fucntion takes the text data and converts it into a term frequency-Inverse Document Frequency vector
    parameter: 
    returns: dataframe with tf-idf vectors
    """
    # Importing the feature transformation classes for doing TF-IDF 
    from pyspark.ml.feature import Tokenizer, StopWordsRemover, CountVectorizer, IDF
    # Carrying out the Tokenization of the text documents (splitting into words)
    tokenizer = Tokenizer(inputCol="desc", outputCol="tokenised_text")
    tokensDf = tokenizer.transform(df)
    # Carrying out the StopWords Removal for TF-IDF
    stopwordsremover=StopWordsRemover(inputCol='tokenised_text',outputCol='words')
    swremovedDf= stopwordsremover.transform(tokensDf)
    # Creating Term Frequency Vector for each word
    cv=CountVectorizer(inputCol="words", outputCol="tf_features", vocabSize=3, minDF=2.0)
    cvModel=cv.fit(swremovedDf)
    tfDf=cvModel.transform(swremovedDf)
    # Carrying out Inverse Document Frequency on the TF data
    idf=IDF(inputCol="tf_features", outputCol="tf-idf_features")
    idfModel = idf.fit(tfDf)
    tfidfDf = idfModel.transform(tfDf)
    tfidfDf.cache().count()
    return tfidfDf

tfidfDf=tfIdf(sdf_cleaned)

我首先使用Tokenizer类在" DESC"列中对每个文本文档进行了令牌化。然后使用stopwordsremover类进行停止词的删除。然后，我将其转换为一袋单词模型，并使用CountVectorizer类获取术语频率。

最后，我使用IDF类将IDF权重应用于术语频率向量。

返回的数据框架的最终结果如下 - 仅显示第一行：

[Row(id=-33753621, title=u'Royal Bank of Scotland is testing a robot that could solve your banking problems (RBS)', desc=u"If you hate dealing with bank tellers or customer service representatives, then the Royal Bank of Scotland might have a solution for you.If this program is successful, it could be a big step forward on the road to automated customer service through the use of AI, notes Laurie Beaver, research associate for BI Intelligence, Business Insider's premium research service.It's noteworthy that Luvo does not operate via a third-party app such as Facebook Messenger, WeChat, or Kik, all of which are currently trying to create bots that would assist in customer service within their respective platforms.Luvo would be available through the web and through smartphones. It would also use machine learning to learn from its mistakes, which should ultimately help with its response accuracy.Down the road, Luvo would become a supplement to the human staff. It can currently answer 20 set questions but as that number grows, it would allow the human employees to more complicated issues. If a problem is beyond Luvo's comprehension, then it would refer the customer to a bank employee; however,xa0a user could choose to speak with a human instead of Luvo anyway.AI such as Luvo, if successful, could help businesses become more efficient and increase their productivity, while simultaneously improving customer service capacity, which would consequentlyxa0save money that would otherwise go toward manpower.And this trend is already starting. Google, Microsoft, and IBM are investing significantly into AI research. Furthermore, the global AI market is estimated to grow from approximately $420 million in 2014 to $5.05 billion in 2020, according to a forecast by Research and Markets.xa0The move toward AI would be just one more way in which the digital age is disrupting retail banking. Customers, particularly millennials, are increasingly moving toward digital banking, and as a result, they're walking into their banks' traditional brick-and-mortar branches less often than ever before.", tokenised_text=[u'if', u'you', u'hate', u'dealing', u'with', u'bank', u'tellers', u'or', u'customer', u'service', u'representatives,', u'then', u'the', u'royal', u'bank', u'of', u'scotland', u'might', u'have', u'a', u'solution', u'for', u'you.if', u'this', u'program', u'is', u'successful,', u'it', u'could', u'be', u'a', u'big', u'step', u'forward', u'on', u'the', u'road', u'to', u'automated', u'customer', u'service', u'through', u'the', u'use', u'of', u'ai,', u'notes', u'laurie', u'beaver,', u'research', u'associate', u'for', u'bi', u'intelligence,', u'business', u"insider's", u'premium', u'research', u"service.it's", u'noteworthy', u'that', u'luvo', u'does', u'not', u'operate', u'via', u'a', u'third-party', u'app', u'such', u'as', u'facebook', u'messenger,', u'wechat,', u'or', u'kik,', u'all', u'of', u'which', u'are', u'currently', u'trying', u'to', u'create', u'bots', u'that', u'would', u'assist', u'in', u'customer', u'service', u'within', u'their', u'respective', u'platforms.luvo', u'would', u'be', u'available', u'through', u'the', u'web', u'and', u'through', u'smartphones.', u'it', u'would', u'also', u'use', u'machine', u'learning', u'to', u'learn', u'from', u'its', u'mistakes,', u'which', u'should', u'ultimately', u'help', u'with', u'its', u'response', u'accuracy.down', u'the', u'road,', u'luvo', u'would', u'become', u'a', u'supplement', u'to', u'the', u'human', u'staff.', u'it', u'can', u'currently', u'answer', u'20', u'set', u'questions', u'but', u'as', u'that', u'number', u'grows,', u'it', u'would', u'allow', u'the', u'human', u'employees', u'to', u'more', u'complicated', u'issues.', u'if', u'a', u'problem', u'is', u'beyond', u"luvo's", u'comprehension,', u'then', u'it', u'would', u'refer', u'the', u'customer', u'to', u'a', u'bank', u'employee;', u'however,xa0a', u'user', u'could', u'choose', u'to', u'speak', u'with', u'a', u'human', u'instead', u'of', u'luvo', u'anyway.ai', u'such', u'as', u'luvo,', u'if', u'successful,', u'could', u'help', u'businesses', u'become', u'more', u'efficient', u'and', u'increase', u'their', u'productivity,', u'while', u'simultaneously', u'improving', u'customer', u'service', u'capacity,', u'which', u'would', u'consequentlyxa0save', u'money', u'that', u'would', u'otherwise', u'go', u'toward', u'manpower.and', u'this', u'trend', u'is', u'already', u'starting.', u'google,', u'microsoft,', u'and', u'ibm', u'are', u'investing', u'significantly', u'into', u'ai', u'research.', u'furthermore,', u'the', u'global', u'ai', u'market', u'is', u'estimated', u'to', u'grow', u'from', u'approximately', u'$420', u'million', u'in', u'2014', u'to', u'$5.05', u'billion', u'in', u'2020,', u'according', u'to', u'a', u'forecast', u'by', u'research', u'and', u'markets.xa0the', u'move', u'toward', u'ai', u'would', u'be', u'just', u'one', u'more', u'way', u'in', u'which', u'the', u'digital', u'age', u'is', u'disrupting', u'retail', u'banking.', u'customers,', u'particularly', u'millennials,', u'are', u'increasingly', u'moving', u'toward', u'digital', u'banking,', u'and', u'as', u'a', u'result,', u"they're", u'walking', u'into', u'their', u"banks'", u'traditional', u'brick-and-mortar', u'branches', u'less', u'often', u'than', u'ever', u'before.'], words=[u'hate', u'dealing', u'bank', u'tellers', u'customer', u'service', u'representatives,', u'royal', u'bank', u'scotland', u'solution', u'you.if', u'program', u'successful,', u'big', u'step', u'forward', u'road', u'automated', u'customer', u'service', u'use', u'ai,', u'notes', u'laurie', u'beaver,', u'research', u'associate', u'bi', u'intelligence,', u'business', u"insider's", u'premium', u'research', u"service.it's", u'noteworthy', u'luvo', u'does', u'operate', u'third-party', u'app', u'facebook', u'messenger,', u'wechat,', u'kik,', u'currently', u'trying', u'create', u'bots', u'assist', u'customer', u'service', u'respective', u'platforms.luvo', u'available', u'web', u'smartphones.', u'use', u'machine', u'learning', u'learn', u'mistakes,', u'ultimately', u'help', u'response', u'accuracy.down', u'road,', u'luvo', u'supplement', u'human', u'staff.', u'currently', u'answer', u'20', u'set', u'questions', u'number', u'grows,', u'allow', u'human', u'employees', u'complicated', u'issues.', u'problem', u"luvo's", u'comprehension,', u'refer', u'customer', u'bank', u'employee;', u'however,xa0a', u'user', u'choose', u'speak', u'human', u'instead', u'luvo', u'anyway.ai', u'luvo,', u'successful,', u'help', u'businesses', u'efficient', u'increase', u'productivity,', u'simultaneously', u'improving', u'customer', u'service', u'capacity,', u'consequentlyxa0save', u'money', u'manpower.and', u'trend', u'starting.', u'google,', u'microsoft,', u'ibm', u'investing', u'significantly', u'ai', u'research.', u'furthermore,', u'global', u'ai', u'market', u'estimated', u'grow', u'approximately', u'$420', u'million', u'2014', u'$5.05', u'billion', u'2020,', u'according', u'forecast', u'research', u'markets.xa0the', u'ai', u'just', u'way', u'digital', u'age', u'disrupting', u'retail', u'banking.', u'customers,', u'particularly', u'millennials,', u'increasingly', u'moving', u'digital', u'banking,', u'result,', u"they're", u'walking', u"banks'", u'traditional', u'brick-and-mortar', u'branches', u'before.'], tf_features=SparseVector(3, {}), tf-idf_features=SparseVector(3, {})),

所以前三列是原始['id'，'title'，'desc']。根据所使用的转换添加新列。如果您看到令牌仪和停车词的工作正常，因为这些输出列是正确的。

但是，我不确定为什么来自countvectorizer和IDF类TF_IDF_FEATURE的TF_Features列是null的，也不是TF-IDF值的向量。

同样在火花中，我们正在从每个列单元格中传递一个文档。Spark如何找到TF矢量的词汇？词汇是整个语料库中出现的独特单词（所有文档），而不仅仅是一个文档。TF的计数是每个文档中出现的术语的频率。因此

请建议。

edit1：

我更改了词汇量，因为显然3是没有意义的。现在我得到了tf_fetures。如下：

tf_features=SparseVector(2000, {6: 1.0, 8: 1.0, 14: 1.0, 17: 2.0, 18: 1.0, 20: 1.0, 32: 1.0, 35: 2.0, 42: 1.0, 52: 1.0, 53: 3.0, 54: 1.0, 62: 1.0, 65: 1.0, 68: 1.0, 79: 1.0, 93: 4.0, 95: 2.0, 98: 1.0, 118: 1.0, 132: 1.0, 133: 1.0, 149: 1.0, 157: 1.0, 167: 5.0, 202: 3.0, 215: 1.0, 219: 1.0, 224: 1.0, 232: 1.0, 265: 3.0, 302: 1.0, 303: 1.0, 324: 2.0, 330: 1.0, 355: 1.0, 383: 1.0, 395: 1.0, 405: 1.0, 432: 1.0, 456: 1.0, 466: 1.0, 472: 1.0, 501: 1.0, 525: 1.0, 537: 1.0, 548: 1.0, 620: 1.0, 630: 1.0, 639: 1.0, 657: 1.0, 662: 1.0, 674: 1.0, 720: 1.0, 734: 1.0, 975: 1.0, 1003: 1.0, 1057: 1.0, 1148: 1.0, 1187: 1.0, 1255: 1.0, 1273: 1.0, 1294: 1.0, 1386: 1.0, 1400: 1.0, 1463: 1.0, 1477: 1.0, 1491: 1.0, 1724: 1.0, 1898: 1.0, 1937: 3.0, 1954: 1.0})

我试图理解这一点。第一个值是否。功能（术语）。这也是我输入的词汇。但是，这里的另一个词典是什么？术语（单词）和值的键"索引"是术语频率吗？如果是，那么我们如何将此单词索引映射回原始单词？就像我想知道哪些单词是上述内容，如何映射dict键的键？

其次，此输出不是向量（就像其词典一样）。是否有任何ML算法的消耗量输出？我在热新的情况下需要一个功能向量，而不是dict。这是如何工作的？

在您显示的 tf_features和 tf-idf_features的示例中，不是null。这些是有效的SparseVectors，所有功能等于0.0（[ 0.0, 0.0, 0.0]）。

我相信负责是CountVectorizer的荒谬配置。使用vocabSize等于3，您仅考虑三个最频繁的术语（" countvectorizer将构建一个词汇，该词汇仅考虑在整个语料库中按术语频率订购的顶级词汇术语。在特定文本中，您将获得观察到的输出。

df = sc.parallelize([
    (["a", ], ), # 'a' occurs only once
    (["b", "c"], ), (["c", "c", "d"], ), (["b", "d"], )
]).toDF(["tokens"])
vectorizer = CountVectorizer(
    inputCol="tokens", outputCol="features", vocabSize=3
).fit(df)
# With vocabulary size 3 A is not in the vocabulary (3 most common words)
vectorizer.vocabulary

['c', 'd', 'b']

vectorizer.transform(df).take(3)

[Row(tokens=['a'], features=SparseVector(3, {})),
 Row(tokens=['b', 'c'], features=SparseVector(3, {0: 1.0, 2: 1.0})),
 Row(tokens=['c', 'c', 'd'], features=SparseVector(3, {0: 2.0, 1: 1.0}))]

您可以看到第一个文档不包含词汇中的任何令牌，因此所有功能均等于0。

SparseVector(3, {}).toArray()

array([ 0.,  0.,  0.])

对于比较，第一个文档包含两个c和d：

v = SparseVector(3, {0: 2.0, 1: 1.0})
{vectorizer.vocabulary[i]: cnt for (i, cnt) in zip(v.indices, v.values)}

{'c': 2.0, 'd': 1.0}

CountVectorizer行为的更详细说明可以在句柄看不见的分类字符串spark countvectorizer中找到。

取决于应用程序vocabSize的数百范围内，使用数十万或更多，尤其是当您考虑使用一些维度降低技术时，并不少见。

。

相关内容

最新更新

热门标签：