Keras Tokenizer只标记CSV文件的第一行

我是keras API的新手，我可能被一个非常简单的任务困住了。我有一个4列的csv文件。目前我只想使用这些列中的1列。我使用pandas库读取csv，并选择仅使用'host'列。

这可以正常工作，但是当我通过keras标记器函数对数据进行标记时，它只读取csv文件中的第一行。

我需要标记器读取csv并在字符级别对其进行标记，它似乎正在这样做，但只针对第一行。请参阅下面的代码，任何帮助都是非常感谢的。

fields=['host']
test_dataset = pd.read_csv('dga_data.csv',usecols=fields)
test_dataset_tok= Tokenizer(split=',',char_level=True, oov_token=True)
print(test_dataset_tok)
test_dataset_tok.fit_on_texts(test_dataset)
print(test_dataset_tok)
test_dataset_sequences=test_dataset_tok.texts_to_sequences(test_dataset)
print(test_dataset_sequences)
print(test_dataset_tok.word_index)

您正在将数据帧传递给fit_on_texts你需要传递一份名单。从文档:

texts:可以是字符串列表，字符串生成器(为了节省内存)，或者字符串列表的列表。

因此，您需要传递一个列表，或者至少是一个PandasSeries，所以当fit_on_texts执行这个for循环，它遍历CSV文件的每一行，而不仅仅是数据帧轴标签。

In [22]: type(test_dataset)
Out[22]: pandas.core.frame.DataFrame
In [23]: type(test_dataset['host'])
Out[23]: pandas.core.series.Series

import pandas as pd
from tensorflow.keras.preprocessing.text import Tokenizer
test_dataset = pd.DataFrame({'host': [
'Aspire to inspire before we expire.',
'Let the beauty of what you love be what you do.',
'The meaning of life is to give life meaning.',
'I have nothing to lose but something to gain.',
]})
# pandas.core.series.Series
test_dataset = test_dataset['host']
test_dataset_tok= Tokenizer(split=',',char_level=True, oov_token=True)
print(test_dataset_tok)
test_dataset_tok.fit_on_texts(test_dataset)
print(test_dataset_tok)
test_dataset_sequences=test_dataset_tok.texts_to_sequences(test_dataset)
print(test_dataset_sequences)
print(test_dataset_tok.word_index)

输出:

<keras_preprocessing.text.Tokenizer object at 0x0000019AFFA65CD0>
<keras_preprocessing.text.Tokenizer object at 0x0000019AFFA65CD0>
[
[8, 11, 18, 4, 14, 3, 2, 6, 5, 2, 4, 7, 11, 18, 4, 14, 3, 2, 15, 3, 12, 5, 14, 3, 2, 19, 3, 2, 3, 23, 18, 4, 14, 3, 16],
[13, 3, 6, 2, 6, 9, 3, 2, 15, 3, 8, 17, 6, 20, 2, 5, 12, 2, 19, 9, 8, 6, 2, 20, 5, 17, 2, 13, 5, 21, 3, 2, 15, 3, 2, 19, 9, 8, 6, 2, 20, 5, 17, 2, 24, 5, 16],
[6, 9, 3, 2, 22, 3, 8, 7, 4, 7, 10, 2, 5, 12, 2, 13, 4, 12, 3, 2, 4, 11, 2, 6, 5, 2, 10, 4, 21, 3, 2, 13, 4, 12, 3, 2, 22, 3, 8, 7, 4, 7, 10, 16], 
[4, 2, 9, 8, 21, 3, 2, 7, 5, 6, 9, 4, 7, 10, 2, 6, 5, 2, 13, 5, 11, 3, 2, 15, 17, 6, 2, 11, 5, 22, 3, 6, 9, 4, 7, 10, 2, 6, 5, 2, 10, 8, 4, 7, 16]
]
{
True: 1, ' ': 2, 'e': 3, 'i': 4, 'o': 5, 't': 6, 'n': 7, 'a': 8,
'h': 9, 'g': 10, 's': 11, 'f': 12, 'l': 13, 'r': 14, 'b': 15, '.': 16,
'u': 17, 'p': 18, 'w': 19, 'y': 20, 'v': 21, 'm': 22, 'x': 23, 'd': 24
}

相关内容

最新更新

热门标签：