我用BertTopic
和KeyBERT
从一些docs
中提取了一些topics
from bertopic import BERTopic
topic_model = BERTopic(nr_topics="auto", verbose=True, n_gram_range=(1, 4), calculate_probabilities=True, embedding_model='paraphrase-MiniLM-L3-v2', min_topic_size= 3)
topics, probs = topic_model.fit_transform(docs)
现在可以访问topic name
freq = topic_model.get_topic_info()
print("Number of topics: {}".format( len(freq)))
freq.head(30)
Topic Count Name
0 -1 1 -1_default_greenbone_gmp_manager
1 0 14 0_http_tls_ssl tls_ssl
2 1 8 1_jboss_console_web_application
和检查主题
[('http', 0.0855701486234524),
('tls', 0.061977919455444744),
('ssl tls', 0.061977919455444744),
('ssl', 0.061977919455444744),
('tcp', 0.04551718585531556),
('number', 0.04551718585531556)]
[('jboss', 0.14014705432060262),
('console', 0.09285308122803233),
('web', 0.07323749337563096),
('application', 0.0622930523123512),
('management', 0.0622930523123512),
('apache', 0.05032395169459188)]
我想要的是有一个最终数据frame
,在一个column
中有topic name
,在另一个column
中有topic
的元素
expected outcome:
class entities
o http_tls_ssl tls_ssl HTTP...etc
1 jboss_console_web_application JBoss, console, etc
和一个在不同列上具有主题名称的数据框
http_tls_ssl tls_ssl jboss_console_web_application
o http JBoss
1 tls console
2 etc etc
我不知道该怎么做。有办法吗?
有一种方法:
设置import pandas as pd
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups
docs = fetch_20newsgroups(subset="all", remove=("headers", "footers", "quotes"))["data"]
topic_model = BERTopic()
# To keep the example reproducible in a reasonable time, limit to 3,000 docs
topics, probs = topic_model.fit_transform(docs[:3_000])
df = topic_model.get_topic_info()
print(df)
# Output
Topic Count Name
0 -1 23 -1_the_of_in_to
1 0 2635 0_the_to_of_and
2 1 114 1_the_he_to_in
3 2 103 2_the_to_in_and
4 3 59 3_ditto_was__
5 4 34 4_pool_andy_table_tell
6 5 32 5_the_to_game_and
第一dataframe使用Pandas字符串方法:
df = (
df.rename(columns={"Name": "class"})
.drop(columns=["Topic", "Count"])
.reset_index(drop=True)
)
df["entities"] = [
[item[0] if item[0] else pd.NA for item in topics]
for topics in topic_model.get_topics().values()
]
df = df.loc[~df["class"].str.startswith("-1"), :] # Remove -1 topic
df["class"] = df["class"].replace(
"^-?d+_", "", regex=True
) # remove prefix '1_', '2_', ...
print(df)
# Output
class entities
1 the_to_of_and [the, to, of, and, is, in, that, it, for, you]
2 the_he_to_in [the, he, to, in, and, that, is, of, his, year]
3 the_to_in_and [the, to, in, and, of, he, team, that, was, game]
4 ditto_was__ [ditto, was, <NA>, <NA>, <NA>, <NA>, <NA>, <NA>, <NA>, <NA>]
5 pool_andy_table_tell [pool, andy, table, tell, us, well, your, about, <NA>, <NA>]
6 the_to_game_and [the, to, game, and, games, espn, on, in, is, have]
第二dataframe使用Pandas转置:
other_df = df.T.reset_index(drop=True)
new_col_labels = other_df.iloc[0] # save first row
other_df = other_df[1:] # remove first row
other_df.columns = new_col_labels
other_df = pd.DataFrame({col: other_df.loc[1, col] for col in other_df.columns})
print(other_df)
# Output
the_to_of_and the_he_to_in the_to_in_and ditto_was__ pool_andy_table_tell the_to_game_and
0 the the the ditto pool the
1 to he to was andy to
2 of to in <NA> table game
3 and in and <NA> tell and
4 is and of <NA> us games
5 in that he <NA> well espn
6 that is team <NA> your on
7 it of that <NA> about in
8 for his was <NA> <NA> is
9 you year game <NA> <NA> have