将主题建模结果强制转换为数据框架



我用BertTopicKeyBERT从一些docs中提取了一些topics

from bertopic import BERTopic
topic_model = BERTopic(nr_topics="auto", verbose=True, n_gram_range=(1, 4), calculate_probabilities=True, embedding_model='paraphrase-MiniLM-L3-v2', min_topic_size= 3)
topics, probs = topic_model.fit_transform(docs)

现在可以访问topic name

freq = topic_model.get_topic_info()
print("Number of topics: {}".format( len(freq)))
freq.head(30)
Topic    Count   Name
0   -1       1     -1_default_greenbone_gmp_manager
1    0      14      0_http_tls_ssl tls_ssl
2    1      8       1_jboss_console_web_application

和检查主题

[('http', 0.0855701486234524),          
('tls', 0.061977919455444744),
('ssl tls', 0.061977919455444744),
('ssl', 0.061977919455444744),
('tcp', 0.04551718585531556),
('number', 0.04551718585531556)]
[('jboss', 0.14014705432060262),
('console', 0.09285308122803233),
('web', 0.07323749337563096),
('application', 0.0622930523123512),
('management', 0.0622930523123512),
('apache', 0.05032395169459188)]

我想要的是有一个最终数据frame,在一个column中有topic name,在另一个column中有topic的元素

expected outcome:
class                         entities
o http_tls_ssl tls_ssl           HTTP...etc
1 jboss_console_web_application  JBoss, console, etc

和一个在不同列上具有主题名称的数据框

http_tls_ssl tls_ssl           jboss_console_web_application
o http                           JBoss
1 tls                            console
2 etc                            etc

我不知道该怎么做。有办法吗?

有一种方法:

设置
import pandas as pd
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups
docs = fetch_20newsgroups(subset="all", remove=("headers", "footers", "quotes"))["data"]
topic_model = BERTopic()
# To keep the example reproducible in a reasonable time, limit to 3,000 docs
topics, probs = topic_model.fit_transform(docs[:3_000])
df = topic_model.get_topic_info()
print(df)
# Output
Topic  Count                    Name
0     -1     23         -1_the_of_in_to
1      0   2635         0_the_to_of_and
2      1    114          1_the_he_to_in
3      2    103         2_the_to_in_and
4      3     59           3_ditto_was__
5      4     34  4_pool_andy_table_tell
6      5     32       5_the_to_game_and
第一dataframe

使用Pandas字符串方法:

df = (
df.rename(columns={"Name": "class"})
.drop(columns=["Topic", "Count"])
.reset_index(drop=True)
)
df["entities"] = [
[item[0] if item[0] else pd.NA for item in topics]
for topics in topic_model.get_topics().values()
]
df = df.loc[~df["class"].str.startswith("-1"), :]  # Remove -1 topic
df["class"] = df["class"].replace(
"^-?d+_", "", regex=True
)  # remove prefix '1_', '2_', ...
print(df)
# Output
class                                                      entities
1         the_to_of_and                [the, to, of, and, is, in, that, it, for, you]
2          the_he_to_in               [the, he, to, in, and, that, is, of, his, year]
3         the_to_in_and             [the, to, in, and, of, he, team, that, was, game]
4           ditto_was__  [ditto, was, <NA>, <NA>, <NA>, <NA>, <NA>, <NA>, <NA>, <NA>]
5  pool_andy_table_tell  [pool, andy, table, tell, us, well, your, about, <NA>, <NA>]
6       the_to_game_and           [the, to, game, and, games, espn, on, in, is, have]
第二dataframe

使用Pandas转置:

other_df = df.T.reset_index(drop=True)
new_col_labels = other_df.iloc[0]  # save first row
other_df = other_df[1:]  # remove first row
other_df.columns = new_col_labels
other_df = pd.DataFrame({col: other_df.loc[1, col] for col in other_df.columns})
print(other_df)
# Output
the_to_of_and the_he_to_in the_to_in_and ditto_was__ pool_andy_table_tell the_to_game_and
0           the          the           the       ditto                 pool             the
1            to           he            to         was                 andy              to
2            of           to            in        <NA>                table            game
3           and           in           and        <NA>                 tell             and
4            is          and            of        <NA>                   us           games
5            in         that            he        <NA>                 well            espn
6          that           is          team        <NA>                 your              on
7            it           of          that        <NA>                about              in
8           for          his           was        <NA>                 <NA>              is
9           you         year          game        <NA>                 <NA>            have

最新更新