在python中，从列中如何检查特定的单词/标签，并在新的相关列中显示它们的存在

我是python的新手。我在MS Excel文件中有一列，其中使用了LOC , ORG , PER和MISCfour tag，given data如下所示：

1 LOC/Thai Buddhist temple;
2 PER/louis;
3 ORG/WikiLeaks;LOC/Southern Ocean;
4 ORG/queen;
5 PER/Sanchez;PER/Eli Wallach;MISC/The Good, The Bad and the Ugly;
6 
7 PER/Thomas Watson;
...................
...................
.............#continue upto 2,000 rows

我想要一个结果，即在特定行中存在或不存在标签，如果存在某些标签，则在其特定(如下所示的新列(列中放置"1"，如果不存在任何标签，则放置"0"。我想要这个 excel 文件中的所有 4 列，它们是 LOC/ORG/PER/MISC，在first column is given data时将是第 2、3、4 和 5 列，文件包含近 2815 行，每一行都有与这些 LOC/ORG/PER/MISC 不同的标签。

我的目标是从新列中计数

LOC总数、ORG总数、PER 总数和MISC总数

结果将是这样的：

given data              LOC  ORG  PER MISC
1 LOC/Thai Buddhist temple;           1    0    0   0   #here only LOC is present
2 PER/louis;                          0     0    1  0   #here only PER is present
3 ORG/WikiLeaks;LOC/Southern Ocean;   1     1   0   0   #here LOC and ORG is present
4 PER/Eli Wallach;MISC/The Good;      0     0   1   1   #here PER and MISC is present
5    .................................................
6                                     0     0   0   0   #here no tag is present
7 .....................................................
.......................................................
..................................continue up to 2815 rows....

我是 Python.so 的初学者，我已经尽力搜索其解决方案代码，但是，我找不到与我的问题相关的任何程序，这就是我在这里发布的原因。所以，好心有人帮助我。

我假设您已经成功地从 excel 读取数据并使用 pandas 在 python 中创建了一个数据帧(要读取 excel 文件，我们有 df1 = read_excel("文件/路径/名称.xls" 标头 = 真/假((。

下面是数据帧 df1 的布局

Colnum | Tagstring
1      |LOC/Thai Buddhist temple;
2      |PER/louis;
3      |ORG/WikiLeaks;LOC/Southern Ocean;
4      |ORG/queen;
5      |PER/Sanchez;PER/Eli Wallach;MISC/The Good, The Bad and the Ugly;
6      |PER/Thomas Watson;

现在，有几种方法可以在字符串中搜索文本。

我将演示查找功能：

语法： str.find(str， beg=0， end=len(string((

str1 = "LOC";
str2 = "PER";
str3 = "ORG";
str4 = "MISC";
df1["LOC"] = (if Tagstring.find(str1) >= 0 then 1 else 0).astype('int')
df1["PER"] = (if Tagstring.find(str2) >= 0 then 1 else 0).astype('int')
df1["ORG"] = (if Tagstring.find(str3) >= 0 then 1 else 0).astype('int')
df1["MISC"] = (if Tagstring.find(str4) >= 0 then 1 else 0).astype('int')

如果你已经读取了你的数据，df那么你可以做：

pd.concat([df,pd.DataFrame({i:df.Tagstring.str.contains(i).astype(int) for i in 'LOC  ORG  PER MISC'.split()})],axis=1)
Out[716]: 
Tagstring  LOC  ORG  PER    MISC 
Colnum                                                                      
1                                LOC/Thai Buddhist temple;    1    0    0       0
2                                               PER/louis;    0    0    1       0
3                        ORG/WikiLeaks;LOC/Southern Ocean;    1    1    0       0
4                                               ORG/queen;    0    1    0       0
5        PER/Sanchez;PER/Eli Wallach;MISC/The Good, The...    0    0    1       1
6                                       PER/Thomas Watson;    0    0    1       0

相关内容

最新更新

热门标签：