检查数据帧中的结束字符并替换它们

我想根据以下条件在pandas数据帧中添加两个新列

如果一句话以"…"结尾然后添加一个值为1的新列，否则为0
如果一个句子以"…"结尾然后添加不带"…"的新列最后

类似这样的东西：

Text
bla bla bla ...
once upon a time
pretty little liars
Batman ...

预期

Text                T    Clean
bla bla bla ...     1    bla bla bla 
once upon a time    0    once upon a time 
pretty little liars 0    pretty little liars
Batman ...          1    Batman

我试着应用regex，但str.endwith可能是检查句子是否以…结尾的更好方法。。。，因为分配了一个布尔值(我的T列(。

我已经尝试过如下操作：df['Text'].str.endswith('...')，但我需要创建一个包含1和0的新列。为了清理文本，我会检查T是否为true：如果为true，我会在末尾删除...。

df['Clean'] = df['Text'].str.rstrip('...')

或df['Clean'] = df['Text'].str[:-3](但不包括任何关于...的逻辑条件或信息(

或df['Clean'] = df['Text'].str.replace(r'...$', '')

重要的是，我要考虑以...结尾的句子，以避免删除句子中间具有不同含义的...。

对于第一列，我将使用您建议的方法：

df['T'] = df['Text'].str.endswith('...')

(从技术上讲，这将创建一个布尔列，而不是整数列。如果您关心这一点，可以使用astype()进行转换。(

对于第二列，我将无条件替换：

df['Clean'] = df['Text'].str.replace(r'...$', '')

如果它没有以…结束。。。，它不会有任何作用。

如果您想替换；结束"；省略号仅在具有该属性的文本行上：

df.loc[df['Text'].str.endswith('...') == True, 'ends_in_ellipsis'] = 1
df.loc[df['ends_in_ellipsis'] == 1, 'Text_2'] = df.loc[df['ends_in_ellipsis'] == 1, 'Text'].str.rstrip('...')

现在，如果你想在一行中完成所有操作(虽然其他人可读性较差，但你保存了一个伪列及其占用的内存(：

df.loc[df['Text'].str.endswith('...') == True, 'Text_2'] = df.loc[df['Text'].str.endswith('...') == True, 'Text'].str.rstrip('...')

让我们试试endswith+rstrip

df['new1']=df.Text.str.endswith('...').astype(int)
df['new2']=df.Text.str.rstrip(' ...') # notice rstrip will not remove any ... in the mid 
df
Text  new1                 new2
0      bla bla bla ...     1          bla bla bla
1     once upon a time     0     once upon a time
2  pretty little liars     0  pretty little liars
3           Batman ...     1               Batman

相关内容

最新更新

热门标签：