是否有用于分隔非分隔字符串中信息的 R 或 Python 函数，其中信息各不相同

我目前正在清理一个凌乱的数据表，其中信息在一个 excel 单元格中给出，其中不同的特性没有分隔(没有逗号，空格是随机的)。因此，我的问题是在没有分隔的情况下分离不同的信息，我可以在我的代码中使用(不能使用拆分命令)

我假设我需要包含信息的每个部分的一些特征，以便识别相应的特征。但是，我不知道该怎么做，因为我对 Python 很陌生，而且我只在回归模型和其他统计分析的框架中使用 R。

简短数据示例：输入：

"WMIN CBOND12/05/2022 23554132121"

或

"WalMaInCBND 12/05/2022-23554132121">

或

"沃尔玛公司债券12/05/2022|23554132121">

预期输出：

"Walmart Inc.", "Corporate Bond", "12/05/2022", "23554132121"

因此，每个"x"都应分类在具有相应标题(公司，证券，成熟度，帐号)的新列中

如您所见，输入随机变化，但我希望上面给出的三个输入中的每一个都有相同的输出(我有超过 200k 个不同公司、证券等的数据点)。

第一个问题是如何有效地分离信息，而不能使用系统模式。

第二个问题(优先级较低)是如何识别公司，而无需为 50k 家公司设置包含 50 个不同输入的字典。

感谢您的帮助！

我建议在可能的情况下首先引入有用的分隔符，并构建一个替换字典，以便使用正则表达式进行处理。

import re
s = 'WMIN CBOND12/05/2022 23554132121'
# CAREFUL this not a real date regex, this should just
# illustrate the principle of regex
# see https://stackoverflow.com/a/15504877/5665958 for
# a good US date regex
date_re = re.compile('([0-9]{2}/[0-9]{2}/[0-9]{4})')
# prepend a whitespace before the date
# this is achieved by searching the date within the string
# and replacing it with itself with a prepended whitespace
# /1 means "insert the first capture group", which in our
# case is the date
s = re.sub(date_re, r' 1', s)
# split by one or more whitespaces and insert
# a seperator (';') to make working with the string
# easier
s = ';'.join(s.split())
# build a dictionary of replacements
replacements = {
'WMIN': 'Walmart Inc.',
'CBOND': 'Corporate Bond',
}
# for each replacement apply subsitution
# a better, but more replicated solution for
# this is given here:
# https://stackoverflow.com/a/15175239/5665958
for pattern, r in replacements.items():
s = re.sub(pattern, r, s)
# use our custom separator to split the parts
out = s.split(';')
print(out)

使用 python 和正则表达式：

import re
def make_filter(pattern):
pattern = re.compile(pattern)
def filter(s):
filtered = pattern.match(s)
return filtered.group(1), filtered.group(2), filtered.group(3), filtered.group(4)
return filter
filter = make_filter("^([a-zA-Z]+)s([a-zA-Z]+)(d+/d+/d+)s(d+)$")
filter("WMIN CBOND12/05/2022 23554132121")

make_filter函数只是一个实用程序，允许您修改模式。它返回一个函数，该函数将根据该模式过滤输出。我将它与考虑一些文本、空格、一些文本、日期、空格和数字的"^([a-zA-Z]+)s([a-zA-Z]+)(d+/d+/d+)s(d+)$"模式一起使用。如果要对此模式进行 kodify ，请提供有关它的更多信息。输出将为("WMIN", "CBOND", "12/05/2022", "23554132121")。

欢迎！是的，我们肯定需要看到更多的例子，而正则表达式似乎是要走的路......但由于似乎没有结构，我认为最好将其视为单独的步骤。

我们知道有一个日期是(X)X/(X)X/XXXX的(即一位或两位数的日期，一位或两位数的月份，四位数的年份，也许有或没有斜杠，对吧？)之后是数字。所以先解决那部分，只留下前两个类别。这实际上是容易的部分:)但不要灰心！
如果这两个类别可能没有任何分隔符(例如WMINCBOND 12/05/202223554132121，或者分隔符并不总是分隔符，例如IMAGINARY COMPANY X CBOND，那么您就遇到了很大的麻烦。 :)但这是我们可以做的：
1. 收集所有代码的列表(希望您有)。
2. 在每个代码上使用str_detect()，看看你是否可以识别任何数据集中的确切字符串(如果你有代码 lemme 知道我会编写代码来完成这部分)。
识别代码后剩下的将是CBOND，不管那是什么......所以最后做那部分...字符串的剩余部分将是那个。或者，如果您有任何CBOND内容的列表，则可以使用相同的str_detect()。
只有在您确定了所有内容之后，您才能替换它们所代表的代码。如果您有代码列表，请告诉我，我会发布代码。

编辑

s = c("WMIN CBOND12/05/2022 23554132121",
"WalMaInCBND 12/05/2022-23554132121",
"WalmartI CorpBond12/05/2022|23554132121")

ID = gsub("([a-zA-Z]+).*","\1",s)
ID2 = gsub(".* ([a-zA-Z]+).*","\1",s)
date = gsub("[a-zA-Z ]+(\d+\/\d+\/\d+).*","\1",s)
num = gsub("^.*[^0-9](.*$)","\1",s)
data.frame(ID=ID,ID2=ID2,date=date,num=num,stringsAsFactors=FALSE)
ID                                ID2       date         num
1        WMIN                              CBOND 12/05/2022 23554132121
2 WalMaInCBND WalMaInCBND 12/05/2022-23554132121 12/05/2022 23554132121
3    WalmartI                           CorpBond 12/05/2022 23554132121

适用于情况 1 和 3，但我还没有弄清楚第二种情况的逻辑，如果不分开，我们怎么知道在哪里拆分包含公司和安全的字符串？

相关内容

最新更新

热门标签：