使用具有给定模式集的 Python 进行文本处理

有输入.csv如下所示：

field_name,field_friendly_name,include in report
LastNm,Last_Name,
cntn_last_mod_wrkr_full_nm,Last_Name,
contact_last_nm,Last_Name,
contact_first_last_nm,Last_Name,
,Last_Name,
last_english_nm,Last_Name,
,,
last_pronunciation_nm,Last_Name,
,Last_Name,
last_nm,Last_Name,
lead_space_last_nm,Last_Name,
last_mod_usr_nm,Last_Name,
lcl_last_nm,Last_Name,
adobe_last_topic_nm,Last_Name,
last_changed_user_nm,Last_Name,
last_purchased_product_service_nm,Last_Name,
last_imported_source_nm,Last_Name,
submt_last_nm,Last_Name,
cntct_last_nm,Last_Name,
cust_submt_last_nm,Last_Name,
cust_cntct_last_nm,Last_Name,
last_mod_by_nm,Last_Name,
last_mod_als_nm,Last_Name,
last_mod_nm,Last_Name,
ship_last_nm,Last_Name,
billing_last_nm,Last_Name,
last_upd_by_nm,Last_Name,
wrkr_last_nm,Last_Name,
trns_line_itm_last_chg_psn_nm,Last_Name,
trns_line_itm_last_cre_psn_nm,Last_Name,
trns_hdr_last_chg_psn_nm,Last_Name,
altr_last_nm,Last_Name,
trns_last_chg_nm,Last_Name,
lastrepaction_nm,Last_Name,
last_build_nm,Last_Name,
LegalLastNm,Last_Name,
ManagerLastNm,Last_Name,
4-LastNm,Last_Name,
NextLevelManagerLastNm,Last_Name,
ManagerLegalLastNm,Last_Name,

需要的帮助是用python编写代码，该代码将生成具有所需输出的csv文件。

条件：

从输入.csv文件中读取每一行。(第一行是列名(。
条件：如果 column1 值仅由给定的单词集组成(在本例中为 name、nm、lst、-、_、[0-9](，则分别更新 column2 和 column3 为 found 和 true。并且搜索应不区分大小写
如果条件失败，则分别将 column2 和 column3 更新为 not_found 和 false。
如果 column1 为空，则还要删除整行。

我想要的输出应该如下：

field_name,field_friendly_name,include in report
LastNm,found,TRUE
cntn_last_mod_wrkr_full_nm,not_found,FALSE
contact_last_nm,not_found,FALSE
contact_first_last_nm,not_found,FALSE
last_english_nm,not_found,FALSE
last_pronunciation_nm,not_found,FALSE
last_nm,found,TRUE
lead_space_last_nm,not_found,FALSE
last_mod_usr_nm,not_found,FALSE
lcl_last_nm,not_found,FALSE
adobe_last_topic_nm,not_found,FALSE
last_changed_user_nm,not_found,FALSE
last_purchased_product_service_nm,not_found,FALSE
last_imported_source_nm,not_found,FALSE
submt_last_nm,not_found,FALSE
cntct_last_nm,not_found,FALSE
cust_submt_last_nm,not_found,FALSE
cust_cntct_last_nm,not_found,FALSE
last_mod_by_nm,not_found,FALSE
last_mod_als_nm,not_found,FALSE
last_mod_nm,not_found,FALSE
ship_last_nm,not_found,FALSE
billing_last_nm,not_found,FALSE
last_upd_by_nm,not_found,FALSE
wrkr_last_nm,not_found,FALSE
trns_line_itm_last_chg_psn_nm,not_found,FALSE
trns_line_itm_last_cre_psn_nm,not_found,FALSE
trns_hdr_last_chg_psn_nm,not_found,FALSE
altr_last_nm,not_found,FALSE
trns_last_chg_nm,not_found,FALSE
lastrepaction_nm,not_found,FALSE
last_build_nm,not_found,FALSE
LegalLastNm,not_found,FALSE
ManagerLastNm,not_found,FALSE
4-LastNm,found,TRUE
NextLevelManagerLastNm,not_found,FALSE
ManagerLegalLastNm,not_found,FALSE

仅供参考，在所需的输出中，以下是与条件匹配的唯一 column1 值。

LastNm,found,TRUE
last_nm,found,TRUE
4-LastNm,found,TRUE

我可以在 Unix 中使用 awk 完成其中的一部分，需要帮助在 python 中完成相同的工作，需要哪些命令或包来执行此操作，以及我们可以用 python 执行此操作的最简单的代码。

awk -F , -v OFS=, 'gensub(/last|lst|name|nm|[0-9_-]*/,"","g",tolower($1))=="" {
$2="found";
print $1, $2
}' file

使用此代码，我得到的输出如下-

LastNm,Found
last_nm,Found
4-LastNm,Found

由于这是一个相当广泛的问题，这里有一些正确方向的提示。

您正在处理一个csv文件，因此我建议您查看python的csv模块。它有一个读取和写入文件的示例。您还需要 re 模块，特别是re.compile()和re.match。

从那里，我会创建一个包含每个修改的列表，并将其写入带有csv.writelines()的文件。

这里有 2 个不同的问题。首先是csv文件的处理：csv模块将完美地完成。其次是测试名称是否仅由集合中不区分大小写的元素组成。在这里，我将使用带有re模块的正则表达式。

代码可以是：

rx = re.compile(r'(last)|(name)|(nm)|(lst)|([-_0-9]+)', re.I)
with open('input.csv') as fd, open('output.csv', 'w',  newline='') as fdout:
rd = csv.reader(fd)
wr = csv.writer(fdout)
wr.writerow(next(rd)) # copy header line
for row in rd:
txt = row[0].strip()  # ignore leading or ending blank characters
if txt != '':         # reject lines with an empty first field
if '' == rx.sub('', txt):     # only elements from the set
wr.writerow((txt, 'found', 'TRUE'))
else:
wr.writerow((txt, 'not_found', 'FALSE'))

<小时 />

import re
import csv
rx = re.compile(r'(last)|(name)|(nm)|(lst)|([-_0-9]+)', re.I)
with open('C:tempinput.csv') as fd, open('C:tempoutput.csv', 'w',  newline='') as fdout:
rd = csv.reader(fd)
wr = csv.writer(fdout)
wr.writerow(next(rd)) # copy header line
for row in rd:
txt = row[0].strip()  # ignore leading or ending blank characters
if txt != '':         # reject lines with an empty first field
if '' == rx.sub('', txt):     # only elements from the set
wr.writerow((txt, 'found', 'TRUE'))
else:
wr.writerow((txt, 'not_found', 'FALSE'))

错误：

OSError                                   Traceback (most recent call last)
<ipython-input-7-1a294c9517ff> in <module>()
3 
4 rx = re.compile(r'(last)|(name)|(nm)|(lst)|([-_0-9]+)', re.I)
----> 5 with open('C:tempinput.csv') as fd, open('C:tempoutput.csv', 'w',  newline='') as fdout:
6     rd = csv.reader(fd)
7     wr = csv.writer(fdout)
OSError: [Errno 22] Invalid argument: 'C:temp\input.csv'

相关内容

最新更新

热门标签：