我有一个CSV文件,其中包含2列,查询和描述。这是文件的示例描述:-
| Query | Description |
| -------- | -------------- |
| What is the type of <mach-name> machine> | <mach-name> is ... |
| What is the use of <mach-name> machine> | The use of <mach-name> is ... |
| How long it takes to rain in <state-name> | It rains for ... hours in <state-name> |
| What is the best restaurant in <state-name> | <state-name>'s best food is in ... |
|
...
etc.
每个查询列和描述列都有这样的唯一字符串。假设通过Pandas将CSV文件读入数据框df
。目标是根据特定条件替换<mach-name>
等<>
型元素。
这些替换需要通过替换标签<>
mach_name = ["Drilling", "ABC", XYZ".... etc.]
state_name = ["New York", "London", "Delhi"... etc.]
示例:任意一行的"查询列"one_answers"描述列"中出现"if(<mach-name>)
",替换通过mach_name
列表中相应的元素来标记。因此,例如,如果mach_name
列表有10个元素,则需要将更多这样的句子附加到数据框df
中。预期的输出如下所示:
| Query | Description |
| -------- | -------------- |
| What is the type of Drilling machine. | Drilling is ... |
| What is the type of ABC machine. | ABC is ... |
| What is the type of XYZ machine. | XYZ is ... |
| What is the use of Drilling machine | The use of Drilling is ... |
| What is the use of ABC machine | The use of ABC is ... |
| What is the use of XYZ machine. | The use of XYZ is ... |
| How long it takes to rain in New York | It rains for ... hours in New York |
| How long it takes to rain in London | It rains for ... hours in London |
| How long it takes to rain in Delhi | It rains for ... hours in Delhi |
| What is the best restaurant in New York | New York's best food is in ... |
| What is the best restaurant in London | London's best food is in ... |
| What is the best restaurant in Delhi |Delhi's best food is in ... |
|
…等。
我希望使用str.replace()
执行一个简单的Python替换,但它可能涉及到一个for
循环来迭代Pandas数据框,所以答案建议不要迭代数据框,但我找不到一个明确的方法来替换基于这些条件的值,同时也根据列表元素添加新的行。任何帮助/指导是感激的。谢谢。
如果您读取原始csv,处理它,然后将结果转换为pandas数据框架,这将更容易,但如果您需要之前读取数据框架,这可能是一个选项:
data=[ {"query": "What is the type of <mach-name> machine>", "description": "<mach-name> is ..."},
{"query": "What is the use of <mach-name> machine>", "description": "The use of <mach-name> is ..."},
{"query": "How long it takes to rain in <state-name>", "description": "It rains for ... hours in <state-name>"}]
df = pd.DataFrame(data)
#mark rows that should that satisfy the conditions
df["replace_mach"] = df['query'].str.contains('<mach-name>') &
df['description'].str.contains('<mach-name>')
df["replace_state"] = df['query'].str.contains('<state-name>') &
df['description'].str.contains('<state-name>')
dfs_list = []
mach_name = ["Drilling", "ABC", "XYZ"]
state_name = ["New York", "London", "Delhi"]
for n in mach_name:
aux = df[df["replace_mach"]].copy()
aux["query"] = aux["query"].str.replace(r"\<mach-name>",n)
aux["description"] = aux["description"].str.replace(r"\<mach-name>",n)
dfs_list.append(aux)
for n in state_name:
aux = df[df["replace_state"]].copy()
aux["query"] = aux["query"].str.replace(r"\<state-name>",n)
aux["description"] = aux["description"].str.replace(r"\<state-name>",n)
dfs_list.append(aux)
# add records without wild cards to dataframe
dfs_list.append(df[~((df["replace_mach"])|(df["replace_state"]))]
replaced_df = pd.concat(dfs_list)
replaced_df