现在,我正在使用skiprows解析我的文件,但skiprows不可靠,因为数据可能会更改。我想跳过基于关键字的行,例如";法拉利,苹果,棒球;。我怎样才能做到这一点?你能提供一些例子吗?
编辑:如果可能的话,另一个对我来说更有效的解决方案是从一开始跳过n行,然后在到达BLANK条目后停止读取列中的值。这可能吗?
import pandas as pd
import pyodbc
df = pd.read_csv(r'C://mycsvfile.csv', skiprows=[3,108,109,110,111,112,114,115,116,118])
"""
Step 2 Specify columns we want to import
"""
columns = ['Run Date','Action','Symbol','Security Description','Security Type','Quantity','Price ($)','Commission ($)','Fees ($)','Accrued Interest ($)','Amount ($)','Settlement Date']
df_data = df[columns]
records = df_data.values.tolist()
print(df)
您可以尝试解析每一列,并尝试查找所需的关键字,并删除关键字所在的行。
df = df[df["Run Date"].str.contains("Ferrari") == False]
让它循环。
有几种方法可以做到这一点。下面是我的解决方案。
- 将所有关键字的大小写都设为小写,以消除区分大小写的情况
- 定义需要检查关键字的列(如果需要,我可以更改为检查所有列(
- 连接列以一次检查所有列,而不是遍历每个列
- 使单元格全部小写(请参见1(
- 保留不包含关键字的行
代码:
import pandas as pd
df = pd.DataFrame([['I love apples.', '', 1, 'Jan 1, 2021'],
['Apple is tasty.', 'Ferrari', 2, 'Jan 2, 2022'],
['This does not contain a keyword', 'Nor does this.', 15, 'Mar 1, 2021'],
['This row is ok', 'But it has baseball in it.', 34, 'Feb 1, 2021']], columns = ['A','B','Value','Date'])
keywords = ['Ferrari', 'Apple', 'Baseball']
keywords = '|'.join(keywords)
keywords = keywords.lower()
columns_to_check = ['A','B', 'Value']
df = df[~df[columns_to_check].astype(str).sum(1).str.lower().str.contains(keywords)]
输入:
print(df.to_string())
A B Value Date
0 I love apples. 1 Jan 1, 2021
1 Apple is tasty. Ferrari 2 Jan 2, 2022
2 This does not contain a keyword Nor does this. 15 Mar 1, 2021
3 This row is ok But it has baseball in it. 34 Feb 1, 2021
输出:
print(df.to_string())
A B Value Date
2 This does not contain a keyword Nor does this. 15 Mar 1, 2021