使用 Python v3.5 删除多个文本文件中行首相同但行尾不同的行

我有一个文件夹，里面装满了.GPS文件，例如1.GPS，2.GPS等...每个文件中有以下五行：

Trace #1 at position 0.004610
$GNGSA,A,3,02,06,12,19,24,25,,,,,,,2.2,1.0,2.0*21
$GNGSA,A,3,75,86,87,,,,,,,,,,2.2,1.0,2.0*2C
$GNVTG,39.0304,T,39.0304,M,0.029,N,0.054,K,D*32
$GNGGA,233701.00,3731.1972590,S,14544.3073733,E,4,09,1.0,514.675,M,,,0.49,3023*27

。后跟具有不同值的相同数据结构，在接下来的五行中：

Trace #6 at position 0.249839
$GNGSA,A,3,02,06,12,19,24,25,,,,,,,2.2,1.0,2.0*21
$GNGSA,A,3,75,86,87,,,,,,,,,,2.2,1.0,2.0*2C
$GNVTG,247.2375,T,247.2375,M,0.081,N,0.149,K,D*3D
$GNGGA,233706.00,3731.1971997,S,14544.3075178,E,4,09,1.0,514.689,M,,,0.71,3023*2F

（我意识到$GNGSA行后的值在上面的例子中没有变化。这只是一个不好的例子...在真实的数据集中，它们确实有所不同！

我需要删除以"$GNGSA"和"$GNVTG"开头的行（即我需要从每组五行中删除第 2、3 和 4 行。全球定位系统文件）。

此五行

模式在每个文件中持续不同次数（对于某些文件，可能只有两个五行组，而其他文件可能有数百个五行组）。因此，根据行号删除这些行将不起作用（因为行号将是可变的）。

我遇到的问题（如上例所示）是"$GNGSA"或"$GNVTG"后面的文本各不相同。

我目前正在学习 Python（我使用的是 v3.5），所以我认为这将是一个很好的项目，让我学习一些新技巧......

我已经尝试过的：

到目前为止，我已经设法创建了代码来遍历整个文件夹：

import os
indir = '/Users/dhunter/GRID01/'  # input directory
for i in os.listdir(indir):  # for each "i" (iteration) within the indir variable directory...
    if i.endswith('.GPS'):  # if the filename of an iteration ends with .GPS, then...
        print(i + ' loaded')  # print the filename to CLI, simply for debugging purposes.
        with open(indir + i, 'r') as my_file:  # open the iteration file
            file_lines = my_file.readlines()    # uses the readlines method to create a list of all lines in the file.
            print(file_lines)  # this prints the entire contents of each file to CLI for debugging purposes.

以上一切都完美运行。

我需要帮助：

如何检测并删除行本身，然后将文件保存（保存到同一位置;无需保存到其他文件名）？
文件名 - 通常以".GPS" - 有时以".gps"结尾（唯一的区别是这种情况）。我上面的代码仅适用于大写文件。除了完全复制代码和更改 endswith 参数之外，我如何使其适用于这两种情况？

最后，我的文件需要看起来像这样：

Trace #1 at position 0.004610
$GNGGA,233701.00,3731.1972590,S,14544.3073733,E,4,09,1.0,514.675,M,,,0.49,3023*27
Trace #6 at position 0.249839
$GNGGA,233706.00,3731.1971997,S,14544.3075178,E,4,09,1.0,514.689,M,,,0.71,3023*2F

有什么建议吗？提前谢谢。:)

你快到了。

import os
indir = '/Users/dhunter/GRID01/'  # input directory
for i in os.listdir(indir):  # for each "i" (iteration) within the indir variable directory...
    if i.endswith('.GPS'):  # if the filename of an iteration ends with .GPS, then...
        print(i + ' loaded')  # print the filename to CLI, simply for debugging purposes.
        with open(indir + i, 'r') as my_file:  # open the iteration file
            for line in my_file:
                if not line.startswith('$GNGSA') and not line.startswith('$GNVTG'):
                    print(line)

根据其他人所说，你走在正确的轨道上！出错的地方在于区分大小写的文件扩展名检查，以及一次读取整个文件内容（这本身并没有错，但它可能会增加我们不需要的复杂性）。

我已经注释了您的代码，为了简单起见，删除了所有调试内容，以说明我的意思：

import os
indir = '/path/to/files'
for i in os.listdir(indir):
if i.endswith('.GPS'): #This CASE SENSITIVELY checks the file extension
    with open(indir + i, 'r') as my_file: # Opens the file
        file_lines = my_file.readlines() # This reads the ENTIRE file at once into an array of lines

因此，我们需要解决区分大小写的问题，而不是读取所有行，而是逐行读取文件，检查每一行以查看是否要丢弃它，然后将我们感兴趣的行写入输出文件。

因此，结合@tdelaney对文件名的不区分大小写的修复，我们将第 #5 行替换为

if i.lower().endswith('.gps'): # Case-insensitively check the file name

我们不会一次读取整个文件，而是迭代文件流并打印出每个所需的行

with open(indir + i) as in_file, open(indir + i + 'new.gps') as out_file: # Open the input file for reading and creates + opens a new output file for writing - thanks @tdelaney once again!
    for line in in_file # This reads each line one-by-one from the in file
        if not line.startswith('$GNGSA') and not line.startswith('$GNVTG'): # Check the line has what we want (thanks Avinash)
            out_file.write(line + "n") # Write the line to the new output file

请注意，您应该确保在"for 行in_file"循环之外打开输出文件，否则该文件将在每次迭代时被覆盖，这将删除您到目前为止已经写入的内容（我怀疑这是您在以前的答案中遇到的问题）。同时打开两个文件，您不会出错。

或者，您可以在打开文件时指定文件访问模式，根据

with open(indir + i + 'new.gps', 'a'):

这将以追加模式打开文件，该模式

是专门用于保留文件原始内容的写入模式，并将新数据附加到其中，而不是覆盖现有数据。

好的

，根据Avinash Raj，tdelaney和Sampson Oliver的建议，在Stack Overflow上，以及另一个私下帮助的朋友，这是现在有效的解决方案：

import os
indir = '/Users/dhunter/GRID01/'  # input directory
for i in os.listdir(indir):  # for each "i" (iteration) within the indir variable directory...
    if i.lower().endswith('.gps'):  # if the filename of an iteration ends with .GPS, then...
        if not i.lower().endswith('.gpsnew.gps'):  # if the filename does not end with .gpsnew.gps, then...
            print(i + ' loaded')  # print the filename to CLI.
            with open (indir + i, 'r') as my_file:
                for line in my_file:
                    if not line.startswith('$GNGSA'):
                        if not line.startswith('$GNVTG'):
                            with open(indir + i + 'new.gps', 'a') as outputfile:
                                outputfile.write(line)
                                outputfile.write('rn')

（

你会看到我不得不添加另一层if语句来阻止它使用以前使用脚本的输出文件"如果不是i.lower（）.endswith（'.gpsnew.gps'）："，但是对于将来使用这些指令的任何人，可以轻松删除此行）

我们将倒数第三行的打开模式切换为"a"进行追加，以便将所有正确的行保存到文件中，而不是每次都覆盖。

我们还在最后一行中添加了在每行末尾添加换行符。

感谢大家的帮助、解释和建议。希望此解决方案将来对某人有用。:)

2. 文件名：

if接受任何返回真值的表达式，您可以将表达式与标准布尔运算符组合在一起：if i.endswith('.GPS') or i.endswith('.gps') 。您也可以将... and ...表达式放在括号中if后面，以感觉更确定，但这不是必需的。

或者，作为一种不太通用的解决方案（但因为您想学习一些技巧:)），在这种情况下可以使用字符串操作：类型 string 的对象有很多方法。 '.gps'.upper()给了'.GPS' - 尝试一下，如果你能利用这个！（即使是打印的字符串也是字符串对象，但变量的行为相同）。

1. 查找线条：

正如您在另一个解决方案中看到的那样，您不需要读出所有行，您可以检查是否要"即时"拥有它们。但我会坚持你的方法readlines.它为您提供了一个列表，列表支持索引和切片。尝试：

anylist[stratindex, endindex, stride] ，对于任何值，例如尝试：newlist = range(100)[1::5] 。

在交互模式下或在脚本开头尝试简单的基本操作总是很有帮助的。这里range(100)只是一些示例列表。在这里，您会看到python for-syntax的工作方式与其他语言不同：您可以迭代任何列表，如果您只需要整数，则可以使用range()创建包含整数的列表。

因此，这将与任何其他列表相同 - 例如，您从readlines()获得的列表

这会从列表中选择一个切片，从第二个元素开始，在末尾结束（因为省略了结束索引），并每隔第 5 个元素获取一次。现在您有了这个子列表，您可以从原始列表中撤消它。因此，对于具有范围的示例：

a = range(100)
del(a[1::5])
print a

所以你看，相应的项目已被删除。现在对file_lines执行相同的操作，然后继续删除要删除的其他行。

然后，在一个新的with块中，打开文件进行写入并执行writelines(file_lines)，以便将剩余的行写回文件。

当然，您也可以采用一种方法来查找每行的内容，并在您的列表和startswith()上for循环。或者您可以组合这些方法，并检查按数字删除行是否留下正确的开始，以便在出现意外时打印错误......

3. 保存文件

您可以在readlines()中保存行后关闭文件。实际上，这是在with块结束时自动完成的。然后只需以'w'模式而不是'r'打开它并执行yourfilename.writelines(yourlist)。你不需要保存，它在关闭时保存。

相关内容

最新更新

热门标签：