分析 CSV 数据以获取具有重复值的行的计数

我有一个csv文件(data.csv)：

data
cn=Clark Kent,ou=users,ou=news,ou=employee,dc=company,dc=com
cn=Peter Parker,ou=News,ou=news,ou=employee,dc=company,dc=com
cn=Mary Jane,ou=News_HQ,ou=news,ou=employee,dc=company,dc=com
cn=Oliver Twist,ou=users,ou=news,ou=employee,dc=company,dc=com
cn=Mary Poppins,ou=Ice Cream,ou=ice cream,dc=company,dc=com
cn=David Tenant,ou=userMger,ou=ice cream,ou=employee,dc=company,dc=com
cn=Pepper Jack,ou=users,ou=store,ou=employee,dc=company,dc=com
cn=Eren Jaeger,ou=Store,ou=store,ou=employee,dc=company,dc=com
cn=Monty Python,ou=users,ou=store,dc=company,dc=com
cn=John Smith,ou=userMger,ou=store,ou=employee,dc=company,dc=com
cn=Anne Potts,ou=Sprinkles_HQ,ou=sprinkles,dc=company,dc=com
cn=Harry Styles,OU=Sprinkles,ou=sprinkles,ou=employee,dc=company,dc=com
cn=James Bond,ou=Sprinkles_HQ,ou=employee,dc=company,dc=com
cn=Harry Potter,ou=users,ou=sprinkles,ou=employee,dc=company,dc=com

我需要将数据解析到可以计算ou中有多少行具有相同名称的程度。例如，如果有Sprinkles_HQ、Sprinkles或sprinkles，它们应该算作相同。如果一行有Sprinkles_HQ和sprinkles(两个同名)，则该行仍应计为一(而不是二)。

我想要的输出类似于这样：

News, 4
Ice Cream, 2
Store, 4
Sprinkles, 4

我采取的第一步是读取我的csv文件，将我的csv转换为数据帧。我用熊猫做了这个：

#open file
file = open(directory)
#read csv and the column I want
df = pd.read_csv(file, usecols=['data'])
#make into a dataframe
rowData = pd.DataFrame(df)

然后，为了使我更容易解析数据，我将每一行分隔为逗号分隔的值。然后将这些值转换为列表列表(每行都是一个列表)。然后删除任何"无"值。然后我需要将所有以"OU="开头的数据移动到它自己的列表中，如果任何数据有"user"或"userMger"或"employee"，我将从列表中删除这些值。这是我目前的代码：

#splits the dataframe into comma separate values
lines =rowData['data'].str.split(",", expand=True)
#makes dataframe into a list of lists
a = lines.values.tolist()
#make my list of lists into a single list
employeeList = []
for i in range(len(a)):
for j in range(len(a[0])):
#there are some None values once converted to a list
if a[i][j] != None: 
employeeList.append(a[i][j])
#list for storing only OUs
ouList = []
#moving the items to the ouList that are only OUs
for i in range(len(employeeList)):
if employeeList[i].startswith('OU='):
ouList.append(employeeList[i])
#need to iterate in reverse as I am removing items from the list
#here I remove the other items
for i in reversed(range(len(ouList))):
if ouList[i].endswith('users') or ouList[i].endswith('userMger') or ouList[i].endswith('employee'):
ouList.remove(ouList[i])

#my list now only contains specific OUs        
print(ouList)

我相信我走在正确的轨道上，我的代码还没有删除列表中每个列表中的任何重复项，例如Sprinkles_HQ、Sprinkles或sprinkles。在我制作employeelist列表之前，我需要找到一种方法来删除重复项，并将它们附加到新列表中。这将使我更容易计数。

我已经研究了如何删除列表列表中的重复项。我尝试使用类似的东西，例如：

new_list = []
for elem in a:
if a not in new_list:
new_list.append(elem)

但这并没有考虑到开头相同的单词。我尝试使用startswith和.lower()，因为有大写和小写，但对我不起作用：

new_list=[]
for i in range(len(a)):
for j in range(len(a[0])):
if a[i][j].lower().startswith(a[i][j].lower()) not in new_list:
new_list.append(a[i][j])

任何建议，将不胜感激。

我想出的解决方案是部分的。我的第一个问题是大小写，我需要所有内容都是小写的。所以在我把项目追加到employeeList之后，我添加了这段代码：

for i in range(len(employeeList)):
for j in range(len(employeeList[i])):
employeeList[i][j] = employeeList[i][j].lower()

这使我的员工列表中的所有内容都小写。

现在，一旦我解决了这个问题，我就需要从单个列表中更改ouList的输出，并将其保留为列表列表。因此，所有只有ou=的行都将在ouList

#list for storing only OUs
ouList = []
#moving the items to the ouList that are only OUs
for i in range(len(employeeList)):
ouList.append([])
for j in range(len(employeeList[i])):
if employeeList[i][j].startswith('ou='):
ouList[i].append(employeeList[i][j])

然后，我需要删除以用户、用户经理或员工结尾的任何项目。我反向迭代并使用.endswith()来实现这一点，没有任何错误。

#need to iterate in reverse as I am removing items from the list
for i in reversed(range(len(ouList))):
for j in reversed(range(len(ouList[i]))):
if (ouList[i][j].endswith('users')
or ouList[i][j].endswith('usermger')
or ouList[i][j].endswith('employee')):
ouList[i].remove(ouList[i][j])

然后为了去除ou=或不必要的字符串，我使用了 re(又名正则表达式或正则表达式)。然后，我将这些新值附加到另一个名为ouListStrip的列表中

#stripping ou= and other strings
ouListStrip = []
for i in range(len(ouList)):
ouListStrip.append([])
for j in range(len(ouList[i])):
ou = re.sub("ou=|_hq", "", ouList[i][j])
ouListStrip[i].append(ou)

此列表输出以下内容：

[['news'], ['news', 'news'], ['news', 'news'], ['news'], ['ice cream', 'ice cream'], ['ice cream'], ['store'], ['store', 'store'], ['store'], ['store'], ['sprinkles', 'sprinkles'], ['sprinkles', 'sprinkles'], ['sprinkles'], ['sprinkles']]

现在我只有一个列表列表，现在可以删除子列表中的重复项。我通过使用not in并将它们作为列表列表附加来实现这一点。

no_repeats = []
for i in range(len(ouListStrip)):
no_repeats.append([])
for j in range(len(ouListStrip[i])):
if ouListStrip[i][j] not in no_repeats[i]:
no_repeats[i].append(ouListStrip[i][j])

no_repeats输出以下内容：

[['news'], ['news'], ['news'], ['news'], ['ice cream'], ['ice cream'], ['store'], ['store'], ['store'], ['store'], ['sprinkles'], ['sprinkles'], ['sprinkles'], ['sprinkles']]

最后，我将列表项列表合并为一个列表：

allOUs = []
for i in range(len(no_repeats)):
for j in range(len(no_repeats[i])):
allOUs.append(no_repeats[i][j])

allOUs输出：

['news', 'news', 'news', 'news', 'ice cream', 'ice cream', 'store', 'store', 'store', 'store', 'sprinkles', 'sprinkles', 'sprinkles', 'sprinkles']

然后我把这个列表做成字典，并使用.count()计算其中的项目：

dict_of_counts = {item:allOUs.count(item) for item in allOUs}

输出：

{'news': 4, 'ice cream': 2, 'store': 4, 'sprinkles': 4}

要使其在视觉上与我想要的相似：

for key, value in dict_of_counts.items():
print(key,',',value)

输出：

news , 4
ice cream , 2
store , 4
sprinkles , 4

相关内容

最新更新

热门标签：