去除有重复的行
创始人
2024-04-06 13:44:22

 【问题】

I have a csv file and I have duplicate as well as unique data getting add to it on a daily basis. This involves too many duplicates. I have to remove the duplicates based on specific columns. For eg:

csvfile1:

title1	title2	title3	title4	title5
abcdef 	12	13	14	15
jklmn 	12	13	56	76
abcdef 	12	13	98	89
bvnjkl 	56	76	86	96

Now, based on title1, title2 and title3 I have to remove duplicates and add the unique entries in a new csv file. As you can see abcdef row is not unique and repeats based on title1,title2 and title3 so it should be removedand the output should look like:

Expected Output CSV File:

title1 title2 title3 title4 title5
jklmn  12     13     56     76
bvnjkl 56     76     86     96

My tried code is here below:CSVINPUT file import csv

f = open("1.csv", 'a+')
writer = csv.writer(f)
writer.writerow(("t1", "t2", "t3"))
a =[["a", 'b', 'c'], ["g", "h", "i"],['a','b','c']] #This list is changed daily so new and duplicates data get added dailyfor i in range(2):writer.writerow((a[i]))
f.close()

Duplicate removal script:

import csv
with open('1.csv','r') as in_file, open('2.csv','w') as out_file:seen = set() # set for fast O(1) amortized lookupfor line in in_file:if line not in seen: continue # skip duplicateout_file.write(line)

My Output: 2.csv:

t1 t2 t3
a  b  c
g  h  i

Now, I do not want a b c in the 2.csv based on t1 and t2 only the unique g h i based on t1 and t2

有人给出解法但楼主表示看不懂

import csv
with open('1.csv','r') as in_file, open('2.csv','w') as out_file:seen = set()seentwice = set()reader = csv.reader(in_file)writer = csv.writer(out_file)rows = []for row in reader:if (row[0],row[1]) in seen:seentwice.add((row[0],row[1]))seen.add((row[0],row[1]))rows.append(row)for row in rows:if (row[0],row[1]) not in seentwice:writer.writerow(row)

【回答】

只要按前3个字段分组,选出成员计数等于1的组,再合并各组记录即可。如无特殊要求,此类结构化计算用SPL来实现要简单且易懂许多:

A
1=file("d:\\source.csv").import@t()
2=A1.group(title1,title2,title3).select(~.len()==1).conj()
3=file("d:\\result.csv").export@c(A2)

A1:读取文件source.csv中的内容。

A2:按前3个字段分组,选出成员计数等于1的组,再合并各组记录。

A3:将A2结果写入文件result.csv中。

相关内容

热门资讯

最新或2023(历届)江苏美术... 乐和美术一样,总分都是410分,今年音乐类总分变成了300分。和今年的试题难易程度、评卷标准等无关。...
四川音乐学院附中最新或2023... 一、学院简介四川音乐学院创建于1939年。是中国西南地区唯一一所集艺术学、工学、管理学、教育学四大学...
山东青年政治学院最新或2023... 最新或2023(历届)山东青年政治学院艺术类专业招生章程第一章 总则为了保障我校最新或2023(历届...
菏泽学院最新或2023(历届)... 招生专业及拟招人数: 专业名称层次学制科类拟招人数生源范围收费标准音乐学本科4年艺术文90山东省80...
最新或2023(历届)中国人民... 中国人民大学最新或2023(历届)外语类保送生昨日开考 本报讯(记者 石滢琪)昨天(17日)...