Script to find duplicates in a csv file


Question

I have a 40 MB csv file with 50,000 records. Its a giant product listing. Each row has close to 20 fields. [Item#, UPC, Desc, etc]

How can I,

a) Find and Print duplicate rows. [This file is a large appended file, so I have multiple headers included in the file which I need to remove, so I wanted to know exact rows which are duplicate first.]

b) Find and Print duplicate rows based on a column. [See if a UPC is assigned to multiple products]

I need to run the command or script on the server and I have Perl and Python installed. Even bash script or command will work for me too.

I dont need to preserve the order of the rows. etc

I tried,

sort largefile.csv | uniq -d

to get the duplicates, But I am not getting the expected answer.

Ideally I would like bash script or command, but if any one has any other suggestion, that would be great too.

Thanks


See: Remove duplicate rows from a large file in Python over on Stack Overflow

1
9
5/23/2017 11:46:40 AM

Try the following:

# Sort before using the uniq command
sort largefile.csv | sort | uniq -d

uniq is a very basic command and only reports uniqueness / duplicates that are next to each other.

8
9/11/2013 8:22:02 AM

Licensed under: CC-BY-SA with attribution
Not affiliated with: Stack Overflow
Icon