Removing more than 2 duplicates from a CSV file - awk

I have found the following script to remove duplicates:
awk -F, '!x[$7]++' 'business-records.csv' > 'business-records-deduped.csv'
When it finds duplicate records instead of deleting all the duplicates and keeping only the first record it would be amazing if it could keep the first 2 or 3 records and remove the rest. So basically allowing the original and one duplicate but deleting the entire row of any more than one or two duplicates.
How to adjust it so that it keeps the original record and the first duplicate and deletes the entire rows of any more than the first duplicate?

You can use awk like this:
awk -F, '++x[$7] <= 2' business-records.csv > business-records-deduped.csv
This will keep 2 duplicate records for 7th column and will delete any more dupes as you desire.

I propose following minimal ameloration of your code
awk -F, '2>x[$7]++' 'business-records.csv' > 'business-records-deduped.csv'
Explanation: ++ is post-increment operation so execution order might be somewhat counter-intuitive
x[$7] gets value from array x for key being content of 7th field, if not present assume 0
2> is test deciding about printing, if this condition does hold line is printed
++ does increase value inside array x, therefore next time you encounter same 7th field content value will be bigger by 1
Observe that sole thing altered is test, regarding non-negative integers ! is true for zero and false for values above 0.

Related

OpenRefine: Remove row if specific cell in this row is empty

The input for OpenRefine is a csv file containing data like this
phy,205.4,,,Unterwasserakustik. Sonar-Technik,,
phy,205.6,,,Lärm. Lärmbekämpfung. Schallschutz. Filter (vgl.a.san 525),,
phy,205.9,,,Sonstiges,,
,,,,,,
,,Wärme. Statistische Physik (Temperaturstrahlung s. phy 495),,,,
,220,,Gesamtgebiet,,,
I would like to remove all rows where the second column (the numeric code) is empty.
In Open Refine I created a Facet->CustomizedFacet->FacetByBlank on the second column. In the appearing menu on the left, I clicked true (197 false, 2 true - which is correct). Then, I went to All->EditRows->RemoveAllMatchingRows. Instead of removing only the two rows, OpenRefine removes 143 rows and no data is shown anymore.
What has happend? And how can I remove only the two rows with an empty second column?
It might be connected to the row counter in the All column: The first time the entry in the first column "phy" is missing, there is no row count anymore.
1. phy 205.4 ...
2. phy 205.6 ...
3. phy 205.9 ...
Wärme...
220 ...
The 220 row does not contain the "phy" column and is incorrectly ignored.
It looks like you may be operating in "record mode" as opposed to "row mode." If the facet says 197 true, 2 false, you should only see two rows displayed on the screen when you go to do your delete. If you see more than that try selecting Row mode.

How to split a column containing two records separately

I have millions of observation in different columns and one of the column contains records of two factors together. for instance, 136789 and i want to split the first character (1) and the rest (36789) as separate columns for all observations.
The field looks like this
#136789
I want to see like this
#1 36789
You can make use of sub() function.
For example:
kent$ awk 'BEGIN{x="123456";sub(/^./,"& ",x);print x}'
1 23456
In your code, you need apply sub() on some column ($x)

Numbering repeated values in column in Excel using VBA

I have a column with varying values and some of these values can sometimes be repeated, so if there are two of the same value I need to have the first value followed by 1 and the second followed by 2.
For Example:
Apple1
Apple2
Lemon1
Apple3
Pear1
Lemon2
Apple4
Orange1
Pear2
I've tried using nested if loops but I can't seem to find an efficient way to do this.
You can use 2 loops to go through all the elements.
Btw, you can add 1 more step to check numeric the last character and skip for faster process

AWK: Ignore lines grouped by an unique value conditioned on occurrences of a specific field value

Please help revise the title and the post if needed, thanks.
In short, I would like to firstly group lines with a unique value in the first field and accumulate the occurrences of a specific value in the other field in the underlying group of lines. If the sum of occurrences doesn't meet the self-defined threshold, the lines in the group should be ignored.
Specifically, with input
111,1,P,1
111,1,P,1
111,1,P,0
111,1,M,1
222,1,M,1
222,1,M,0
333,1,P,0
333,1,P,1
444,1,M,1
444,1,M,1
444,0,M,0
555,1,P,1
666,1,P,0
the desired output should be
111,1,P,1
111,1,P,1
111,1,P,0
111,1,M,1
333,1,P,0
333,1,P,1
555,1,P,1
666,1,P,0
meaning that "because the unique values in the first field 222 and 444 don't have at least one (which can be any desired threshold) P in the third field, lines corresponding to 222 and 444 are ignored."
Furthermore, this should be done without editing the original file and have to be combined with the solved issue Split CSV to Multiple Files Containing a Set Number of Unique Field Values. By doing this, a few lines will not be involved in the resulted split files.
I believe this one-liner does what you want:
$ awk -F, '{a[$1,++c[$1]]=$0}$3=="P"{p[$1]}END{for(i in c)if(i in p)for(j=1;j<=c[i];++j)print a[i,j]}' file
111,1,P,1
111,1,P,1
111,1,P,0
111,1,M,1
333,1,P,0
333,1,P,1
555,1,P,1
666,1,P,0
Array a, keeps track of all the lines in the file, grouping them by the first field and a count c which we use later. If the third field contains a P, set a key in the p array.
After processing the entire file, loop through all the values of the first field. If a key has been set in p for the value, then print the lines from a.
You mention a threshold number of entries in your question. If by that, you mean that there must be N occurrences of "P" in order for the lines to be printed, you could change {p[$1]} to {++p[$1]}, then change if(i in p) to if(p[i]>=N) in the END block.

How to change a particular column value for certain number of rows in Pig latin

I have a a pig file with say 10000 rows. Is there any quick way where I can change the value of a certain column for say first 1000 rows ?
Since some info is missing, I will make a few assumptions, and then offer a solution.
by "first 1000 rows" you mean that you can order them records using some column
you which to change the value of column $1 in first 1000 records when ordering by column $2
The following code snippet will do what you asked for:
a = load ...
b = rank a by $2;
c = foreach b generate $0, (rank_a<1000?$1:3*$1), $2..;
Use For Each and Limit Operations to achieve the effect.