AWK: Ignore lines grouped by an unique value conditioned on occurrences of a specific field value - awk

Please help revise the title and the post if needed, thanks.
In short, I would like to firstly group lines with a unique value in the first field and accumulate the occurrences of a specific value in the other field in the underlying group of lines. If the sum of occurrences doesn't meet the self-defined threshold, the lines in the group should be ignored.
Specifically, with input
111,1,P,1
111,1,P,1
111,1,P,0
111,1,M,1
222,1,M,1
222,1,M,0
333,1,P,0
333,1,P,1
444,1,M,1
444,1,M,1
444,0,M,0
555,1,P,1
666,1,P,0
the desired output should be
111,1,P,1
111,1,P,1
111,1,P,0
111,1,M,1
333,1,P,0
333,1,P,1
555,1,P,1
666,1,P,0
meaning that "because the unique values in the first field 222 and 444 don't have at least one (which can be any desired threshold) P in the third field, lines corresponding to 222 and 444 are ignored."
Furthermore, this should be done without editing the original file and have to be combined with the solved issue Split CSV to Multiple Files Containing a Set Number of Unique Field Values. By doing this, a few lines will not be involved in the resulted split files.

I believe this one-liner does what you want:
$ awk -F, '{a[$1,++c[$1]]=$0}$3=="P"{p[$1]}END{for(i in c)if(i in p)for(j=1;j<=c[i];++j)print a[i,j]}' file
111,1,P,1
111,1,P,1
111,1,P,0
111,1,M,1
333,1,P,0
333,1,P,1
555,1,P,1
666,1,P,0
Array a, keeps track of all the lines in the file, grouping them by the first field and a count c which we use later. If the third field contains a P, set a key in the p array.
After processing the entire file, loop through all the values of the first field. If a key has been set in p for the value, then print the lines from a.
You mention a threshold number of entries in your question. If by that, you mean that there must be N occurrences of "P" in order for the lines to be printed, you could change {p[$1]} to {++p[$1]}, then change if(i in p) to if(p[i]>=N) in the END block.

Related

Removing more than 2 duplicates from a CSV file

I have found the following script to remove duplicates:
awk -F, '!x[$7]++' 'business-records.csv' > 'business-records-deduped.csv'
When it finds duplicate records instead of deleting all the duplicates and keeping only the first record it would be amazing if it could keep the first 2 or 3 records and remove the rest. So basically allowing the original and one duplicate but deleting the entire row of any more than one or two duplicates.
How to adjust it so that it keeps the original record and the first duplicate and deletes the entire rows of any more than the first duplicate?
You can use awk like this:
awk -F, '++x[$7] <= 2' business-records.csv > business-records-deduped.csv
This will keep 2 duplicate records for 7th column and will delete any more dupes as you desire.
I propose following minimal ameloration of your code
awk -F, '2>x[$7]++' 'business-records.csv' > 'business-records-deduped.csv'
Explanation: ++ is post-increment operation so execution order might be somewhat counter-intuitive
x[$7] gets value from array x for key being content of 7th field, if not present assume 0
2> is test deciding about printing, if this condition does hold line is printed
++ does increase value inside array x, therefore next time you encounter same 7th field content value will be bigger by 1
Observe that sole thing altered is test, regarding non-negative integers ! is true for zero and false for values above 0.

How to read csv files correctly using pandas?

I'm having a csv file like below. I need to check whether the number of columns are greater than the max length of rows. Ex,
name,age,profession
"a","24","teacher","cake"
"b",31,"Doctor",""
"c",27,"Engineer","tea"
If i try to read it using
print(pd.read_csv('test.csv'))
it will print as below.
name age profession
a 24 teacher cake
b 31 Doctor NaN
c 27 Engineer tea
But it's wrong. It happened due to the less number of columns. So i need to identify this scenario as a wrong csv format. what is the best way to test this other than reading this as string and testing the length of each row.
And important thing is, the columns can be different. There are no any mandatory columns to present.
You can try put header=None into .read_csv. Then pandas will throw ParserError if number of columns won't match length of rows. For example:
try:
df = pd.read_csv("your_file.csv", header=None)
except pd.errors.ParserError:
print("File Invalid")

How to split a column containing two records separately

I have millions of observation in different columns and one of the column contains records of two factors together. for instance, 136789 and i want to split the first character (1) and the rest (36789) as separate columns for all observations.
The field looks like this
#136789
I want to see like this
#1 36789
You can make use of sub() function.
For example:
kent$ awk 'BEGIN{x="123456";sub(/^./,"& ",x);print x}'
1 23456
In your code, you need apply sub() on some column ($x)

How to change a particular column value for certain number of rows in Pig latin

I have a a pig file with say 10000 rows. Is there any quick way where I can change the value of a certain column for say first 1000 rows ?
Since some info is missing, I will make a few assumptions, and then offer a solution.
by "first 1000 rows" you mean that you can order them records using some column
you which to change the value of column $1 in first 1000 records when ordering by column $2
The following code snippet will do what you asked for:
a = load ...
b = rank a by $2;
c = foreach b generate $0, (rank_a<1000?$1:3*$1), $2..;
Use For Each and Limit Operations to achieve the effect.

How to access columns by their names and not by their positions?

I have just tried my first sqlite select-statement and got a result (an iterator over tuples). So, in other words, every row is represented by a tuple and I can access value in the cells of the row like this: r[7] or r[3] (get value from the column 7 or column 3). But I would like to access columns not by their positions but by their names. Let us say, I would like to know the value in the column user_name. What is the way to do it?
I found the answer on my question here:
cursor.execute("PRAGMA table_info(tablename)")
print cursor.fetchall()