Countif like function in AWK with field headers - awk

I am looking for a way of counting the number of times a value in a field appears in a range of fields in a csv file much the same as countif in excel although I would like to use an awk command if possible.
So column 6 should have the range of values and column 7 would have the times the value appears in column 7, as per below
>awk -F, '{print $0}' file3
f1,f2,f3,f4,f5,test
row1_1,row1_2,row1_3,SBCDE,row1_5,SBCD
row2_1,row2_2,row2_3,AWERF,row2_5,AWER
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF
row4_1,row4_2,row4_3,PRE-ASDQG,row4_5,ASDQ
row4_1,row4_2,row4_3,PRE-ASDQF,row4_5,ASDQ
>awk -F, '{print $6}' file3
test
SBCD
AWER
ASDF
ASDQ
ASDQ
What i want is:
f1,f2,f3,f4,f5,test,count
row1_1,row1_2,row1_3,SBCDE,row1_5,SBCD,1
row2_1,row2_2,row2_3,AWERF,row2_5,AWER,1
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF,1
row4_1,row4_2,row4_3,PRE-ASDQG,row4_5,ASDQ,2
row4_1,row4_2,row4_3,PRE-ASDQF,row4_5,ASDQ,2
#adds field name count that I want:
awk -F, -v OFS=, 'NR==1{ print $0, "count"}
NR>1{ print $0}' file3
Ho do I get the output I want?
I have tried this from previous/similar question but no joy,
>awk -F, 'NR>1{c[$6]++;l[NR>1]=$0}END{for(i=0;i++<NR;){split(l[i],s,",");print l[i]","c[s[1]]}}' file3
row4_1,row4_2,row4_3,PRE-ASDQF,row4_5,ASDQ,
,
,
,
,
,
very similar question to this one
similar python related Q, for my ref

I would harness GNU AWK for this task following way, let file.txt content be
f1,f2,f3,f4,f5,test
row1_1,row1_2,row1_3,SBCDE,row1_5,SBCD
row2_1,row2_2,row2_3,AWERF,row2_5,AWER
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF
row4_1,row4_2,row4_3,PRE-ASDQG,row4_5,ASDQ
row4_1,row4_2,row4_3,PRE-ASDQF,row4_5,ASDQ
then
awk 'BEGIN{FS=OFS=","}NR==1{print $0,"count";next}FNR==NR{arr[$6]+=1;next}FNR>1{print $0,arr[$6]}' file.txt file.txt
gives output
f1,f2,f3,f4,f5,test,count
row1_1,row1_2,row1_3,SBCDE,row1_5,SBCD,1
row2_1,row2_2,row2_3,AWERF,row2_5,AWER,1
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF,1
row4_1,row4_2,row4_3,PRE-ASDQG,row4_5,ASDQ,2
row4_1,row4_2,row4_3,PRE-ASDQF,row4_5,ASDQ,2
Explanation: this is two-pass approach, hence file.txt appears twice. I inform GNU AWK that , is both field separator (FS) and output field separator (OFS), then for first line (header) I print it followed by count and instruct GNU AWK to go to next line, so nothing other is done regarding 1st line, then for first pass, i.e. where global number of line (NR) is equal to number of line in file (FNR) I count number of occurences of values in 6th field and store them as values in array arr, then instruct GNU AWK to get to next line, so onthing other is done in this pass. During second pass for all lines after 1st (FNR>1) I print whole line ($0) followed by corresponding value from array arr
(tested in GNU Awk 5.0.1)

You did not copy the code from the linked question properly. Why change l[NR] to l[NR>1] at all? On the other hand, you should change s[1] to s[6] since it's the sixth field that has the key you're counting:
awk -F, 'NR>1{c[$6]++;l[NR]=$0}END{for(i=0;i++<NR;){split(l[i],s,",");print l[i]","c[s[6]]}}'
You can also output the header with the new field name:
awk -F, -vOFS=, 'NR==1{print $0,"count"}NR>1{c[$6]++;l[NR]=$0}END{for(i=0;i++<NR;){split(l[i],s,",");print l[i],c[s[6]]}}'

One awk idea:
awk '
BEGIN { FS=OFS="," } # define input/output field delimiters as comma
{ lines[NR]=$0
if (NR==1) next
col6[NR]=$6 # copy field 6 so we do not have to parse the contents of lines[] in the END block
cnt[$6]++
}
END { for (i=1;i<=NR;i++)
print lines[i], (i==1 ? "count" : cnt[col6[i]] )
}
' file3
This generates:
f1,f2,f3,f4,f5,test,count
row1_1,row1_2,row1_3,SBCDE,row1_5,SBCD,1
row2_1,row2_2,row2_3,AWERF,row2_5,AWER,1
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF,1
row4_1,row4_2,row4_3,PRE-ASDQG,row4_5,ASDQ,2
row4_1,row4_2,row4_3,PRE-ASDQF,row4_5,ASDQ,2

Related

How to use awk script to generate a file

I have a very large compressed file(dataFile.gz) that I want to generate another file using cat and awk. So using cat to view the contents and then piping it to awk to generate the new file.
The contents of compressed as like below
Time,SequenceNumber,MsgType,MsgLength,CityOrign,RTime
7:20:13,1,A,34,Tokyo,0
7:20:13,2,C,35,Nairobi,7:20:14
7:20:14,3,E,30,Berlin,7:20:15
7:20:16,4,A,34,Berlin,7:20:17
7:20:17,5,C,35,Denver,0
7:20:17,6,D,33,Helsinki,7:20:18
7:20:18,7,F,37,Tokyo,0
….
….
….
For the new file, I want to generate, I only want the Time, MsgType and RTime. Meaning columns 0,2 and 5. And for column 5, if the value is 0, replace it with the value at column 0. i.e replace RTime with Time
Time,MsgType,RTime
7:20:13,A,7:20:13
7:20:13,C,7:20:14
7:20:14,E,7:20:15
7:20:16,A,7:20:17
7:20:17,C,7:20:17
7:20:17,D,7:20:18
7:20:18,F,7:20:18
This is my script so far:
#!/usr/bin/awk -f
BEGIN {FS=","
print %0,%2,
if ($5 == "0") {
print $0
} else {
print $5
}
}
My question is, will this script work and how do I call it. Can I call it on the terminal like below?
zcat dataFile.gz | <awk script> > generatedFile.csv
awk index starts with 1 and $0 represents full record. So column numbers would be 1, 3, 6.
You may use this awk:
awk 'BEGIN{FS=OFS=","} !$6{$6=$1} {print $1, $3, $6}' file
Time,MsgType,RTime
7:20:13,A,7:20:13
7:20:13,C,7:20:14
7:20:14,E,7:20:15
7:20:16,A,7:20:17
7:20:17,C,7:20:17
7:20:17,D,7:20:18
7:20:18,F,7:20:18
Could you please try following. A bit shorter version of #anubhava sir's solution. This one is NOT having assignment to 6th field it only checks if that is zero or not and accordingly it prints the values.
awk 'BEGIN{FS=OFS=","} {print $1, $3, $6==0?$1:$6}' Input_file

Filtering rows based on column values of csv file

I have a dataset with 1000 rows and 10 columns. Here is the sample dataset
A,B,C,D,E,F,
a,b,c,d,e,f,
g,h,i,j,k,l,
m,n,o,p,q,r,
s,t,u,v,w,x,
From this dataset I want to copy the rows whose has value of column A as 'a' or 'm' to a new csv file. Also I want the header to get copied.
I have tried using awk. It copied all the rows but not the header.
awk '{$1~/a//m/ print}' inputfile.csv > outputfile.csv
How can I copy the header also into the new outputfile.csv?
Thanks in advance.
Considering that your header will be on 1st row, could you please try following.
awk 'BEGIN{FS=OFS=","} FNR==1{print;next} $1 ~ /^a$|^m$/' Input_file > outputfile.csv
OR as per Cyrus sir's comment adding following:
awk 'BEGIN{FS=OFS=","} FNR==1{print;next} $1 ~ /^(a|m)$/' Input_file > outputfile.csv
OR as per Ed sir's comment try following:
awk -F, 'NR==1 || $1~/^[am]$/' Input_file > outputfile.csv
Added corrections in OP's attempt:
Added FS and OFS as , here for all lines since lines are comma delimited.
Added FNR==1 condition which means it is checking 1st line here and printing it simply, since we want to print headers in out file. It will print very first line and then next will skip all further statements from here.
Used a better regex for checking 1st field's condition $1 ~ /^a$|^m$/
This might work for you (GNU sed):
sed '1b;/^[am],/!d' oldFile >newFile
Always print the first line and delete any other line that does not beging a, or m,.
Alternative:
awk 'NR==1 || /^[am],/' oldFile >newFile
With awk. Set field separator (FS) to , and output current row if it's first row or if its first column contains a or m.
awk 'NR==1 || $1=="a" || $1=="m"' FS=',' in.csv >out.csv
Output to out.csv:
A,B,C,D,E,F,
a,b,c,d,e,f,
m,n,o,p,q,r,
$ awk -F, 'BEGIN{split("a,m",tmp); for (i in tmp) tgts[tmp[i]]} NR==1 || $1 in tgts' file
A,B,C,D,E,F,
a,b,c,d,e,f,
m,n,o,p,q,r,
It appears that awk's default delimiter is whitespace. Link
Changing the delimiter can be denoted by using the FS variable:
awk 'BEGIN { FS = "," } ; { print $2 }'

Duplicate Lines 2 times and transpose from row to column

I will like to duplicate each line 2 times and print values of column 5 and 6 separated.( transpose values of column 5 and 6 from column to row ) for each line
I mean value on column 5 (first line) value in column 6 ( second line)
Input File
08,1218864123180000,3201338573,VV,22,27
08,1218864264864000,3243738789,VV,15,23
08,1218864278580000,3244738513,VV,3,13
08,1218864310380000,3243938789,VV,15,23
08,1218864324180000,3244538513,VV,3,13
08,1218864334380000,3200538561,VV,22,27
Desired Output
08,1218864123180000,3201338573,VV,22
08,1218864123180000,3201338573,VV,27
08,1218864264864000,3243738789,VV,15
08,1218864264864000,3243738789,VV,23
08,1218864278580000,3244738513,VV,3
08,1218864278580000,3244738513,VV,13
08,1218864310380000,3243938789,VV,15
08,1218864310380000,3243938789,VV,23
08,1218864324180000,3244538513,VV,3
08,1218864324180000,3244538513,VV,13
08,1218864334380000,3200538561,VV,22
08,1218864334380000,3200538561,VV,27
I use this code to duplicate the lines 2 times, but i cant'n figer out the condition with values of column 5 and 6
awk '{print;print}' file
Thanks in advance
To repeatedly print the start of a line for each of the last N fields where N is 2 in this case:
$ awk -v n=2 '
BEGIN { FS=OFS="," }
{
base = $0
sub("("FS"[^"FS"]+){"n"}$","",base)
for (i=NF-n+1; i<=NF; i++) {
print base, $i
}
}
' file
08,1218864123180000,3201338573,VV,22
08,1218864123180000,3201338573,VV,27
08,1218864264864000,3243738789,VV,15
08,1218864264864000,3243738789,VV,23
08,1218864278580000,3244738513,VV,3
08,1218864278580000,3244738513,VV,13
08,1218864310380000,3243938789,VV,15
08,1218864310380000,3243938789,VV,23
08,1218864324180000,3244538513,VV,3
08,1218864324180000,3244538513,VV,13
08,1218864334380000,3200538561,VV,22
08,1218864334380000,3200538561,VV,27
In this simple case where the last field has to be removed and placed on the last line, you can do
awk -F , -v OFS=, '{ x = $6; NF = 5; print; $5 = x; print }'
Here -F , and -v OFS=, will set the input and output field separators to a comma, respectively, and the code does
{
x = $6 # remember sixth field
NF = 5 # Set field number to 5, so the last one won't be printed
print # print those first five fields
$5 = x # replace value of fifth field with remembered value of sixth
print # print modified line
}
This approach can be extended to handle fields in the middle with a function like the one in the accepted answer of this question.
EDIT: As Ed notes in the comments, writing to NF is not explicitly defined to trigger a rebuild of $0 (the whole-line record that print prints) in the POSIX standard. The above code works with GNU awk and mawk, but with BSD awk (as found on *BSD and probably Mac OS X) it fails to do anything.
So to be standards-compliant, we have to be a little more explicit and force awk to rebuild $0 from the modified field state. This can be done by assigning to any of the field variables $1...$NF, and it's common to use $1=$1 when this problem pops up in other contexts (for example: when only the field separator needs to be changed but not any of the data):
awk -F , -v OFS=, '{ x = $6; NF = 5; $1 = $1; print; $5 = x; print }'
I've tested this with GNU awk, mawk and BSD awk (which are all the awks I can lay my hands on), and I believe this to be covered by the awk bit in POSIX where it says "setting any other field causes the re-evaluation of $0" right at the top. Mind you, the spec could be more explicit on this point, and I'd be interested to test if more exotic awks behave the same way.
Could you please try following(considering that your Input_file always is same as shown and you need to print every time 1st four fields and then rest of the fields(one by one printing along with 1st four)).
awk 'BEGIN{FS=OFS=","}{for(i=5;i<=NF;i++){print $1,$2,$3,$4,$i}}' Input_file
This might work for you (GNU awk):
awk '{print gensub(/((.*,).*),/,"\\1\n\\2",1)}' file
Replace the last comma by a newline and the previous fields less the penultimate.

How to increment a column value with an increasing number in a csv file

I have a text file with 3 columns as below.
$ cat test.txt
1,A,300
1,B,300
1,C,300
Till now i have tried as, awk -F, '{$3=$3+1;print}' OFS=, test.txt
But output is coming as:
1,A,301
1,B,301
1,C,301
& below is my desired output
Now i want to increment the third column only, the output should be like below
1,A,300
1,B,301
1,C,302
How can I achieve the desired output?
could be, assuming line are sequential like your sample)
awk -F ',' '{sub($3"$",$3+NR-1)}7' YourFile
it use the line numer as increment value, changing the line end and not the field value (different from an awk POV, that don't need to rebuild the line with separator)
Alternative if empty or other line between modifiable lines (i arbitrary use NF as filter but it depend of your criteria if any)
awk -F ',' 'NF{sub($3"$",$3+i++)}7' YourFile
awk 'BEGIN{x=0;FS=OFS=","} NF>1{$3=$3+x;x++}1' inputfile
1,A,300
1,B,301
1,C,302
Explanation:
BEGIN Block : It contains x which is a counter initially set to zero, FS and OFS . /./ is used to ignore blank lines(Remove this part if there are no blank lines). $3=$3+x : This will add the value of counter to $3. x++ : To increment the current value of the counter.
try this NR starts at 1 so NR -1 should give you the correct number
awk -F, '{$3=$3+NR-1;print}' OFS=, test.txt
Yet another:
awk 'BEGIN{ FS=OFS="," } ($3+=i++)||1 ' file
awk 'BEGIN{i=0;FS=OFS=","} NF>1{$3=$3+i;i++}1' filename
It contains x which is a counter initially set to zero, FS and OFS . /./ is used to ignore blank lines(Remove this part if there are no blank lines).
$3=$3+i : This will add the value of counter to $3. i++ : To increment the value of counter. Must and should give space betwen awk and begin as well as filename and end of the file

Awk Field number of matched pattern

I was wondering if there's a built in command in awk to get the field number of the phrase that you just matched.
Banana is yellow.
awk {
/yellow/{ for (i=1;i<=NF;i++) if($i ~/yellow/) print $i}'
Is there a way to avoid writing the loop?
Your command doesn't work when I test it. Here's my version:
echo "banana is yellow" | awk '{for (i=1;i<=NF;i++) if($i ~/yellow/) print i}'
The output is :
3
As far as I know, there's no such built-in feature, to improve your command, the pattern match /yellow/ at the beginning is not necessary, and also $i will print the matching field other than the field number that you need.
Alternatively, you can use an array to store each field and its corresponding index number, and then print field by arr["yellow"]
If the input string is a oneline string you can set the record delimiter to the field delimiter. Doing so you can use NR to print the position:
awk 'BEGIN{RS=FS}/yellow/{print NR}' <<< 'banana is yellow'
3