Count a number of records in a field using awk - awk

I have a big file like this one
C1,C2,C3
C1,C2
C5
C3,C5
I expected one output like this
C1,C2,C3 3
C1,C2 2
C5 1
C3,C5 2
I would like to make this using shell. Could you help me guys, please?
Thankyou

Something like
awk 'BEGIN{FS=","}{printf "%-20s\t%d\n",$0,NF;}' file
should give
C1,C2,C3 3
C1,C2 2
C5 1
C3,C5 2
Note You need to adjust the width logically considering the maximum length of your lines

Another in awk:
$ awk '{
m=(m<(n=length($0))?n:m) # get the max record length
a[NR]=$0 } # hash to a
END {
for(i=1;i<=NR;i++) # iterate and (below) output nicely
printf "%s%"(m-length(a[i])+4)"s\n",a[i],gsub(/,/,"&",a[i])+1 }
' file
C1,C2,C3 3
C1,C2 2
C5 1
C3,C5 2
IF you want to change the distance between fields and the length, toy with that +4 in the printf.

Related

extract specific row with numbers over N

I have a dataframe like this
1 3 MAPQ=0;CT=3to5;SRMAPQ=60
2 34 MAPQ=60;CT=3to5;SRMAPQ=67
4 56 MAPQ=67;CT=3to5;SRMAPQ=50
5 7 MAPQ=44;CT=3to5;SRMAPQ=61
with using awk (or others)
I want to extract rows with only SRMAPQ over 60.
This means the output is
2 34 MAPQ=60;CT=3to5;SRMAPQ=67
5 7 MAPQ=44;CT=3to5;SRMAPQ=61
update: "SRMAPQ=60" can be anywhere in the line,
MAPQ=44;CT=3to5;SRMAPQ=61;DT=3to5
You don't have to extract the value out of SRMAPQ separately and do the comparison. If the format is fixed like above, just use = as the field separator and access the last field using $NF
awk -F= '$NF > 60' file
Or if SRMAPQ can occur anywhere in the line (as updated in the comments), use a generic approach
awk 'match($0, /SRMAPQ=([0-9]+)/){ l = length("SRMAPQ="); v = substr($0, RSTART+l, RLENGTH-l) } v > 60' file
I would use GNU AWK following way let file.txt content be
1 3 MAPQ=0;CT=3to5;SRMAPQ=60
2 34 MAPQ=60;CT=3to5;SRMAPQ=67;SOMETHING=2
4 56 MAPQ=67;CT=3to5;SRMAPQ=50
5 7 MAPQ=44;CT=3to5;SRMAPQ=61
then
awk 'BEGIN{FS="SRMAPQ="}$2>60' file.txt
output
2 34 MAPQ=60;CT=3to5;SRMAPQ=67;SOMETHING=2
5 7 MAPQ=44;CT=3to5;SRMAPQ=61
Note: added SOMETHING to test if it would work when SRMAPQ is not last. Explantion: I set FS to SRMAPQ= thus what is before that becomes first field ($1) and what is behind becomes second field ($2). In 2nd line this is 67;SOMETHING=2 with which GNU AWK copes by converting its' longmost prefix which constitute number in this case 67, other lines have just numbers. Disclaimer: this solution assumes that all but last field have trailing ;, if this does not hold true please test my solution fully before usage.
(tested in gawk 4.2.1)

awk: Perform arithmetic on subset of columns and print all columns with modified values

I have columns containing some values e.g.:
A 10 20 30 AA AAA AAAA
B 40 50 60 BB BBB BBBB
C 70 80 90 CC CCC CCCC
I want to perform an arithmetic operation like multiplication on cols 2,3,4 and return a new table.
A 100 200 300 AA AAA AAAA
B 400 500 600 BB BBB BBBB
C 700 800 900 CC CCC CCCC
I can operate specifically on cols 2,3,4 using
awk '{print $2*10"\s"$3*10"\s"$4*10}' inp > out
but dont know how to print the entire table with cols with modified values. Is there a way to do this in awk?
Adding a generic solution here, written and tested with shown samples in GNU awk. Just mention all field numbers in fields variable of awk with comma separated and mention digit by which you want to multiply fields in multiplyBy and that should do the trick.
awk -v multplyBy="10" -v fields="2,3,4" '
BEGIN{
num=split(fields,arr,",")
for(i=1;i<=num;i++){
look[arr[i]]
}
}
{
for(i=1;i<=NF;i++){
if(i in look){
$i=($i * multplyBy)
}
}
}
1' Input_file
NOTE: Just now saw user's comments in other answer. In case some one wants to skip first 5 lines then change { before for loop to FNR>5{ and that should do the trick for it.
In your example you calculate and print together. With awk you can do any modifications first on the fields and print finally all the line or a part of it like this:
awk '{$2=10*$2; $3=10*$3; $4=10*$4} {print}' file
{print} with no arguments means {print $0}, print the whole line. Also it can be replaced by any true condition, like 1, for example awk '1' file means print every line.
So your command can be also:
awk '{$2=10*$2; $3=10*$3; $4=10*$4} 1' file
Additionally, before any body with actions ({}) we can have conditions. For example if we want to skip the first 5 lines, that condition is NR>5 where NR is the record (usually means row) number. So here we do not consider the 5 first lines for the calculation but we print them together with all lines:
awk 'NR>5 {$2=10*$2; $3=10*$3; $4=10*$4} {print}' file
Here we totally ignore 5 first lines, we don't print them too:
awk 'NR>5 {$2=10*$2; $3=10*$3; $4=10*$4; print}' file

awk compare two elements in the same line with regular expression

I have very long files where I have to compare two chromosome numbers present in the same line. I would like to use awk to create a file that take only the lines where the chromosome numbers are different.
Here is the example of my file:
CHROM ALT
1 ]1:1234567]T
1 T[1:2345678[
1 A[12:3456789[
2 etc...
In this example, I wish to compare the number of the chromosome (here '1' in the CHROM column) and the number that is between the first bracket ([ or ]) and the ":" symbol. If these numbers are different, I wish to print the corresponding line.
Here, the result should be like this:
1 A[12:3456789[
Thank you for your help.
$ awk -F'[][]' '$1+0 != $2+0' file
1 A[12:3456789[
2 etc...
This requires GNU awk for the 3 argument match() function:
gawk 'match($2, /[][]([0-9]+):/, a) && $1 != a[1]' file
Thanks again for the different answers.
Here are how my data looks like with several columns:
CHROM POS ID REF ALT
1 1000000 123:1 A ]1:1234567]T
1 2000000 456:1 A T[1:2345678[
1 3000000 789:1 T A[12:3456789[
2 ... ... . ...
My question is: how do I modify the previous code, when I have several columns?

AWK select rows where all columns are equal

I have a file with tab-separated values where the number of columns is not known a priori. In other words the number of columns is consistent within a file but different files have different number of columns. The first column is a key, the other columns are some arbitrary values.
I need to filter out the rows where the values are not the same. For example, assuming that the number of columns is 4, I need to keep the first 2 rows and filter out the 3-rd:
1 A A A
2 B B B
3 C D C
I'm planning to use AWK for this purpose, but I don't know how to deal with the fact that the number of columns is unknown. The case of the known number of columns is simple, this is a solution for 4 columns:
$2 == $3 && $3 == $4 {print}
How can I generalize the solution for arbitrary number of columns?
If you guarantee no field contains regex-active chars and the first field never match the second, and there is no blank line in the input:
awk '{tmp=$0;gsub($2,"")} NF==1{print tmp}' file
Note that this solution is designed for this specific case and less extendable than others.
Another slight twist on the approach. In your case you know you want to compare fields 2-4 so you can simply loop from i=3;i<=NF checking $i!=$(i-1) for equality, and if it fails, don't print, get the next record, e.g.
awk '{for(i=3;i<=NF;i++)if($i!=$(i-1))next}1'
Example Use/Output
With your data in file.txt:
$ awk '{for(i=3;i<=NF;i++)if($i!=$(i-1))next}1' file.txt
1 A A A
2 B B B
Could you please try following. This will compare all columns from 2nd column to till last column and check if every element is equal or not. If they are all same it will print line.
awk '{for(i=3;i<=NF;i++){if($(i-1)==$i){count++}};if((NF-2)==count){print};count=""}' Input_file
OR(by hard coding $2 in code, since if $2=$3 AND $3=$4 it means $2=$3=$4 so intentionally taking $2 in comparison rather than having i-1 fetching its previous value.)
awk '{for(i=3;i<=NF;i++){if($2==$i){count++}};if((NF-2)==count){print};count=""}' Input_file
I'd use a counter t with initial value of 2 to add the number of times $i == $(i+1) where i iterates from 2 to NF-1. print the line only if t==NF is true:
awk -F'\t' '{t=2;for(i=2;i<NF;i++){t+=$i==$(i+1)}}t==NF' file.txt
Here is a generalisation of the problem:
Select all lines where a set of columns have the same value: c1 c2 c3 c4 ..., where ci can be any number:
Assume we want to select the columns: 2 3 4 11 15
awk 'BEGIN{n=split("2 3 4 11 15",a)}
{for(i=2;i<=n;++i) if ($(a[i])!=$(a[1])) next}1' file
A bit more robust, in case a line might not contain all fields:
awk 'BEGIN{n=split("2 3 4 11 15",a)}
{for(i=2;i<=n;++i) if (a[i] <= NF) if ($(a[i])!=$(a[1])) next}1' file

Copy lines by rows in awk

I have an input file that contains, per row, a value and two weights.
I would like to generate two output files - where the value in the first column is repeated once per line, according to the weights. This is probably best explained with a short example. If the input file is:
file.in:
35 2 0
37 2 3
38 0 4
Then I would like to generate two output files:
file.out1:
35
35
37
37
file.out2:
37
37
37
38
38
38
38
I will then use these output files to calculate the average and median of first column according to the weights in the second and third column.
This is pretty easy in awk.
awk '{for(i=0;i<$2;i++) print $1;}' file.in > file.out1
generates the first file, and
awk '{for(i=0;i<$3;i++) print $1;}' file.in > file.out2
generates the second
It is not clear from your question whether you know how to compute the mean and median from these files - it seems you just wanted to create these output files. Let me know if the rest is giving your trouble, or whether the above scripts are not clear (I think they are pretty self-explanatory).
If I understood well you need average and median.
Average:
awk '{a+=$1}END{print a/NR}' file.in
36.6667
Median:
cat file.in | awk '{print $1}' | sort | awk '{a[NR]=$1}END{ b=NR/2; b=b%1?int(b)+1:b; print a[b] }'
37
Explanation:
Putting in simple terms NR is a variable which keeps the number of lines, for average you want a sum of every line divided by the number of lines.
For median you want you input sorted and pick the median value, but it's not so simple for your input because I you divide number of lines which is 3 by 2 you will get 1.5 so you need a ceiling function which awk doesn't have so I am doing it with b=NR/2; b=b%1?int(b)+1:b;
I hope this helps.