I have ~ 100 files, they are a.vcf; b.vcf, d.vcf......
For example:
a.vcf
##contig= ID=chr1,length=249250621
##contig= ID=chr2,length=243199373
##contig= ID=chr3,length=198022430
##contig= ID=chr4,length=191154276
b.vcf
##contig= ID=chr5,length=180915260
##contig= ID=chr6,length=171115067
##contig= ID=chr7,length=159138663
##contig= ID=chr8,length=146364022
##contig= ID=chr9,length=141213431
##contig= ID=chr10,length=135534747
I want to add additional col as the last col, for examples, new file a_a.vcf
a_a.vcf
##contig= ID=chr1,length=249250621 a.vcf
##contig= ID=chr2,length=243199373 a.vcf
##contig= ID=chr3,length=198022430 a.vcf
##contig= ID=chr4,length=191154276 a.vcf
For single vcf file, I used the following code:
awk 'NR == 1 {print $0 " name_file"; next;}{print $0 " " FILENAME;}' a.vcf
Then I want to apply this to all the files in this folder.
for d in *.vcf; do
awk 'NR == 1 {print $0 " name_file"; next;}{print $0 " " FILENAME;}' a_$d
done
But I found the -zsh replaced $0, How could I fix the problem?
awk 'NR == 1 {print -zsh name_file; next;}{print -zsh FILENAME;}' a_a.vcf
awk 'NR == 1 {print -zsh name_file; next;}{print -zsh FILENAME;}' a_b.vcf
awk 'NR == 1 {print -zsh name_file; next;}{print -zsh FILENAME;}' a_c.vcf
GNU AWK is not limited to single input file, you might provide multiple files to single awk by using filenames sheared by spaces, in your case try
awk 'FNR == 1 {print $0 " name_file"; next;}{print $0 " " FILENAME;}' a.vcf b.vcf c.vcf
which should give same output as
awk 'NR == 1 {print $0 " name_file"; next;}{print $0 " " FILENAME;}' a.vcf
awk 'NR == 1 {print $0 " name_file"; next;}{print $0 " " FILENAME;}' b.vcf
awk 'NR == 1 {print $0 " name_file"; next;}{print $0 " " FILENAME;}' c.vcf
Note that I used FNR in place of NR i.e. number of line inside file rather than (global) number of line. As suggested in comments, you might further ameliorate your code exploiting OFS variable as follows
awk 'BEGIN{OFS=" "}FNR == 1 {print $0, "name_file"; next}{print $0, FILENAME}' a.vcf b.vcf c.vcf
If you want to know more about OFS and other read 8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR
i have to compare 2 files using awk.
The structure of each files is the same : path checksum
File1.txt
/content/cr444/commun/ 50d174f143d115b2d12d09c152a2ca59be7fbb91
/content/cr764/commun/ 10d174f14fd115b2d12d09c152a2ca59be7fbb91
/content/cr999/commun/ 10d174f14fd115b2d12d09c152a2ca59be7fbbpp
File2.txt
/content/cr555/test/ 51d174f14f6115b2d12d09c152a2ca59be7fbb91
/content/cr764/commun/ 10d174f14fd115b2d12d09c152a2ca59be7fbb78
/content/cr999/commun/ 10d174f14fd115b2d12d09c152a2ca59be7fbbpp
Result expected is a .csv (with separator |):
/content/cr444/commun/|50d174f143d115b2d12d09c152a2ca59be7fbb91||not in file2
/content/cr555/test/||51d174f14f6115b2d12d09c152a2ca59be7fbb91|not in file1
/content/cr999/commun/|10d174f14fd115b2d12d09c152a2ca59be7fbbpp|10d174f14fd115b2d12d09c152a2ca59be7fbbpp|same checksum
/content/cr764/commun||10d174f14fd115b2d12d09c152a2ca59be7fbb91|10d174f14fd115b2d12d09c152a2ca59be7fbb78|not same checksum
I assume the order of output lines is not important. Then you could:
Collect lines from File1.txt into an indexed array ($1 -> $2)
Process lines from File2.txt:
If $1 is in the indexed array from (1) compare their checksums and print accordingly
If $1 is not in the indexed array from (1), print accordingly
Print all remaining itmes from array (1)
Here's the code:
$ awk 'BEGIN{OFS="|"} NR==FNR{f1[$1]=$2; next} {if ($1 in f1) { print $1,f1[$1],$2,($2==f1[$1]?"":"not ")"same checksum"; delete f1[$1]} else print $1,"",$2,"not in file1"} END{for (i in f1) print i,f1[i],"","not in file2"}' File1.txt File2.txt
Output:
/content/cr555/test/|51d174f14f6115b2d12d09c152a2ca59be7fbb91|not in file1
/content/cr764/commun/|10d174f14fd115b2d12d09c152a2ca59be7fbb91|10d174f14fd115b2d12d09c152a2ca59be7fbb78|not same checksum
/content/cr999/commun/|10d174f14fd115b2d12d09c152a2ca59be7fbbpp|10d174f14fd115b2d12d09c152a2ca59be7fbbpp|same checksum
/content/cr444/commun/|50d174f143d115b2d12d09c152a2ca59be7fbb91||not in file2
One way, using join to merge the two files, and awk to compare the checksums on each line:
$ join -a1 -a2 -11 -21 -e XXXX -o 0,1.2,2.2 <(sort -k1 file1.txt) <(sort -k1 file2.txt) |
awk -v OFS='|' '$2 == "XXXX" { print $1, "", $3, "not in file1"; next }
$3 == "XXXX" { print $1, $2, "", "not in file2"; next }
$2 == $3 { print $1, $2, $3, "same checksum"; next }
{ print $1, $2, $3, "not same checksum" }'
/content/cr444/commun/|50d174f143d115b2d12d09c152a2ca59be7fbb91||not in file2
/content/cr555/test/||51d174f14f6115b2d12d09c152a2ca59be7fbb91|not in file1
/content/cr764/commun/|10d174f14fd115b2d12d09c152a2ca59be7fbb91|10d174f14fd115b2d12d09c152a2ca59be7fbb78|not same checksum
/content/cr999/commun/|10d174f14fd115b2d12d09c152a2ca59be7fbbpp|10d174f14fd115b2d12d09c152a2ca59be7fbbpp|same checksum
I would like to aggregate values in a file based on a specific field value which is a kind of group attribute. The ending file should have one line per group.
MWE:
$ head -n4 foo
X;Y;OID;ID;OQTE;QTE;OTYPE;TYPE;Z
603.311;800.928;930;982963;0;XTX;49;comment;191.299
603.512;810.700;930;982963;0;XTX;49;comment;191.341
604.815;802.475;930;982963;0;XTX;49;comment;191.393
601.901;858.701;122;982954;0;XTX;50;comment;194.547
601.851;832.317;122;982954;0;XTX;50;comment;193.733
There is two groups here; 982963 and 982954.
Target:
$ head -n2 bar
CODE;OID;ID;OQTE;QTE;OTYPE;TYPE
"FLW (603.311 800.928 191.299, 603.512 801.700 191.341, 604.815 802.475 191.393)";982963;0;XTX;49;comment
"FLW (601.901 858.701 194.547, 601.851 832.317 193.733)";982954;0;XTX;49;comment
The group field is the 4 of the foo file. All other may vary.
X Y Z values of each record composing the group should be stored within the FLW parenthesis, following the same order as they appear in the first file lines.
I've tried many things ans as I'm absolutely not an expert using awk yet, this kind of code doesn't work at all:
awk -F ";" 'NR==1 {print "CODE;"$3";"$4";"$5";"$6";"$7";"$8}; NR>1 {a[$4]=a[$4]}END{for(i in a) { print "\"FLW ("$1","$2","$NF")\";"$3";"i""a[i]";"$5";"$6";"$7";"$8 }}' foo
Try:
$ awk -F ";" 'NR==1 {print "CODE;"$3";"$4";"$5";"$6";"$7";"$8}; NR>1 {a[$4]=$5";"$6";"$7";"$8; b[$4]=(b[$4]?b[$4]", ":"")$1" "$2" "$NF;}END{for(i in a) printf "\"FLW (%s)\";%s;%s\n", b[i], i, a[i]}' foo
CODE;OID;ID;OQTE;QTE;OTYPE;TYPE
"FLW (601.901 858.701 194.547, 601.851 832.317 193.733)";982954;0;XTX;50;comment
"FLW (603.311 800.928 191.299, 603.512 810.700 191.341, 604.815 802.475 191.393)";982963;0;XTX;49;comment
Or, as spread out over multiple lines:
awk -F ";" '
NR==1 {
print "CODE;"$3";"$4";"$5";"$6";"$7";"$8
}
NR>1 {
a[$4]=$5";"$6";"$7";"$8
b[$4]=(b[$4]?b[$4]", ":"")$1" "$2" "$NF
}
END{
for(i in a)
printf "\"FLW (%s)\";%s;%s\n", b[i], i, a[i]
}
' foo
Alternate styles
For one, we can replace ";" with FS:
awk -F";" 'NR==1 {print "CODE;"$3 FS $4 FS $5 FS $6 FS $7 FS $8}; NR>1 {a[$4]=$5 FS $6 FS $7 FS $8; b[$4]=(b[$4]?b[$4]", ":"")$1" "$2" "$NF;}END{for(i in a) printf "\"FLW (%s)\";%s;%s\n", b[i], i, a[i]}' foo
For another, the first print can also be replaced with a printf:
awk -F";" 'NR==1 {printf "CODE;%s;%s;%s;%s;%s;%s",$3,$4,$5,$6,$7,$8}; NR>1 {a[$4]=$5 FS $6 FS $7 FS $8; b[$4]=(b[$4]?b[$4]", ":"")$1" "$2" "$NF;}END{for(i in a) printf "\"FLW (%s)\";%s;%s\n", b[i], i, a[i]}' foo
Variation
If, as per the comments, the group field is the third, not the fourth, then:
awk -F";" 'NR==1 {print "CODE;"$3 FS $4 FS $5 FS $6 FS $7 FS $8}; NR>1 {a[$3]= $4 FS $5 FS $6 FS $7 FS $8; b[$3]=(b[$3]?b[$3]", ":"")$1" "$2" "$NF;}END{for(i in a) printf "\"FLW (%s)\";%s;%s\n", b[i], i, a[i]}'
Would like to get your suggestion to improve this command and want to remove unwanted execution to avoid time consumption,
actually i am trying to find CountOfLines and SumOf$6 group by $2,substr($3,4,6),substr($4,4,6),$10,$8,$6.
GunZip Input file contains around 300 Mn rows of lines.
Input.gz
2067,0,09-MAY-12.04:05:14,09-MAY-12.04:05:14,21-MAR-16,600,INR,RO312,20120321_1C,K1,,32
2160,0,26-MAY-14.02:05:27,26-MAY-14.02:05:27,18-APR-18,600,INR,RO414,20140418_7,K1,,30
2160,0,26-MAY-14.02:05:27,26-MAY-14.02:05:27,18-APR-18,600,INR,RO414,20140418_7,K1,,30
2160,0,26-MAY-14.02:05:27,26-MAY-14.02:05:27,18-APR-18,600,INR,RO414,20140418_7,K1,,30
2104,5,13-JAN-13.01:01:38,,13-JAN-17,4150,INR,RO113,CD1301_RC50_B1_20130113,K2,,21
Am using the below command and working fine.
zcat Input.gz | awk -F"," '{OFS=","; print $2,substr($3,4,6),substr($4,4,6),$10,$8,$6}' | \
awk -F"," 'BEGIN {count=0; sum=0; OFS=","} {key=$0; a[key]++;b[key]=b[key]+$6} \
END {for (i in a) print i,a[i],b[i]}' >Output.txt
Output.txt
0,MAY-14,MAY-14,K1,RO414,600,3,1800
0,MAY-12,MAY-12,K1,RO312,600,1,600
5,JAN-13,,K2,RO113,4150,1,4150
Any suggestion to improve the above command are welcome ..
This seems more efficient:
zcat Input.gz | awk -F, '{key=$2","substr($3,4,6)","substr($4,4,6)","$10","$8","$6;++a[key];b[key]=b[key]+$6}END{for(i in a)print i","a[i]","b[i]}'
Output:
0,MAY-14,MAY-14,K1,RO414,600,3,1800
0,MAY-12,MAY-12,K1,RO312,600,1,600
5,JAN-13,,K2,RO113,4150,1,4150
Uncondensed form:
zcat Input.gz | awk -F, '{
key = $2 "," substr($3, 4, 6) "," substr($4, 4, 6) "," $10 "," $8 "," $6
++a[key]
b[key] = b[key] + $6
}
END {
for (i in a)
print i "," a[i] "," b[i]
}'
You can do this with one awk invocation by redefining the fields according to the first awk script, i.e. something like this:
$1 = $2
$2 = substr($3, 4, 6)
$3 = substr($4, 4, 6)
$4 = $10
$5 = $8
No need to change $6 as that is the same field. Now if you base the key on the new fields, the second script will work almost unaltered. Here is how I would write it, moving the code into a script file for better readability and maintainability:
zcat Input.gz | awk -f parse.awk
Where parse.awk contains:
BEGIN {
FS = OFS = ","
}
{
$1 = $2
$2 = substr($3, 4, 6)
$3 = substr($4, 4, 6)
$4 = $10
$5 = $8
key = $1 OFS $2 OFS $3 OFS $4 OFS $5 OFS $6
a[key]++
b[key] += $6
}
END {
for (i in a)
print i, a[i], b[i]
}
You can of course still run it as a one-liner, but it will look more cryptic:
zcat Input.gz | awk '{ key = $2 FS substr($3,4,6) FS substr($4,4,6) FS $10 FS $8 FS $6; a[key]++; b[key]+=$6 } END { for (i in a) print i,a[i],b[i] }' FS=, OFS=,
Output in both cases:
0,MAY-14,MAY-14,K1,RO414,600,3,1800
0,MAY-12,MAY-12,K1,RO312,600,1,600
5,JAN-13,,K2,RO113,4150,1,4150