Add a col using file name - awk

I have ~ 100 files, they are a.vcf; b.vcf, d.vcf......
For example:
a.vcf
##contig= ID=chr1,length=249250621
##contig= ID=chr2,length=243199373
##contig= ID=chr3,length=198022430
##contig= ID=chr4,length=191154276
b.vcf
##contig= ID=chr5,length=180915260
##contig= ID=chr6,length=171115067
##contig= ID=chr7,length=159138663
##contig= ID=chr8,length=146364022
##contig= ID=chr9,length=141213431
##contig= ID=chr10,length=135534747
I want to add additional col as the last col, for examples, new file a_a.vcf
a_a.vcf
##contig= ID=chr1,length=249250621 a.vcf
##contig= ID=chr2,length=243199373 a.vcf
##contig= ID=chr3,length=198022430 a.vcf
##contig= ID=chr4,length=191154276 a.vcf
For single vcf file, I used the following code:
awk 'NR == 1 {print $0 " name_file"; next;}{print $0 " " FILENAME;}' a.vcf
Then I want to apply this to all the files in this folder.
for d in *.vcf; do
awk 'NR == 1 {print $0 " name_file"; next;}{print $0 " " FILENAME;}' a_$d
done
But I found the -zsh replaced $0, How could I fix the problem?
awk 'NR == 1 {print -zsh name_file; next;}{print -zsh FILENAME;}' a_a.vcf
awk 'NR == 1 {print -zsh name_file; next;}{print -zsh FILENAME;}' a_b.vcf
awk 'NR == 1 {print -zsh name_file; next;}{print -zsh FILENAME;}' a_c.vcf

GNU AWK is not limited to single input file, you might provide multiple files to single awk by using filenames sheared by spaces, in your case try
awk 'FNR == 1 {print $0 " name_file"; next;}{print $0 " " FILENAME;}' a.vcf b.vcf c.vcf
which should give same output as
awk 'NR == 1 {print $0 " name_file"; next;}{print $0 " " FILENAME;}' a.vcf
awk 'NR == 1 {print $0 " name_file"; next;}{print $0 " " FILENAME;}' b.vcf
awk 'NR == 1 {print $0 " name_file"; next;}{print $0 " " FILENAME;}' c.vcf
Note that I used FNR in place of NR i.e. number of line inside file rather than (global) number of line. As suggested in comments, you might further ameliorate your code exploiting OFS variable as follows
awk 'BEGIN{OFS=" "}FNR == 1 {print $0, "name_file"; next}{print $0, FILENAME}' a.vcf b.vcf c.vcf
If you want to know more about OFS and other read 8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR

Related

Compare the columns of the two files

I have a two files and trying compare the files on the basis of columns
File_1
CALL_3 CALL_1
CALL_2 CALL_5
CALL_3 CALL_2
CALL_1 CALL_4
File_2
CALL_1 GAP:A GAP:G
CALL_3 GAP:C GAP:Q GAP:R
CALL_5 GAP:R GAP:A
CALL_4 GAP:C GAP:D GAP:A GAP:W
CALL_2 GAP:C GAP:R GAP:A
I want to print only those interaction from file_1 having atleast one GAP_id is comman between these two.
Expected output
CALL_2 CALL_5 GAP:A GAP:R
CALL_3 CALL_2 GAP:C GAP:R
CALL_1 CALL_4 GAP:A
I tried the following :
awk 'NR==FNR {
a[$1]=($1 OFS $2 OFS $3 OFS $4 OFS $5 OFS $6 OFS $7 OFS $8 OFS $9)
next
}
($1 in a)&&($2 in a) {
print a[$1],a[$2]
}' File_2 File_1
It is working good for fixed number of columns. But number of columns is not fixed in file_2 (more than 1000 columns). How to get the expected output?
Could you please try following.
awk '
FNR==NR{
val=$1
$1=""
$0=$0
$1=$1
a[val]=$0
next
}
{
val=""
num1=split(a[$1],array1," ")
for(i=1;i<=num1;i++){
array3[array1[i]]
}
num2=split(a[$2],array2," ")
for(i=1;i<=num2;i++){
array4[array2[i]]
}
for(k in array3){
if(k in array4){
val=(val?val OFS:"")k
}
}
if(val){
print $0,val
}
val=""
delete array1
delete array2
delete array3
delete array4
}
' Input_file2 Input_file1
Output will be as follows.
CALL_2 CALL_5 GAP:A GAP:R
CALL_3 CALL_2 GAP:C GAP:R
CALL_1 CALL_4 GAP:A
Explanation: Adding detailed explanation for above code.
awk ' ##Starting awk program here.
FNR==NR{ ##Checking condition FNR==NR which will be TRUE for first Input_file is being read.
val=$1 ##Creating a variable named val whose value is $1 of current line.
$1="" ##Nullifying $1 here.
$0=$0 ##Re-assigning value of current line to itself, so that initial space will be removed.
$1=$1 ##Re-assigning value of current line to itself, so that initial space will be removed.
a[val]=$0 ##Creating an array named a whose index is val and value is $0.
next ##next will skip all further statements from here.
}
{
val="" ##Nullifying variable val here.
num1=split(a[$1],array1," ") ##splitting array a with index $1 to array1 and having its total number in num1.
for(i=1;i<=num1;i++){ ##Starting a for loop from i=1 till value of num1
array3[array1[i]] ##Creating an array named array3 with index of array1 with index i.
}
num2=split(a[$2],array2," ") ##splitting array a with index $2 to array2 and having its total number in num2.
for(i=1;i<=num2;i++){ ##Starting a for loop from i=1 till value of num2.
array4[array2[i]] ##Creating an array named array4 with value of array2 with index i.
}
for(k in array3){ ##Traversing through array3 here.
if(k in array4){ ##Checking condition if k which is index of array3 is present in array4 then do following.
val=(val?val OFS:"")k ##Creating variable named val whose value is variable k with concatenating its own value each time to it.
}
}
if(val){ ##Checking condition if variable val is NOT NULL then do following.
print $0,val ##Printing current line and variable val here.
}
val="" ##Nullifying variable val here.
delete array1 ##Deleting array1 here.
delete array2 ##Deleting array2 here.
delete array3 ##Deleting array3 here.
delete array4 ##Deleting array4 here.
}
' Input_file2 Input_file1 ##Mentioning Input_file names here.
With GNU awk for arrays of arrays:
$ cat tst.awk
NR==FNR {
for (i=2; i<=NF; i++) {
gaps[$1][$i]
}
next
}
{
common = ""
for (gap in gaps[$1]) {
if (gap in gaps[$2]) {
common = common OFS gap
}
}
if ( common != "" ) {
print $0 common
}
}
$ awk -f tst.awk file2 file1
CALL_2 CALL_5 GAP:A GAP:R
CALL_3 CALL_2 GAP:C GAP:R
CALL_1 CALL_4 GAP:A
With any awk:
$ cat tst.awk
NR==FNR {
key = $1
sub(/[^[:space:]]+[[:space:]]+/,"")
gaps[key] = $0
next
}
{
mkSet(gaps[$1],gaps1)
mkSet(gaps[$2],gaps2)
common = ""
for (gap in gaps1) {
if (gap in gaps2) {
common = common OFS gap
}
}
if ( common != "" ) {
print $0 common
}
}
function mkSet(str,arr, i,tmp) {
delete arr
split(str,tmp)
for (i in tmp) {
arr[tmp[i]]
}
}
$ awk -f tst.awk file2 file1
CALL_2 CALL_5 GAP:A GAP:R
CALL_3 CALL_2 GAP:C GAP:R
CALL_1 CALL_4 GAP:A
I did it in bash with coreutils. A oneliner:
join -12 -21 <(join -11 -21 <(sort file_1) <(sort file_2) | sort -k2) <(sort file_2) | xargs -l1 bash -c 'a=$(<<<"${#:3}" tr " " "\n" | sort | uniq -d | tr "\n" " "); if [ -n "$a" ]; then printf "%s %s %s\n" "$1" "$2" "$a"; fi' --
Or a bit more lines:
join -12 -21 <(
join -11 -21 <(sort file_1) <(sort file_2) | sort -k2
) <(
sort file_2
) |
xargs -l1 bash -c '
a=$(<<<"${#:3}" tr " " "\n" | sort | uniq -d | tr "\n" " ");
if [ -n "$a" ]; then
printf "%s %s %s\n" "$1" "$2" "$a"
fi
' --
Join file_1 with file_2 on the first fields.
Join the result from point 1 on field 2 with file_2 again
Then for each line:
Get only the duplicates of the GAP* parts
If there are any duplicates print the CALL_* with the duplicates
Results in:
CALL_2 CALL_3 GAP:C GAP:R
CALL_4 CALL_1 GAP:A
CALL_5 CALL_2 GAP:A GAP:R
With awk this is straightforward:
$ awk '(NR==FNR){$1=$1;a[$1]=$0;next}
{str=strt=$1 OFS $2}
{split(a[$1],b,OFS)}
{for(i in b) if(index(a[$2] OFS, OFS b[i] OFS)) str=str OFS a[$2]}
(str!=strt){print str}' file2 file1
How does this work:
(NR==FNR){$1=$1;a[$1]=$0;next}
The first line buffers file2 in an associative array a[key]=value where key is the first element and value the full line. Eg.
a["CALL_1"]="CALL_1 GAP:A GAP:G"
Remark, that we substituted all FS into OFS using $1=$1.
{str=strt=$1 OFS $2}
This just stores CALL_1 CALL_2 in the variable str
{split(a[$1],b,OFS)}: split the buffered line into array b
{for(i in b) if(index(a[$2] OFS, OFS b[i] OFS)) str=str OFS a[$2]}
For all entries in array b, check if the string OFS b[i] OFS is found in the string a[$2] OFS. We add the extra OFS to ensure field matches. We do test for values like OFS CALL_2 OFS, but this will never match. This is a tiny overhead, but fixing this would create much more overhead.
A more optimised version would read:
$ awk '(NR==FNR){k=$1;$1="";a[k]=$1;c[k]=NF-1;next}
{str=strt=$1 OFS $2}
(c[$1]< c[$2]) {split(substr(a[$1],2),b,OFS);s=a[$2] OFS}
(c[$1]>=c[$2]) {split(substr(a[$2],2),b,OFS);s=a[$1] OFS}
{for(i in b) if(index(s, OFS b[i] OFS)) str=str OFS a[$2]}
(str!=strt){print str}' file2 file1

awk : compare 2 files with 2 columns

i have to compare 2 files using awk.
The structure of each files is the same : path checksum
File1.txt
/content/cr444/commun/ 50d174f143d115b2d12d09c152a2ca59be7fbb91
/content/cr764/commun/ 10d174f14fd115b2d12d09c152a2ca59be7fbb91
/content/cr999/commun/ 10d174f14fd115b2d12d09c152a2ca59be7fbbpp
File2.txt
/content/cr555/test/ 51d174f14f6115b2d12d09c152a2ca59be7fbb91
/content/cr764/commun/ 10d174f14fd115b2d12d09c152a2ca59be7fbb78
/content/cr999/commun/ 10d174f14fd115b2d12d09c152a2ca59be7fbbpp
Result expected is a .csv (with separator |):
/content/cr444/commun/|50d174f143d115b2d12d09c152a2ca59be7fbb91||not in file2
/content/cr555/test/||51d174f14f6115b2d12d09c152a2ca59be7fbb91|not in file1
/content/cr999/commun/|10d174f14fd115b2d12d09c152a2ca59be7fbbpp|10d174f14fd115b2d12d09c152a2ca59be7fbbpp|same checksum
/content/cr764/commun||10d174f14fd115b2d12d09c152a2ca59be7fbb91|10d174f14fd115b2d12d09c152a2ca59be7fbb78|not same checksum
I assume the order of output lines is not important. Then you could:
Collect lines from File1.txt into an indexed array ($1 -> $2)
Process lines from File2.txt:
If $1 is in the indexed array from (1) compare their checksums and print accordingly
If $1 is not in the indexed array from (1), print accordingly
Print all remaining itmes from array (1)
Here's the code:
$ awk 'BEGIN{OFS="|"} NR==FNR{f1[$1]=$2; next} {if ($1 in f1) { print $1,f1[$1],$2,($2==f1[$1]?"":"not ")"same checksum"; delete f1[$1]} else print $1,"",$2,"not in file1"} END{for (i in f1) print i,f1[i],"","not in file2"}' File1.txt File2.txt
Output:
/content/cr555/test/|51d174f14f6115b2d12d09c152a2ca59be7fbb91|not in file1
/content/cr764/commun/|10d174f14fd115b2d12d09c152a2ca59be7fbb91|10d174f14fd115b2d12d09c152a2ca59be7fbb78|not same checksum
/content/cr999/commun/|10d174f14fd115b2d12d09c152a2ca59be7fbbpp|10d174f14fd115b2d12d09c152a2ca59be7fbbpp|same checksum
/content/cr444/commun/|50d174f143d115b2d12d09c152a2ca59be7fbb91||not in file2
One way, using join to merge the two files, and awk to compare the checksums on each line:
$ join -a1 -a2 -11 -21 -e XXXX -o 0,1.2,2.2 <(sort -k1 file1.txt) <(sort -k1 file2.txt) |
awk -v OFS='|' '$2 == "XXXX" { print $1, "", $3, "not in file1"; next }
$3 == "XXXX" { print $1, $2, "", "not in file2"; next }
$2 == $3 { print $1, $2, $3, "same checksum"; next }
{ print $1, $2, $3, "not same checksum" }'
/content/cr444/commun/|50d174f143d115b2d12d09c152a2ca59be7fbb91||not in file2
/content/cr555/test/||51d174f14f6115b2d12d09c152a2ca59be7fbb91|not in file1
/content/cr764/commun/|10d174f14fd115b2d12d09c152a2ca59be7fbb91|10d174f14fd115b2d12d09c152a2ca59be7fbb78|not same checksum
/content/cr999/commun/|10d174f14fd115b2d12d09c152a2ca59be7fbbpp|10d174f14fd115b2d12d09c152a2ca59be7fbbpp|same checksum

Awk output formatting

I have 2 .po files and some word in there has 2 different meanings
and want to use awk to turn it into some kind of translator
For example
in .po file 1
msgid "example"
msgstr "something"
in .po file 2
msgid "example"
msgstr "somethingelse"
I came up with this
awk -F'"' 'match($2, /^example$/) {printf "%s", $2": ";getline; printf "%s", $2}' file1.po file2.po
The output will be
example:something example:somethinelse
How do I make it into this kind of format
example : something, somethingelse.
Reformatting
example:something example:somethinelse
into
example : something, somethingelse
can be done with this one-liner:
awk -F":| " -v OFS="," '{printf "%s:", $1; for (i=1;i<=NF;i++) if (i % 2 == 0)printf("%s%s%s", ((i==2)?"":OFS), $i, ((i==NF)?"\n":""))}'
Testing:
$ echo "example:something example:somethinelse example:something3 example:something4" | \
awk -F":| " -v OFS="," '{ \
printf "%s:", $1; \
for (i=1;i<=NF;i++) \
if (i % 2 == 0) \
printf("%s%s%s", ((i==2)?"":OFS), $i, ((i==NF)?"\n":""))}'
example:something,somethinelse,something3,something4
Explanation:
$ cat tst.awk
BEGIN{FS=":| ";OFS=","} # define field sep and output field sep
{ printf "%s:", $1 # print header line "example:"
for (i=1;i<=NF;i++) # loop over all fields
if (i % 2 == 0) # we're only interested in all "even" fields
printf("%s%s%s", ((i==2)?"":OFS), $i, ((i==NF)?"\n":""))
}
But you could have done the whole thing in one go with something like this:
$ cat tst.awk
BEGIN{OFS=","} # set output field sep to ","
NF{ # if NF (i.e. number of fields) > 0
# - to skip empty lines -
if (match($0,/msgid "(.*)"/,a)) id=a[1] # if line matches 'msgid "something",
# set "id" to "something"
if (match($0,/msgstr "(.*)"/,b)) str=b[1] # same here for 'msgstr'
if (id && str){ # if both "id" and "str" are set
r[id]=(id in r)?r[id] OFS str:str # save "str" in array r with index "id".
# if index "id" already exists,
# add "str" preceded by OFS (i.e. "," here)
id=str=0 # after printing, reset "id" and "str"
}
}
END { for (i in r) printf "%s : %s\n", i, r[i] } # print array "r"
and call this like:
awk -f tst.awk *.po
$ awk -F'"' 'NR%2{k=$2; next} NR==FNR{a[k]=$2; next} {print k" : "a[k]", "$2}' file1 file2
example : something, somethingelse

awk: aggregate several lines in only one based on a field value

I would like to aggregate values in a file based on a specific field value which is a kind of group attribute. The ending file should have one line per group.
MWE:
$ head -n4 foo
X;Y;OID;ID;OQTE;QTE;OTYPE;TYPE;Z
603.311;800.928;930;982963;0;XTX;49;comment;191.299
603.512;810.700;930;982963;0;XTX;49;comment;191.341
604.815;802.475;930;982963;0;XTX;49;comment;191.393
601.901;858.701;122;982954;0;XTX;50;comment;194.547
601.851;832.317;122;982954;0;XTX;50;comment;193.733
There is two groups here; 982963 and 982954.
Target:
$ head -n2 bar
CODE;OID;ID;OQTE;QTE;OTYPE;TYPE
"FLW (603.311 800.928 191.299, 603.512 801.700 191.341, 604.815 802.475 191.393)";982963;0;XTX;49;comment
"FLW (601.901 858.701 194.547, 601.851 832.317 193.733)";982954;0;XTX;49;comment
The group field is the 4 of the foo file. All other may vary.
X Y Z values of each record composing the group should be stored within the FLW parenthesis, following the same order as they appear in the first file lines.
I've tried many things ans as I'm absolutely not an expert using awk yet, this kind of code doesn't work at all:
awk -F ";" 'NR==1 {print "CODE;"$3";"$4";"$5";"$6";"$7";"$8}; NR>1 {a[$4]=a[$4]}END{for(i in a) { print "\"FLW ("$1","$2","$NF")\";"$3";"i""a[i]";"$5";"$6";"$7";"$8 }}' foo
Try:
$ awk -F ";" 'NR==1 {print "CODE;"$3";"$4";"$5";"$6";"$7";"$8}; NR>1 {a[$4]=$5";"$6";"$7";"$8; b[$4]=(b[$4]?b[$4]", ":"")$1" "$2" "$NF;}END{for(i in a) printf "\"FLW (%s)\";%s;%s\n", b[i], i, a[i]}' foo
CODE;OID;ID;OQTE;QTE;OTYPE;TYPE
"FLW (601.901 858.701 194.547, 601.851 832.317 193.733)";982954;0;XTX;50;comment
"FLW (603.311 800.928 191.299, 603.512 810.700 191.341, 604.815 802.475 191.393)";982963;0;XTX;49;comment
Or, as spread out over multiple lines:
awk -F ";" '
NR==1 {
print "CODE;"$3";"$4";"$5";"$6";"$7";"$8
}
NR>1 {
a[$4]=$5";"$6";"$7";"$8
b[$4]=(b[$4]?b[$4]", ":"")$1" "$2" "$NF
}
END{
for(i in a)
printf "\"FLW (%s)\";%s;%s\n", b[i], i, a[i]
}
' foo
Alternate styles
For one, we can replace ";" with FS:
awk -F";" 'NR==1 {print "CODE;"$3 FS $4 FS $5 FS $6 FS $7 FS $8}; NR>1 {a[$4]=$5 FS $6 FS $7 FS $8; b[$4]=(b[$4]?b[$4]", ":"")$1" "$2" "$NF;}END{for(i in a) printf "\"FLW (%s)\";%s;%s\n", b[i], i, a[i]}' foo
For another, the first print can also be replaced with a printf:
awk -F";" 'NR==1 {printf "CODE;%s;%s;%s;%s;%s;%s",$3,$4,$5,$6,$7,$8}; NR>1 {a[$4]=$5 FS $6 FS $7 FS $8; b[$4]=(b[$4]?b[$4]", ":"")$1" "$2" "$NF;}END{for(i in a) printf "\"FLW (%s)\";%s;%s\n", b[i], i, a[i]}' foo
Variation
If, as per the comments, the group field is the third, not the fourth, then:
awk -F";" 'NR==1 {print "CODE;"$3 FS $4 FS $5 FS $6 FS $7 FS $8}; NR>1 {a[$3]= $4 FS $5 FS $6 FS $7 FS $8; b[$3]=(b[$3]?b[$3]", ":"")$1" "$2" "$NF;}END{for(i in a) printf "\"FLW (%s)\";%s;%s\n", b[i], i, a[i]}'

awk improve command - Count & Sum

Would like to get your suggestion to improve this command and want to remove unwanted execution to avoid time consumption,
actually i am trying to find CountOfLines and SumOf$6 group by $2,substr($3,4,6),substr($4,4,6),$10,$8,$6.
GunZip Input file contains around 300 Mn rows of lines.
Input.gz
2067,0,09-MAY-12.04:05:14,09-MAY-12.04:05:14,21-MAR-16,600,INR,RO312,20120321_1C,K1,,32
2160,0,26-MAY-14.02:05:27,26-MAY-14.02:05:27,18-APR-18,600,INR,RO414,20140418_7,K1,,30
2160,0,26-MAY-14.02:05:27,26-MAY-14.02:05:27,18-APR-18,600,INR,RO414,20140418_7,K1,,30
2160,0,26-MAY-14.02:05:27,26-MAY-14.02:05:27,18-APR-18,600,INR,RO414,20140418_7,K1,,30
2104,5,13-JAN-13.01:01:38,,13-JAN-17,4150,INR,RO113,CD1301_RC50_B1_20130113,K2,,21
Am using the below command and working fine.
zcat Input.gz | awk -F"," '{OFS=","; print $2,substr($3,4,6),substr($4,4,6),$10,$8,$6}' | \
awk -F"," 'BEGIN {count=0; sum=0; OFS=","} {key=$0; a[key]++;b[key]=b[key]+$6} \
END {for (i in a) print i,a[i],b[i]}' >Output.txt
Output.txt
0,MAY-14,MAY-14,K1,RO414,600,3,1800
0,MAY-12,MAY-12,K1,RO312,600,1,600
5,JAN-13,,K2,RO113,4150,1,4150
Any suggestion to improve the above command are welcome ..
This seems more efficient:
zcat Input.gz | awk -F, '{key=$2","substr($3,4,6)","substr($4,4,6)","$10","$8","$6;++a[key];b[key]=b[key]+$6}END{for(i in a)print i","a[i]","b[i]}'
Output:
0,MAY-14,MAY-14,K1,RO414,600,3,1800
0,MAY-12,MAY-12,K1,RO312,600,1,600
5,JAN-13,,K2,RO113,4150,1,4150
Uncondensed form:
zcat Input.gz | awk -F, '{
key = $2 "," substr($3, 4, 6) "," substr($4, 4, 6) "," $10 "," $8 "," $6
++a[key]
b[key] = b[key] + $6
}
END {
for (i in a)
print i "," a[i] "," b[i]
}'
You can do this with one awk invocation by redefining the fields according to the first awk script, i.e. something like this:
$1 = $2
$2 = substr($3, 4, 6)
$3 = substr($4, 4, 6)
$4 = $10
$5 = $8
No need to change $6 as that is the same field. Now if you base the key on the new fields, the second script will work almost unaltered. Here is how I would write it, moving the code into a script file for better readability and maintainability:
zcat Input.gz | awk -f parse.awk
Where parse.awk contains:
BEGIN {
FS = OFS = ","
}
{
$1 = $2
$2 = substr($3, 4, 6)
$3 = substr($4, 4, 6)
$4 = $10
$5 = $8
key = $1 OFS $2 OFS $3 OFS $4 OFS $5 OFS $6
a[key]++
b[key] += $6
}
END {
for (i in a)
print i, a[i], b[i]
}
You can of course still run it as a one-liner, but it will look more cryptic:
zcat Input.gz | awk '{ key = $2 FS substr($3,4,6) FS substr($4,4,6) FS $10 FS $8 FS $6; a[key]++; b[key]+=$6 } END { for (i in a) print i,a[i],b[i] }' FS=, OFS=,
Output in both cases:
0,MAY-14,MAY-14,K1,RO414,600,3,1800
0,MAY-12,MAY-12,K1,RO312,600,1,600
5,JAN-13,,K2,RO113,4150,1,4150