IGNORECASE in file comparison with awk - awk

Here is my code:
'FNR==NR {a[$1 FS $2 FS $3 FS $4]++; next} !a[$1 FS $2 FS $3 FS $5]' comparegam.txt workstudyusers1.csv >noidea6.txt
it is doing exactly what I need except I need to ignorecase. I have tried using IGNORECASE=1 in various places but I cant get it to work. it either fails, gives me zero results, or ignores it all together. I tried using BEGIN [IGNORECASE = 1} with no luck.
any help would be appreciated, I am at a lost. I am running this in a terminal window, not from a bash script yet. that is the end goal
Note: Output has to contain case to match original files
Here is the exact code that I have tried with IGNORECASE:
awk -F, 'IGNORECASE=1 FNR==NR {a[$1 FS $2 FS $3 FS $4]++; next} !a[$1 FS $2 FS $3 FS $5]' comparegam.txt workstudyusers1.csv >noidea6.txt
awk -F, 'FNR==NR {IGNORECASE=1} {a[$1 FS $2 FS $3 FS $4]++; next} !a[$1 FS $2 FS $3 FS $5]' comparegam.txt workstudyusers1.csv >noidea6.txt
awk -F, '{IGNORECASE = 1} FNR==NR {a[$1 FS $2 FS $3 FS $4]++; next} !a[$1 FS $2 FS $3 FS $5]' comparegam.txt workstudyusers1.csv >noidea6.txt
awk -F, 'BEGIN {IGNORECASE = 1} FNR==NR {a[$1 FS $2 FS $3 FS $4]++; next} !a[$1 FS $2 FS $3 FS $5]' comparegam.txt workstudyusers1.csv >noidea6.txt
awk -F, 'FNR==NR {{IGNORECASE=1} a[$1 FS $2 FS $3 FS $4]++; next} !a[$1 FS $2 FS $3 FS $5]' comparegam.txt workstudyusers1.csv >noidea6.txt
and various iterations of these.

'
FNR==NR {
a[tolower($1 FS $2 FS $3 FS $4)]++;
next
}
!a[tolower($1 FS $2 FS $3 FS $5)]
' comparegam.txt workstudyusers1.csv >noidea6.txt

Related

Add a col using file name

I have ~ 100 files, they are a.vcf; b.vcf, d.vcf......
For example:
a.vcf
##contig= ID=chr1,length=249250621
##contig= ID=chr2,length=243199373
##contig= ID=chr3,length=198022430
##contig= ID=chr4,length=191154276
b.vcf
##contig= ID=chr5,length=180915260
##contig= ID=chr6,length=171115067
##contig= ID=chr7,length=159138663
##contig= ID=chr8,length=146364022
##contig= ID=chr9,length=141213431
##contig= ID=chr10,length=135534747
I want to add additional col as the last col, for examples, new file a_a.vcf
a_a.vcf
##contig= ID=chr1,length=249250621 a.vcf
##contig= ID=chr2,length=243199373 a.vcf
##contig= ID=chr3,length=198022430 a.vcf
##contig= ID=chr4,length=191154276 a.vcf
For single vcf file, I used the following code:
awk 'NR == 1 {print $0 " name_file"; next;}{print $0 " " FILENAME;}' a.vcf
Then I want to apply this to all the files in this folder.
for d in *.vcf; do
awk 'NR == 1 {print $0 " name_file"; next;}{print $0 " " FILENAME;}' a_$d
done
But I found the -zsh replaced $0, How could I fix the problem?
awk 'NR == 1 {print -zsh name_file; next;}{print -zsh FILENAME;}' a_a.vcf
awk 'NR == 1 {print -zsh name_file; next;}{print -zsh FILENAME;}' a_b.vcf
awk 'NR == 1 {print -zsh name_file; next;}{print -zsh FILENAME;}' a_c.vcf
GNU AWK is not limited to single input file, you might provide multiple files to single awk by using filenames sheared by spaces, in your case try
awk 'FNR == 1 {print $0 " name_file"; next;}{print $0 " " FILENAME;}' a.vcf b.vcf c.vcf
which should give same output as
awk 'NR == 1 {print $0 " name_file"; next;}{print $0 " " FILENAME;}' a.vcf
awk 'NR == 1 {print $0 " name_file"; next;}{print $0 " " FILENAME;}' b.vcf
awk 'NR == 1 {print $0 " name_file"; next;}{print $0 " " FILENAME;}' c.vcf
Note that I used FNR in place of NR i.e. number of line inside file rather than (global) number of line. As suggested in comments, you might further ameliorate your code exploiting OFS variable as follows
awk 'BEGIN{OFS=" "}FNR == 1 {print $0, "name_file"; next}{print $0, FILENAME}' a.vcf b.vcf c.vcf
If you want to know more about OFS and other read 8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR

awk : compare 2 files with 2 columns

i have to compare 2 files using awk.
The structure of each files is the same : path checksum
File1.txt
/content/cr444/commun/ 50d174f143d115b2d12d09c152a2ca59be7fbb91
/content/cr764/commun/ 10d174f14fd115b2d12d09c152a2ca59be7fbb91
/content/cr999/commun/ 10d174f14fd115b2d12d09c152a2ca59be7fbbpp
File2.txt
/content/cr555/test/ 51d174f14f6115b2d12d09c152a2ca59be7fbb91
/content/cr764/commun/ 10d174f14fd115b2d12d09c152a2ca59be7fbb78
/content/cr999/commun/ 10d174f14fd115b2d12d09c152a2ca59be7fbbpp
Result expected is a .csv (with separator |):
/content/cr444/commun/|50d174f143d115b2d12d09c152a2ca59be7fbb91||not in file2
/content/cr555/test/||51d174f14f6115b2d12d09c152a2ca59be7fbb91|not in file1
/content/cr999/commun/|10d174f14fd115b2d12d09c152a2ca59be7fbbpp|10d174f14fd115b2d12d09c152a2ca59be7fbbpp|same checksum
/content/cr764/commun||10d174f14fd115b2d12d09c152a2ca59be7fbb91|10d174f14fd115b2d12d09c152a2ca59be7fbb78|not same checksum
I assume the order of output lines is not important. Then you could:
Collect lines from File1.txt into an indexed array ($1 -> $2)
Process lines from File2.txt:
If $1 is in the indexed array from (1) compare their checksums and print accordingly
If $1 is not in the indexed array from (1), print accordingly
Print all remaining itmes from array (1)
Here's the code:
$ awk 'BEGIN{OFS="|"} NR==FNR{f1[$1]=$2; next} {if ($1 in f1) { print $1,f1[$1],$2,($2==f1[$1]?"":"not ")"same checksum"; delete f1[$1]} else print $1,"",$2,"not in file1"} END{for (i in f1) print i,f1[i],"","not in file2"}' File1.txt File2.txt
Output:
/content/cr555/test/|51d174f14f6115b2d12d09c152a2ca59be7fbb91|not in file1
/content/cr764/commun/|10d174f14fd115b2d12d09c152a2ca59be7fbb91|10d174f14fd115b2d12d09c152a2ca59be7fbb78|not same checksum
/content/cr999/commun/|10d174f14fd115b2d12d09c152a2ca59be7fbbpp|10d174f14fd115b2d12d09c152a2ca59be7fbbpp|same checksum
/content/cr444/commun/|50d174f143d115b2d12d09c152a2ca59be7fbb91||not in file2
One way, using join to merge the two files, and awk to compare the checksums on each line:
$ join -a1 -a2 -11 -21 -e XXXX -o 0,1.2,2.2 <(sort -k1 file1.txt) <(sort -k1 file2.txt) |
awk -v OFS='|' '$2 == "XXXX" { print $1, "", $3, "not in file1"; next }
$3 == "XXXX" { print $1, $2, "", "not in file2"; next }
$2 == $3 { print $1, $2, $3, "same checksum"; next }
{ print $1, $2, $3, "not same checksum" }'
/content/cr444/commun/|50d174f143d115b2d12d09c152a2ca59be7fbb91||not in file2
/content/cr555/test/||51d174f14f6115b2d12d09c152a2ca59be7fbb91|not in file1
/content/cr764/commun/|10d174f14fd115b2d12d09c152a2ca59be7fbb91|10d174f14fd115b2d12d09c152a2ca59be7fbb78|not same checksum
/content/cr999/commun/|10d174f14fd115b2d12d09c152a2ca59be7fbbpp|10d174f14fd115b2d12d09c152a2ca59be7fbbpp|same checksum

awk: aggregate several lines in only one based on a field value

I would like to aggregate values in a file based on a specific field value which is a kind of group attribute. The ending file should have one line per group.
MWE:
$ head -n4 foo
X;Y;OID;ID;OQTE;QTE;OTYPE;TYPE;Z
603.311;800.928;930;982963;0;XTX;49;comment;191.299
603.512;810.700;930;982963;0;XTX;49;comment;191.341
604.815;802.475;930;982963;0;XTX;49;comment;191.393
601.901;858.701;122;982954;0;XTX;50;comment;194.547
601.851;832.317;122;982954;0;XTX;50;comment;193.733
There is two groups here; 982963 and 982954.
Target:
$ head -n2 bar
CODE;OID;ID;OQTE;QTE;OTYPE;TYPE
"FLW (603.311 800.928 191.299, 603.512 801.700 191.341, 604.815 802.475 191.393)";982963;0;XTX;49;comment
"FLW (601.901 858.701 194.547, 601.851 832.317 193.733)";982954;0;XTX;49;comment
The group field is the 4 of the foo file. All other may vary.
X Y Z values of each record composing the group should be stored within the FLW parenthesis, following the same order as they appear in the first file lines.
I've tried many things ans as I'm absolutely not an expert using awk yet, this kind of code doesn't work at all:
awk -F ";" 'NR==1 {print "CODE;"$3";"$4";"$5";"$6";"$7";"$8}; NR>1 {a[$4]=a[$4]}END{for(i in a) { print "\"FLW ("$1","$2","$NF")\";"$3";"i""a[i]";"$5";"$6";"$7";"$8 }}' foo
Try:
$ awk -F ";" 'NR==1 {print "CODE;"$3";"$4";"$5";"$6";"$7";"$8}; NR>1 {a[$4]=$5";"$6";"$7";"$8; b[$4]=(b[$4]?b[$4]", ":"")$1" "$2" "$NF;}END{for(i in a) printf "\"FLW (%s)\";%s;%s\n", b[i], i, a[i]}' foo
CODE;OID;ID;OQTE;QTE;OTYPE;TYPE
"FLW (601.901 858.701 194.547, 601.851 832.317 193.733)";982954;0;XTX;50;comment
"FLW (603.311 800.928 191.299, 603.512 810.700 191.341, 604.815 802.475 191.393)";982963;0;XTX;49;comment
Or, as spread out over multiple lines:
awk -F ";" '
NR==1 {
print "CODE;"$3";"$4";"$5";"$6";"$7";"$8
}
NR>1 {
a[$4]=$5";"$6";"$7";"$8
b[$4]=(b[$4]?b[$4]", ":"")$1" "$2" "$NF
}
END{
for(i in a)
printf "\"FLW (%s)\";%s;%s\n", b[i], i, a[i]
}
' foo
Alternate styles
For one, we can replace ";" with FS:
awk -F";" 'NR==1 {print "CODE;"$3 FS $4 FS $5 FS $6 FS $7 FS $8}; NR>1 {a[$4]=$5 FS $6 FS $7 FS $8; b[$4]=(b[$4]?b[$4]", ":"")$1" "$2" "$NF;}END{for(i in a) printf "\"FLW (%s)\";%s;%s\n", b[i], i, a[i]}' foo
For another, the first print can also be replaced with a printf:
awk -F";" 'NR==1 {printf "CODE;%s;%s;%s;%s;%s;%s",$3,$4,$5,$6,$7,$8}; NR>1 {a[$4]=$5 FS $6 FS $7 FS $8; b[$4]=(b[$4]?b[$4]", ":"")$1" "$2" "$NF;}END{for(i in a) printf "\"FLW (%s)\";%s;%s\n", b[i], i, a[i]}' foo
Variation
If, as per the comments, the group field is the third, not the fourth, then:
awk -F";" 'NR==1 {print "CODE;"$3 FS $4 FS $5 FS $6 FS $7 FS $8}; NR>1 {a[$3]= $4 FS $5 FS $6 FS $7 FS $8; b[$3]=(b[$3]?b[$3]", ":"")$1" "$2" "$NF;}END{for(i in a) printf "\"FLW (%s)\";%s;%s\n", b[i], i, a[i]}'

add a new column to the file based on another file

I have two files file1 and file2 as shown below. file1 has two columns and file2 has one column. I want to add second column to the file2 based on file1. How can I do this with awk?
file1
2WPN B
2WUS A
2X83 A
2XFG A
2XQR C
file2
2WPN_1
2WPN_2
2WPN_3
2WUS
2X83
2XFG_1
2XFG_2
2XQR
Desired Output
2WPN_1 B
2WPN_2 B
2WPN_3 B
2WUS A
2X83 A
2XFG_1 A
2XFG_2 A
2XQR C
your help would be appreciated.
awk -v OFS='\t' 'FNR == NR { a[$1] = $2; next } { t = $1; sub(/_.*$/, "", t); print $1, a[t] }' file1 file2
Or
awk 'FNR == NR { a[$1] = $2; next } { t = $1; sub(/_.*$/, "", t); printf "%s\t%s\n", $1, a[t] }' file1 file2
Output:
2WPN_1 B
2WPN_2 B
2WPN_3 B
2WUS A
2X83 A
2XFG_1 A
2XFG_2 A
2XQR C
You may pass output to column -t to keep it uniform with spaces and not tabs.

awk improve command - Count & Sum

Would like to get your suggestion to improve this command and want to remove unwanted execution to avoid time consumption,
actually i am trying to find CountOfLines and SumOf$6 group by $2,substr($3,4,6),substr($4,4,6),$10,$8,$6.
GunZip Input file contains around 300 Mn rows of lines.
Input.gz
2067,0,09-MAY-12.04:05:14,09-MAY-12.04:05:14,21-MAR-16,600,INR,RO312,20120321_1C,K1,,32
2160,0,26-MAY-14.02:05:27,26-MAY-14.02:05:27,18-APR-18,600,INR,RO414,20140418_7,K1,,30
2160,0,26-MAY-14.02:05:27,26-MAY-14.02:05:27,18-APR-18,600,INR,RO414,20140418_7,K1,,30
2160,0,26-MAY-14.02:05:27,26-MAY-14.02:05:27,18-APR-18,600,INR,RO414,20140418_7,K1,,30
2104,5,13-JAN-13.01:01:38,,13-JAN-17,4150,INR,RO113,CD1301_RC50_B1_20130113,K2,,21
Am using the below command and working fine.
zcat Input.gz | awk -F"," '{OFS=","; print $2,substr($3,4,6),substr($4,4,6),$10,$8,$6}' | \
awk -F"," 'BEGIN {count=0; sum=0; OFS=","} {key=$0; a[key]++;b[key]=b[key]+$6} \
END {for (i in a) print i,a[i],b[i]}' >Output.txt
Output.txt
0,MAY-14,MAY-14,K1,RO414,600,3,1800
0,MAY-12,MAY-12,K1,RO312,600,1,600
5,JAN-13,,K2,RO113,4150,1,4150
Any suggestion to improve the above command are welcome ..
This seems more efficient:
zcat Input.gz | awk -F, '{key=$2","substr($3,4,6)","substr($4,4,6)","$10","$8","$6;++a[key];b[key]=b[key]+$6}END{for(i in a)print i","a[i]","b[i]}'
Output:
0,MAY-14,MAY-14,K1,RO414,600,3,1800
0,MAY-12,MAY-12,K1,RO312,600,1,600
5,JAN-13,,K2,RO113,4150,1,4150
Uncondensed form:
zcat Input.gz | awk -F, '{
key = $2 "," substr($3, 4, 6) "," substr($4, 4, 6) "," $10 "," $8 "," $6
++a[key]
b[key] = b[key] + $6
}
END {
for (i in a)
print i "," a[i] "," b[i]
}'
You can do this with one awk invocation by redefining the fields according to the first awk script, i.e. something like this:
$1 = $2
$2 = substr($3, 4, 6)
$3 = substr($4, 4, 6)
$4 = $10
$5 = $8
No need to change $6 as that is the same field. Now if you base the key on the new fields, the second script will work almost unaltered. Here is how I would write it, moving the code into a script file for better readability and maintainability:
zcat Input.gz | awk -f parse.awk
Where parse.awk contains:
BEGIN {
FS = OFS = ","
}
{
$1 = $2
$2 = substr($3, 4, 6)
$3 = substr($4, 4, 6)
$4 = $10
$5 = $8
key = $1 OFS $2 OFS $3 OFS $4 OFS $5 OFS $6
a[key]++
b[key] += $6
}
END {
for (i in a)
print i, a[i], b[i]
}
You can of course still run it as a one-liner, but it will look more cryptic:
zcat Input.gz | awk '{ key = $2 FS substr($3,4,6) FS substr($4,4,6) FS $10 FS $8 FS $6; a[key]++; b[key]+=$6 } END { for (i in a) print i,a[i],b[i] }' FS=, OFS=,
Output in both cases:
0,MAY-14,MAY-14,K1,RO414,600,3,1800
0,MAY-12,MAY-12,K1,RO312,600,1,600
5,JAN-13,,K2,RO113,4150,1,4150