Awk output formatting - awk

I have 2 .po files and some word in there has 2 different meanings
and want to use awk to turn it into some kind of translator
For example
in .po file 1
msgid "example"
msgstr "something"
in .po file 2
msgid "example"
msgstr "somethingelse"
I came up with this
awk -F'"' 'match($2, /^example$/) {printf "%s", $2": ";getline; printf "%s", $2}' file1.po file2.po
The output will be
example:something example:somethinelse
How do I make it into this kind of format
example : something, somethingelse.

Reformatting
example:something example:somethinelse
into
example : something, somethingelse
can be done with this one-liner:
awk -F":| " -v OFS="," '{printf "%s:", $1; for (i=1;i<=NF;i++) if (i % 2 == 0)printf("%s%s%s", ((i==2)?"":OFS), $i, ((i==NF)?"\n":""))}'
Testing:
$ echo "example:something example:somethinelse example:something3 example:something4" | \
awk -F":| " -v OFS="," '{ \
printf "%s:", $1; \
for (i=1;i<=NF;i++) \
if (i % 2 == 0) \
printf("%s%s%s", ((i==2)?"":OFS), $i, ((i==NF)?"\n":""))}'
example:something,somethinelse,something3,something4
Explanation:
$ cat tst.awk
BEGIN{FS=":| ";OFS=","} # define field sep and output field sep
{ printf "%s:", $1 # print header line "example:"
for (i=1;i<=NF;i++) # loop over all fields
if (i % 2 == 0) # we're only interested in all "even" fields
printf("%s%s%s", ((i==2)?"":OFS), $i, ((i==NF)?"\n":""))
}
But you could have done the whole thing in one go with something like this:
$ cat tst.awk
BEGIN{OFS=","} # set output field sep to ","
NF{ # if NF (i.e. number of fields) > 0
# - to skip empty lines -
if (match($0,/msgid "(.*)"/,a)) id=a[1] # if line matches 'msgid "something",
# set "id" to "something"
if (match($0,/msgstr "(.*)"/,b)) str=b[1] # same here for 'msgstr'
if (id && str){ # if both "id" and "str" are set
r[id]=(id in r)?r[id] OFS str:str # save "str" in array r with index "id".
# if index "id" already exists,
# add "str" preceded by OFS (i.e. "," here)
id=str=0 # after printing, reset "id" and "str"
}
}
END { for (i in r) printf "%s : %s\n", i, r[i] } # print array "r"
and call this like:
awk -f tst.awk *.po

$ awk -F'"' 'NR%2{k=$2; next} NR==FNR{a[k]=$2; next} {print k" : "a[k]", "$2}' file1 file2
example : something, somethingelse

Related

compare and print 2 columns from 2 files in awk ou perl

I have 2 files with 2 million lines.
I need to compare 2 columns in 2 different files and I want to print the lines of the 2 files where there are equal items.
this awk code works, but it does not print lines from the 2 files:
awk 'NR == FNR {a[$3]; next}$3 in a' file1.txt file2.txt
file1.txt
0001 00000001 084010800001080
0001 00000010 041140000100004
file2.txt
2451 00000009 401208008004000
2451 00000010 084010800001080
desired output:
file1[$1]-file2[$1] file1[$2]-file2[$2] $3 ( same on both files )
0001-2451 00000001-00000010 084010800001080
how to do this in awk or perl?
Assuming your $3 values are unique within each input file as shown in your sample input/output:
$ cat tst.awk
NR==FNR {
foos[$3] = $1
bars[$3] = $2
next
}
$3 in foos {
print foos[$3] "-" $1, bars[$3] "-" $2, $3
}
$ awk -f tst.awk file1.txt file2.txt
0001-2451 00000001-00000010 084010800001080
I named the arrays foos[] and bars[] as I don't know what the first 2 columns of your input actually represent - choose a more meaningful name.
With your shown samples, please try following awk code. Fair warning
I haven't tested it yet with millions of lines.
awk '
FNR == NR{
arr1[$3]=$0
next
}
($3 in arr1){
split(arr1[$3],arr2)
print (arr2[1]"-"$1,arr2[2]"-"$2,$3)
delete arr2
}
' file1.txt file2.txt
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR == NR{ ##checking condition which will be TRUE when first Input_file is being read.
arr1[$3]=$0 ##Creating arr1 array with value of $1 OFS $2 and $3
next ##next will skip all further statements from here.
}
($3 in arr1){ ##checking if $3 is present in arr1 then do following.
split(arr1[$3],arr2) ##Splitting value of arr1 into arr2.
print (arr2[1]"-"$1,arr2[2]"-"$2,$3) ##printing values as per requirement of OP.
delete arr2 ##Deleting arr2 array here.
}
' file1.txt file2.txt ##Mentioning Input_file names here.
If you have two massive files, you may want to use sort, join and awk to produce your output without having to have the first file mostly in memory.
Based on your example, this pipe would do that:
join -1 3 -2 3 <(sort -k3 -n file1) <(sort -k3 -n file2) | awk '{printf("%s-%s %s-%s %s\n",$2,$4,$3,$5,$1)}'
Prints:
0001-2451 00000001-00000010 084010800001080
If your files are that big, you might want to avoid storing the data in memory. It's a whole lot of comparisons, 2 million lines times 2 million lines = 4 * 1012 comparisons.
use strict;
use warnings;
use feature 'say';
my $file1 = shift;
my $file2 = shift;
open my $fh1, "<", $file1 or die "Cannot open '$file1': $!";
while (<$fh1>) {
my #F = split;
open my $fh2, "<", $file2 or die "Cannot open '$file2': $!";
# for each line of file1 file2 is reopened and read again
while (my $cmp = <$fh2>) {
my #C = split ' ', $cmp;
if ($F[2] eq $C[2]) { # check string equality
say "$F[0]-$C[0] $F[1]-$C[1] $F[2]";
}
}
}
With your rather limited test set, I get the following output:
0001-2451 00000001-00000010 084010800001080
Python: tested with 2.000.000 rows each file
d = {}
with open('1.txt', 'r') as f1, open('2.txt', 'r') as f2:
for line in f1:
if not line: break
c0,c1,c2 = line.split()
d[(c2)] = (c0,c1)
for line in f2:
if not line: break
c0,c1,c2 = line.split()
if (c2) in d: print("{}-{} {}-{} {}".format(d[(c2)][0], c0, d[(c2)][1], c1, c2))
$ time python3 comapre.py
1001-2001 10000001-20000001 224010800001084
1042-2013 10000042-20000013 224010800001096
real 0m3.555s
user 0m3.234s
sys 0m0.321s

awk : compare 2 files with 2 columns

i have to compare 2 files using awk.
The structure of each files is the same : path checksum
File1.txt
/content/cr444/commun/ 50d174f143d115b2d12d09c152a2ca59be7fbb91
/content/cr764/commun/ 10d174f14fd115b2d12d09c152a2ca59be7fbb91
/content/cr999/commun/ 10d174f14fd115b2d12d09c152a2ca59be7fbbpp
File2.txt
/content/cr555/test/ 51d174f14f6115b2d12d09c152a2ca59be7fbb91
/content/cr764/commun/ 10d174f14fd115b2d12d09c152a2ca59be7fbb78
/content/cr999/commun/ 10d174f14fd115b2d12d09c152a2ca59be7fbbpp
Result expected is a .csv (with separator |):
/content/cr444/commun/|50d174f143d115b2d12d09c152a2ca59be7fbb91||not in file2
/content/cr555/test/||51d174f14f6115b2d12d09c152a2ca59be7fbb91|not in file1
/content/cr999/commun/|10d174f14fd115b2d12d09c152a2ca59be7fbbpp|10d174f14fd115b2d12d09c152a2ca59be7fbbpp|same checksum
/content/cr764/commun||10d174f14fd115b2d12d09c152a2ca59be7fbb91|10d174f14fd115b2d12d09c152a2ca59be7fbb78|not same checksum
I assume the order of output lines is not important. Then you could:
Collect lines from File1.txt into an indexed array ($1 -> $2)
Process lines from File2.txt:
If $1 is in the indexed array from (1) compare their checksums and print accordingly
If $1 is not in the indexed array from (1), print accordingly
Print all remaining itmes from array (1)
Here's the code:
$ awk 'BEGIN{OFS="|"} NR==FNR{f1[$1]=$2; next} {if ($1 in f1) { print $1,f1[$1],$2,($2==f1[$1]?"":"not ")"same checksum"; delete f1[$1]} else print $1,"",$2,"not in file1"} END{for (i in f1) print i,f1[i],"","not in file2"}' File1.txt File2.txt
Output:
/content/cr555/test/|51d174f14f6115b2d12d09c152a2ca59be7fbb91|not in file1
/content/cr764/commun/|10d174f14fd115b2d12d09c152a2ca59be7fbb91|10d174f14fd115b2d12d09c152a2ca59be7fbb78|not same checksum
/content/cr999/commun/|10d174f14fd115b2d12d09c152a2ca59be7fbbpp|10d174f14fd115b2d12d09c152a2ca59be7fbbpp|same checksum
/content/cr444/commun/|50d174f143d115b2d12d09c152a2ca59be7fbb91||not in file2
One way, using join to merge the two files, and awk to compare the checksums on each line:
$ join -a1 -a2 -11 -21 -e XXXX -o 0,1.2,2.2 <(sort -k1 file1.txt) <(sort -k1 file2.txt) |
awk -v OFS='|' '$2 == "XXXX" { print $1, "", $3, "not in file1"; next }
$3 == "XXXX" { print $1, $2, "", "not in file2"; next }
$2 == $3 { print $1, $2, $3, "same checksum"; next }
{ print $1, $2, $3, "not same checksum" }'
/content/cr444/commun/|50d174f143d115b2d12d09c152a2ca59be7fbb91||not in file2
/content/cr555/test/||51d174f14f6115b2d12d09c152a2ca59be7fbb91|not in file1
/content/cr764/commun/|10d174f14fd115b2d12d09c152a2ca59be7fbb91|10d174f14fd115b2d12d09c152a2ca59be7fbb78|not same checksum
/content/cr999/commun/|10d174f14fd115b2d12d09c152a2ca59be7fbbpp|10d174f14fd115b2d12d09c152a2ca59be7fbbpp|same checksum

awk one row substracts the next row if their first two colums are the same

If we would like to substract $17 if their $1 & $2 are the same: input
targetID,cpd_number,Cell_assay_id,Cell_alt_assay_id,Cell_type_desc,Cell_Operator,Cell_result_value,Cell_unit_value,assay_id,alt_assay_id,type_desc,operator,result_value,unit_value,Ratio_operator,Ratio,log_ratio,Cell_experiment_date,experiment_date,Cell_discipline,discipline
111,CPD-123456,2222,1111,IC50,,6.1,uM,1183,1265,Ki,,0.16,uM,,38.125,1.7511,2003-03-03 00:00:00,2003-02-10 00:00:00,Cell,Enzyme
111,CPD-123456,2222,1111,IC50,,9.02053,uM,1183,1265,Ki,,0.16,uM,,56.3783,-1.5812,2003-02-27 00:00:00,2003-02-10 00:00:00,Cell,Enzyme
111,CPD-777888,3333,4444,IC50,,6.1,uM,1183,1265,Ki,,0.16,uM,,38.125,-1,2003-03-03 00:00:00,2003-02-10 00:00:00,Cell,Enzyme
111,CPD-777888,3333,4444,IC50,,9.02053,uM,1183,1265,Ki,,0.16,uM,,56.3783,-3,2003-02-27 00:00:00,2003-02-10 00:00:00,Cell,Enzyme
The desired output should be (1.7511-(-1.5812)=3.3323); (-1-(-3)=2)
3.3323
2
First attempt:
awk -F, ' last != $1""$2 && last{ # ONLY When last key "TargetID + Cpd_number"
print C # differs from actual , print line + substraction
C=0} # reset acumulators
{ # This block process each line of infile
C -= $17 # C calc
line=$0 # Line will be actual line without activity
last=$1""$2} # Store the key in orther to track switching
END{ # This block triggers after the complete file read
# to print the last average that cannot be trigger during
# the previous block
print C}' input
It will give the output:
-0.1699
4
The second attempt:
#!/bin/bash
tail -n+2 test > test2 # remove the title/header
awk -F, '$1 == $1 && $2 == $2 {print $17}' test2 >> test3 # print $17 if the $1 and $2 are the same
awk 'NR==1{s=$1;next}{s-=$1}END{print s}' test3
rm test2 test3
test3 will be
1.7511
-1.5812
-1
-3
Output is
7.3323
Could any guru kindly give some comments? Thanks!
You could try the below awk command,
$ awk -F, 'NR==1{next} {var=$1; foo=$2; bar=$17; getline;} $1==var && $2==foo{xxx=bar-$17; print xxx}' file
3.3323
2
awk '
BEGIN { FS = "," }
NR == 1 { next } # skip header line
{ # accumulate totals
if ($1 SUBSEP $2 in a) # if key already exists
a[$1,$2] -= $17 # subtract $17 from value
else # if first appearance of this key
a[$1,$2] = $17 # set value to $17
}
END { # print results
for (x in a)
print a[x]
}
' file

awk improve command - Count & Sum

Would like to get your suggestion to improve this command and want to remove unwanted execution to avoid time consumption,
actually i am trying to find CountOfLines and SumOf$6 group by $2,substr($3,4,6),substr($4,4,6),$10,$8,$6.
GunZip Input file contains around 300 Mn rows of lines.
Input.gz
2067,0,09-MAY-12.04:05:14,09-MAY-12.04:05:14,21-MAR-16,600,INR,RO312,20120321_1C,K1,,32
2160,0,26-MAY-14.02:05:27,26-MAY-14.02:05:27,18-APR-18,600,INR,RO414,20140418_7,K1,,30
2160,0,26-MAY-14.02:05:27,26-MAY-14.02:05:27,18-APR-18,600,INR,RO414,20140418_7,K1,,30
2160,0,26-MAY-14.02:05:27,26-MAY-14.02:05:27,18-APR-18,600,INR,RO414,20140418_7,K1,,30
2104,5,13-JAN-13.01:01:38,,13-JAN-17,4150,INR,RO113,CD1301_RC50_B1_20130113,K2,,21
Am using the below command and working fine.
zcat Input.gz | awk -F"," '{OFS=","; print $2,substr($3,4,6),substr($4,4,6),$10,$8,$6}' | \
awk -F"," 'BEGIN {count=0; sum=0; OFS=","} {key=$0; a[key]++;b[key]=b[key]+$6} \
END {for (i in a) print i,a[i],b[i]}' >Output.txt
Output.txt
0,MAY-14,MAY-14,K1,RO414,600,3,1800
0,MAY-12,MAY-12,K1,RO312,600,1,600
5,JAN-13,,K2,RO113,4150,1,4150
Any suggestion to improve the above command are welcome ..
This seems more efficient:
zcat Input.gz | awk -F, '{key=$2","substr($3,4,6)","substr($4,4,6)","$10","$8","$6;++a[key];b[key]=b[key]+$6}END{for(i in a)print i","a[i]","b[i]}'
Output:
0,MAY-14,MAY-14,K1,RO414,600,3,1800
0,MAY-12,MAY-12,K1,RO312,600,1,600
5,JAN-13,,K2,RO113,4150,1,4150
Uncondensed form:
zcat Input.gz | awk -F, '{
key = $2 "," substr($3, 4, 6) "," substr($4, 4, 6) "," $10 "," $8 "," $6
++a[key]
b[key] = b[key] + $6
}
END {
for (i in a)
print i "," a[i] "," b[i]
}'
You can do this with one awk invocation by redefining the fields according to the first awk script, i.e. something like this:
$1 = $2
$2 = substr($3, 4, 6)
$3 = substr($4, 4, 6)
$4 = $10
$5 = $8
No need to change $6 as that is the same field. Now if you base the key on the new fields, the second script will work almost unaltered. Here is how I would write it, moving the code into a script file for better readability and maintainability:
zcat Input.gz | awk -f parse.awk
Where parse.awk contains:
BEGIN {
FS = OFS = ","
}
{
$1 = $2
$2 = substr($3, 4, 6)
$3 = substr($4, 4, 6)
$4 = $10
$5 = $8
key = $1 OFS $2 OFS $3 OFS $4 OFS $5 OFS $6
a[key]++
b[key] += $6
}
END {
for (i in a)
print i, a[i], b[i]
}
You can of course still run it as a one-liner, but it will look more cryptic:
zcat Input.gz | awk '{ key = $2 FS substr($3,4,6) FS substr($4,4,6) FS $10 FS $8 FS $6; a[key]++; b[key]+=$6 } END { for (i in a) print i,a[i],b[i] }' FS=, OFS=,
Output in both cases:
0,MAY-14,MAY-14,K1,RO414,600,3,1800
0,MAY-12,MAY-12,K1,RO312,600,1,600
5,JAN-13,,K2,RO113,4150,1,4150

gawk "not equal to", field separator

gawk -F ";" '!($2 == "-") {print $0}' file.csv > out.file
This is my file:
MYH1;Myosin-1;MYH1_HUMAN
CALML6;Calmodulin-like;protein
UNQ6494;-;-
This is the output I want:
MYH1;Myosin-1;MYH1_HUMAN
CALML6;Calmodulin-like;protein
I thought that I understood this, but I get this output (same as input):
MYH1;Myosin-1;MYH1_HUMAN
CALML6;Calmodulin-like;protein
UNQ6494;-;-