How to compare number of lines of two files using Awk - awk

I am new to awk and need to compare the number of lines of two files.
The script shall return true,
if lines(f1) == (lines(f2)+1)
otherwise false. How can I do that?
Best regards

If it has to be awk:
awk 'NR==FNR{x++} END{ if(x!=FNR){exit 1} }' file1 file2
The varibale x is incremented and contains the number of line of file1 and FNR contains the number of file2. At the end, both are compared and the script is exited 0 or 1.
See an example:
user#host:~$ awk 'NR==FNR{x++} END{ if(x!=FNR){exit 1} }' shortfile longfile
user#host:~$ echo $?
1
user#host:~$ awk 'NR==FNR{x++} END{ if(x!=FNR){exit 1} }' samefile samefile
user#host:~$ echo $?
0

Something like this should suit your purposes:
[ oele3110 $] cat line_compare.awk
#!/usr/bin/gawk -f
NR==FNR{
n_file1++;
}
NR!=FNR{
n_file2++;
}
END{
n_file2++;
if(n_file1==n_file2){exit(1);}
}
[ oele3110 $] cat f1
1
1
1
1
1
1
[ oele3110 $] cat f2
1
1
1
1
1
[ oele3110 $] cat f3
1
1
1
1
1
[ oele3110 $]
[ oele3110 $] wc -l f*
6 f1
5 f2
5 f3
16 total
[ oele3110 $] ./line_compare.awk f1 f2
[ oele3110 $] echo $?
1
[ oele3110 $] ./line_compare.awk f2 f3
[ oele3110 $] echo $?
0
[ oele3110 $]
Actually, I think I should have asked you to invest a bit more effort before giving you the answer. I'll leave it for now, but next time I won't make the same mistake.

Related

join 2 files with different number of rows

Good morning
I got 2 files and I want to join them.
I am using awk but I can use other command in bash.
the problem is that when I try to awk some records that are not in both files do not appear in the final file.
file1
supply_DBReplication, 27336
test_after_upgrade, 0
test_describe_topic, 0
teste2e_funcional, 0
test_latency, 0
test_replication, 0
ticket_dl, 90010356798
ticket_dl.replica_cloudera, 0
traza_auditoria_eventos, 0
Ezequiel1,473789563
Ezequiel2,526210437
Ezequiel3,1000000000
file2
Domimio2,supply_bdsupply-stock-valorado-sherpa
Domimio8,supply_DBReplication
Domimio9,test_after_upgrade
Domimio7,test_describe_topic
Domimio3,teste2e_funcional
,test_latency
,test_replication
,ticket_dl
,ticket_dl.replica_cloudera
,traza_auditoria_eventos
And I wish:
file3
Domimio2,0
Domimio8,27336
Domimio9,0
Domimio7,0
Domimio3,0
NoDomain,0
NoDomain,0
NoDomain,90010356798
NoDomain,0
NoDomain,0
NoDomain,473789563
NoDomain,526210437
NoDomain,1000000000
I am executing this
awk 'NR==FNR {T[$1]=FS $2; next} {print $1 T[$2]}' FS="," file1 file2
But i got:
Domimio2, 0
Domimio8, 27336
Domimio9, 0
Domimio7, 0
Domimio3, 0
, 0
, 0
, 90010356798
, 0
, 23034
, 0
How can i do it?
Thank you
Assumptions:
join criteria: file1.field#1 == file2.field#2
output format: file2.field#1 , file1,field#2
file2 - if field#1 is blank then replace with NoDomain
file2.field#2 - if no match in file1.field#1 then output file2.field#1 + 0
file1.field#1 - if no match in file2.field#2 then output NoDomain + file1.field#2 (sorted by field#2 values)
One GNU awk idea:
awk '
BEGIN { FS=OFS="," }
NR==FNR { gsub(" ","",$2) # strip blanks from field #2
a[$1]=$2
next
}
{ $1 = ($1 == "") ? "NoDomain" : $1 # if file2.field#1 is missing then set to "NoDomain"
print $1,a[$2]+0
delete a[$2] # delete file1 entry so we do not print again in the END{} block
}
END { PROCINFO["sorted_in"]="#val_num_asc" # any entries leftover from file1 (ie, no matches) then sort by value and ...
for (i in a)
print "NoDomain",a[i] # print to stdout
}
' file1 file2
NOTE: GNU awk is required for the use of PROCINFO["sorted_in"]; if sorting of the file1 leftovers is not required then PROCINFO["sorted_in"]="#val_num_asc" can be removed from the code
This generates:
Domimio2,0
Domimio8,27336
Domimio9,0
Domimio7,0
Domimio3,0
NoDomain,0
NoDomain,0
NoDomain,90010356798
NoDomain,0
NoDomain,0
NoDomain,473789563
NoDomain,526210437
NoDomain,1000000000

Counting the number of unique values based on more than two columns in bash

I need to modify the below code to work on more than one column.
Counting the number of unique values based on two columns in bash
awk ' ##Starting awk program from here.
BEGIN{
FS=OFS="\t"
}
!found[$0]++{ ##Checking condition if 1st and 2nd column is NOT present in found array then do following.
val[$1]++ ##Creating val with 1st column inex and keep increasing its value here.
}
END{ ##Starting END block of this progra from here.
for(i in val){ ##Traversing through array val here.
print i,val[i] ##Printing i and value of val with index i here.
}
}
' Input_file ##Mentioning Input_file name here.
Table to count how many of each double (all DIS)
patient sex DISa DISb DISc DISd DISe DISf DISg DISh DISi
patient1 male 550.1 550.5 594.1 594.3 594.8 591 1019 960.1 550.1
patient2 female 041 208 250.2 276.14 426.32 550.1 550.5 558 041
patient3 female NA NA NA NA NA NA NA 041 NA
The output I need is:
550.1 3
550.5 2
594.1 1
594.3 1
594.8 1
591 1
1019 1
960.1 1
550.1 1
041 3
208 1
250.2 1
276.14 1
426.32 1
558 1
Consider this awk:
awk -v OFS='\t' 'NR > 1 {for (i=3; i<=NF; ++i) if ($i+0 == $i) ++fq[$i]} END {for (i in fq) print i, fq[i]}' file
276.14 1
960.1 1
594.3 1
426.32 1
208 1
041 3
594.8 1
550.1 3
591 1
1019 1
558 1
550.5 2
250.2 1
594.1 1
A more readable form:
awk -v OFS='\t' '
NR > 1 {
for (i=3; i<=NF; ++i)
if ($i+0 == $i)
++fq[$i]
}
END {
for (i in fq)
print i, fq[i]
}' file
$i+0 == $i is a check for making sure column value is numeric.
If the ordering must be preserved, then you need an additional array b[] to keep the order each number is encountered, e.g.
awk '
BEGIN { OFS = "\t" }
FNR > 1 {
for (i=3;i<=NF;i++)
if ($i~/^[0-9]/) {
if (!($i in a))
b[++n] = $i;
a[$i]++
}
}
END {
for (i=1;i<=n;i++)
print b[i], a[b[i]]
}' file
Example Use/Output
$ awk '
> BEGIN { OFS = "\t" }
> FNR > 1 {
> for (i=3;i<=NF;i++)
> if ($i~/^[0-9]/) {
> if (!($i in a))
> b[++n] = $i;
> a[$i]++
> }
> }
> END {
> for (i=1;i<=n;i++)
> print b[i], a[b[i]]
> }' patients
550.1 3
550.5 2
594.1 1
594.3 1
594.8 1
591 1
1019 1
960.1 1
041 3
208 1
250.2 1
276.14 1
426.32 1
558 1
Let me know if you have further questions.
Taking complete solution from above 2 answers(#anubhava and #David) with all respect, just adding a little tweak(of applying check for integer value here as per shown samples of OP) to their solutions and adding 2 solutions here. Written and tested with shown samples only.
1st solution: If order doesn't matter in output try:
awk -v OFS='\t' '
NR > 1 {
for (i=3; i<=NF; ++i)
if (int($i))
++fq[$i]
}
END {
for (i in fq)
print i, fq[i]
}' Input_file
2nd solution: If order matters based on David's answer try.
awk '
BEGIN { OFS = "\t" }
FNR > 1 {
for (i=3;i<=NF;i++)
if (int($i)) {
if (!($i in a))
b[++n] = $i;
a[$i]++
}
}
END {
for (i=1;i<=n;i++)
print b[i], a[b[i]]
}' Input_file
Using GNU awk for multi-char RS:
$ awk -v RS='[[:space:]]+' '$0+0 == $0' file | sort | uniq -c
3 041
1 1019
1 208
1 250.2
1 276.14
1 426.32
3 550.1
2 550.5
1 558
1 591
1 594.1
1 594.3
1 594.8
1 960.1
If the order of fields really matters just pipe the above to awk '{print $2, $1}'.

How to print something multiple times in awk

I have a file sample.txt that looks like this:
Sequence: chr18_gl000207_random
Repeat 1
Indices: 2822--2996 Score: 135
Period size: 36 Copynumber: 4.8 Consensus size: 36
Consensus pattern (36 bp):
TCAGTTGCAGTGCTGGCTGTTGTTGTGGCAGACTGT
Repeat 2
Indices: 2736--3623 Score: 932
Period size: 111 Copynumber: 8.1 Consensus size: 111
Consensus pattern (111 bp):
TTGTGGCAGACTGTTCAGTTGCAGTGCTGGCTGTTGTTGTGGTTGCGGGTTCAGTAGAGGTGGTA
GTGGTGGCTGTTGTGGTTGTAGCCTCAGTGGAAGTGCCTGCAGTTG
Repeat 3
Indices: 3421--3496 Score: 89
Period size: 39 Copynumber: 1.9 Consensus size: 39
Consensus pattern (39 bp):
AGTGCTGACTGTTGTGGTGGCAGCCTCAGTAGAAGTGGT
I have used awk to extract values for parameters that are relevant for me like this:
paste <(awk '/Indices/ {print $2}' sample.txt) <(awk '/Period size/ {print $3}' sample.txt) <(awk '/Copynumber/ {print $5}' sample.txt) <(awk '/Consensus pattern/ {getline; print $0}' sample.txt)
Output:
2822--2996 36 4.8 TCAGTTGCAGTGCTGGCTGTTGTTGTGGCAGACTGT
2736--3623 111 8.1 TTGTGGCAGACTGTTCAGTTGCAGTGCTGGCTGTTGTTGTGGTTGCGGGTTCAGTAGAGGTGGTA
3421--3496 39 1.9 AGTGCTGACTGTTGTGGTGGCAGCCTCAGTAGAAGTGGT
Now I want to add the parameter Sequence to every row.
Desired output:
chr18_gl000207_random:2822--2996 36 4.8 TCAGTTGCAGTGCTGGCTGTTGTTGTGGCAGACTGT
chr18_gl000207_random:2736--3623 111 8.1 TTGTGGCAGACTGTTCAGTTGCAGTGCTGGCTGTTGTTGTGGTTGCGGGTTCAGTAGAGGTGGTA
chr18_gl000207_random:3421--3496 39 1.9 AGTGCTGACTGTTGTGGTGGCAGCCTCAGTAGAAGTGGT
I want to do this for many files in a loop so I need a solution that would work with a different number of Repeats as well.
$ cat tst.awk
BEGIN { OFS="\t" }
$1 == "Sequence:" { seq = $2; next }
$1 == "Indices:" { ind = $2; next }
$1 == "Period" { per = $3; cpy = $5; next }
$1 == "Consensus" { isCon=1; next }
isCon { print seq":"ind, per, cpy, $1; isCon=0 }
$ awk -f tst.awk file
chr18_gl000207_random:2822--2996 36 4.8 TCAGTTGCAGTGCTGGCTGTTGTTGTGGCAGACTGT
chr18_gl000207_random:2736--3623 111 8.1 TTGTGGCAGACTGTTCAGTTGCAGTGCTGGCTGTTGTTGTGGTTGCGGGTTCAGTAGAGGTGGTA
chr18_gl000207_random:3421--3496 39 1.9 AGTGCTGACTGTTGTGGTGGCAGCCTCAGTAGAAGTGGT

Concatenate multiple files and create new file based on the value

I have more than 50 files as like this
dell.txt
Name Id Year Value
xx.01 45 1990 2k
SS.01 89 2000 6.0k
Hp.txt
Name Id Year Value
xx.01 48 1994 21k
SS.01 80 2001 2k
Apple.txt
Name Id Year Value
xx.02 45 1990 20k
SS.01 89 2000 60k
kp.03 23 1996 530k
I just need to make a new file as like this
Name Id Year dell Hp Apple
xx.01 45 1990 2k 0 0
xx.01 48 1994 0 21k 0
xx.02 45 1990 0 0 20k
SS.01 80 2001 0 2k 0
SS.01 89 2000 6.0k 0 60k
kp.03 23 1996 0 0 530k
I tried with paste for concatenation but it is adding different order. any other way using awk? I used flowing code:
$ awk ' FNR==1{ if (!($0 in h)){file=h[$0]=i++} else{file=h[$0];next} } {print >> (file)} ' *.txt –
Could you please try following, written and tested with GNU awk and is giving output in sorted format.
awk '
FNR==1{
tfile=FILENAME
sub(/\..*/,"",tfile)
file=(file?file OFS:"")tfile
header=($1 FS $2 FS $3)
next
}
{
a[$1 FS $2 FS $3 "#" FILENAME]=$NF
}
END{
print header,file
for(i in a){
oldi=i
split(i,arr,"#")
sub(/#.*/,"",i)
printf("%s ",i)
for(i=1;i<=ARGIND;i++){
val=(val?val OFS:"")((arr[1] "#" ARGV[i]) in a?a[oldi]:0)
}
printf("%s\n",val)
val=""
}
}
' dell.txt Hp.txt Apple.txt | sort -k1 | column -t
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR==1{ ##Checking if this is 1st line.
tfile=FILENAME
sub(/\..*/,"",tfile)
file=(file?file OFS:"")tfile ##Creating file which has all Input_file names in it.
header=($1 FS $2 FS $3) ##Header has 3 fields in it from 1st line.
next ##next will skip all further statements from here.
}
{
a[$1 FS $2 FS $3 "#" FILENAME]=$NF ##Creating a with index of 1st, 2nd, 3rd fields # Input_file name and has value as last field.
}
END{ ##Starting END block of this awk program from here.
print header,file ##Printing header and file variables here.
for(i in a){ ##Traversing through a here.
oldi=i ##Setting i value as oldi here.
split(i,arr,"#") ##Splitting i with arr delimiter as # here.
sub(/#.*/,"",i) ##Substituting from # to till last of line with NULL.
printf("%s ",i) ##Printing i here.
for(i=1;i<=ARGIND;i++){ ##Running a for loop till ARGIND value from i=1
val=(val?val OFS:"")((arr[1] "#" ARGV[i]) in a?a[oldi]:0) ##Creating val if arr[1] "#" ARGV[i] in a then have a value with index a[oldi] or put 0.
}
printf("%s\n",val) ##Printing val here with new line.
val="" ##Nullifying val here.
}
}
' dell.txt Hp.txt Apple.txt | sort -k1 | column -t ##Mentioning Input_file names, sorting output and then using column -t to look output well.
Output will be as follows.
Name Id Year dell Hp Apple
SS.01 80 2001 0 2k 0
SS.01 89 2000 6.0k 0 6.0k
SS.01 89 2000 60k 0 60k
kp.03 23 1996 0 0 530k
xx.01 45 1990 2k 0 0
xx.01 48 1994 0 21k 0
xx.02 45 1990 0 0 20k
Here is an awk script to join the files as required.
BEGIN { OFS = "\t"}
NR==1 { col[++c] = $1 OFS $2 OFS $3 }
FNR==1 {
split(FILENAME, arr, ".")
f = arr[1]
col[++c] = f
next
}
{
id[$1 OFS $2 OFS $3] = $4
cell[$1 OFS $2 OFS $3 OFS f] = $4
}
END {
for (i=1; i<=length(col); i++) {
printf col[i] OFS
}
printf ORS
for (i in id) {
printf i OFS
for (c=2; c<=length(col); c++) {
printf (cell[i OFS col[c]] ? cell[i OFS col[c]] : "0") OFS
}
printf ORS
}
}
Usage:
awk -f tst.awk *.txt | sort -nk3
Note that the glob fetches the files in alphabetical order and the arguments order determines the column order of the output. If you want a different column order, you have to order the arguments, for example like below.
Output is a real tab-columned file, but if you want a tab-like look with spaces, pipe to column -t
Testing
Using your sample files and providing their order:
> awk -f tst.awk dell.txt Hp.txt Apple.txt | sort -nk3 | column -t
Name Id Year dell Hp Apple
xx.01 45 1990 2k 0 0
xx.02 45 1990 0 0 20k
xx.01 48 1994 0 21k 0
kp.03 23 1996 0 0 530k
SS.01 89 2000 6.0k 0 60k
SS.01 80 2001 0 2k 0

editing a output file to be delimited with a semicolon and the input file is a CSV kornshell

My input file is CSV
AED,E ,3.67295,20160105,20:10:00,UAE DIRHAM
ATS,E ,10.9814,20160105,20:10:00,AUSTRIAN SHILLINGS
AUD,A ,0.71525,20160105,20:10:00,AUSTRALIAN DOLLAR
I want to read it in to output it like so
EUR;1.127650;USD/EUR;EURO;Cash
JPY;124.335000;JPY/USD;JAPANESE YEN;Cash
GBP;1.538050;USD/GBP;BRITISH POUND;Cash
actual code :
cat $FILE2 | while read a b c d e f
do
echo $a $c $a/USD $f Cash \
| awk -F, 'BEGIN { OFS =";" } {print $1, $2, $3, $4, $5}' >> my_ratesoutput.csv
output:
Cash;;;;95 AED/USD UAE DIRHAM
Cash;;;;14 ATS/USD AUSTRIAN SHILLINGS
Cash;;;;25 AUD/USD AUSTRALIAN DOLLAR
Cash;;;;/USD BARBADOS DOLLAR
export IFS=","
semico=';'
FILE=rates.csv
FILE2=rateswork.csv
echo $FILE
rm my_ratesoutput.csv
cp -p $FILE $FILE2
sed 1d $FILE2 > temp.csv
mv temp.csv $FILE2
echo "Currency;Spot Rate;Terms;Name;Curve" >>my_ratesoutput.csv
cat $FILE2 |while read a b c d e f
do
echo $a$semico$c$semico$a/USD$semico$f$semicoCash >> my_ratesoutput.csv
done