Reading from 2 text files one line at a time in UNIX - awk

I have 2 files, file1 and file2. I am trying to read one line from file1 and read another line from file2 and insert HTML
flags to make is usealbe in an html file. I have been trying to work with awk with little success. Can someone please help?
File1:
SILOS.SIL_Stage_GroupAccountNumberDimension_FinStatementItem
SDE_ORA11510_Adaptor.SDE_ORA_Stage_GLAccountDimension_FinSubCodes
File2:
FlatFileConnection.DBConnection_OLAP.SILOS.SIL_Stage_GroupAccountNumberDimension_FinStatementItem.txt
FlatFileConnection.DBConnection_OLAP.SDE_ORA11510_Adaptor.SDE_ORA_Stage_GLAccountDimension_FinSubCodes.txt
Desired output:
<ParameterFile>
<workflow>SILOS.SIL_Stage_GroupAccountNumberDimension_FinStatementItem</workflow>
<File>FlatFileConnection.DBConnection_OLAP.SILOS.SIL_Stage_GroupAccountNumberDimension_FinStatementItem.txt</File>
<ParameterFile>
<workflow>SDE_ORA11510_Adaptor.SDE_ORA_Stage_GLAccountDimension_FinSubCodes</workflow>
<File>FlatFileConnection.DBConnection_OLAP.SDE_ORA11510_Adaptor.SDE_ORA_Stage_GLAccountDimension_FinSubCodes.txt</File>

Using bash:
printItem() { printf "<%s>%s</%s>\n" "$1" "${!1}" "$1"; }
paste file1 file2 |
while read workflow File; do
echo "<ParameterFile>"
printItem workflow
printItem File
done
With awk, it would be:
awk '
NR==FNR {workflow[FNR]=$1; next}
{
print "<ParameterFile>"
printf "<workflow>%s</workflow>\n", workflow[FNR]
printf "<File>%s</File>\n", $1
}
' file1 file2
Another approach that does not require storing the first file in memory:
awk '{
print "<ParameterFile>"
print "<workflow>" $0 "</workflow>"
getline < "file2"
print "<File>" $0 "</File>"
}' file1

If you don't mind mixing in some shell:
$ paste -d$'\n' file1 file2 |
awk '{ printf (NR%2 ? "<ParameterFile>\n<workflow>%s</workflow>\n" : "<File>%s</File>\n"), $0 }'
<ParameterFile>
<workflow>SILOS.SIL_Stage_GroupAccountNumberDimension_FinStatementItem</workflow>
<File>FlatFileConnection.DBConnection_OLAP.SILOS.SIL_Stage_GroupAccountNumberDimension_FinStatementItem.txt</File>
<ParameterFile>
<workflow>SDE_ORA11510_Adaptor.SDE_ORA_Stage_GLAccountDimension_FinSubCodes</workflow>
<File>FlatFileConnection.DBConnection_OLAP.SDE_ORA11510_Adaptor.SDE_ORA_Stage_GLAccountDimension_FinSubCodes.txt</File>
otherwise see #GlennJackman's solution for the pure awk way to do it.

Related

How to improve the speed of this awk script

I have a large file, say file1.log, that looks like this:
1322 a#gmail.com
2411 b#yahoo.com
and a smaller file, say file2.log, that looks like this:
a#gmail.com
c#yahoo.com
In fact, file1.log contains about 6500000 lines and file2.log contains about 140000.
I want to find all lines in file2.log that do not appear in file1.log. I wrote this awk command:
awk 'NR==FNR{c[$2]++} NR!=FNR && c[$1]==0 {print $0}' file1.log file2.log > result.log'
after half an hour or so I find the command is still running, and less result.log shows that result.log is empty.
I am wondering whether there is something I can do to do the job quicker?
Hash the smaller file file2 into memory. Remember Tao of Programming, 1.3: How could it be otherwise?:
$ awk '
NR==FNR { # hash file2 since its smaller
a[$0]
next
}
($2 in a) { # if file1 entry found in hash
delete a[$2] # remove it
}
END { # in the end
for(i in a) # print the ones that remain in the hash
print i
}' file2 file1 # mind the order
Output:
c#yahoo.com
If you sort the files, you can use comm to print only those lines present in the second file with:
comm -13 <(awk '{ print $2 }' file1.log | sort) <(sort file2.log)
I believe the easiest is just a simple grep-pipeline
grep -Fwof file2 file1 | grep -Fwovf - file2
you can also just extract the second column of file1 and use the last part of the above command again:
awk '{print $2}' file1 | grep -Fwovf - file2
Or everything in a single awk:
awk '(NR==FNR){a[$2]; next}!($1 in a)' file1 file2

Concatenate the sequence to the ID in fasta file

Here is my input file
>OTU1;size=4;
ATTCCGGGTTTACT
ATTCCTTTTATCGA
ATC
>OTU2;size=10;
CGGATCTAGGCGAT
ACT
>OTU3;size=5;
ATTCCCGGGATCTA
ACTTTTC
The expected output file is:
>OTU1;size=4;ATTCCGGGTTTACTATTCCTTTTATCGAATC
>OTU2;size=10;CGGATCTAGGCGATACT
>OTU3;size=5;ATTCCCGGGATCTAACTTTTC
I've tried the code from Remove line breaks in a FASTA file
but this doesn't work for me, and I am not sure how to modify the code from that post...
Any suggestion? Thanks in advance!
Here is another awk script. Using the awk internal parsing mechanism.
awk 'BEGIN{RS=">";OFS="";}NR>1{$1=$1;print ">"$0}' input.txt
Output is:
>OTU1;size=4;ATTCCGGGTTTACTATTCCTTTTATCGAATC
>OTU2;size=10;CGGATCTAGGCGATACT
>OTU3;size=5;ATTCCCGGGATCTAACTTTTC
Explanation:
awk '
BEGIN { # initialize awk internal variables
RS=">"; # set `RS`=record separator to `>`
OFS=""; # set `OFS`=output field separator to empty string.
}
NR>1 { # handle from 2nd record (1st record is empty).
$1=$1; # regenerate the output line
print ">"$0 # print out ">" with computed output line
}' input.txt
$ awk '{printf "%s%s", (/^>/ ? ors : ""), $0; ors=ORS} END{print ""}' file
>OTU1;size=4;ATTCCGGGTTTACTATTCCTTTTATCGAATC
>OTU2;size=10;CGGATCTAGGCGATACT
>OTU3;size=5;ATTCCCGGGATCTAACTTTTC
Could you please try following too.
awk -v RS=">" 'NR>1{gsub(/\n/,"");print ">"$0}' Input_file
My original attempt was awk -v RS=">" -v FS="\n" -v OFS="" 'NF>1{$1=$1;print ">"$0}' Input_file but later I saw it is already answered buy dudi boy so written another(first mentioned) one.
Similar to my answer here:
$ awk 'BEGIN{RS=">"; FS="\n"; ORS=""}
(FNR==1){next}
{ name=$1; seq=$0; gsub(/(^[^\n]*|)\n/,"",seq) }
{ print ">" name seq }' file1.fasta file2.fasta file3.fasta ...

Get only part of a file name in Awk

I have tried
awk '{print FILENAME}'
And the result was full path of the file.
I want to get only the file name, example: from "test/testing.test.txt" I just want to get "testing" without ".test.txt".
Use -F to delimit by the period and print the first string before that delimiter:
awk -F'.' '{ print $1 }'
Alternatively,
ls -l | awk '{ print $9 }' | awk -F"." '{ print $1 }'
will run through the whole folder👍
(there's a fancier way to do it, but that's easy).
Use the sub and/or split functions to extract the part of FILENAME you want.

Awk print string with variables

How do I print a string with variables?
Trying this
awk -F ',' '{printf /p/${3}_abc/xyz/${5}_abc_def/}' file
Need this at output
/p/APPLE_abc/xyz/MANGO_abc_def/
where ${3} = APPLE
and ${5} = MANGO
printf allows interpolation of variables. With this as the test file:
$ cat file
a,b,APPLE,d,MANGO,f
We can use printf to achieve the output you want as follows:
$ awk -F, '{printf "/p/%s_abc/xyz/%s_abc_def/\n",$3,$5;}' file
/p/APPLE_abc/xyz/MANGO_abc_def/
In printf, the string %s means insert-a-variable-here-as-a-string. We have two occurrences of %s, one for $3 and one for $5.
Not as readable, but the printf isn't necessary here. Awk can insert the variables directly into the strings if you quote the string portion.
$ cat file.txt
1,2,APPLE,4,MANGO,6,7,8
$ awk -F, '{print "/p/" $3 "_abc/xyz/" $5 "_abc_def/"}' file.txt
/p/APPLE_abc/xyz/MANGO_abc_def/

use awk to print a column, adding a comma

I have a file, from which I want to retrieve the first column, and add a comma between each value.
Example:
AAAA 12345 xccvbn
BBBB 43431 fkodks
CCCC 51234 plafad
to obtain
AAAA,BBBB,CCCC
I decided to use awk, so I did
awk '{ $1=$1","; print $1 }'
Problem is: this add a comma also on the last value, which is not what I want to achieve, and also I get a space between values.
How do I remove the comma on the last element, and how do I remove the space? Spent 20 minutes looking at the manual without luck.
$ awk '{printf "%s%s",sep,$1; sep=","} END{print ""}' file
AAAA,BBBB,CCCC
or if you prefer:
$ awk '{printf "%s%s",(NR>1?",":""),$1} END{print ""}' file
AAAA,BBBB,CCCC
or if you like golf and don't mind it being inefficient for large files:
$ awk '{r=r s $1;s=","} END{print r}' file
AAAA,BBBB,CCCC
awk {'print $1","$2","$3'} file_name
This is the shortest I know
Why make it complicated :) (as long as file is not too large)
awk '{a=NR==1?$1:a","$1} END {print a}' file
AAAA,BBBB,CCCC
For better porability.
awk '{a=(NR>1?a",":"")$1} END {print a}' file
You can do this:
awk 'a++{printf ","}{printf "%s", $1}' file
a++ is interpreted as a condition. In the first row its value is 0, so the comma is not added.
EDIT:
If you want a newline, you have to add END{printf "\n"}. If you have problems reading in the file, you can also try:
cat file | awk 'a++{printf ","}{printf "%s", $1}'
awk 'NR==1{printf "%s",$1;next;}{printf "%s%s",",",$1;}' input.txt
It says: If it is first line only print first field, for the other lines first print , then print first field.
Output:
AAAA,BBBB,CCCC
In this case, as simple cut and paste solution
cut -d" " -f1 file | paste -s -d,
In case somebody as me wants to use awk for cleaning docker images:
docker image ls | grep tag_name | awk '{print $1":"$2}'
Surpised that no one is using OFS (output field separator). Here is probably the simplest solution that sticks with awk and works on Linux and Mac: use "-v OFS=," to output in comma as delimiter:
$ echo '1:2:3:4' | awk -F: -v OFS=, '{print $1, $2, $4, $3}' generates:
1,2,4,3
It works for multiple char too:
$ echo '1:2:3:4' | awk -F: -v OFS=., '{print $1, $2, $4, $3}' outputs:
1.,2.,4.,3
Using Perl
$ cat group_col.txt
AAAA 12345 xccvbn
BBBB 43431 fkodks
CCCC 51234 plafad
$ perl -lane ' push(#x,$F[0]); END { print join(",",#x) } ' group_col.txt
AAAA,BBBB,CCCC
$
This can be very simple like this:
awk -F',' '{print $1","$1","$2","$3}' inputFile
where input file is : 1,2,3
2,3,4 etc.
I used the following, because it lists the api-resource names with it, which is useful, if you want to access it directly. I also use a label "application" to find specific apps in a namespace:
kubectl -n ops-tools get $(kubectl api-resources --no-headers=true --sort-by=name | awk '{printf "%s%s",sep,$1; sep=","}') -l app.kubernetes.io/instance=application