Update tab-delimited file in-place with gawk - gawk

I am trying to add field headers to a file in-place using gawk. The input file is tab delimited so I added that to the command. If I substitute gawk -i inplace with just awk the command runs but the file is not updated. I know awk doesn't have an in-place edit like sed, but can gawk be used or is there a better way?
gawk -i inplace '
BEGIN {FS = OFS = "\t"
}
NR == 1 {
$1= "Chr"
$2= "Start"
$3= "End"
$4= "Gene"
}
1' file
file (input file to update)
chr7 121738788 121738930 AASS
chr7 121738788 121738930 AASS
chr7 121738788 121738930 AASS
desired output
Chr Start End Gene
chr7 121738788 121738930 AASS
chr7 121738788 121738930 AASS
chr7 121738788 121738930 AASS
I was using the SO Q&A awk save modifications in place as a guide but was not able to solve my issue.

awk 'BEGIN {print "Chr\tStart\tEnd\tGene"}1' file > newFile && mv newFile file
Output
Chr Start End Gene
chr7 121738788 121738930 AASS
chr7 121738788 121738930 AASS
chr7 121738788 121738930 AASS
As it seems you're mostly interested in adding a header line, just print that before anything happens (via the BEGIN block). The 1 is a "true" statement, so all lines of input are printed (by default). You could replace it with the long hand {print $0} if you want code that non awk-gurus will understand.
Even using a -i inplace option, the program is doing the same as awk 'code' file > newFile && mv newFile file behind the scenes, so there is no "savings" in processing to adding a header to a file. The file has to be rewritten in either case.
IHTH

It'd be more efficient to just do:
cat - file <<<$'Chr\tStart\tEnd\tGene' > newfile && mv newfile file
with no awk involvement at all.

Related

awk :: how to find matching words in two files

Some good folk here on StackOverflow helped me find common lines in two files using awk:
awk 'NR==FNR{a[tolower($0)]; next} tolower($0) in a' 1.txt 2.txt
But how to find common words in two files where words are in line?
For example, let's say that I have 1.txt with these words:
apple
orange
butter
flower
And then 2.txt with these words:
dog cat Butter tower
How to return butter or Butter?
I just want to find the common words.
This grep should do the job:
grep -oiwFf 1.txt 2.txt
Butter
Or else this simple gnu awk would also work:
awk -v RS='[[:space:]]+' 'NR==FNR {w[tolower($1)]; next} tolower($1) in w' 1.txt 2.txt
Butter
Given:
$ cat file1
apple
orange
butter
flower
$ cat file2
dog cat Butter tower
I would write it this way:
awk 'FNR==NR{for(i=1;i<=NF;i++) words[tolower($i)]; next}
{for (i=1;i<=NF;i++) if (tolower($i) in words) print $i}
' file1 file2
Note there is a field by field loop in the case of FNR==NR that handles files that may have more than one word per line. If you know that that is not the case, you can simplify to:
awk 'FNR==NR{words[tolower($1)]; next}
{for (i=1;i<=NF;i++) if (tolower($i) in words) print $i}
' file1 file2
If this is not working on Windows it may be an issue with \r\n line endings. If awk is using a RS=[\n] value then the the \r is left on all words at the end of a line; butter\r does not match butter.
Try:
awk -v RS='[ \r\n\t]' 'FNR==NR{words[tolower($0)]; next}
tolower($0) in words' file1 file2
Comments on your WSL comments in the link:
Your workarounds for Unix files on DOS are many.
Create file1 with DOS line endings this way:
$ printf 'apple\r\norange\r\nbutter\r\nflower\r\n' >file1
Now you can test / see the file has those line endings with cat -v:
$ cat -v file1
apple^M
orange^M
butter^M
flower^M
You can also remove those line endings with sed, perl, awk, etc. Here is a awk removing the \r from the files:
$ cat -v <(awk 1 RS='\r\n' ORS='\n' file1)
apple
orange
butter
flower
A sed and perl:
$ cat -v <(sed 's/\r$//' file1)
#same
or
$ cat -v <(perl -0777 -lpe 's/\r\n/\n/g' file1)
etc. Then use that same construct with awk-on-windows:
awk 'your_awk_program' <(awk 1 RS='\r\n' ORS='\n' file1) <(awk 1 RS='\r\n' ORS='\n' file2)
The downside: While each input is treated as a different logical file, so the FNR==NR awk test still works, the awk special variable FILENAME is lost in the process. If you want to keep FILENAME associated with the actual file, you need to preprocess the files prior to feeding to awk or deal with the \r inside your awk script.
You need to loop over every field per line (of 2.txt) and check:
awk 'NR==FNR{a[tolower($0)];next}{for(i=1;i<=NF;i++){if(tolower($i) in a){print $i}}}' \
1.txt 2.txt
An alternative way to do this in awk would be to add whitespace to the input record separator when processing the 2nd file:
awk 'NR==FNR{a[tolower($0)];next} tolower($0) in a' 1.txt RS="[\n ]" 2.txt

awk to remove lines in file with character in it

Trying to use awk to remove each line that has an_ in $5. The formating of the file makes it look like its $4, but neither works. I also tied sed '/_/d' but that removed all lines. Thank you :).
file
chr1 114713907 114713907 chr1:115256528-115256528 NRAS
chr1 114713789 114713988 NRAS_3
chr1 247424106 247424106 chr1:247587408-247587408 NLRP3
chr1 247423836 247425609 NLRP3_3
file
chr1 114713907 114713907 chr1:115256528-115256528 NRAS
chr1 247424106 247424106 chr1:247587408-247587408 NLRP3
awk
awk -F\t '$4 !~ /_/'
awk -F\t '$5 !~ /_/'
Could you please try following, written and tested with shown samples in GNU awk.
awk -F'[[:blank:]]+' '$5!~/_/' Input_file
Explanation: Simply making [[:blank:]] character class as a field separator for all the lines of Input_file. Then checking condition if 5th field is NOT having _ then print that line(no action mentioned so by default printing of that line will happen).
2nd solution: Or if its always last field in your Input_file then try following.
awk '$NF!~/_/' Input_file
You may use $NF as that field is last field in every line:
awk -F '\t' '$NF !~ /_/' file
chr1 114713907 114713907 chr1:115256528-115256528 NRAS
chr1 247424106 247424106 chr1:247587408-247587408 NLRP3
Or you can avoid regex:
awk -F '\t' 'index($NF, "_") == 0' file

Print line modified and the line after using awk

I want to modify lines in a file using awk and print the new lines with the following line.
My file is like this
Name_Name2_ Name3_Name4
ASHRGSJFSJRGDJRG
Name5_Name6_Name7_Name8
ADGTHEGHGTJKLGRTIWRK
I want
Name-Name2
ASHRGSJFSJRGDJRG
Name5-Name6
ADGTHEGHGTJKLGRTIWRK
I have sued awk to modify my file:
awk -F'_' {print $1 "-" $2} file > newfile
but I don't know how to tell to print also the line just after (ABDJRH)
sure is it possible with awk x=NR+1 NR<=x
thanks
Following awk may help you on same.
awk -F"_" '/_/{print $1"-"$2;next} 1' Input_file
assuming your structure in sample (no separation in line with "data" letter )
awk '$0=$1' Input_file
# or with sed
sed 's/[[:space:]].*//' Input_file

find match, print first occurrence and continue until the end of the file - awk

I have a pretty large file from which I'd like to extract only the first line of those containing my match and then continuing doing that until the end of the file. Example of input and desired output below
Input
C,4,2,5,6,8,9,5
C,4,5,4,5,4,43,6
S,4,23,567,2,4,5
S,23,4,7,78,8,9,6
S,3,5,67,8,54,56
S,4,8,9,54,3,4,52
E,2,3,213,5,8,44
E,5,7,9,67,89,33
E,54,526,54,43,53
S,9,8,9,7,9,32,4
S,5,6,4,5,67,87,88
S,4,23,5,8,5,7,3
E,4,6,4,8,9,32,23
E,43,7,1,78,9,8,65
Output
S,4,23,567,2,4,5
S,9,8,9,7,9,32,4
The match in my lines is S, which usually comes after a line that starts with either E or C. What I'm struggling with is to tell awk to print only the first line after those with E or C. Another way would be to print the first of the bunch of lines containing S. Any idea??
does this one-liner help?
awk '/^S/&&!i{print;i=!i}!/^S/{i=!i}' file
or more "readable":
awk -v p=1 '/^S/&&p{print;p=0}!/^S/{p=1}' file
You can use sed, like this:
sed -rn '/^(E|C)/{:a;n;/^S/!ba;p}' file
here's a multi liner to enter in a file (e.g. u.awk)
/^[CE]/ {ON=1; next}
/^S/ {if (ON) print}
{ON=0}
then run : "awk -f u.awk inputdatafile"
awk to the rescue!
$ awk '/^[CE]/{p=1} /^S/&&p{p=0;print}' file
S,4,23,567,2,4,5
S,9,8,9,7,9,32,4
$ awk '/^S/{if (!f) print; f=1; next} {print; f=0}' file
C,4,2,5,6,8,9,5
C,4,5,4,5,4,43,6
S,4,23,567,2,4,5
E,2,3,213,5,8,44
E,5,7,9,67,89,33
E,54,526,54,43,53
S,9,8,9,7,9,32,4
E,4,6,4,8,9,32,23
E,43,7,1,78,9,8,65

Reading from 2 text files one line at a time in UNIX

I have 2 files, file1 and file2. I am trying to read one line from file1 and read another line from file2 and insert HTML
flags to make is usealbe in an html file. I have been trying to work with awk with little success. Can someone please help?
File1:
SILOS.SIL_Stage_GroupAccountNumberDimension_FinStatementItem
SDE_ORA11510_Adaptor.SDE_ORA_Stage_GLAccountDimension_FinSubCodes
File2:
FlatFileConnection.DBConnection_OLAP.SILOS.SIL_Stage_GroupAccountNumberDimension_FinStatementItem.txt
FlatFileConnection.DBConnection_OLAP.SDE_ORA11510_Adaptor.SDE_ORA_Stage_GLAccountDimension_FinSubCodes.txt
Desired output:
<ParameterFile>
<workflow>SILOS.SIL_Stage_GroupAccountNumberDimension_FinStatementItem</workflow>
<File>FlatFileConnection.DBConnection_OLAP.SILOS.SIL_Stage_GroupAccountNumberDimension_FinStatementItem.txt</File>
<ParameterFile>
<workflow>SDE_ORA11510_Adaptor.SDE_ORA_Stage_GLAccountDimension_FinSubCodes</workflow>
<File>FlatFileConnection.DBConnection_OLAP.SDE_ORA11510_Adaptor.SDE_ORA_Stage_GLAccountDimension_FinSubCodes.txt</File>
Using bash:
printItem() { printf "<%s>%s</%s>\n" "$1" "${!1}" "$1"; }
paste file1 file2 |
while read workflow File; do
echo "<ParameterFile>"
printItem workflow
printItem File
done
With awk, it would be:
awk '
NR==FNR {workflow[FNR]=$1; next}
{
print "<ParameterFile>"
printf "<workflow>%s</workflow>\n", workflow[FNR]
printf "<File>%s</File>\n", $1
}
' file1 file2
Another approach that does not require storing the first file in memory:
awk '{
print "<ParameterFile>"
print "<workflow>" $0 "</workflow>"
getline < "file2"
print "<File>" $0 "</File>"
}' file1
If you don't mind mixing in some shell:
$ paste -d$'\n' file1 file2 |
awk '{ printf (NR%2 ? "<ParameterFile>\n<workflow>%s</workflow>\n" : "<File>%s</File>\n"), $0 }'
<ParameterFile>
<workflow>SILOS.SIL_Stage_GroupAccountNumberDimension_FinStatementItem</workflow>
<File>FlatFileConnection.DBConnection_OLAP.SILOS.SIL_Stage_GroupAccountNumberDimension_FinStatementItem.txt</File>
<ParameterFile>
<workflow>SDE_ORA11510_Adaptor.SDE_ORA_Stage_GLAccountDimension_FinSubCodes</workflow>
<File>FlatFileConnection.DBConnection_OLAP.SDE_ORA11510_Adaptor.SDE_ORA_Stage_GLAccountDimension_FinSubCodes.txt</File>
otherwise see #GlennJackman's solution for the pure awk way to do it.