how to delete specific words in a line with awk? - awk

I have a text file as shown below. I need only PDB IDs after the > symbol. How can I do this with awk?
>results for sequence "files/1H8U.pdb" starting "ASPILEGLUGLY"
DIEGREKQQPSRVS
>results for sequence "files/1P6K.pdb" starting "ILEALALYSASP"
IAKDVAKEGSDGATKQRTHPQDSASI
Desired output
>1H8U
DIEGREKQQPSRVS
>1P6K
IAKDVAKEGSDGATKQRTHPQDSASI

I would probably use sed for this, but here's the awk:
awk '/^>/ { sub (/[^\/]+\//,">", $0); sub (/\..+/, "", $0) }1' file.txt
Here's the sed:
sed -r '/^>/s%[^/]+/%>%;s%\..+%%' file.txt

This might work for you:
awk -F[/.] '/^>/{$1=">"$2;NF=1};1' file
or:
sed '/^>.*\/\([^.]*\)\..*/s//>\1/' file

Related

printing strings and inputs from file

I have file named files.txt:
file1.F
data.dat
image.png
I would like to desired file including:
IN='file1.F'
IN='data.dat'
IN='image.png'
How to reach that? I tried this, but the syntax is poor:
awk '{print 'IN=\''$1'\''}' files.txt > input
Could you please try following.
awk -v s1="\047" -v var="IN=" '{print var s1 $0 s1}' Input_file
Output will be as follows.
IN='file1.F'
IN='data.dat'
IN='image.png'
If sed is an option.
sed "s/.*/IN='&'/" file
Output:
IN='file1.F'
IN='data.dat'
IN='image.png'

With sed or awk, move line matching pattern to bottom of file

I have a similar problem. I need to move a line in /etc/sudoers to the end of the file.
The line I am wanting to move:
#includedir /etc/sudoers.d
I have tried with a variable
#creates variable value
templine=$(cat /etc/sudoers | grep "#includedir /etc/sudoers.d")
#delete value
sed '/"${templine}"/d' /etc/sudoers
#write value to the bottom of the file
cat ${templine} >> /etc/sudoers
Not getting any errors nor the result I am looking for.
Any suggestions?
With awk:
awk '$0=="#includedir /etc/sudoers.d"{lastline=$0;next}{print $0}END{print lastline}' /etc/sudoers
That says:
If the line $0 is "#includedir /etc/sudoers.d" then set the variable lastline to this line's value $0 and skip to the next line next.
If you are still here, print the line {print $0}
Once every line in file is processed, print whatever is in the lastline variable.
Example:
$ cat test.txt
hi
this
is
#includedir /etc/sudoers.d
a
test
$ awk '$0=="#includedir /etc/sudoers.d"{lastline=$0;next}{print $0}END{print lastline}' test.txt
hi
this
is
a
test
#includedir /etc/sudoers.d
You could do the whole thing with sed:
sed -e '/#includedir .etc.sudoers.d/ { h; $p; d; }' -e '$G' /etc/sudoers
This might work for you (GNU sed):
sed -n '/regexp/H;//!p;$x;$s/.//p' file
This removes line(s) containing a specified regexp and appends them to the end of the file.
To only move the first line that matches the regexp, use:
sed -n '/regexp/{h;$p;$b;:a;n;p;$!ba;x};p' file
This uses a loop to read/print the remainder of the file and then append the matched line.
If you have multiple entries which you want to move to the end of the file, you can do the following:
awk '/regex/{a[++c]=$0;next}1;END{for(i=1;i<=c;++i) print a[i]}' file
or
sed -n '/regex/!{p;ba};H;:a;${x;s/.//;p}' file

Why does awk not filter the first column in the first line of my files?

I've got a file with following records:
depots/import/HDN1YYAA_15102018.txt;1;CAB001
depots/import/HDN1YYAA_20102018.txt;2;CLI001
depots/import/HDN1YYAA_20102018.txt;32;CLI001
depots/import/HDN1YYAA_25102018.txt;1;CAB001
depots/import/HDN1YYAA_50102018.txt;1;CAB001
depots/import/HDN1YYAA_65102018.txt;1;CAB001
depots/import/HDN1YYAA_80102018.txt;2;CLI001
depots/import/HDN1YYAA_93102018.txt;2;CLI001
When I execute following oneliner awk:
cat lignes_en_erreur.txt | awk 'FS=";"{ if(NR==1){print $1}}END {}'
the output is not the expected:
depots/import/HDN1YYAA_15102018.txt;1;CAB001
While I am suppose get only the frist column:
If I run it through all the records:
cat lignes_en_erreur.txt | awk 'FS=";"{ if(NR>0){print $1}}END {}'
then it will start filtering only after the second line and I get the following output:
depots/import/HDN1YYAA_15102018.txt;1;CAB001
depots/import/HDN1YYAA_20102018.txt
depots/import/HDN1YYAA_20102018.txt
depots/import/HDN1YYAA_25102018.txt
depots/import/HDN1YYAA_50102018.txt
depots/import/HDN1YYAA_65102018.txt
depots/import/HDN1YYAA_80102018.txt
depots/import/HDN1YYAA_93102018.txt
Does anybody knows why awk is skiping the first line only.
I tried deleting first record but the behaviour is the same, it will skip the first line.
First, it should be
awk 'BEGIN{FS=";"}{ if(NR==1){print $1}}END {}' filename
You can omit the END block if it is empty:
awk 'BEGIN{FS=";"}{ if(NR==1){print $1}}' filename
You can use the -F command line argument to set the field delimiter:
awk -F';' '{if(NR==1){print $1}}' filename
Furthermore, awk programs consist of a sequence of CONDITION [{ACTIONS}] elements, you can omit the if:
awk -F';' 'NR==1 {print $1}' filename
You need to specify delimiter in either BEGIN block or as a command-line option:
awk 'BEGIN{FS=";"}{ if(NR==1){print $1}}'
awk -F ';' '{ if(NR==1){print $1}}'
cut might be better suited here, for all lines
$ cut -d';' -f1 file
to skip the first line
$ sed 1d file | cut -d';' -f1
to get the first line only
$ sed 1q file | cut -d';' -f1
however at this point it's better to switch to awk
if you have a large file and only interested in the first line, it's better to exit early
$ awk -F';' '{print $1; exit}' file

Grep part of string after symbol and shuffle columns

I would like to take the number after the - sign and put is as column 2 in my matrix. I know how to grep the string but not how to print it after the text string.
in:
1-967764 GGCTGGTCCGATGGTAGTGGGTTATCAGAACT
3-425354 GCATTGGTGGTTCAGTGGTAGAATTCTCGCC
4-376323 GGCTGGTCCGATGGTAGTGGGTTATCAGAAC
5-221398 GGAAGAGCACACGTCTGAACTCCAGTCACGTGAAAATCTCGTATGCCGTCT
6-180339 TCCCTGGTGGTCTAGTGGTTAGGATTCGGCGCT
out:
GGCTGGTCCGATGGTAGTGGGTTATCAGAACT 967764
GCATTGGTGGTTCAGTGGTAGAATTCTCGCC 425354
GGCTGGTCCGATGGTAGTGGGTTATCAGAAC 376323
GGAAGAGCACACGTCTGAACTCCAGTCACGTGAAAATCTCGTATGCCGTCT 221398
TCCCTGGTGGTCTAGTGGTTAGGATTCGGCGCT 180339
awk -F'[[:space:]-]+' '{print $3,$2}' file
Seems like a simple substitution should do the job:
sed -E 's/[0-9]+-([0-9]+)[[:space:]]*(.*)/\2 \1/' file
Capture the parts you're interested in and use them in the replacement.
Alternatively, using awk:
awk 'sub(/^[0-9]+-/, "") { print $2, $1 }' file
Remove the leading digits and - from the start of the line. When this is successful, sub returns true, so the action is performed, printing the second field, followed by the first.
Using regex ( +|-) as field separator:
$ awk -F"( +|-)" '{print $3,$2}' file
GGCTGGTCCGATGGTAGTGGGTTATCAGAACT 967764
GCATTGGTGGTTCAGTGGTAGAATTCTCGCC 425354
GGCTGGTCCGATGGTAGTGGGTTATCAGAAC 376323
GGAAGAGCACACGTCTGAACTCCAGTCACGTGAAAATCTCGTATGCCGTCT 221398
TCCCTGGTGGTCTAGTGGTTAGGATTCGGCGCT 180339
here is another awk
$ awk 'split($1,a,"-") {print $2,a[2]}' file
awk '{sub(/.-/,"");print $2,$1}' file
GGCTGGTCCGATGGTAGTGGGTTATCAGAACT 967764
GCATTGGTGGTTCAGTGGTAGAATTCTCGCC 425354
GGCTGGTCCGATGGTAGTGGGTTATCAGAAC 376323
GGAAGAGCACACGTCTGAACTCCAGTCACGTGAAAATCTCGTATGCCGTCT 221398
TCCCTGGTGGTCTAGTGGTTAGGATTCGGCGCT 180339

find match, print first occurrence and continue until the end of the file - awk

I have a pretty large file from which I'd like to extract only the first line of those containing my match and then continuing doing that until the end of the file. Example of input and desired output below
Input
C,4,2,5,6,8,9,5
C,4,5,4,5,4,43,6
S,4,23,567,2,4,5
S,23,4,7,78,8,9,6
S,3,5,67,8,54,56
S,4,8,9,54,3,4,52
E,2,3,213,5,8,44
E,5,7,9,67,89,33
E,54,526,54,43,53
S,9,8,9,7,9,32,4
S,5,6,4,5,67,87,88
S,4,23,5,8,5,7,3
E,4,6,4,8,9,32,23
E,43,7,1,78,9,8,65
Output
S,4,23,567,2,4,5
S,9,8,9,7,9,32,4
The match in my lines is S, which usually comes after a line that starts with either E or C. What I'm struggling with is to tell awk to print only the first line after those with E or C. Another way would be to print the first of the bunch of lines containing S. Any idea??
does this one-liner help?
awk '/^S/&&!i{print;i=!i}!/^S/{i=!i}' file
or more "readable":
awk -v p=1 '/^S/&&p{print;p=0}!/^S/{p=1}' file
You can use sed, like this:
sed -rn '/^(E|C)/{:a;n;/^S/!ba;p}' file
here's a multi liner to enter in a file (e.g. u.awk)
/^[CE]/ {ON=1; next}
/^S/ {if (ON) print}
{ON=0}
then run : "awk -f u.awk inputdatafile"
awk to the rescue!
$ awk '/^[CE]/{p=1} /^S/&&p{p=0;print}' file
S,4,23,567,2,4,5
S,9,8,9,7,9,32,4
$ awk '/^S/{if (!f) print; f=1; next} {print; f=0}' file
C,4,2,5,6,8,9,5
C,4,5,4,5,4,43,6
S,4,23,567,2,4,5
E,2,3,213,5,8,44
E,5,7,9,67,89,33
E,54,526,54,43,53
S,9,8,9,7,9,32,4
E,4,6,4,8,9,32,23
E,43,7,1,78,9,8,65