Delete text before comma in a delimited field - awk

I have a pipe delimited file where I want to remove all text before a comma in field 9.
Example line:
www.upstate.edu|upadhyap|Prashant K Upadhyaya, MD||General Surgery|http://www.upstate.edu/hospital/providers/doctors/?docID=upadhyap|Patricia J. Numann Center for Breast, Endocrine & Plastic Surgery|Upstate Specialty Services at Harrison Center|Suite D, 550 Harrison Street||Syracuse|NY|13202|
so the targeted field is: |Suite D, 550 Harrison Street|
and I want it to look like: |550 Harrison Street|
So far what I have tried has either deleted information from other fields (usually the name in field 3) or has had no effect.
The .awk script I have been trying to write looks like this:
mv $1 $1.bak4
cat $1.bak4 | awk -F "|" '{
gsub(/*,/,"", $9);
print $0
}' > $1

The pattern argument to gsub is a regex not a glob. Your * isn't matching what you expect it to. You want /.*,/ there. You are also going to need to OFS to | to keep that delimiter.
mv $1 $1.bak4
awk 'BEGIN{ FS = OFS = "|" }{ gsub(/.*,/,"",$9) } 1' $1.bak4 > $1
I also replaced the verbose print line you had with a true pattern (1) that uses the fact that the default action is print.

Related

how to ignore action on printing a field value using awk

Hello I have several files whose starting line (or record) follows this format:
cat file_1.txt | grep '>'
> CP022114.1 Kluyvera georgiana strain YDC799 chromosome, complete genome
I want to retrieve the second field on that record which corresponds to the genus taxonomic category, on this example it is "Kluyvera". So I use this:
awk 'NR==1{print $2}' file.txt
and I got
Kluyvera
The issue is that in some files the second field doesnt corresponds to the genus taxonomic category and the genus is preceeded by the string "candidatus":
cat file_2.txt | grep '>'
> NTKC01000006.1 Candidatus Thioglobus sp. MED-G25 SUP05-clade-MED-G25-C6, whole genome shotgun sequence
on the above record , "Thioglobus" is to the genus of the specie. so when I try the above awk command it retrieves me "Candidatus".
I want awk to print "this file has candidatus" instead of retrieving the second field for that record.
Let's say you have input file like this:
cat file
CP022114.1 Kluyvera georgiana strain YDC799 chromosome, complete genome
NTKC01000006.1 Candidatus Thioglobus sp. MED-G25 SUP05-clade-MED-G25-C6, whole genome shotgun sequence
You can use awk like this with a conditional print:
awk '{print ($2 == "Candidatus" ? $3 : $2)}' file
Kluyvera
Thioglobus
Or if you want to print a custom string for the Candidatus record then use:
awk '{print ($2 == "Candidatus" ? "this file has candidatus" : $2)}' file
Kluyvera
this file has candidatus

removing lines with special characters in awk

I have a text file like this:
VAREAKAVVLRDRKSTRLN 2888
ACP*VRWPIYTACGP 292
RDRKSTRLNSSHVVTSRMP 114
VAREA*KAVVLRDRRAHV*T 73
in the 1st column in some rows there is a "*". I want to remove all the lines with that '*'. here is the expected output:
expected output:
VAREAKAVVLRDRKSTRLN 2888
RDRKSTRLNSSHVVTSRMP 114
to do so, I am using this code:
awk -F "\t" '{ if(($1 == '*')) { print $1 "," $2} }' infile.txt > outfile.txt
this code does not return the expected output. how can I fix it?
how can I fix it?
You did
awk -F "\t" '{ if(($1 == '*')) { print $1 "," $2} }' infile.txt > outfile.txt
by doing $1 == "*" you are asking: is first field * not does first contain *? You might use index function which does return position of match if found or 0 otherwise. Let infile.txt content be
VAREAKAVVLRDRKSTRLN 2888
ACP*VRWPIYTACGP 292
RDRKSTRLNSSHVVTSRMP 114
VAREA*KAVVLRDRRAHV*T 73
then
awk 'index($1,"*")==0{print $1,$2}' infile.txt
output
VAREAKAVVLRDRKSTRLN 2888
RDRKSTRLNSSHVVTSRMP 114
Note that if you use index rather than pattern /.../ you do not have to care about characters with special meaning, e.g. .. Note that for data you have you do not have to set field separator (FS) explicitly. Important ' is not legal string delimiter in GNU AWK, you should use " for that purpose, unless your intent is to summon hard to find bugs.
(tested in gawk 4.2.1)
with your shown samples, please try following awk program.
awk '$1!~/\*/' Input_file
OR above will print complete line when condition is NOT matched, in case you want to print only 1st and 2nd fields of matched condition line then try following:
awk '$1!~/\*/{print $1,$2}' Input_file
Use grep like so to remove the lines that contain literal asterisk (*). Note that it should be escaped with a backslash (\*) or put in a character class ([*]) to prevent grep from interpreting * as a modifier meaning 0 or more characters:
echo "A*B\nCD" | grep -v '[*]'
CD
Here, GNU grep uses the following options:
-v : Print lines that do not match.

Print first column of a file and the substraction of two columns plus a number changing the separator

I am trying to print the first column of this file as well as the substraction between the fifth and fourth columns plus 1. In addition, I want to change the separator from a space to a tab.
This is the file:
A gene . 200 500 y
H gene . 1000 2000 j
T exon 1 550 650 m
U intron . 300 400 o
My expected output is:
A 301
H 1001
T 101
U 101
I´ve tried:
awk '{print $1'\t'$5-$4+1}' myFile
But my output is not tab separated, in fact, columns are not even separated by spaces.
I also tried:
awk OFS='\t' '{print $1 $5-$4+1}' myFile
But then I get a syntax error
Do you know how can I solve this?
Thanks!
Could you please try following. Written with shown samples.
awk 'BEGIN{OFS="\t"} {print $1,(($5-$4)+1)}' Input_file
Explanation: Why your output is not tab separated because you haven't used ,(comma) to print separator hence it will print them like A301 and so on. Also in case you want to set OFS in variable level in awk then you should use awk -v OFS='\t' '{print $1,(($5-$4)+1)}' Input_file where -v is important to let awk know that you are defining variable's value as TAB here. Also I have used parenthesis with subtraction and addition to make it clearer.

If Field3 is contained in Field1 then Modify Field1 (Awk)

I am trying to implement this logic with Awk:
If $3 is in $1 then replace the “$3 part of $1 and 1 blank space” with “” (blanks).
Print this new line and all other lines.
e.g. In my output (below) “Paris” in field $3 is found in field $1. So “Paris “ is replaced by “” in field $1.
INPUT FILE
field1|field2|field3
abc Paris Match|xxxx|Paris
aaaaa|yyyyy|London
OUTPUT NEEDED
field1|field2|field3
abc Match|xxxx|Paris
aaaaa|yyyyy|London
CODE TRIED (Not working)
awk ' BEGIN{ FS=OFS="|"}
{
{ if ( $1 ~ /$3/ ) { print $0 } }
} '
$ awk 'BEGIN{FS=OFS="|"} s=index($1,$3){$1=substr($1,1,s-1) substr($1,s+length($3)); gsub(/ +/," ",$1)} 1' file
field1|field2|field3
abc Match|xxxx|Paris
aaaaa|yyyyy|London
The above assumes you don't care if any existing sequences of multiple blanks in $1 are also compressed to single blanks.
here is one approach
$ awk 'BEGIN {FS=OFS="|"}
{sub($3,"",$1)}1' file
field1|field2|field3
abc Match|xxxx|Paris
aaaaa|yyyyy|London
note that there are two spaces between remaining words.

How to remove field separators in awk when printing $0?

eg, each row of the file is like :
1, 2, 3, 4,..., 1000
How can print out
1 2 3 4 ... 1000
?
If you just want to delete the commas, you can use tr:
$ tr -d ',' <file
1 2 3 4 1000
If it is something more general, you can set FS and OFS (read about FS and OFS) in your begin block:
awk 'BEGIN{FS=","; OFS=""} ...' file
You need to set OFS (the output field separator). Unfortunately, this has no effect unless you also modify the string, leading the rather cryptic:
awk '{$1=$1}1' FS=, OFS=
Although, if you are happy with some additional space being added, you can leave OFS at its default value (a single space), and do:
awk -F, '{$1=$1}1'
and if you don't mind omitting blank lines in the output, you can simplify further to:
awk -F, '$1=$1'
You could also remove the field separators:
awk -F, '{gsub(FS,"")} 1'
Set FS to the input field separators. Assigning to $1 will then reformat the field using the output field separator, which defaults to space:
awk -F',\s*' '{$1 = $1; print}'
See the GNU Awk Manual for an explanation of $1 = $1