remove decimal places in strings ids using awk - awk

I want to remove the decimal places in strings from a list of identifiers:
ENSG00000166224.12
ENSG00000102897.5
ENSG00000168496.3
ENSG00000010295.15
ENSG00000147533.12
ENSG00000119242.4
My desired output will be
ENSG00000166224
ENSG00000102897
ENSG00000168496
ENSG00000010295
ENSG00000147533
ENSG00000119242
I would like to do it with awk, I have been playing with printf but with no success.
UPDATE:
The awk answer setting the field separator to . works well in files with only one column, but what if the file is composed of different columns (strings and float numbers)?
Here is an example:
ENSG00000166224.12 0.0730716237772557 -0.147970450702234
ENSG00000102897.5 0.156405616866614 -0.0398488625782745
ENSG00000168496.3 -0.110396121325736 -0.0147093758392248
How can I remove only the decimal places in the first field?
Thanks

You can set the field separator to the dot and print the first element:
$ awk -F. '{print $1}' file
ENSG00000166224
ENSG00000102897
ENSG00000168496
ENSG00000010295
ENSG00000147533
ENSG00000119242
In sed you would say sed 's/\.[^\.]*$//' file, which will catch everything from the last dot on and remove it.
You would be able to do it with printf if it just was a number. Then, you would use something to not print the decimal places. However, since it is an alphanumeric string it is best to handle it as a string.
Update
Use gsub to replace everything from . in the first field:
$ awk '{gsub(/\..*$/,"",$1)}1' a
ENSG00000166224 0.0730716237772557 -0.147970450702234
ENSG00000102897 0.156405616866614 -0.0398488625782745
ENSG00000168496 -0.110396121325736 -0.0147093758392248

use sub function also.
awk '{sub(/\..*/, "")}1' file

Using cut:
$ cut -d. -f1 file
ENSG00000166224
ENSG00000102897
ENSG00000168496
ENSG00000010295
ENSG00000147533
ENSG00000119242

If you are looking for a solution in perl
perl -pne 's/\..*$//' file.txt
This eventually remove everything after the decimal point.

Related

Awk - Grep - Match the exact string in a file

I have a file that looks like this
ON,111111,TEN000812,Super,7483747483,767,Free
ON,262762,BOB747474,SuperMan,4347374,676,Free
ON,454644,FRED84848,Super Man,65757,555,Free
I need to match the values in the fourth column exactly as they are written. So if I am searching for "Super" I need it to return the line with "Super" only.
ON,111111,TEN000812,Super,7483747483,767,Free
Likewise, if I'm looking for "Super Man" I need that exact line returned.
ON,454644,FRED84848,Super Man,65757,555,Free
I have tried using grep, but grep will match all instances that contain Super. So if I do this:
grep -i "Super" file.txt
It returns all lines, because they all contain "Super"
ON,111111,TEN000812,Super,7483747483,767,Free
ON,262762,BOB747474,SuperMan,4347374,676,Free
ON,454644,FRED84848,Super Man,65757,555,Free
I have also tired with awk, and I believe I'm close, but when I do:
awk '$4==Super' file.txt
I still get output like this:
ON,111111,TEN000812,Super,7483747483,767,Free
ON,262762,BOB747474,SuperMan,4347374,676,Free
I have been at this for hours, and any help would be greatly appreciated at this point.
You were close, or I should say very close just put field delimiter as comma in your solution and you are all set.
awk 'BEGIN{FS=","} $4=="Super"' Input_file
Also one more thing in OP's attempt while comparison with 4th field with string value, string should be wrapped in "
OR in case you want to mention value to be compared as an awk variable then try following.
awk -v value="Super" 'BEGIN{FS=","} $4==value' Input_file
You are quite close actually, you can try :
awk -F, '$4=="Super" {print}' file.txt
I find this form easier to grasp. Slightly longer than #RavinderSingh13 though
-F is the field separator, in this case comma
Next you have a condition followed by action
Condition is to check if the fourth field has the string Super
If the string is found, print it

Delete string from line that matches regex with AWK

I have file that contains a lot of data like this and I have to delete everything that matches this regex [-]+\d+(.*)
Input:
zxczxc-6-9hw7w
qweqweqweqweqwe-18-8c5r6
asdasdasasdsad-11-br9ft
Output should be:
zxczxc
qweqweqweqweqwe
asdasdasasdsad
How can I do this with AWK?
sed might be easier...
$ sed -E 's/-+[0-9].*//' file
note that .* covers +.*
AFAIK awk doesn't support \d so you could use [0-9], your regex is correct only thing you need to put it in correct function of awk.
awk '{sub(/-+[0-9].*/,"")} 1' Input_file
You don't need the extra <plus> sign afther [0-9] as this is covered by the .*
Generally, if you want to delete a string that matches a regular expression, then all you need to do is substitute it with an empty string. The most straightforward solution is sed which is presented by karafka, the other solution is using awk as presented by RavinderSingh13.
The overall syntax would look like this:
sed -e 's/ere//g' file
awk '{gsub(/ere/,"")}1' file
with ere the regular expression representation. Note I use g and gsub here to substitute all non-overlapping strings.
Due to the nature of the regular expression in the OP, i.e. it ends with .*, the g can be dropped. It also allows us to write a different awk solution which works with field separators:
awk -F '-+[0-9]' '{print $1}' file

awk to filter lines in file by removing pattern

Trying to use awk to remove the IonCode_4 digits (always 4 may be different) and leave the file extension. Is the below the best way? Thank you :).
file
1112233 ID_1234_000000-Control_z_zzzz_zz_zz_zz_zz_zz_zzz_zz-zzzz-zzz-zzz_zzzz_zzzz_zzz_zzz_zzz_zzz_zzz.txt
1112231 ID_1234_000000-Control_z_zzzz_zz_zz_zz_zz_zz_zzz_zz-zzzz-zzz-zzz_zzzz_zzzz_zzz_zzz_zzz_zzz_zzz.txt
awk
awk '/_tn_/ {next} gsub ("^.*/|_.*$|IonCode_...._", "", $2)'f
current
1112233 000000-Control
1112231 000000-Control
desired
1112233 000000-Control.txt
1112231 000000-Control.txt
Split records by 1+ spaces or underscore, so the 4th field will be the part you're interested in.
awk -F '[[:space:]]+|_' '!/_tn_/{print $1,$4".txt"}' file
Could you please try following. This is simplest I could think, though we could do it with number of fields mentioning too but that will be more like hard-coding of numbers, so I went with this approach here.
awk '
{
sub(/[^_]*_/,"",$2)
sub(/[^_]*_/,"",$2)
sub(/_.*/,".txt")
}
1
' Input_file
with sed
$ sed -E 's/ID_[0-9]{4}_([^_]+).*(\..*)/\1\2/' file
1112233 000000-Control.txt
1112231 000000-Control.txt

Need help AWK script

Could you let me know how to print "user.%" string in below text by awk?
The value of 'user' is not fixed and the number of strings in '( )' are not fixed.
start user1.table% NOT (%OLD, %2016%) user.% another strings
UPDATE
It is the basis of SQL processing. $2 means schema.table but here user can use '%' and also exclude by NOT keyword. It ends with ')'. The next one is a second schema.table and that is the one I want to catch.
I think I should parse the string after ')' with a regular expression but failed.
Regular expression:
[)]\s+(\S+)
Above expression can be used to catch that string I guess.
How can I apply this one in awk script(Not one liner).
If the structure of the query keeps the same, you can use this:
awk -F'[).]' '{print $3".%"}'
I'm using the closing parenthesis or the literal dot as the delimiter. Doing so the value of interest is in field 3.
While it is simple it leaves some whitespace in front of user. We can enhance the field delimiter regex to fix this:
awk -F')[[:space:]]*|[.]' '{print $3".%"}'
Btw, you may use this sed command alternatively:
sed 's/.*)[[:space:]]*\([^.]*\).*/\1.%/'
or if you have GNU grep, use this:
grep -oP '\)\s*\K[^%]*%'
Try this (GNU awk):
awk '{match($0, /[)] +([^ ]+)/, var);print var[1];}'
You need to match first (GNU awk function).
Given your posted sample input, all you need is:
awk '{print $6}'
e.g.:
$ echo 'start user1.table% NOT (%OLD, %2016%) user.% another strings' |
awk '{print $6}'
user.%
If that doesn't work for you then your posted sample input isn't representative enough of your real input so edit your question to include a few lines of truly representative sample input and the expected output given that input.

how to replace a pattern with a string depending on part of the pattern?

I have the following problem. I'm interpreting an input file, and now I'm encountering this:
I need to translate %%BLANKx to x spaces.
So, whereever in the input file, I find for example %%BLANK8, I need to replace %%BLANK8 with 8 spaces, %%BLANK10 with 10 spaces etc.
You can split your String on %%BLANK tag.
After, you can read the first number present in any of your token and convert their in spaces.
Now, you can concat every token in a new String.
perl -pe 's/%%BLANK(\d+)/" " x $1/e' input_file
try this. I have not tested exhaustively
$ awk '/BLANK/{ match($0,/%%BLANK([0-9]+)/,a);s=sprintf("%"a[1]"s","") ; gsub(a[0],s)}1' file
Or Ruby(1.9+)
$ ruby -ne 'print $_.gsub(/%%BLANK(\d+)/){|m|" "*$1.to_i}' file
using "%%BLANK" as record seperator , now if any new record that starts with a number replace the number with spaces.
awk 'BEGIN {RS="%%BLANK";ORS=""}{MatchFound=match($0,"^[0-9]+",Matched_string);if(MatchFound){sub(Matched_string[0],"",$0);for (i=0;i<Matched_string[0];i++){$0=" "$0};print $0}else{print $0}}' InputFile.txt