Using AWK to extract one column from a tab separated file - awk

I know this is a simple question, but the awk command is literally melting my brain. I have a tab separated file "inputfile.gtf" and I need to extract one column from it and put it into a new file "newfile.tsv" I cannot for the life of me figure out the proper syntax to do this with awk. Here is what I've tried:
awk -F, 'BEGIN{OFS="/t"} {print $8}' inputfile.gtf > newfile.tsv
also
awk 'BEGIN{OFS="/t";FS="/t"};{print $8}' inputfile.gtf > newfile.tsv
Both of these just give me an empty file. Everywhere I search, people seem to have completely different ways of trying to achieve this simple task, and at this point I am completely lost. Any help would be greatly appreciated. Thanks.

Why not simpler :
awk -F'\t' '{print $8}' inputfile.gtf > newfile.tsv

You have specified the wrong delimiter /t, the tab character typed as \t:
awk 'BEGIN{ FS=OFS="\t" }{ print $8 }' inputfile.gtf > newfile.tsv

Your 1st command :
awk -F, 'BEGIN{OFS="/t"} {print $8}' inputfile.gtf > newfile.tsv
You are setting -F, which is not required, as your file is not , comma separated.
next, OFS="/t" : Syntax is incorrect, it should be OFS="\t", but again you don't need this as you don't want to set Output fields separator as \t since you're printing only a single record and OFS is not at all involved in this case; unless you print atleast two fields.
Your 2nd command :
awk 'BEGIN{OFS="/t";FS="/t"};{print $8}' inputfile.gtf > newfile.tsv
Again it's not /t it should be \t. Also FS="\t" is similar to -F "\t"
What you actually need is :
awk -F"\t" '{print $8}' inputfile.gtf > newfile.tsv
or
awk -v FS="\t" '{print $8}' inputfile.gtf > newfile.tsv
And if your file has just tabs and your fields don't have spaces in between then you can simply use :
awk '{print $8}' inputfile.gtf > newfile.tsv

Related

How to extract string from a file in bash

I have a file called DB_create.sql which has this line
CREATE DATABASE testrepo;
I want to extract only testrepo from this. So I've tried
cat DB_create.sql | awk '{print $3}'
This gives me testrepo;
I need only testrepo. How do I get this ?
With your shown samples, please try following.
awk -F'[ ;]' '{print $(NF-1)}' DB_create.sql
OR
awk -F'[ ;]' '{print $3}' DB_create.sql
OR without setting any field separators try:
awk '{sub(/;$/,"");print $3}' DB_create.sql
Simple explanation would be: making field separator as space OR semi colon and then printing 2nd last field($NF-1) which is required by OP here. Also you need not to use cat command with awk because awk can read Input_file by itself.
Using gnu awk, you can set record separator as ; + line break:
awk -v RS=';\r?\n' '{print $3}' file.sql
testrepo
Or using any POSIX awk, just do a call to sub to strip trailing ;:
awk '{sub(/;$/, "", $3); print $3}' file.sql
testrepo
You can use
awk -F'[;[:space:]]+' '{print $3}' DB_create.sql
where the field separator is set to a [;[:space:]]+ regex that matches one or more occurrences of ; or/and whitespace chars. Then, Field 3 will contain the string you need without the semi-colon.
More pattern details:
[ - start of a bracket expression
; - a ; char
[:space:] - any whitespace char
] - end of the bracket expression
+ - a POSIX ERE one or more occurrences quantifier.
See the online demo.
Use your own code but adding the function sub():
cat DB_create.sql | awk '{sub(/;$/, "",$3);print $3}'
Although it's better not using cat. Here you can see why: Comparison of cat pipe awk operation to awk command on a file
So better this way:
awk '{sub(/;$/, "",$3);print $3}' file

AWK:Remove multiple columns and retain the column delimiters [duplicate]

This command works. It outputs the field separator (in this case, a comma):
$ echo "hi,ho"|awk -F, '/hi/{print $0}'
hi,ho
This command has strange output (it is missing the comma):
$ echo "hi,ho"|awk -F, '/hi/{$2="low";print $0}'
hi low
Setting the OFS (output field separator) variable to a comma fixes this case, but it really does not explain this behaviour.
Can I tell awk to keep the OFS?
When you modify the line ($0) awk re-constructs all columns and puts the value of OFS between them which by default is space. You modified the value of $2 which means you forced awk to re-evaluate$0.
When you print the line as is using $0 in your first case, since you did not modify any fields, awk did not re-evaluated each field and hence the field separator is preserved.
In order to preserve the field separator, you can specify that using:
BEGIN block:
$ echo "hi,ho" | awk 'BEGIN{FS=OFS=","}/hi/{$2="low";print $0}'
hi,low
Using -v option:
$ echo "hi,ho" | awk -F, -v OFS="," '/hi/{$2="low";print $0}'
hi,low
Defining at the end of awk:
$ echo "hi,ho" | awk -F, '/hi/{$2="low";print $0}' OFS=","
hi,low
You first example does not change anything, so all is printed out as the input.
In second example, it change the line and it will use the default OFS, that is (one space)
So to overcome this:
echo "hi,ho"|awk -F, '/hi/{$2="low";print $0}' OFS=","
hi,low
In your BEGIN action, set OFS = FS.

awk, print all columns and add new column with substr

I have this table
USI,Name,2D-3D
RO0001,Patate,2D
RO0002,Haricot,3D
RO0003,Banane,2D
RO0004,Pomme,2D
RO0005,Poire,2D
and I want this
USI,Name,2D-3D
RO0001,Patate,2D,RO_2D_Patate
RO0002,Haricot,3D,RO_3D_Haricot
RO0003,Banane,2D,RO_2D_Banane
RO0004,Pomme,2D,RO_2D_Pomme
RO0005,Poire,2D,RO_2D_Poire
I manage to obtain the construction "RO_2D_Patate" with awk
awk -F "," '{print substr($1,1,2)"_"substr($3,1,2)"_"$2}' Test4.txt
But I want to print all my column $0 before as my second table.
I tried everything But I am still a novice !!!!
Any idea over there?
awk -F, '{print $0 (NR>1 ? FS substr($1,1,2)"_"$3"_"$2 : "")}' Test4.txt
$ awk -F, -v OFS=, 'NR>1{$4=substr($1,1,2)"_"$3"_"$2}1' Test4.txt
USI,Name,2D-3D
RO0001,Patate,2D,RO_2D_Patate
RO0002,Haricot,3D,RO_3D_Haricot
RO0003,Banane,2D,RO_2D_Banane
RO0004,Pomme,2D,RO_2D_Pomme
RO0005,Poire,2D,RO_2D_Poire
awk -F, 'NR>1{print $0,substr($1,1,2)"_"$NF"_"$2}/USI/' OFS=, file
USI,Name,2D-3D
RO0001,Patate,2D,RO_2D_Patate
RO0002,Haricot,3D,RO_3D_Haricot
RO0003,Banane,2D,RO_2D_Banane
RO0004,Pomme,2D,RO_2D_Pomme
RO0005,Poire,2D,RO_2D_Poire

How to delete first three columns in a delimited file

For example, I have a csv file as follow,
12345432|1346283301|5676438284971|13564357342151697 ...
87540258|1356433301|1125438284971|135643643462151697 ...
67323266|1356563471|1823543828471|13564386436651697 ...
and hundreds more columns but I want to remove first three columns and save to a new file(if possible same file would be better for me)
This is the result I want.
13564357342151697 ...
135643643462151697 ...
13564386436651697 ...
I have been looking and trying but I am not able to do it. And below is the code I have.
awk -F'|' '{print $1 > "newfile"; sub(/^[^|]+\|/,"")}1' old.csv > new.csv
Appreciate if someone can help me. Thank you.
You can use cut :
cut -f4- -d'|' old.csv > new.csv
#Heng: try:
awk -F"|" '{for(i=4;i<=NF;i++){printf("%s%s",$i,i==NF?"":"|")};print ""}' Input_file
OR
awk -F"|" '{for(i=4;i<=NF;i++){printf("%s%s",$i,i==NF?"\n":"|")};}' Input_file
you could re-direct this command's output into a file as per your need.
EDIT:
awk -F"|" 'FNR==1{++e;fi="REPORT_A1_"e;} {for(i=4;i<=NF;i++){printf("%s%s",$i,i==NF?"\n":"|") > fi}}' Input_file1 Input_file2 Input_file3
This is what you're looking for:
awk -F '|' '{$1=$2=$3=""; print $0}' oldfile > newfile
But it will have leading whitespaces so then add the following substitution:
sub(/^[ \t\|]+/,"") --> changed to sub(/^[ \t\|]+/,"") (escaped leading '|' from column removal)
awk -F '|' '{$1=$2=$3="";OFS="|";sub(/^[ \t\|]+/,"") ;print $0}' oldFile > newFile
awk -F\| '{print $NF}' file >newfile
13564357342151697 ...
135643643462151697 ...
13564386436651697 ...

Why does an awk field assignment lose the output field separator?

This command works. It outputs the field separator (in this case, a comma):
$ echo "hi,ho"|awk -F, '/hi/{print $0}'
hi,ho
This command has strange output (it is missing the comma):
$ echo "hi,ho"|awk -F, '/hi/{$2="low";print $0}'
hi low
Setting the OFS (output field separator) variable to a comma fixes this case, but it really does not explain this behaviour.
Can I tell awk to keep the OFS?
When you modify the line ($0) awk re-constructs all columns and puts the value of OFS between them which by default is space. You modified the value of $2 which means you forced awk to re-evaluate$0.
When you print the line as is using $0 in your first case, since you did not modify any fields, awk did not re-evaluated each field and hence the field separator is preserved.
In order to preserve the field separator, you can specify that using:
BEGIN block:
$ echo "hi,ho" | awk 'BEGIN{FS=OFS=","}/hi/{$2="low";print $0}'
hi,low
Using -v option:
$ echo "hi,ho" | awk -F, -v OFS="," '/hi/{$2="low";print $0}'
hi,low
Defining at the end of awk:
$ echo "hi,ho" | awk -F, '/hi/{$2="low";print $0}' OFS=","
hi,low
You first example does not change anything, so all is printed out as the input.
In second example, it change the line and it will use the default OFS, that is (one space)
So to overcome this:
echo "hi,ho"|awk -F, '/hi/{$2="low";print $0}' OFS=","
hi,low
In your BEGIN action, set OFS = FS.