Hello I have several files whose starting line (or record) follows this format:
cat file_1.txt | grep '>'
> CP022114.1 Kluyvera georgiana strain YDC799 chromosome, complete genome
I want to retrieve the second field on that record which corresponds to the genus taxonomic category, on this example it is "Kluyvera". So I use this:
awk 'NR==1{print $2}' file.txt
and I got
Kluyvera
The issue is that in some files the second field doesnt corresponds to the genus taxonomic category and the genus is preceeded by the string "candidatus":
cat file_2.txt | grep '>'
> NTKC01000006.1 Candidatus Thioglobus sp. MED-G25 SUP05-clade-MED-G25-C6, whole genome shotgun sequence
on the above record , "Thioglobus" is to the genus of the specie. so when I try the above awk command it retrieves me "Candidatus".
I want awk to print "this file has candidatus" instead of retrieving the second field for that record.
Let's say you have input file like this:
cat file
CP022114.1 Kluyvera georgiana strain YDC799 chromosome, complete genome
NTKC01000006.1 Candidatus Thioglobus sp. MED-G25 SUP05-clade-MED-G25-C6, whole genome shotgun sequence
You can use awk like this with a conditional print:
awk '{print ($2 == "Candidatus" ? $3 : $2)}' file
Kluyvera
Thioglobus
Or if you want to print a custom string for the Candidatus record then use:
awk '{print ($2 == "Candidatus" ? "this file has candidatus" : $2)}' file
Kluyvera
this file has candidatus
Related
Let's say I have a file like so:
test.txt
one
two
three
I'd like to get the following output: one|two|three
And am currently using this command: gawk -v ORS='|' '{ print $0 }' test.txt
Which gives: one|two|three|
How can I print it so that the last | isn't there?
Here's one way to do it:
$ seq 1 | awk -v ORS= 'NR>1{print "|"} 1; END{print "\n"}'
1
$ seq 3 | awk -v ORS= 'NR>1{print "|"} 1; END{print "\n"}'
1|2|3
With paste:
$ seq 1 | paste -sd'|'
1
$ seq 3 | paste -sd'|'
1|2|3
Convert one column to one row with field separator:
awk '{$1=$1} 1' FS='\n' OFS='|' RS='' file
Or in another notation:
awk -v FS='\n' -v OFS='|' -v RS='' '{$1=$1} 1' file
Output:
one|two|three
See: 8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR
awk solutions work great. Here is tr + sed solution:
tr '\n' '|' < file | sed 's/\|$//'
1|2|3
just flatten it :
gawk/mawk 'BEGIN { FS = ORS; RS = "^[\n]*$"; OFS = "|"
} NF && ( $NF ? NF=NF : —-NF )'
ascii | = octal \174 = hex 0x7C. The reason for —-NF is that more often than not, the input includes a trailing new line, which makes field count 1 too many and result in
1|2|3|
Both NF=NF and --NF are similar concepts to $1=$1. Empty inputs, regardless of whether trailing new lines exist or not, would result in nothing printed.
At the OFS spot, you can delimit it with any string combo you like instead of being constrained by tr, which has inconsistent behavior. For instance :
gtr '\012' '高' # UTF8 高 = \351\253\230 = xE9 xAB x98
on bsd-tr, \n will get replaced by the unicode properly 1高2高3高 , but if you're on gnu-tr, it would only keep the leading byte of the unicode, and result in
1 \351 2 \351 . . .
For unicode equiv-classes, bsd-tr works as expected while gtr '[=高=]' '\v' results in
gtr: ?\230: equivalence class operand must be a single character
and if u attempt equiv-classes with an arbitrary non-ASCII byte, bsd-tr does nothing while gnu-tr would gladly oblige, even if it means slicing straight through UTF8-compliant characters :
g3bn 77138 | (g)tr '[=\224=]' '\v'
bsd-tr : 77138=Koyote 코요태 KYT✜ 高耀太
gnu-tr : 77138=Koyote ?
?
태 KYT✜ 高耀太
I would do it following way, using GNU AWK, let test.txt content be
one
two
three
then
awk '{printf NR==1?"%s":"|%s", $0}' test.txt
output
one|two|three
Explanation: If it is first line print that line content sans trailing newline, otherwise | followed by line content sans trailing newline. Note that I assumed that test.txt has not trailing newline, if this is not case test this solution before applying it.
(tested in gawk 5.0.1)
Also you can try this with awk:
awk '{ORS = (NR%3 ? "|" : RS)} 1' file
one|two|three
% is the modulo operator and NR%3 ? "|" : RS is a ternary expression.
See Ed Morton's explanation here: https://stackoverflow.com/a/55998710/14259465
With a GNU sed, you can pass -z option to match line breaks, and thus all you need is replace each newline but the last one at the end of string:
sed -z 's/\n\(.\)/|\1/g' test.txt
perl -0pe 's/\n(?!\z)/|/g' test.txt
perl -pe 's/\n/|/g if !eof' test.txt
See the online demo.
Details:
s - substitution command
\n\(.\) - an LF char followed with any one char captured into Group 1 (so \n at the end of string won't get matched)
|\1 - a | char and the captured char
g - all occurrences.
The first perl command matches any LF char (\n) not at the end of string ((?!\z)) after slurping the whole file into a single string input (again, to make \n visible to the regex engine).
The second perl command replaces an LF char at the end of each line except the one at the end of file (eof).
To make the changes inline add -i option (mind this is a GNU sed example):
sed -i -z 's/\n\(.\)/|\1/g' test.txt
perl -i -0pe 's/\n(?!\z)/|/g' test.txt
perl -i -pe 's/\n/|/g if !eof' test.txt
I have a file with 100 columns of data. I want to print the first column and i-th column in 99 separate files, I am trying to use
for i in {2..99}; do awk '{print $1" " $i }' input.txt > data${i}; done
But I am getting errors
awk: illegal field $(), name "i"
input record number 1, file input.txt
source line number 1
How to correctly use $i inside the {print }?
Following single awk may help you too here:
awk -v start=2 -v end=99 '{for(i=start;i<=end;i++){print $1,$i > "file"i;close("file"i)}}' Input_file
An all awk solution. First test data:
$ cat foo
11 12 13
21 22 23
Then the awk:
$ awk '{for(i=2;i<=NF;i++) print $1,$i > ("data" i)}' foo
and results:
$ ls data*
data2 data3
$ cat data2
11 12
21 22
The for iterates from 2 to the last field. If there are more fields that you desire to process, change the NF to the number you'd like. If, for some reason, a hundred open files would be a problem in your system, you'd need to put the print into a block and add a close call:
$ awk '{for(i=2;i<=NF;i++){f=("data" i); print $1,$i >> f; close(f)}}' foo
If you want to do what you try to accomplish :
for i in {2..99}; do
awk -v x=$i '{print $1" " $x }' input.txt > data${i}
done
Note
the -v switch of awk to pass variables
$x is the nth column defined in your variable x
Note2 : this is not the fastest solution, one awk call is fastest, but I just try to correct your logic. Ideally, take time to understand awk, it's never a wasted time
I have a csv file where the third column is a number. Some of the entries don't have a value in this column.
I want to pull 100k blocks from the file, but only entries with a valid value for that column.
I could use split, but how do I make it check that column for a value?
$ cat test.txt
1,2,3,get me
4,5,,skip me
6,7,8,get me
9,10,11,stop before me
$ awk -F, '$3!="" && ++i<=2' test.txt
1,2,3,get me
6,7,8,get me
If your trying to verify whether or not the third field within a record has a value and output its contents if it does, you could try the following:
awk -F , '{ if($3 != ""){print $3} }'
This could also be written as:
awk -F , '$3 != ""{print $3}'
I have a pipe delimited file where I want to remove all text before a comma in field 9.
Example line:
www.upstate.edu|upadhyap|Prashant K Upadhyaya, MD||General Surgery|http://www.upstate.edu/hospital/providers/doctors/?docID=upadhyap|Patricia J. Numann Center for Breast, Endocrine & Plastic Surgery|Upstate Specialty Services at Harrison Center|Suite D, 550 Harrison Street||Syracuse|NY|13202|
so the targeted field is: |Suite D, 550 Harrison Street|
and I want it to look like: |550 Harrison Street|
So far what I have tried has either deleted information from other fields (usually the name in field 3) or has had no effect.
The .awk script I have been trying to write looks like this:
mv $1 $1.bak4
cat $1.bak4 | awk -F "|" '{
gsub(/*,/,"", $9);
print $0
}' > $1
The pattern argument to gsub is a regex not a glob. Your * isn't matching what you expect it to. You want /.*,/ there. You are also going to need to OFS to | to keep that delimiter.
mv $1 $1.bak4
awk 'BEGIN{ FS = OFS = "|" }{ gsub(/.*,/,"",$9) } 1' $1.bak4 > $1
I also replaced the verbose print line you had with a true pattern (1) that uses the fact that the default action is print.
I want to create a number of files from a much larger file, dividing by columns. For example, the header of my larger file looks like this:
Name Chr Position SNP1A SNP1B SNP1C SNP2A SNP2B SNP2C SNP3A SNP3B SNP3C
and I want to create these files:
Name Chr Position SNP1A SNP1B SNP1C
Name Chr Position SNP2A SNP2B SNP2C
Name Chr Position SNP3A SNP3B SNP3C
I've been trying to use awk, but I'm a bit of a novice with it, so my command currently reads:
for ((i=1; i<=440;i++)); do awk -f printindivs.awk inputfile done
Where printindivs.awk is:
{print $1 $2 $3 $((3*$i)+1) $((3*$i)+2) $((3*$i)+3))}
The output I'm getting suggests that my way of trying to get the sets of three is wrong: how can I do this?
Thanks
You can do this easily will just a simple awk script:
$ awk '{for(i=4;i<=NF;i+=3)print $1,$2,$3,$i,$(i+1),$(i+2) > ("out"++j)}' file
The output files will be in out[1..n]:
$ cat out1
Name Chr Position SNP1A SNP1B SNP1C
$ cat out2
Name Chr Position SNP2A SNP2B SNP2C
$ cat out3
Name Chr Position SNP3A SNP3B SNP3C