I have a file whose head looks like this:
>PZ7180000000004_TX nReads=26 cov=9.436
>PZ7180000031590 nReads=3 cov=2.59465
>PZ7180000027934 nReads=5 cov=2.32231
>PZ456916 nReads=1 cov=1
>PZ7180000037718 nReads=9 cov=6.26448
>PZ7180000000004_TY nReads=86 cov=36.4238
>PZ7180000000067_AF nReads=16 cov=12.0608
>PZ7180000031591 nReads=4 cov=3.26022
>PZ7180000024036 nReads=14 cov=5.86079
>PZ15501_A nReads=1 cov=1
I want to add the string _nogroup onto the first column of each line that does not have _XX already designated (i.e. the 1st column on the 1st line is fine but the 1st column on the 2nd line should read >PZ7180000031590_nogroup).
Can I do this using awk like to use the command line.
You can use this awk command:
awk '!($1 ~ /_[a-zA-Z]{2}$/) {$1=$1 "_nogroup"} 1' file
>PZ7180000000004_TX nReads=26 cov=9.436
>PZ7180000031590_nogroup nReads=3 cov=2.59465
>PZ7180000027934_nogroup nReads=5 cov=2.32231
>PZ456916_nogroup nReads=1 cov=1
>PZ7180000037718_nogroup nReads=9 cov=6.26448
>PZ7180000000004_TY nReads=86 cov=36.4238
>PZ7180000000067_AF nReads=16 cov=12.0608
>PZ7180000031591_nogroup nReads=4 cov=3.26022
>PZ7180000024036_nogroup nReads=14 cov=5.86079
>PZ15501_A_nogroup nReads=1 cov=1
Related
I am trying to split a file column using the substr awk command. So the input is as follows (it consists of 4 lines, one blank line):
#NS500645:122:HYGVMBGX2:4:21402:2606:16446:ACCTAGAAGG:R1
ACCTAGAAGGATATGCGCTTGCGCGTTAGAGATCACTAGAGCTAAGGAATTTGAGATTACAGTAAGCTATGATCC
/AAAAEEEEEEEEEEAAEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
I want to split the second line by the pattern "GATC" but keeping it on the right sub-string like:
ACCTAGAAGGATATGCGCTTGCGCGTTAGA GATCACTAGAGCTAAGGAATTTGAGATTACAGTAAGCTATGATCC
I want that the last line have the same length as the splitted one and regenerate the file like:
ACCTAGAAGGATATGCGCTTGCGCGTTAGA
/AAAAEEEEEEEEEEAAEEEAEEEEEEEEE
GATCACTAGAGCTAAGGAATTTGAGATTACAGTAAGCTAT
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
GATCC
EEEEE
For split the last colum I am using this awk script:
cat prove | paste - - - - | awk 'BEGIN
{FS="\t"; OFS="\t"}\ {gsub("GATC","/tGATC", $2); {split ($2, a, "\t")};\ for
(i in a) print substr($4, length(a[i-1])+1,
length(a[i-1])+length(a[i]))}'
But the output is as follows:
/AAAAEEEEEEEEEEAAEEEAEEEEEEEEE
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
Being the second and third line longer that expected.
I check the calculated length that are passed to the substr command and are correct:
1 30
31 70
41 45
Using these length the output should be:
/AAAAEEEEEEEEEEAAEEEAEEEEEEEEE
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
EEEEE
But as I showed it is not the case.
Any suggestions?
I guess you're looking something like this, but your question formatting is really confusing
$ awk -v OFS='\t' 'NR==1 {next}
NR==2 {n=index($0,"GATC")}
/^[^+]/ {print substr($0,1,n-1),substr($0,n)}' file
ACCTAGAAGGATATGCGCTTGCGCGTTAGA GATCACTAGAGCTAAGGAATTTGAGATTACAGTAAGCTATGATCC
/AAAAEEEEEEEEEEAAEEEAEEEEEEEEE EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
I assumed your file is in this format
dummy header line to be ignored
ACCTAGAAGGATATGCGCTTGCGCGTTAGAGATCACTAGAGCTAAGGAATTTGAGATTACAGTAAGCTATGATCC
+
/AAAAEEEEEEEEEEAAEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
I want to extract the value in each row of a file that comes after AA. I can do this like so:
awk -F'[;=|]' '{for(i=1;i<=NF;i++)if($i=="AA"){print toupper($(i+1));next}}'
This gives me the exact information I need and converts to uppercase, which is exactly what I want to do. How can I do this and then print the entire row with this altered value in its previous position? I am essentially trying to do a find and replace where the value is changed to uppercase.
EDIT:
Here is a sample input line:
11 128196 rs576393503 A G 100 PASS AC=453;AF=0.0904553;AN=5008;NS=2504;DP=5057;EAS_AF=0.0159;AMR_AF=0.0259;AFR_AF=0.3071;EUR_AF=0.006;SAS_AF=0.0072;AA=g|||;VT=SNP
and here is a how I would like the output to look:
11 128196 rs576393503 A G 100 PASS AC=453;AF=0.0904553;AN=5008;NS=2504;DP=5057;EAS_AF=0.0159;AMR_AF=0.0259;AFR_AF=0.3071;EUR_AF=0.006;SAS_AF=0.0072;AA=G|||;VT=SNP
All that has changed is the g after AA= is changed to uppercase.
Following awk may help you on same.
awk '
{
match($0,/AA=[^|]*/);
print substr($0,1,RSTART+2) toupper(substr($0,RSTART+3,RLENGTH-3)) substr($0,RSTART+RLENGTH)
}
' Input_file
With GNU sed and perl, using word boundaries
$ echo 'SAS_AF=0.0072;AA=g|||;VT=SNP' | sed 's/\bAA=[^;=|]*\b/\U&/'
SAS_AF=0.0072;AA=G|||;VT=SNP
$ echo 'SAS_AF=0.0072;AA=g|||;VT=SNP' | perl -pe 's/\bAA=[^;=|]*\b/\U$&/'
SAS_AF=0.0072;AA=G|||;VT=SNP
\U will uppercase string following it until end or \E or another case-modifier
use g modifier if there can be more than one match per line
I am trying to extract a pattern along with printing the starting string of the line.
Input
Saureus1000(37 genes,10 taxa): Saureus08BA02176_00020(Saureus08BA02176) Saureus1269_00069(Saureus1269) Saureus170_00062(Saureus170) Saureus71193_00020(Saureus71193) SaureusED133_00019(SaureusED133) SaureusED98_00019(SaureusED98) SaureusLGA251_00019(SaureusLGA251) SaureusN305_00605(SaureusN305) SaureusRF122_00019(SaureusRF122) SaureusST398_00020(SaureusST398) Saureus08BA02176_01763(Saureus08BA02176) Saureus08BA02176_01805(Saureus08BA02176) Saureus08BA02176_01808(Saureus08BA02176) Saureus1269_01194(Saureus1269) Saureus1269_01237(Saureus1269) Saureus1269_01240(Saureus1269) Saureus71193_01635(Saureus71193) Saureus71193_01678(Saureus71193) Saureus71193_01681(Saureus71193) SaureusED133_01798(SaureusED133) SaureusED133_01840(SaureusED133) SaureusED133_01843(SaureusED133) SaureusED98_01777(SaureusED98) SaureusED98_01821(SaureusED98) SaureusED98_01824(SaureusED98) SaureusLGA251_01748(SaureusLGA251) SaureusLGA251_01790(SaureusLGA251) SaureusLGA251_01793(SaureusLGA251) SaureusN305_00013(SaureusN305) SaureusN305_00016(SaureusN305) SaureusN305_00059(SaureusN305) SaureusRF122_01807(SaureusRF122) SaureusRF122_01848(SaureusRF122) SaureusRF122_01851(SaureusRF122) SaureusST398_01884(SaureusST398) SaureusST398_01927(SaureusST398) SaureusST398_01930(SaureusST398)
Saureus1001(35 genes,12 taxa): Saureus08BA02176_01441(Saureus08BA02176) Saureus1269_02301(Saureus1269) Saureus1269_02527(Saureus1269) Saureus71193_01310(Saureus71193) SaureusED98_01421(SaureusED98) SaureusED98_01424(SaureusED98) SaureusN305_02184(SaureusN305) SaureusN305_02188(SaureusN305) SaureusN305_02190(SaureusN305) SaureusRF122_01383(SaureusRF122) SaureusRF122_01386(SaureusRF122) SaureusST398_01476(SaureusST398) Saureus08BA02176_01442(Saureus08BA02176) Saureus08BA02176_01443(Saureus08BA02176) Saureus08BA02176_01445(Saureus08BA02176) Saureus1269_02302(Saureus1269) Saureus1269_02529(Saureus1269) Saureus1364_00430(Saureus1364) Saureus170_00571(Saureus170) Saureus170_00574(Saureus170) Saureus302_00352(Saureus302) Saureus302_00556(Saureus302) Saureus71193_01311(Saureus71193) Saureus71193_01312(Saureus71193) Saureus71193_01314(Saureus71193) SaureusED98_01423(SaureusED98) SaureusED98_01426(SaureusED98) SaureusLGA251_01423(SaureusLGA251) SaureusN305_02185(SaureusN305) SaureusN305_02187(SaureusN305) SaureusST398_01477(SaureusST398) SaureusST398_01478(SaureusST398) SaureusST398_01548(SaureusST398) SaureusED133_01465(SaureusED133) Saureus302_01433(Saureus302)
Req.Output
Saureus1000 Saureus08BA02176_00020
I am using this code to find but not getting the required output in single line
awk '{print $1} {for(i=1;i<=NF;i++){if($i~/^Saureus08BA/){print $i}}}' file > test
Output for this command
Saureus1000(37
Saureus08BA02176_00020(Saureus08BA02176)
Saureus08BA02176_01763(Saureus08BA02176)
Saureus08BA02176_01805(Saureus08BA02176)
Saureus08BA02176_01808(Saureus08BA02176)
Saureus1001(35
Saureus08BA02176_01441(Saureus08BA02176)
Saureus08BA02176_01442(Saureus08BA02176)
Saureus08BA02176_01443(Saureus08BA02176)
Saureus08BA02176_01445(Saureus08BA02176)
GNU awk solution:
awk 'match($0,/^([^(]+)\([^(]+(Saureus08BA[0-9]+_[0-9]+)/,a){ print a[1],a[2] }' file
([^(]+) - capturing the needed part from the 1st field
(Saureus08BA[0-9]+_[0-9]+) - the 2nd captured group containing the next "Saureus" item
The output:
Saureus1000 Saureus08BA02176_00020
Saureus1001 Saureus08BA02176_01441
I would like to search a file, using awk, to output rows that have a value commencing at a specific column number. e.g.
I looking for 979719 starting at column number 10:
moobaaraa**979719**
moobaaraa123456
moo**979719**123456
moobaaraa**979719**
moobaaraa123456
As you can see, there are no delimiters. It is a raw data text file. I would like to output rows 1 and 4. Not row 3 which does contain the pattern but not at the desired column number.
awk '/979719$/' file
moobaaraa979719
moobaaraa979719
An simple sed approach.
$ cat file
moobaaraa979719
moobaaraa123456
moo979719123456
moobaaraa979719
moobaaraa123456
Just search for a pattern, that end's up with 979719 and print the line:
$ sed -n '/^.*979719$/p' file
moobaaraa979719
moobaaraa979719
This code works:
awk 'length($1) == 9' FS="979719" raw-text-file
This code sets 979719 as the field separator, and checks whether the first field has a length of 9 characters. Then prints the line (as default action).
awk 'substr($0,10,6) == 979719' file
You can drop the ,6 if you want to search from the 10th char to the end of each line.
I have a 2 column tsv that I need to insert a new first column using part of the value in column 2.
What I have:
fastq/D0110.L001_R1_001.fastq fastq/D0110.L001_R2_001.fastq
fastq/D0206.L001_R1_001.fastq fastq/D0206.L001_R2_001.fastq
fastq/D0208.L001_R1_001.fastq fastq/D0208.L001_R2_001.fastq
What I want:
D0110 fastq/D0110.L001_R1_001.fastq fastq/D0110.L001_R2_001.fastq
D0206 fastq/D0206.L001_R1_001.fastq fastq/D0206.L001_R2_001.fastq
D0208 fastq/D0208.L001_R1_001.fastq fastq/D0208.L001_R2_001.fastq
I want to pull everything between "fastq/" and the first period and print that as the new first column.
$ awk -F'[/.]' '{printf "%s\t%s\n",$2,$0}' file
D0110 fastq/D0110.L001_R1_001.fastq fastq/D0110.L001_R2_001.fastq
D0206 fastq/D0206.L001_R1_001.fastq fastq/D0206.L001_R2_001.fastq
D0208 fastq/D0208.L001_R1_001.fastq fastq/D0208.L001_R2_001.fastq
How it works
awk implicitly loops over all input lines.
-F'[/.]'
This tells awk to use any occurrence of / or . as a field separator. This means that, for your input, the string you are looking for will be the second field.
printf "%s\t%s\n",$2,$0
This tells awk to print the second field ($2), followed by a tab (\t), followed by the input line ($0), followed by a newline character (\n)