How to print first column of row along with specific pattern? - awk
I am trying to extract a pattern along with printing the starting string of the line.
Input
Saureus1000(37 genes,10 taxa): Saureus08BA02176_00020(Saureus08BA02176) Saureus1269_00069(Saureus1269) Saureus170_00062(Saureus170) Saureus71193_00020(Saureus71193) SaureusED133_00019(SaureusED133) SaureusED98_00019(SaureusED98) SaureusLGA251_00019(SaureusLGA251) SaureusN305_00605(SaureusN305) SaureusRF122_00019(SaureusRF122) SaureusST398_00020(SaureusST398) Saureus08BA02176_01763(Saureus08BA02176) Saureus08BA02176_01805(Saureus08BA02176) Saureus08BA02176_01808(Saureus08BA02176) Saureus1269_01194(Saureus1269) Saureus1269_01237(Saureus1269) Saureus1269_01240(Saureus1269) Saureus71193_01635(Saureus71193) Saureus71193_01678(Saureus71193) Saureus71193_01681(Saureus71193) SaureusED133_01798(SaureusED133) SaureusED133_01840(SaureusED133) SaureusED133_01843(SaureusED133) SaureusED98_01777(SaureusED98) SaureusED98_01821(SaureusED98) SaureusED98_01824(SaureusED98) SaureusLGA251_01748(SaureusLGA251) SaureusLGA251_01790(SaureusLGA251) SaureusLGA251_01793(SaureusLGA251) SaureusN305_00013(SaureusN305) SaureusN305_00016(SaureusN305) SaureusN305_00059(SaureusN305) SaureusRF122_01807(SaureusRF122) SaureusRF122_01848(SaureusRF122) SaureusRF122_01851(SaureusRF122) SaureusST398_01884(SaureusST398) SaureusST398_01927(SaureusST398) SaureusST398_01930(SaureusST398)
Saureus1001(35 genes,12 taxa): Saureus08BA02176_01441(Saureus08BA02176) Saureus1269_02301(Saureus1269) Saureus1269_02527(Saureus1269) Saureus71193_01310(Saureus71193) SaureusED98_01421(SaureusED98) SaureusED98_01424(SaureusED98) SaureusN305_02184(SaureusN305) SaureusN305_02188(SaureusN305) SaureusN305_02190(SaureusN305) SaureusRF122_01383(SaureusRF122) SaureusRF122_01386(SaureusRF122) SaureusST398_01476(SaureusST398) Saureus08BA02176_01442(Saureus08BA02176) Saureus08BA02176_01443(Saureus08BA02176) Saureus08BA02176_01445(Saureus08BA02176) Saureus1269_02302(Saureus1269) Saureus1269_02529(Saureus1269) Saureus1364_00430(Saureus1364) Saureus170_00571(Saureus170) Saureus170_00574(Saureus170) Saureus302_00352(Saureus302) Saureus302_00556(Saureus302) Saureus71193_01311(Saureus71193) Saureus71193_01312(Saureus71193) Saureus71193_01314(Saureus71193) SaureusED98_01423(SaureusED98) SaureusED98_01426(SaureusED98) SaureusLGA251_01423(SaureusLGA251) SaureusN305_02185(SaureusN305) SaureusN305_02187(SaureusN305) SaureusST398_01477(SaureusST398) SaureusST398_01478(SaureusST398) SaureusST398_01548(SaureusST398) SaureusED133_01465(SaureusED133) Saureus302_01433(Saureus302)
Req.Output
Saureus1000 Saureus08BA02176_00020
I am using this code to find but not getting the required output in single line
awk '{print $1} {for(i=1;i<=NF;i++){if($i~/^Saureus08BA/){print $i}}}' file > test
Output for this command
Saureus1000(37
Saureus08BA02176_00020(Saureus08BA02176)
Saureus08BA02176_01763(Saureus08BA02176)
Saureus08BA02176_01805(Saureus08BA02176)
Saureus08BA02176_01808(Saureus08BA02176)
Saureus1001(35
Saureus08BA02176_01441(Saureus08BA02176)
Saureus08BA02176_01442(Saureus08BA02176)
Saureus08BA02176_01443(Saureus08BA02176)
Saureus08BA02176_01445(Saureus08BA02176)
GNU awk solution:
awk 'match($0,/^([^(]+)\([^(]+(Saureus08BA[0-9]+_[0-9]+)/,a){ print a[1],a[2] }' file
([^(]+) - capturing the needed part from the 1st field
(Saureus08BA[0-9]+_[0-9]+) - the 2nd captured group containing the next "Saureus" item
The output:
Saureus1000 Saureus08BA02176_00020
Saureus1001 Saureus08BA02176_01441
Related
Countif like function in AWK with field headers
I am looking for a way of counting the number of times a value in a field appears in a range of fields in a csv file much the same as countif in excel although I would like to use an awk command if possible. So column 6 should have the range of values and column 7 would have the times the value appears in column 7, as per below >awk -F, '{print $0}' file3 f1,f2,f3,f4,f5,test row1_1,row1_2,row1_3,SBCDE,row1_5,SBCD row2_1,row2_2,row2_3,AWERF,row2_5,AWER row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF row4_1,row4_2,row4_3,PRE-ASDQG,row4_5,ASDQ row4_1,row4_2,row4_3,PRE-ASDQF,row4_5,ASDQ >awk -F, '{print $6}' file3 test SBCD AWER ASDF ASDQ ASDQ What i want is: f1,f2,f3,f4,f5,test,count row1_1,row1_2,row1_3,SBCDE,row1_5,SBCD,1 row2_1,row2_2,row2_3,AWERF,row2_5,AWER,1 row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF,1 row4_1,row4_2,row4_3,PRE-ASDQG,row4_5,ASDQ,2 row4_1,row4_2,row4_3,PRE-ASDQF,row4_5,ASDQ,2 #adds field name count that I want: awk -F, -v OFS=, 'NR==1{ print $0, "count"} NR>1{ print $0}' file3 Ho do I get the output I want? I have tried this from previous/similar question but no joy, >awk -F, 'NR>1{c[$6]++;l[NR>1]=$0}END{for(i=0;i++<NR;){split(l[i],s,",");print l[i]","c[s[1]]}}' file3 row4_1,row4_2,row4_3,PRE-ASDQF,row4_5,ASDQ, , , , , , very similar question to this one similar python related Q, for my ref
I would harness GNU AWK for this task following way, let file.txt content be f1,f2,f3,f4,f5,test row1_1,row1_2,row1_3,SBCDE,row1_5,SBCD row2_1,row2_2,row2_3,AWERF,row2_5,AWER row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF row4_1,row4_2,row4_3,PRE-ASDQG,row4_5,ASDQ row4_1,row4_2,row4_3,PRE-ASDQF,row4_5,ASDQ then awk 'BEGIN{FS=OFS=","}NR==1{print $0,"count";next}FNR==NR{arr[$6]+=1;next}FNR>1{print $0,arr[$6]}' file.txt file.txt gives output f1,f2,f3,f4,f5,test,count row1_1,row1_2,row1_3,SBCDE,row1_5,SBCD,1 row2_1,row2_2,row2_3,AWERF,row2_5,AWER,1 row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF,1 row4_1,row4_2,row4_3,PRE-ASDQG,row4_5,ASDQ,2 row4_1,row4_2,row4_3,PRE-ASDQF,row4_5,ASDQ,2 Explanation: this is two-pass approach, hence file.txt appears twice. I inform GNU AWK that , is both field separator (FS) and output field separator (OFS), then for first line (header) I print it followed by count and instruct GNU AWK to go to next line, so nothing other is done regarding 1st line, then for first pass, i.e. where global number of line (NR) is equal to number of line in file (FNR) I count number of occurences of values in 6th field and store them as values in array arr, then instruct GNU AWK to get to next line, so onthing other is done in this pass. During second pass for all lines after 1st (FNR>1) I print whole line ($0) followed by corresponding value from array arr (tested in GNU Awk 5.0.1)
You did not copy the code from the linked question properly. Why change l[NR] to l[NR>1] at all? On the other hand, you should change s[1] to s[6] since it's the sixth field that has the key you're counting: awk -F, 'NR>1{c[$6]++;l[NR]=$0}END{for(i=0;i++<NR;){split(l[i],s,",");print l[i]","c[s[6]]}}' You can also output the header with the new field name: awk -F, -vOFS=, 'NR==1{print $0,"count"}NR>1{c[$6]++;l[NR]=$0}END{for(i=0;i++<NR;){split(l[i],s,",");print l[i],c[s[6]]}}'
One awk idea: awk ' BEGIN { FS=OFS="," } # define input/output field delimiters as comma { lines[NR]=$0 if (NR==1) next col6[NR]=$6 # copy field 6 so we do not have to parse the contents of lines[] in the END block cnt[$6]++ } END { for (i=1;i<=NR;i++) print lines[i], (i==1 ? "count" : cnt[col6[i]] ) } ' file3 This generates: f1,f2,f3,f4,f5,test,count row1_1,row1_2,row1_3,SBCDE,row1_5,SBCD,1 row2_1,row2_2,row2_3,AWERF,row2_5,AWER,1 row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF,1 row4_1,row4_2,row4_3,PRE-ASDQG,row4_5,ASDQ,2 row4_1,row4_2,row4_3,PRE-ASDQF,row4_5,ASDQ,2
Problems with awk substr
I am trying to split a file column using the substr awk command. So the input is as follows (it consists of 4 lines, one blank line): #NS500645:122:HYGVMBGX2:4:21402:2606:16446:ACCTAGAAGG:R1 ACCTAGAAGGATATGCGCTTGCGCGTTAGAGATCACTAGAGCTAAGGAATTTGAGATTACAGTAAGCTATGATCC /AAAAEEEEEEEEEEAAEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE I want to split the second line by the pattern "GATC" but keeping it on the right sub-string like: ACCTAGAAGGATATGCGCTTGCGCGTTAGA GATCACTAGAGCTAAGGAATTTGAGATTACAGTAAGCTATGATCC I want that the last line have the same length as the splitted one and regenerate the file like: ACCTAGAAGGATATGCGCTTGCGCGTTAGA /AAAAEEEEEEEEEEAAEEEAEEEEEEEEE GATCACTAGAGCTAAGGAATTTGAGATTACAGTAAGCTAT EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE GATCC EEEEE For split the last colum I am using this awk script: cat prove | paste - - - - | awk 'BEGIN {FS="\t"; OFS="\t"}\ {gsub("GATC","/tGATC", $2); {split ($2, a, "\t")};\ for (i in a) print substr($4, length(a[i-1])+1, length(a[i-1])+length(a[i]))}' But the output is as follows: /AAAAEEEEEEEEEEAAEEEAEEEEEEEEE EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE Being the second and third line longer that expected. I check the calculated length that are passed to the substr command and are correct: 1 30 31 70 41 45 Using these length the output should be: /AAAAEEEEEEEEEEAAEEEAEEEEEEEEE EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE EEEEE But as I showed it is not the case. Any suggestions?
I guess you're looking something like this, but your question formatting is really confusing $ awk -v OFS='\t' 'NR==1 {next} NR==2 {n=index($0,"GATC")} /^[^+]/ {print substr($0,1,n-1),substr($0,n)}' file ACCTAGAAGGATATGCGCTTGCGCGTTAGA GATCACTAGAGCTAAGGAATTTGAGATTACAGTAAGCTATGATCC /AAAAEEEEEEEEEEAAEEEAEEEEEEEEE EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE I assumed your file is in this format dummy header line to be ignored ACCTAGAAGGATATGCGCTTGCGCGTTAGAGATCACTAGAGCTAAGGAATTTGAGATTACAGTAAGCTATGATCC + /AAAAEEEEEEEEEEAAEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
awk to store field length in variable then use in print
In the awk below I am trying to store the length of $5 in a variable il if the condition is met (in the two lines it is) and then add that variable to $3 in the print statement. The two sub statements are to remove the matching from both $5 and $6. The script as is executes and produces the current output. However, il does not seem to be populated and added in the print. It seems close but I'm not sure why the variable isn't being stored? Thank you :) awk awk 'BEGIN{FS=OFS="\t"} # define fs and output FNR==NR{ # process each field in each line of file if(length($5) < length($6)) { # condition il=$(length($5)) echo $il sub($5,"",$6) && sub($6,"",$5) # removing matching print $1,$2,$3+$il,$3+$il,"-",$6 # print desired output next } }' in in tab-delimited id1 1 116268178 GAAA GAAAA id2 2 228197304 A AATCC current output tab-delimited id1 1 116268178 116268178 - A id2 2 228197304 228197304 - ATCC desired output tab-delimited since `$5` is 4 in line 1 that is added to `$3` since `$5` is 1 in line 2 that is added to `$3` id1 1 116268181 116268181 - A id2 2 228197305 228197305 - ATCC
Following awk may help you here. awk '{$3+=length($4);$3=$3 OFS $3;sub($4,"",$5);$4="-"} 1' Input_file Please add BEGIN{FS=OFS="\t"} in case your Input_file is TAB delimited and you require output in TAB delimited form too.
gawk to create first column based on part of second column
I have a 2 column tsv that I need to insert a new first column using part of the value in column 2. What I have: fastq/D0110.L001_R1_001.fastq fastq/D0110.L001_R2_001.fastq fastq/D0206.L001_R1_001.fastq fastq/D0206.L001_R2_001.fastq fastq/D0208.L001_R1_001.fastq fastq/D0208.L001_R2_001.fastq What I want: D0110 fastq/D0110.L001_R1_001.fastq fastq/D0110.L001_R2_001.fastq D0206 fastq/D0206.L001_R1_001.fastq fastq/D0206.L001_R2_001.fastq D0208 fastq/D0208.L001_R1_001.fastq fastq/D0208.L001_R2_001.fastq I want to pull everything between "fastq/" and the first period and print that as the new first column.
$ awk -F'[/.]' '{printf "%s\t%s\n",$2,$0}' file D0110 fastq/D0110.L001_R1_001.fastq fastq/D0110.L001_R2_001.fastq D0206 fastq/D0206.L001_R1_001.fastq fastq/D0206.L001_R2_001.fastq D0208 fastq/D0208.L001_R1_001.fastq fastq/D0208.L001_R2_001.fastq How it works awk implicitly loops over all input lines. -F'[/.]' This tells awk to use any occurrence of / or . as a field separator. This means that, for your input, the string you are looking for will be the second field. printf "%s\t%s\n",$2,$0 This tells awk to print the second field ($2), followed by a tab (\t), followed by the input line ($0), followed by a newline character (\n)
Add a string to the end of column 1 using awk
I have a file whose head looks like this: >PZ7180000000004_TX nReads=26 cov=9.436 >PZ7180000031590 nReads=3 cov=2.59465 >PZ7180000027934 nReads=5 cov=2.32231 >PZ456916 nReads=1 cov=1 >PZ7180000037718 nReads=9 cov=6.26448 >PZ7180000000004_TY nReads=86 cov=36.4238 >PZ7180000000067_AF nReads=16 cov=12.0608 >PZ7180000031591 nReads=4 cov=3.26022 >PZ7180000024036 nReads=14 cov=5.86079 >PZ15501_A nReads=1 cov=1 I want to add the string _nogroup onto the first column of each line that does not have _XX already designated (i.e. the 1st column on the 1st line is fine but the 1st column on the 2nd line should read >PZ7180000031590_nogroup). Can I do this using awk like to use the command line.
You can use this awk command: awk '!($1 ~ /_[a-zA-Z]{2}$/) {$1=$1 "_nogroup"} 1' file >PZ7180000000004_TX nReads=26 cov=9.436 >PZ7180000031590_nogroup nReads=3 cov=2.59465 >PZ7180000027934_nogroup nReads=5 cov=2.32231 >PZ456916_nogroup nReads=1 cov=1 >PZ7180000037718_nogroup nReads=9 cov=6.26448 >PZ7180000000004_TY nReads=86 cov=36.4238 >PZ7180000000067_AF nReads=16 cov=12.0608 >PZ7180000031591_nogroup nReads=4 cov=3.26022 >PZ7180000024036_nogroup nReads=14 cov=5.86079 >PZ15501_A_nogroup nReads=1 cov=1