gawk to create first column based on part of second column - gawk

I have a 2 column tsv that I need to insert a new first column using part of the value in column 2.
What I have:
fastq/D0110.L001_R1_001.fastq fastq/D0110.L001_R2_001.fastq
fastq/D0206.L001_R1_001.fastq fastq/D0206.L001_R2_001.fastq
fastq/D0208.L001_R1_001.fastq fastq/D0208.L001_R2_001.fastq
What I want:
D0110 fastq/D0110.L001_R1_001.fastq fastq/D0110.L001_R2_001.fastq
D0206 fastq/D0206.L001_R1_001.fastq fastq/D0206.L001_R2_001.fastq
D0208 fastq/D0208.L001_R1_001.fastq fastq/D0208.L001_R2_001.fastq
I want to pull everything between "fastq/" and the first period and print that as the new first column.

$ awk -F'[/.]' '{printf "%s\t%s\n",$2,$0}' file
D0110 fastq/D0110.L001_R1_001.fastq fastq/D0110.L001_R2_001.fastq
D0206 fastq/D0206.L001_R1_001.fastq fastq/D0206.L001_R2_001.fastq
D0208 fastq/D0208.L001_R1_001.fastq fastq/D0208.L001_R2_001.fastq
How it works
awk implicitly loops over all input lines.
-F'[/.]'
This tells awk to use any occurrence of / or . as a field separator. This means that, for your input, the string you are looking for will be the second field.
printf "%s\t%s\n",$2,$0
This tells awk to print the second field ($2), followed by a tab (\t), followed by the input line ($0), followed by a newline character (\n)

Related

awk/sed replace multiple newlines in the record except end of record

I have file where:
field delimiter is \x01
the record delimiter is \n
Some lines contain multiple newlines I need to remove them, however I don't want to remove the legitimate newlines at the end of each lines. I have tried this with awk:
awk -F '\x01' 'NF < 87 {getline s; $0 = $0 s} 1' infile > outfile
But this is only working when the line contains one newline in the record (except end of line newline). This does not work for multiple newlines.
Note: the record contains 87 fields.
What am I doing wrong here?
Example of file:
PL^ANov-21^A29-11-2021^A0^A00^A00^A0000000
test^A00000000
Test^A^A^A^A
PL^ANov-21^A29-11-2021^A0^A00^A00^A0000000
test^A00000000
Test^A^A^A^A
SL^ANov-21^A30-11-2021^AB^A0000^A1234567^A00000
test^A12102120^A00000^A00^A^A
NOTE: The file contains 11 fields; field separate \x01; record separator \n
Expected result:
PL^ANov-21^A29-11-2021^A0^A00^A00^A0000000test^A00000000 Test^A^A^A^A
PL^ANov-21^A29-11-2021^A0^A00^A00^A0000000test^A00000000 Test^A^A^A^A
SL^ANov-21^A30-11-2021^AB^A0000^A1234567^A00000test^A12102120^A00000^A00^A^A
Note: I need to preserve the field delimiter (\x01) and record delimiter (\n)
Thank you very much in advance for looking into this.
The file always contains 87 fields;
The fild delimiter is '\x01', but when viewing in Linux it is represented as '^A'
Some lines contain newlines - I need to remove them, but I don't want to remove the legitimate newlines at the end of each line.
The newline appears twice in the 1st and second record and once in third record - this are the newlines I want to remove.
In the examples/expected results there are 11 delimiters "x01" represented as "^A",
I expect to have 3 records and not 6, i.e.:
First record:
test^A00000000 should be joined to the previous line
Test^A^A^A^A should be joined to the first line as well
forming one record:
PL^ANov-21^A29-11-2021^A0^A00^A00^A0000000test^A00000000 Test^A^A^A^A
Second record
test^A00000000 should be joined to the previous line
Test^A^A^A^A should be joined to that previous line as well
forming one record:
PL^ANov-21^A29-11-2021^A0^A00^A00^A0000000test^A00000000 Test^A^A^A^A
Third record:
test^A12102120^A00000^A00^A^A should be joined to the previous line
forming one record:
SL^ANov-21^A30-11-2021^AB^A0000^A1234567^A00000test^A12102120^A00000^A00^A^A
Note:
The example of awk - provided works when there is one unwanted newline in the record but not when there are multiple newlines
Thank you so very much. It works perfectly. Thank you for explaining it so well to me too.
This might work for you (GNU sed):
sed ':a;N;s/\x01/&/87;Ta;s/\n//g' file
Gather up lines until there are 87 separators, remove any newlines and print the result.
What's wrong with your attempt is that you concatenate two lines, print the result and move to the next line. NF is then reset to the next fields count. As all your lines have less than 87 fields the NF < 87 condition is useless, your script would work the same without it.
Try this awk script:
$ awk -F'\x01' -vn=87 -vi=0 '
{printf("%s", $0); i+=NF; if(i==n) {i=0; print "";} else i-=1;}' file
Here, we use the real \x01 field separator and the NF fields count. Variable i counts the number of already printed fields. We first print the current line without the trailing newline (printf("%s", $0)). Then we update our i fields counter. If it is equal to n we reset it and print a newline. Else we decrement it such that we do not count the last field of this line and the first of the next as 2 separate fields.
Demo with n=12 instead of 87 and your own input file (with \x01 field separators):
$ awk -F'\x01' -vn=12 -vi=0 '
{printf("%s", $0); i+=NF; if(i==n) {i=0; print "";} else i-=1;}' file |
sed 's/\x01/|/g'
PL|Nov-21|29-11-2021|0|00|00|0000000test|00000000 Test||||
PL|Nov-21|29-11-2021|0|00|00|0000000test|00000000 Test||||
SL|Nov-21|30-11-2021|B|0000|1234567|00000test|12102120|00000|00||
The sed command shows the result with the \x01 replaced by | for easier viewing.

split based on the last dot and create a new column with the last part of the string

I have a file with 2 columns. In the first column, there are several strings (IDs) and in the second values. In the strings, there are a number of dots that can be variable. I would like to split these strings based on the last dot. I found in the forum how remove the last past after the last dot, but I don't want to remove it. I would like to create a new column with the last part of the strings, using bash command (e.g. awk)
Example of strings:
5_8S_A.3-C_1.A 50
6_FS_B.L.3-O_1.A 20
H.YU-201.D 80
UI-LP.56.2011.A 10
Example of output:
5_8S_A.3-C_1 A 50
6_FS_B.L.3-O_1 A 20
H.YU-201 D 80
UI-LP.56.2011 A 10
I tried to solve it by using the following command but it works if I have just 1 dot in the string:
awk -F' ' '{{split($1, arr, "."); print arr[1] "\t" arr[2] "\t" $2}}' file.txt
You may use this sed:
sed -E 's/^([[:blank:]]*[^[:blank:]]+)\.([^[:blank:]]+)/\1 \2/' file
5_8S_A.3-C_1 A 50
6_FS_B.L.3-O_1 A 20
H.YU-201 D 80
UI-LP.56.2011 A 10
Details:
^: Start
([[:blank:]]*[^[:blank:]]+): Capture group #2 to match 0 or more whitespaces followed by 1+ non-whitespace characters.
\.: Match a dot. Since this regex pattern is greedy it will match until last dot
([^[:blank:]]+): Capture group #2 to match 1+ non-whitespace characters
\1 \2: Replacement to place a space between capture value #1 and capture value #2
Assumptions:
each line consists of two (white) space delimited fields
first field contains at least one period (.)
Sticking with OP's desire (?) to use awk:
awk '
{ n=split($1,arr,".") # split first field on period (".")
pfx=""
for (i=1;i<n;i++) { # print all but the nth array entry
printf "%s%s",pfx,arr[i]
pfx="."}
print "\t" arr[n] "\t" $2} # print last array entry and last field of line
' file.txt
Removing comments and reducing to a one-liner:
awk '{n=split($1,arr,"."); pfx=""; for (i=1;i<n;i++) {printf "%s%s",pfx,arr[i]; pfx="."}; print "\t" arr[n] "\t" $2}' file.txt
This generates:
5_8S_A.3-C_1 A 50
6_FS_B.L.3-O_1 A 20
H.YU-201 D 80
UI-LP.56.2011 A 10
With your shown samples, here is one more variant of rev + awk solution.
rev Input_file | awk '{sub(/\./,OFS)} 1' | rev
Explanation: Simple explanation would be, using rev to print reverse order(from last character to first character) for each line, then sending its output as a standard input to awk program where substituting first dot(which is last dot as per OP's shown samples only) with spaces and printing all lines. Then sending this output as a standard input to rev again to print output into correct order(to remove effect of 1st rev command here).
$ sed 's/\.\([^.]*$\)/\t\1/' file
5_8S_A.3-C_1 A 50
6_FS_B.L.3-O_1 A 20
H.YU-201 D 80
UI-LP.56.2011 A 10

Awk pattern matching on rows that have a value at specific column. No delimiter

I would like to search a file, using awk, to output rows that have a value commencing at a specific column number. e.g.
I looking for 979719 starting at column number 10:
moobaaraa**979719**
moobaaraa123456
moo**979719**123456
moobaaraa**979719**
moobaaraa123456
As you can see, there are no delimiters. It is a raw data text file. I would like to output rows 1 and 4. Not row 3 which does contain the pattern but not at the desired column number.
awk '/979719$/' file
moobaaraa979719
moobaaraa979719
An simple sed approach.
$ cat file
moobaaraa979719
moobaaraa123456
moo979719123456
moobaaraa979719
moobaaraa123456
Just search for a pattern, that end's up with 979719 and print the line:
$ sed -n '/^.*979719$/p' file
moobaaraa979719
moobaaraa979719
This code works:
awk 'length($1) == 9' FS="979719" raw-text-file
This code sets 979719 as the field separator, and checks whether the first field has a length of 9 characters. Then prints the line (as default action).
awk 'substr($0,10,6) == 979719' file
You can drop the ,6 if you want to search from the 10th char to the end of each line.

add header to strings by using the string itself

I want to add a header t my input strings. The header should be > directly followed by the string and the number after the string separated with a _
To add a header I used this awk '{print ">"$0;print}' However I dont kno how to add the number behind.
input:
CTTCTATGATGAATTTGATTGCATTGATCGTCTGACATGATAATGTATTT 2
AGAACGAAAGTCGGAGGTTCGAAGACGATC 14
TACCCTGTAGAACCGAANTTGT 1
TCCCTGTGGTCTAGTGGTTAGGATTCTGCGCTCTCACCGCCGCGGCCCGGG 2
GGGCCAGGATGAAACCTAATTTGAGTGGCCATCCATGGATGAGAAATGCGG 4
TAATACGGCCGGGTAATGATGGA 0
CCAGATGATGAACTTATTGACGGGCGGACAGAAACTGTGTGCTGATTGTCA 7240
CGCCCGATCTCGTCTGATCTCG 34
GCAGGGGTGGTTCAGTGGTAGAATTCTCGCC 3
output:
>CTTCTATGATGAATTTGATTGCATTGATCGTCTGACATGATAATGTATTT_2
CTTCTATGATGAATTTGATTGCATTGATCGTCTGACATGATAATGTATTT
>AGAACGAAAGTCGGAGGTTCGAAGACGATC_14
AGAACGAAAGTCGGAGGTTCGAAGACGATC
....
$ awk '{printf ">%s_%s\n %s\n",$1,$2,$1;}' file
>CTTCTATGATGAATTTGATTGCATTGATCGTCTGACATGATAATGTATTT_2
CTTCTATGATGAATTTGATTGCATTGATCGTCTGACATGATAATGTATTT
>AGAACGAAAGTCGGAGGTTCGAAGACGATC_14
AGAACGAAAGTCGGAGGTTCGAAGACGATC
>TACCCTGTAGAACCGAANTTGT_1
TACCCTGTAGAACCGAANTTGT
>TCCCTGTGGTCTAGTGGTTAGGATTCTGCGCTCTCACCGCCGCGGCCCGGG_2
TCCCTGTGGTCTAGTGGTTAGGATTCTGCGCTCTCACCGCCGCGGCCCGGG
>GGGCCAGGATGAAACCTAATTTGAGTGGCCATCCATGGATGAGAAATGCGG_4
GGGCCAGGATGAAACCTAATTTGAGTGGCCATCCATGGATGAGAAATGCGG
>TAATACGGCCGGGTAATGATGGA_0
TAATACGGCCGGGTAATGATGGA
>CCAGATGATGAACTTATTGACGGGCGGACAGAAACTGTGTGCTGATTGTCA_7240
CCAGATGATGAACTTATTGACGGGCGGACAGAAACTGTGTGCTGATTGTCA
>CGCCCGATCTCGTCTGATCTCG_34
CGCCCGATCTCGTCTGATCTCG
>GCAGGGGTGGTTCAGTGGTAGAATTCTCGCC_3
GCAGGGGTGGTTCAGTGGTAGAATTCTCGCC
How it works
The awk script consists of a single command:
printf ">%s_%s\n %s\n",$1,$2,$1
By default, awk splits up input lines into fields based on white space. So, For the first line for example, field 1 is CTTCTATGATGAATTTGATTGCATTGATCGTCTGACATGATAATGTATTT and field 2 is 2. The printf allows us to rearrange the input into the desired format. For each input line, two lines are written. The first one, with format >%s_%s\n writes > followed by field 1 followed by _ followed by field 2 followed by a newline character. The format for the second output line is%s\n which outputs a space followed by field one followed by a newline character.

Add a string to the end of column 1 using awk

I have a file whose head looks like this:
>PZ7180000000004_TX nReads=26 cov=9.436
>PZ7180000031590 nReads=3 cov=2.59465
>PZ7180000027934 nReads=5 cov=2.32231
>PZ456916 nReads=1 cov=1
>PZ7180000037718 nReads=9 cov=6.26448
>PZ7180000000004_TY nReads=86 cov=36.4238
>PZ7180000000067_AF nReads=16 cov=12.0608
>PZ7180000031591 nReads=4 cov=3.26022
>PZ7180000024036 nReads=14 cov=5.86079
>PZ15501_A nReads=1 cov=1
I want to add the string _nogroup onto the first column of each line that does not have _XX already designated (i.e. the 1st column on the 1st line is fine but the 1st column on the 2nd line should read >PZ7180000031590_nogroup).
Can I do this using awk like to use the command line.
You can use this awk command:
awk '!($1 ~ /_[a-zA-Z]{2}$/) {$1=$1 "_nogroup"} 1' file
>PZ7180000000004_TX nReads=26 cov=9.436
>PZ7180000031590_nogroup nReads=3 cov=2.59465
>PZ7180000027934_nogroup nReads=5 cov=2.32231
>PZ456916_nogroup nReads=1 cov=1
>PZ7180000037718_nogroup nReads=9 cov=6.26448
>PZ7180000000004_TY nReads=86 cov=36.4238
>PZ7180000000067_AF nReads=16 cov=12.0608
>PZ7180000031591_nogroup nReads=4 cov=3.26022
>PZ7180000024036_nogroup nReads=14 cov=5.86079
>PZ15501_A_nogroup nReads=1 cov=1