add header to strings by using the string itself - awk

I want to add a header t my input strings. The header should be > directly followed by the string and the number after the string separated with a _
To add a header I used this awk '{print ">"$0;print}' However I dont kno how to add the number behind.
input:
CTTCTATGATGAATTTGATTGCATTGATCGTCTGACATGATAATGTATTT 2
AGAACGAAAGTCGGAGGTTCGAAGACGATC 14
TACCCTGTAGAACCGAANTTGT 1
TCCCTGTGGTCTAGTGGTTAGGATTCTGCGCTCTCACCGCCGCGGCCCGGG 2
GGGCCAGGATGAAACCTAATTTGAGTGGCCATCCATGGATGAGAAATGCGG 4
TAATACGGCCGGGTAATGATGGA 0
CCAGATGATGAACTTATTGACGGGCGGACAGAAACTGTGTGCTGATTGTCA 7240
CGCCCGATCTCGTCTGATCTCG 34
GCAGGGGTGGTTCAGTGGTAGAATTCTCGCC 3
output:
>CTTCTATGATGAATTTGATTGCATTGATCGTCTGACATGATAATGTATTT_2
CTTCTATGATGAATTTGATTGCATTGATCGTCTGACATGATAATGTATTT
>AGAACGAAAGTCGGAGGTTCGAAGACGATC_14
AGAACGAAAGTCGGAGGTTCGAAGACGATC
....

$ awk '{printf ">%s_%s\n %s\n",$1,$2,$1;}' file
>CTTCTATGATGAATTTGATTGCATTGATCGTCTGACATGATAATGTATTT_2
CTTCTATGATGAATTTGATTGCATTGATCGTCTGACATGATAATGTATTT
>AGAACGAAAGTCGGAGGTTCGAAGACGATC_14
AGAACGAAAGTCGGAGGTTCGAAGACGATC
>TACCCTGTAGAACCGAANTTGT_1
TACCCTGTAGAACCGAANTTGT
>TCCCTGTGGTCTAGTGGTTAGGATTCTGCGCTCTCACCGCCGCGGCCCGGG_2
TCCCTGTGGTCTAGTGGTTAGGATTCTGCGCTCTCACCGCCGCGGCCCGGG
>GGGCCAGGATGAAACCTAATTTGAGTGGCCATCCATGGATGAGAAATGCGG_4
GGGCCAGGATGAAACCTAATTTGAGTGGCCATCCATGGATGAGAAATGCGG
>TAATACGGCCGGGTAATGATGGA_0
TAATACGGCCGGGTAATGATGGA
>CCAGATGATGAACTTATTGACGGGCGGACAGAAACTGTGTGCTGATTGTCA_7240
CCAGATGATGAACTTATTGACGGGCGGACAGAAACTGTGTGCTGATTGTCA
>CGCCCGATCTCGTCTGATCTCG_34
CGCCCGATCTCGTCTGATCTCG
>GCAGGGGTGGTTCAGTGGTAGAATTCTCGCC_3
GCAGGGGTGGTTCAGTGGTAGAATTCTCGCC
How it works
The awk script consists of a single command:
printf ">%s_%s\n %s\n",$1,$2,$1
By default, awk splits up input lines into fields based on white space. So, For the first line for example, field 1 is CTTCTATGATGAATTTGATTGCATTGATCGTCTGACATGATAATGTATTT and field 2 is 2. The printf allows us to rearrange the input into the desired format. For each input line, two lines are written. The first one, with format >%s_%s\n writes > followed by field 1 followed by _ followed by field 2 followed by a newline character. The format for the second output line is%s\n which outputs a space followed by field one followed by a newline character.

Related

split based on the last dot and create a new column with the last part of the string

I have a file with 2 columns. In the first column, there are several strings (IDs) and in the second values. In the strings, there are a number of dots that can be variable. I would like to split these strings based on the last dot. I found in the forum how remove the last past after the last dot, but I don't want to remove it. I would like to create a new column with the last part of the strings, using bash command (e.g. awk)
Example of strings:
5_8S_A.3-C_1.A 50
6_FS_B.L.3-O_1.A 20
H.YU-201.D 80
UI-LP.56.2011.A 10
Example of output:
5_8S_A.3-C_1 A 50
6_FS_B.L.3-O_1 A 20
H.YU-201 D 80
UI-LP.56.2011 A 10
I tried to solve it by using the following command but it works if I have just 1 dot in the string:
awk -F' ' '{{split($1, arr, "."); print arr[1] "\t" arr[2] "\t" $2}}' file.txt
You may use this sed:
sed -E 's/^([[:blank:]]*[^[:blank:]]+)\.([^[:blank:]]+)/\1 \2/' file
5_8S_A.3-C_1 A 50
6_FS_B.L.3-O_1 A 20
H.YU-201 D 80
UI-LP.56.2011 A 10
Details:
^: Start
([[:blank:]]*[^[:blank:]]+): Capture group #2 to match 0 or more whitespaces followed by 1+ non-whitespace characters.
\.: Match a dot. Since this regex pattern is greedy it will match until last dot
([^[:blank:]]+): Capture group #2 to match 1+ non-whitespace characters
\1 \2: Replacement to place a space between capture value #1 and capture value #2
Assumptions:
each line consists of two (white) space delimited fields
first field contains at least one period (.)
Sticking with OP's desire (?) to use awk:
awk '
{ n=split($1,arr,".") # split first field on period (".")
pfx=""
for (i=1;i<n;i++) { # print all but the nth array entry
printf "%s%s",pfx,arr[i]
pfx="."}
print "\t" arr[n] "\t" $2} # print last array entry and last field of line
' file.txt
Removing comments and reducing to a one-liner:
awk '{n=split($1,arr,"."); pfx=""; for (i=1;i<n;i++) {printf "%s%s",pfx,arr[i]; pfx="."}; print "\t" arr[n] "\t" $2}' file.txt
This generates:
5_8S_A.3-C_1 A 50
6_FS_B.L.3-O_1 A 20
H.YU-201 D 80
UI-LP.56.2011 A 10
With your shown samples, here is one more variant of rev + awk solution.
rev Input_file | awk '{sub(/\./,OFS)} 1' | rev
Explanation: Simple explanation would be, using rev to print reverse order(from last character to first character) for each line, then sending its output as a standard input to awk program where substituting first dot(which is last dot as per OP's shown samples only) with spaces and printing all lines. Then sending this output as a standard input to rev again to print output into correct order(to remove effect of 1st rev command here).
$ sed 's/\.\([^.]*$\)/\t\1/' file
5_8S_A.3-C_1 A 50
6_FS_B.L.3-O_1 A 20
H.YU-201 D 80
UI-LP.56.2011 A 10

Bash command/script to split line on a certain character

I would like to split the below data to the expected output:
Raw Data:
931096|376601|1|ART|AT-2151780724|2151780724|2|102809198|I|CGM44I|MIL3VF03|52576377.3600|PENDING|MO|PEND-INFO|Pend ACS4R|N|N|N|N|N|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|N|NULL|NULL|N|system|NULL|NULL|52576377.3600|1317720|system|2020-02-13 02:00:42|0
931097|375789|1|AYT|AT-2151509210|2151509210|7|102614605|A|CTHGMI|OZF19|444006.6400|APPROVED|NULL|APPROVED|Approved|N|N|N|N|N|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|N|NULL|NULL|N|kg17718|NULL|NULL|0.0000|1317722|system|2020-02-13 02:00:43|0931098|375979|1|AHT|AT-2151780726|2151780726|2|102809199|I|CGMI|MILaesLF11|26312.0000|PENDING|MO|PEND-INFO|Pend ACRES|N|N|N|N|N|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|N|NULL|NULL|N|system|NULL|NULL|26312.0000|1317721|system|2020-02-13 02:00:43|0
931099|376572|1|AT|AT-2151399812|2151399812|5|102673999|I|CG2rMI|WEL44LF15|60991.6956|PENDING|MO|PEND-INFO|Pend ACERS|N|N|N|N|N|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|N|NULL|NULL|N|system|NULL|NULL|0.0000|1317723|system|2020-02-13 02:00:45|0
Expected Output:
931096|376601|1|ART|AT-2151780724|2151780724|2|102809198|I|CGM44I|MIL3VF03|52576377.3600|PENDING|MO|PEND-INFO|Pend ACS4R|N|N|N|N|N|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|N|NULL|NULL|N|system|NULL|NULL|52576377.3600|1317720|system|2020-02-13 02:00:42|0
931097|375789|1|AYT|AT-2151509210|2151509210|7|102614605|A|CTHGMI|OZF19|444006.6400|APPROVED|NULL|APPROVED|Approved|N|N|N|N|N|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|N|NULL|NULL|N|kg17718|NULL|NULL|0.0000|1317722|system|2020-02-13 02:00:43|0
931098|375979|1|AHT|AT-2151780726|2151780726|2|102809199|I|CGMI|MILaesLF11|26312.0000|PENDING|MO|PEND-INFO|Pend ACRES|N|N|N|N|N|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|N|NULL|NULL|N|system|NULL|NULL|26312.0000|1317721|system|2020-02-13 02:00:43|0
931099|376572|1|AT|AT-2151399812|2151399812|5|102673999|I|CG2rMI|WEL44LF15|60991.6956|PENDING|MO|PEND-INFO|Pend ACERS|N|N|N|N|N|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|N|NULL|NULL|N|system|NULL|NULL|0.0000|1317723|system|2020-02-13 02:00:45|0
Basically the \n character is getting lost sometimes in the data and the lines are getting merged. Sometimes more than 1 line gets merged as well (even the opposite happens but we can get to that later).
The data always has 43 columns | separated. The last but one column(42nd) always is a timestamp and the last column is usually 0 or 1.
Trying for the below approach:
If cols > 43
Split 44th column to add \n and print the remaining.
Repeat process until cols=43
echo "${curr}" | awk -F\| ' { if(NF > 43) {for(i=43;i<NF;i++) "sed '${NR}s/\(^0\)/\1\n/p' $i" }}' filename
less complex
awk 'BEGIN {FS=OFS="|"}
NF>43 {for(i=43;i<=NF;i+=42) {t=$i; $i=substr(t,1,1) ORS substr(t,2)}}1' file
931096|376601|1|ART|AT-2151780724|2151780724|2|102809198|I|CGM44I|MIL3VF03|52576377.3600|PENDING|MO|PEND-INFO|Pend ACS4R|N|N|N|N|N|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|N|NULL|NULL|N|system|NULL|NULL|52576377.3600|1317720|system|2020-02-13 02:00:42|0
931097|375789|1|AYT|AT-2151509210|2151509210|7|102614605|A|CTHGMI|OZF19|444006.6400|APPROVED|NULL|APPROVED|Approved|N|N|N|N|N|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|N|NULL|NULL|N|kg17718|NULL|NULL|0.0000|1317722|system|2020-02-13 02:00:43|0
931098|375979|1|AHT|AT-2151780726|2151780726|2|102809199|I|CGMI|MILaesLF11|26312.0000|PENDING|MO|PEND-INFO|Pend ACRES|N|N|N|N|N|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|N|NULL|NULL|N|system|NULL|NULL|26312.0000|1317721|system|2020-02-13 02:00:43|0
931099|376572|1|AT|AT-2151399812|2151399812|5|102673999|I|CG2rMI|WEL44LF15|60991.6956|PENDING|MO|PEND-INFO|Pend ACERS|N|N|N|N|N|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|NULL|N|NULL|NULL|N|system|NULL|NULL|0.0000|1317723|system|2020-02-13 02:00:45|0
following your spec
If cols > 43 Split 44th 43th column to add
\n and print the remaining. Repeat process until cols=43 the end.
The usual way with sed: write a regex that matches 43 | characters with anything in between and a digit. Then insert a newline after the matched string.
sed 's/[0-9]\{6\}\(|[^|]*\)\{41\}|[0-9]/&\n/g ; s/\n$//'
# ^^^^^^^ - remove the leftover newline
# ^ - the matched string
# ^^^^^ - trailing digit
# ^ - 42th pipe character
# ^^^^^^^^^^^^^^^^ - 41 fields with anything in between
# ^^^^^^^^^^ - leading 6 digits
tested on repl
Or maybe match 42 pipes with anything in front and a digit::
sed 's/\([^|]*|\)\{42\}[0-9]/&\n/g ; s/\n$//'
Or match a character after 42 pipes and a digit and insert a newline in between:
sed 's/\(\([^|]*|\)\{42\}[0-9]\)\(.\)/\1\n\3/g'
Could you please try following, written and tested with shown samples. This solution will take care of inserting new lines even if you have more than 1 occurrences present in your single line too.
awk '
match($0,/[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2}\|0/){
val=substr($0,RSTART+RLENGTH)
if(val){
num=gsub(/[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2}\|0/,"&")
while(++count<num){
sub(/[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2}\|0/,"&\n")
}
}
val=count=num=""
}
1
' Input_file
You don't trust the source of the data. Maybe it will add another | and the number of columns is wrong.
Another approach is guessing that you can trust the timestamp field.
So try to split the line when the field after the timestamp has more dan one character (and split after the first).
sed -E 's/([0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2}\|.)(.)/\1\n\2/g' file
This might work for you (GNU sed):
sed 's/[^|]*/\n&/44;s/\(|.\)\([^|]*|\)\n/\1\n\2/;P;D' file
If there is a 44th field, insert a newline before it. Then remove that newline and insert it following the first character of the 43rd field. Print the first line, delete the first line and repeat.

gawk to create first column based on part of second column

I have a 2 column tsv that I need to insert a new first column using part of the value in column 2.
What I have:
fastq/D0110.L001_R1_001.fastq fastq/D0110.L001_R2_001.fastq
fastq/D0206.L001_R1_001.fastq fastq/D0206.L001_R2_001.fastq
fastq/D0208.L001_R1_001.fastq fastq/D0208.L001_R2_001.fastq
What I want:
D0110 fastq/D0110.L001_R1_001.fastq fastq/D0110.L001_R2_001.fastq
D0206 fastq/D0206.L001_R1_001.fastq fastq/D0206.L001_R2_001.fastq
D0208 fastq/D0208.L001_R1_001.fastq fastq/D0208.L001_R2_001.fastq
I want to pull everything between "fastq/" and the first period and print that as the new first column.
$ awk -F'[/.]' '{printf "%s\t%s\n",$2,$0}' file
D0110 fastq/D0110.L001_R1_001.fastq fastq/D0110.L001_R2_001.fastq
D0206 fastq/D0206.L001_R1_001.fastq fastq/D0206.L001_R2_001.fastq
D0208 fastq/D0208.L001_R1_001.fastq fastq/D0208.L001_R2_001.fastq
How it works
awk implicitly loops over all input lines.
-F'[/.]'
This tells awk to use any occurrence of / or . as a field separator. This means that, for your input, the string you are looking for will be the second field.
printf "%s\t%s\n",$2,$0
This tells awk to print the second field ($2), followed by a tab (\t), followed by the input line ($0), followed by a newline character (\n)

Print lines containing the same second field for more than 3 times in a text file

Here is what I am doing.
The text file is comma separated and has three field,
and I want to extract all the line containing the same second field
more than three times.
Text file (filename is "text"):
11,keyword1,content1
4,keyword1,content3
5,keyword1,content2
6,keyword2,content5
6,keyword2,content5
7,keyword1,content4
8,keyword1,content2
1,keyword1,content2
My command is like below. cat the whole text file inside awk and grep with the second field of each line and count the number of the line.
If the number of the line is greater than 2, print the whole line.
The command:
awk -F "," '{ "cat text | grep "$2 " | wc -l" | getline var; if ( 2 < var ) print $0}' text
However, the command output contains only first three consecutive lines,
instead of printing also last three lines containing "keyword1" which occurs in the text for six times.
Result:
11,keyword1,content1
4,keyword1,content3
5,keyword1,content2
My expected result:
11,keyword1,content1
4,keyword1,content3
5,keyword1,content2
7,keyword1,content4
8,keyword1,content2
1,keyword1,content2
Can somebody tell me what I am doing wrong?
It is relatively straight-forward to make just two passes over the file. In the first pass, you count the number of occurrences of each value in column 2. In the second pass, you print out the rows where the value in column 2 occurs more than your threshold value of 3 times.
awk -F, 'FNR == NR { count[$2]++ }
FNR != NR { if (count[$2] > 3) print }' text text
The first line of code handles the first pass; it counts the occurrences of each different value of the second column.
The second line of code handles the second pass; if the value in column 2 was counted more than 3 times, print the whole line.
This doesn't work if the input is only available on a pipe rather than as a file (so you can't make two passes over the data). Then you have to work much harder.

formatted reading using awk

I am trying to read in a formatted file using awk. The content looks like the following:
1PS1 A1 1 11.197 5.497 7.783
1PS1 A1 1 11.189 5.846 7.700
.
.
.
Following c format, these lines are in following format
"%5d%5s%5s%5d%8.3f%.3f%8.3f"
where, first 5 positions are integer (1), next 5 positions are characters (PS1), next 5 positions are characters (A1), next 5 positions are integer (1), next 24 positions are divided into 3 columns of 8 positions with 3 decimal point floating numbers.
What I've been using is just calling these lines separated by columns using "$1, $2, $3". For example,
cat test.gro | awk 'BEGIN{i=0} {MolID[i]=$1; id[i]=$2; num[i]=$3; x[i]=$4;
y[i]=$5; z[i]=$6; i++} END { ...} >test1.gro
But I ran into some problems with this, and now I am trying to read these files in a formatted way as discussed above.
Any idea how I do this?
Looking at your sample input, it seems the format string is actually "%5d%-5s%5s%5d%8.3f%.3f%8.3f" with the first string field being left-justified. It's too bad awk doesn't have a scanf() function, but you can get your data with a few substr() calls
awk -v OFS=: '
{
a=substr($0,1,5)
b=substr($0,6,5)
c=substr($0,11,5)
d=substr($0,16,5)
e=substr($0,21,8)
f=substr($0,29,8)
g=substr($0,37,8)
print a,b,c,d,e,f,g
}
'
outputs
1:PS1 : A1: 1: 11.197: 5.497: 7.783
1:PS1 : A1: 1: 11.189: 5.846: 7.700
If you have GNU awk, you can use the FIELDWIDTHS variable like this:
gawk -v FIELDWIDTHS="5 5 5 5 8 8 8" -v OFS=: '{print $1, $2, $3, $4, $5, $6, $7}'
also outputs
1:PS1 : A1: 1: 11.197: 5.497: 7.783
1:PS1 : A1: 1: 11.189: 5.846: 7.700
You never said exactly which fields you think should have what number, so I'd like to be clear about how awk thinks that works (Your choice to be explicit about calling the whitespace in your output format string fields makes me worry a little. You might have a different idea about this than awk.).
From the manpage:
An input line is normally made up of fields separated by white space,
or by regular expression FS. The fields are denoted $1, $2, ..., while
$0 refers to the entire line. If FS is null, the input line is split
into one field per character.
Take note that the whitespace in the input line does not get assigned a field number and that sequential whitespace is treated as a single field separator.
You can test this with something like:
echo "1 2 3 4" | awk '{print "1:" $1 "\t2:" $2 "\t3:" $3 "\t4:" $4}'
at the command line.
All of this assumes that you have not diddles the FS variable, of course.