Rearrange lines to tabular format - awk

I would like to grab the part after "-" and combine it with the following letter-string into a tab-output. I tried something like cut -d "*-" -f 2 <<< "$your_str" but I am not sure how to do the whole shuffling.
Input:
>1-395652
TATTGCACTTGTCCCGGCCTGT
>2-369990
TATTGCACTCGTCCCGGCCTCC
>3-132234
TATTGCACTCGTCCCGGCCTC
>4-122014
TATTGCACTTGTCCCGGCCTGTAA
>5-118616
Output:
TATTGCACTTGTCCCGGCCTGT 395652
TATTGCACTCGTCCCGGCCTCC 369990

awk to the rescue!
awk -F- '/^>/{k=$2; next} {print $0, k}' file

With GNU sed:
sed -nE 'N;s/.*-([0-9]+)\n(.*)/\2\t\1/p' file
Output:
TATTGCACTTGTCCCGGCCTGT 395652
TATTGCACTCGTCCCGGCCTCC 369990
TATTGCACTCGTCCCGGCCTC 132234
TATTGCACTTGTCCCGGCCTGTAA 122014

Portable sed:
sed -n 's/.*-//;x;n;G;s/\n/ /p' inputfile
Output:
TATTGCACTTGTCCCGGCCTGT 395652
TATTGCACTCGTCCCGGCCTCC 369990
TATTGCACTCGTCCCGGCCTC 132234
TATTGCACTTGTCCCGGCCTGTAA 122014

Related

Regexp in gawk matches multiples ways

I have some text I need to split up to extract the relevant argument, and my [g]awk match command does not behave - I just want to understand why?! (I have written a less elegant way around it now...).
So the string is blahblah|msgcontent1=HeaderUUIiewConsenFlagPSMessage|msgtype2=Blah002|msgcontent2=header
I want to output just the contents of msgcontent1=, so did
echo "blahblah|msgcontent1=HeaderUUIiewConsenFlagPSMessage|msgtype2=Blah002|msgcontent2=header" | gawk '{ if (match($0,/msgcontent1=(.*)[|]/,a)) { print a[1]; } }'
Trouble instead of getting
HeaderUUIiewConsenFlagPSMessage
I get the match with everything from there to the last pipe of the string HeaderUUIiewConsenFlagPSMessage|msgtype2=Blah002
Now I accept this is because the regexp in /msgcontent1=(.*)[|]/ can match multiple ways, but HOW do I make it match the way I want it to??
With your shown samples please try following. Written and tested in GNU awk this will print only contents from msgcontent1= till | first occurrence.
awk 'match($0,/msgcontent1=[^|]*/){print substr($0,RSTART+12,RLENGTH-12)}' Input_file
OR with echo + awk try:
echo "blahblah|msgcontent1=HeaderUUIiewConsenFlagPSMessage|msgtype2=Blah002|msgcontent2=header" |
awk 'match($0,/msgcontent1=[^|]*/){print substr($0,RSTART+12,RLENGTH-12)}'
With FPAT option in GNU awk:
awk -v FPAT='msgcontent1=[^|]*' '{sub(/.*=/,"",$1);print $1}' Input_file
This is your input:
s='blahblah|msgcontent1=HeaderUUIiewConsenFlagPSMessage|msgtype2=Blah002|msgcontent2=header'
You may use gnu awk like this to extract value after msgcontent1=:
awk -F= -v RS='|' '$1 == "msgcontent1" {print $2}' <<< "$s"
HeaderUUIiewConsenFlagPSMessage
or using this sed:
sed -E 's/^(.*\|)?msgcontent1=([^|]+).*/\2/' <<< "$s"
HeaderUUIiewConsenFlagPSMessage
Or using this gnu grep:
grep -oP '(^|\|)msgcontent1=\K[^|]+' <<< "$s"
HeaderUUIiewConsenFlagPSMessage
echo "blahblah|msgcontent1=HeaderUUIiewConsenFlagPSMessage|msgtype2=Blah002|msgcontent2=header" | awk '{ if (match($0,/msgcontent1=([^\|]*)/,a)) print a[1] }'
this prints HeaderUUIiewConsenFlagPSMessage
The reason your regex match msgcontent1=HeaderUUIiewConsenFlagPSMessage|msgtype2=Blah002 is that matching is 'hungry' so it allways finds the longest possible match
Also with awk:
echo 'blahblah|msgcontent1=HeaderUUIiewConsenFlagPSMessage|msgtype2=Blah002|msgcontent2=header' | awk -v FS='[=|]' '$2 == "msgcontent1" {print $3}'
HeaderUUIiewConsenFlagPSMessage

How to remove n-numbers of characters after a certain pattern using sed or awk?

Let's say I have a file with lines look like this:
12:03:22.141245532 12:03:22.892612543 my_script_bla_bla
I need to make all the lines of the file look like this:
12:03:22.1 12:03:22.8 my_script_bla_bla
How can I achieve it using awk or sed?
You may use this gnu awk command:
awk '{for (i=1; i<=2; ++i) $i=gensub(/(\.[0-9])[0-9]+$/, "\\1", "1", $i)} 1' file
12:03:22.1 12:03:22.8 my_script_bla_bla
sed -e 's/^\([0-2][0-9]:[0-5][0-9]:[0-5][0-9]\.[0-9]\)[0-9]* \([0-2][0-9]:[0-5][0-9]:[0-5][0-9]\.[0-9]\)[0-9]*/\1 \2/'
Match two time strings, capturing the information up to and including the first digit after the decimal point, ignoring any extra digits, replacing the input with the two captured strings.
It's not very compact.
I would use GNU AWK gensub following way, let file.txt content be:
12:03:22.141245532 12:03:22.892612543 my_script_bla_bla
then
awk '{print gensub(/([0-9]{2}:[0-9]{2}:[0-9]{2}\.[0-9])[0-9]+/, "\\1", "g")}' file.txt
output
12:03:22.1 12:03:22.8 my_script_bla_bla
Explanation: I used single capturing group enclosing characters to be preserved and global (g) mode of gensub.
(tested in gawk 4.2.1)
This might work for you (GNU sed ):
sed -E 's/(..:..:..\..)\S+/\1/g' file
Pattern match a date and time and remove most of the time.
N.B. This will probably work but if you want belt-and-braces use:
sed -E 's/^([[:digit:]]{2}:[[:digit:]]{2}:[[:digit:]]{2}\.[[:digit:]])[[:digit:]]* ([[:digit:]]{2}:[[:digit:]]{2}:[[:digit:]]{2}\.[[:digit:]])[[:digit:]]*/\1 \2/g' file
another awk
$ awk '{$1=substr($1,1,10);$2=substr($2,1,10)}1' file
12:03:22.1 12:03:22.8 my_script_bla_bla
looks like a timestamp formatted data; trim the first two fields after the 10'th char.
Using Perl
$ echo "12:03:22.141245532 12:03:22.892612543 my_script_bla_bla" | perl -ne ' s/\.(.)\S+/.$1/ ; s/\.(.)\S+/.$1/; print '
12:03:22.1 12:03:22.8 my_script_bla_bla
$
If the my_script_bla_bla doesn't have any ".", then
$ echo "12:03:22.141245532 12:03:22.892612543 my_script_bla_bla" | perl -pe ' s/\.\K(.)\S+/$1/g '
12:03:22.1 12:03:22.8 my_script_bla_bla
$

printing strings and inputs from file

I have file named files.txt:
file1.F
data.dat
image.png
I would like to desired file including:
IN='file1.F'
IN='data.dat'
IN='image.png'
How to reach that? I tried this, but the syntax is poor:
awk '{print 'IN=\''$1'\''}' files.txt > input
Could you please try following.
awk -v s1="\047" -v var="IN=" '{print var s1 $0 s1}' Input_file
Output will be as follows.
IN='file1.F'
IN='data.dat'
IN='image.png'
If sed is an option.
sed "s/.*/IN='&'/" file
Output:
IN='file1.F'
IN='data.dat'
IN='image.png'

AWK pipe output to SED to replace

I'm trying to replace a string using AWK pipe out to SED
grep pdo_user /html/data/_default_/configs/application.ini | awk '{print $3}' | sed -i 's/$1/"username"/g' /html/data/_default_/configs/application.ini
but found string is not replaced
Output for
grep pdo_user /html/data/_default_/configs/application.ini | awk '{print $3}'
is
"root"
Any tips on that?
I suggest to use awk and mv:
awk '/pdo_user/ && $3=="\"root\"" {$3="\"username\""}1' /path/to/application.ini > /path/to/application.tmp
mv /path/to/application.tmp /path/to/application.ini
Working solution based on Shelter's tip using AWK and SED
sed -i 's/'$(awk '/pdo_user/{print $3}' /path/to/application.ini)'/"username"/' /path/to/application.ini

use awk to print a column, adding a comma

I have a file, from which I want to retrieve the first column, and add a comma between each value.
Example:
AAAA 12345 xccvbn
BBBB 43431 fkodks
CCCC 51234 plafad
to obtain
AAAA,BBBB,CCCC
I decided to use awk, so I did
awk '{ $1=$1","; print $1 }'
Problem is: this add a comma also on the last value, which is not what I want to achieve, and also I get a space between values.
How do I remove the comma on the last element, and how do I remove the space? Spent 20 minutes looking at the manual without luck.
$ awk '{printf "%s%s",sep,$1; sep=","} END{print ""}' file
AAAA,BBBB,CCCC
or if you prefer:
$ awk '{printf "%s%s",(NR>1?",":""),$1} END{print ""}' file
AAAA,BBBB,CCCC
or if you like golf and don't mind it being inefficient for large files:
$ awk '{r=r s $1;s=","} END{print r}' file
AAAA,BBBB,CCCC
awk {'print $1","$2","$3'} file_name
This is the shortest I know
Why make it complicated :) (as long as file is not too large)
awk '{a=NR==1?$1:a","$1} END {print a}' file
AAAA,BBBB,CCCC
For better porability.
awk '{a=(NR>1?a",":"")$1} END {print a}' file
You can do this:
awk 'a++{printf ","}{printf "%s", $1}' file
a++ is interpreted as a condition. In the first row its value is 0, so the comma is not added.
EDIT:
If you want a newline, you have to add END{printf "\n"}. If you have problems reading in the file, you can also try:
cat file | awk 'a++{printf ","}{printf "%s", $1}'
awk 'NR==1{printf "%s",$1;next;}{printf "%s%s",",",$1;}' input.txt
It says: If it is first line only print first field, for the other lines first print , then print first field.
Output:
AAAA,BBBB,CCCC
In this case, as simple cut and paste solution
cut -d" " -f1 file | paste -s -d,
In case somebody as me wants to use awk for cleaning docker images:
docker image ls | grep tag_name | awk '{print $1":"$2}'
Surpised that no one is using OFS (output field separator). Here is probably the simplest solution that sticks with awk and works on Linux and Mac: use "-v OFS=," to output in comma as delimiter:
$ echo '1:2:3:4' | awk -F: -v OFS=, '{print $1, $2, $4, $3}' generates:
1,2,4,3
It works for multiple char too:
$ echo '1:2:3:4' | awk -F: -v OFS=., '{print $1, $2, $4, $3}' outputs:
1.,2.,4.,3
Using Perl
$ cat group_col.txt
AAAA 12345 xccvbn
BBBB 43431 fkodks
CCCC 51234 plafad
$ perl -lane ' push(#x,$F[0]); END { print join(",",#x) } ' group_col.txt
AAAA,BBBB,CCCC
$
This can be very simple like this:
awk -F',' '{print $1","$1","$2","$3}' inputFile
where input file is : 1,2,3
2,3,4 etc.
I used the following, because it lists the api-resource names with it, which is useful, if you want to access it directly. I also use a label "application" to find specific apps in a namespace:
kubectl -n ops-tools get $(kubectl api-resources --no-headers=true --sort-by=name | awk '{printf "%s%s",sep,$1; sep=","}') -l app.kubernetes.io/instance=application