gawk : extracting data from special symbols

gawk : extracting data from special symbols - gawk

I am trying to get the total time from strace -T, which is reported as :
pid command [time]
(for each system call)
Now I want to sum the [time] . I am using gawk, and I know that the last field can be accessed with $NF . However, $NF reports [time] (with brackets) instead of just time, which I obviously can't sum up, so I what I ask is how do I get time instead of [time] ?
Thanks

You can get to "time" in "[time]" by changing the field separator:
awk 'BEGIN {FS="[\\[\\]]"}; {print $(NF-1)}'

Related

Bash - does the awk function have a built in julian day function?

I have a csv file where the 1st column is the date and time the data was generated - I am trying to use awk to convert that date into a julian day and add it as an extra column on the end
"2021-01-22 22:02:00",475673,485,0,0,0,0,0,0,0,0,-1.308788,-4.421722,-99
"2021-01-22 23:03:00",475674,485,0,0,0,0,0,0,0,0,-1.329033,-4.373959,-99
"2021-01-22 24:04:00",475675,485,0,0,0,0,0,0,0,0,-1.320374,-4.359528,-99
"2021-01-22 25:05:00",475676,485,0,0,0,0,0,0,0,0,-1.329685,-4.494766,-99
"2021-01-22 26:06:00",475677,485,0,0,0,0,0,0,0,0,-1.343422,-4.650154,-99
I have written a script in bash that is called when a file arrives for processing. I have tried a couple of different variation on the below
awk '{ jday=date -d(substr($0,2,10)) +%j;print $0","jday }' temp.CMP
The reason I am using the awk command is because I am also extracting the year, month, day, hour, minute data and adding as individual columns on the end of each line.
is what I am trying possible using awk in Bash?
thanks in advance for any help.

If you have access to GNU awk, you can try the following:
awk -F, '{ dattim=gensub("[-:\"]"," ","g",$1);print $0","strftime("%j",mktime(dattim))}' file
Use gensub to replace all "-" and ":" entries for a space in the first comma delimited field. Read the result into the variable dattim. Then use this variable along with strftime and mktime functions to append the julian format date to the end of the line.

How to filter the OTU by counts with AWK?

I am trying to filter all the singleton from a fasta file.
Here is my input file:
>OTU1;size=3;
ATTCCCCGGGGGGG
>OTU2;size=1;
ATCCGGGACTGATC
>OTU3;size=5;
GAACTATCGGGTAA
>OTU4;size=1;
AATTGGCCATCT
The expected output is:
>OTU1;size=3;
ATTCCCCGGGGGGG
>OTU3;size=5;
GAACTATCGGGTAA
I've tried
awk -F'>' '{if($1>=2) {print $0}' input.fasta > ouput.fasta
but this will remove all the header for each OTU.
Anyone could help me out?

Could you please try following.
awk -F'[=;]' '/^>/{flag=""} $3>=3{flag=1} flag' Input_file

$ awk '/>/{f=/=1;/}!f' file
>OTU1;size=3;
ATTCCCCGGGGGGG
>OTU3;size=5;
GAACTATCGGGTAA

awk -v FS='[;=]' 'prev_sz>=2 && !/size/{print prev RS $0} /size/{prev=$0;prev_sz=$(NF-1)}'
>OTU1;size=3;
ATTCCCCGGGGGGG
>OTU3;size=5;
GAACTATCGGGTAA
Store the size from each line in prev_sz variable and whole line in prev variables. Now check if its >= 2, then print the previous line and the current line. RS is used to print new line.

While all the above methods work, they are limited to the fact that input always has to look the same. I.e the sequence-name in your fasta-file needs to have the form:
>NAME;size=value;
A few solutions can handle a bit more extended sequence-names, but none handle the case where things go a bit more generic, i.e.
>NAME;label1=value1;label2=value2;STRING;label3=value3;
Print sequence where label xxx matches value vvv:
awk '/>{f = /;xxx=vvv;/}f' file.fasta
Print sequence where label xxx has a numeric value p bigger than q:
awk -v label="xxx" -v limit=q \
'BEGIN{ere=";" label "="}
/>/{ f=0; match($0,ere);value=0+substr($0,RSTART+length(ere)); f=(value>limit)}
f' <file>
In the above ere is a regular expression we try to match. We use it to find the location of the value attached to label xxx. This substring will have none-numeric characters after its value, but by adding 0 to it, it is converted to a number, losing all non-numeric values (i.e. 3;label4=value4; is converted to 3). We check if the value is bigger than our limit, and print the sequence based on that result.

Finding max value using AWK

I have a file with two column data and I want to find the max value and print it.
file =load_measure
11:20,18.03
11:25,17.85
11:30,18.24
11:35,19.19
11:40,18.45
11:45,17.53
11:50,17.56
11:55,17.60
12:00,18.51
12:05,18.50
I try via hereunder code but It returns 0
awk 'BEGIN {max = 0} {if ($2>max) max=$2} END {print max}' load_measure
0
I try via declaring max as $max but it does not count the real max:
awk 'BEGIN {max = 0} {if ($2>max) max=$2} END {print $max}' load_measure
12:05,18.50
Can anyone explain what I'm doing wrong?
thank you!

When your fields are separated by something other that white space you need to tell awk what that something is by populating FS. You also need to set max to the first value read so the script will work for all-negative input and you need to print max+0 in the END to ensure numeric output even if the input file is empty:
awk -F, 'NR==1{max=$2} $2>max{max=$2} END{print max+0}' file
Whern max is 2, print max is printing the value of max, i.e. 2, while print $max is printing the value of the field indexed by the value of max, i.e. $2, which in an END section will either be null or the value of $2 on the last line read (undefined behavior per POSIX so awk-dependent).

You should specify the value of FS that is the input field separator. It describes how each record is split into fields; it may even be an extended regular expression.
On awk's command line, FS can be specified as -F <sep> (or -v FS=<sep>). You can also set it in the BEGIN block.
I'm normally using the later method but that's just a personal preference:
BEGIN {max=0;FS=","} ....
Also Your problem can be solved like this too:
awk -F, -v m=0 '$2>m {m=$2} END {print m}'
thus sparing an if statement.
The POSIX-mandated default value is a space (0x20). But be aware that running spaces (more than one) might be considered as one field separator.
Here is the official documentation for GNU Awk.

Bash script process csv file line by line while updateing $6 with different value but keeping other values unchanged

I am beginner at bash scripting and I have been trying to fix this for more than 8 hours.
I have searched on StackOwerflow and tried the answers to fit my needs, but without success.
I want to use bash script to change csv file's date value to current date.
I am using a dummy .csv file ( http://eforexcel.com/wp/wp-content/uploads/2017/07/100-Sales-Records.zip ) and I want to change the 6th value (date) to the current date.
What I have been doing so far:
I have created one line csv to test the script
cat oneline.csv:
Australia and Oceania,Tuvalu,Baby Food,Offline,H,5/28/2010,669165933,6/27/2010,9925,255.28,159.42,2533654.00,1582243.50,951410.50
then I have tested the one line script:
echo `cat oneline.csv | awk -F, '{ print $1"," $2"," $3"," $4"," $5","}'` `date` `cat oneline.csv |awk -F, '{print $7"," $8"," $9"," $10"," $11"," $12"," $13"," $14"\n"}'
then I have this code for the whole 100 line files in source.sh:
#I want to change 6th value for every line of source.csv to current date and keep the rest and export it to output.csv
while read
do
part1=$(`cat source.csv | awk -F, '{ print $1"," $2"," $3"," $4"," $5","}'`)
datum=$(`date`)
part2=$(`cat source.csv |awk -F, '{print $7"," $8"," $9"," $10"," $11"," $12"," $13"," $14"\n"}'`)
echo `$part1 $datum $part2`
done
and I expect to run the command like ./source.sh > output.csv
What I want for the full 100 lines file is to have result like:
Food,Offline,H,Thu Jan 17 06:34:03 EST 2019,669165933,6/27/2010,9925,255.28,159.42,2533654.00,1582243.50,951410.50
Could you guide me how to change the code to get the result?

Refactor everything to a single Awk script; that also avoids the echo in backticks.
awk -v datum="$(date)" -F , 'BEGIN { OFS=FS }
{ $6 = datum } 1' source.csv >output.csv
Briefly, we split on comma (-F ,) and replace the value of the sixth field with the value of the variable we passed in with -v. OFS=FS sets the output field separator to the input field separator (comma). Then the 1 means "print unconditionally".
Generally speaking, you should probably avoid while read.
Tangentially, your quoting looks wacky; you don't want backticks around $part1 unless it is a command you want the shell to run (which in turn is probably a bad idea in itself). Also, backticks have long been deprecated in favor of $(command) syntax which is more legible and offers some syntactic advantages.

Using awk to fill in SQL Dates

I am trying to generate SQL filling in a date via command line, using awk's printf. The code I am using is:
awk 'BEGIN{ printf " convert_tz(time,\047GMT\047,\047America/New_York\047) as timestamp , date_format(convert_tz(time,\047GMT\047,\047America/New_York\047), \047\045Y\045m\045d \045H\045i\045s\047) as dt from table where time >= convert_tz(%s,\047America/New_York\047,\047GMT\047) and time <= convert_tz(%s + interval 1 day ,\047America/New_York\047,\047GMT\047);", "2011-01-01", "2011-01-01"}'
I believe I have the escaping correct, but I get the following result:
awk: fatal: not enough arguments to satisfy format string
Does anyone have an idea of why the %s is not getting caught and populated?
The specific version of awk I'm using is GNU Awk 3.1.6.

Your escaping is a bit off, you can't replace % with \045 since it's replaced back to % before printf is called and makes it confused. The way to escape % in printf is to instead use %% and it will work well.
...\047%%Y%%m%%d %%H%%i%%s\047...

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

gawk : extracting data from special symbols - gawk

You can get to "time" in "[time]" by changing the field separator: awk 'BEGIN {FS="[\\[\\]]"}; {print $(NF-1)}'

Related

Bash - does the awk function have a built in julian day function?

How to filter the OTU by counts with AWK?

Finding max value using AWK

Bash script process csv file line by line while updateing $6 with different value but keeping other values unchanged

Using awk to fill in SQL Dates

Categories

Resources