bash or awk - generating report from complex data set - awk

I have a program that generates a large data file, and I put a small sample in the input section. What I am trying to do is start with an AOUT. Then look at the 4th column to find its next connection, which shows up in the second column somewhere else in the file and repeat those steps until it ends with an AIN in the first column. The number of connections between the AOUT and AIN varies from just one to over ten. If there isn't an AIN at the end, there shouldn't be any output. the output should start with AOUT and show each connection until it reaches AIN. Is there a way to use awk or anything to create my desired output?
input (this is a small section there are many more and the order they appear is not standard)
AOUT,03xx:LY0372A,LIC0372.OUT,LIC0372
PIDA,03xx:LIC0372,LT372_SEL.OUT,LT372_SEL
SIGSEL,03xx:LT372_SEL,LT1_0372.PNT,LT1_0372
AIN,03xx:LT1_0372
output:
03xx:LY0372A
=03xx:LT372_SEL.OUT
=03xx:LT1_0372.PNT
=03xx:LT1_0372
output format:
(AOUT)
=(any number of jumps)
=(any number of jumps))
=(AIN)

If you don't provide more input and answers to the questions in the comments above, a possible solution in AWK could be:
#!/bin/bash
awk -F',' '{
if ($1 == "AOUT") {
output = $2 "\n"
connector = $4
sub (":.*", "", $2)
label = $2
}
else if ($1 == "AIN") {
output = output " =" $2
print output
output = ""
}
else if (output != "") {
if ($2 == label ":" connector) {
output = output " =" label ":" $3 "\n"
connector = $4
}
}
}' input.csv

Related

Sum of specific columns data with based on date using awk

I am having a data which is separated by a comma
LBA0SF004,2018-10-01,4681,4681
LBA0SF004,2018-10-01,919,919
LBA0SF004,2018-10-01,3,3
LBA0SF004,2018-10-01,11453,11453
LBA0SF004,2018-10-02,4681,4681
LBA0SF004,2018-10-02,1052,1052
LBA0SF004,2018-10-02,3,3
LBA0SF004,2018-10-02,8032,8032
I need an awk command to add all 3rd and 4th columns with awk command based on date. If you see the same server with different dates values are available I need data like this
LBA0SF004 2018-10-01 17056 17056
LBA0SF004 2018-10-02 13768 13768
Below GNU AWK construct should be able to do what you are looking for.
awk '
BEGIN {
FS = ","
OFS = " "
}
{
if(NF == 4)
{
a[$1][$2]["3rd"] += $3;
a[$1][$2]["4th"] += $4;
}
}
END {
for (i in a)
{
for (j in a[i])
{
print i, j, a[i][j]["3rd"], a[i][j]["4th"];
}
}
}
' Input_File.txt
Explanation :-
FS is input field Separator which in your case is ,
OFS is output field Separator which is
Create an array a with first column, second column and sum of third and forth columns
At the END, Print the contents of the array

Analysing two files using awk with if condition

I have two files. First contains names, numbers and days for all samples
sam_name.csv
Number,Day,Sample
171386,0,38_171386_D0_2-1.raw
171386,0,38_171386_D0_2-2.raw
171386,2,30_171386_D2_1-1.raw
171386,2,30_171386_D2_1-2.raw
171386,-1,40_171386_D-1_1-1.raw
171386,-1,40_171386_D-1_1-2.raw
The second includes information about batches (last column)
sam_batch.csv
Number,Day,Quar,Code,M.F,Status,Batch
171386,0,1,x,F,C,1
171386,1,1,x,F,C,2
171386,2,1,x,F,C,5
171386,-1,1,x,F,C,6
I would like to get the information about batches (using two condition number and day) and add it to the first file. I have used awk command to do that, but I am getting results only at one-time point (-1).
Here is my command:
awk -F"," 'NR==FNR{number[$1]=$1;day[$1]=$2;batch[$1]=$7; next}{if($1==number[$1] && $2==day[$1]){print $0 "," number[$1] "," day[$1] "," batch[$1]}}' sam_batch.csv sam_nam.csv
Here are my results: (a file sam_name, number and day from file sam_batch (just to check if a condition is working) and batch number (a value which I need)
Number,Day,Sample,Number,Day, Batch
171386,-1,40_171386_D-1_1-1.raw,171386,-1,6
171386,-1,40_171386_D-1_1-2.raw,171386,-1,6
175618,-1,08_175618_D-1_1-1.raw,175618,-1,2
Here I corrected your AWK code:
awk -F"," 'NR==FNR{
number_day = $1 FS $2
batch[number_day]=$7
next
}
{
number_day = $1 FS $2
print $0 "," batch[number_day]
}' sam_batch.csv sam_name.csv
Output:
Number,Day,Sample,Batch
171386,0,38_171386_D0_2-1.raw,1
171386,0,38_171386_D0_2-2.raw,1
171386,2,30_171386_D2_1-1.raw,5
171386,2,30_171386_D2_1-2.raw,5
171386,-1,40_171386_D-1_1-1.raw,6
171386,-1,40_171386_D-1_1-2.raw,6
(No need for double-checking if you understand how the script works.)
Here's another AWK solution (my original answer):
awk -v "b=sam_batch.csv" 'BEGIN {
FS=OFS=","
while(( getline line < b) > 0) {
n = split(line,a)
nd = a[1] FS a[2]
nd2b[nd] = a[n]
}
}
{ print $1,$2,$3,nd2b[$1 FS $2] }' sam_name.csv
Both solutions parse file sam_batch.csv at the beginning to form a dictionary of (number, day) -> batch. Then they parse sam_name.csv, printing out the first three fields together with the "Batch" from another file.

Finding a string with AWK before or after an iteration of a repeated sequence

So I've got a text document that looks like this (truncated)
[FRAME]
pkt_pts_time=0.000000
pict_type=I
[/FRAME]
[FRAME]
pkt_pts_time=0.250250
pict_type=B
[/FRAME]
[FRAME]
pkt_pts_time=0.500500
pict_type=P
[/FRAME]
[FRAME]
pkt_pts_time=0.750750
pict_type=B
[/FRAME]
[FRAME]
pkt_pts_time=0.959292
pict_type=I
[/FRAME]
This text was created with this command:
ffprobe -select_streams v -show_frames -show_entries frame=pkt_pts_time,pict_type,frame_number -v quiet input.mp4
As you can see, the [Frame] to [/Frame] sequence is repeated. So this is a way for me to count the frames and find which frame is an I frame. In each sequence the "pict_type=" value changes. I was wondering if there was a way for me to use AWK to input an iteration number and output the preceding pkt_pts_time value where the pict_type value equals I.
For instance, if my frame number is 3. I would be able to enter the number 3 and the awk expression would go to the third [Frame] to [/Frame] sequence and then look back from there till it found a "pict_type=I" string. Then it would see that the pkt_pts_time for that sequence iteration was "pkt_pts_time=0.00000" and it would output 0.0000
Check this. I will explain how it works, if it does that you want.
I count frames by ending tag - [/FRAME], but it can be changed to the starting tag [FRAME].
awk -F '=' -v frame_number=3 '
$1 == "[/FRAME]" {
frame_cnt++;
}
$1 == "pkt_pts_time" {
tmp_time = $2;
}
$2 == "I" {
i_time = tmp_time;
}
frame_cnt == frame_number {
print i_time;
exit;
}' input.txt
The version with the frame number after the I frame:
awk -F '=' -v frame_number=3 '
$1 == "[/FRAME]" {
frame_cnt++;
}
$1 == "pkt_pts_time" {
tmp_time = $2;
}
$2 == "I" {
i_time = tmp_time;
i_frame_number = frame_cnt + 1;
}
frame_cnt == frame_number {
print "The I frame time = " i_time;
print "The I frame number + 1 = " i_frame_number + 1;
exit;
}' input.txt
This version prints lower and upper "I" frame values, nearest to the target frame:
awk -F '=' -v frame_number=3 '
# The frame counter - each time the first field of the line
# equals to the [FRAME] string, the counter increments.
$1 == "[FRAME]" {
frame_cnt++;
}
# The "tmp_time" variable is updated each time the "pkt_pts_time" occurs.
# So, it does not have fixed value, it changing each time - floating.
$1 == "pkt_pts_time" {
tmp_time = $2;
}
# Here we are determining the nearest "I" frame, before the target frame.
# It works this way: each time the "I" frame occurs, the "i_lower" value
# updated. It happens, while we are not reach the target frame. Then, it is
# last time, whey the "i_lower" variable is updated. So, we found the nearest
# "I" frame before the target frame.
frame_cnt <= frame_number && $2 == "I" {
i_lower = tmp_time;
}
# Here, we are determining the nearest "I" frame, after the target frame.
# When it occurs, the lower and upper "I" frame values are printed
# and the script execution stops.
# Note, that if the upper "I" frame does not exist, the script will print nothing,
# because, the condition returns false.
frame_cnt >= frame_number && $2 == "I" {
print "lower I = " i_lower;
print "upper I = " tmp_time;
exit;
}' input.txt
another gawk using record structure
$ awk RS='\\[/FRAME\\]' '/pict_type=I/{for(i=1;i<=NF;i++)
if($i~/pkt_pts_time/)
{time=$i; break}};
NR==3 {split(time,t,"="); print t[2]; exit}'
store the time for given type, when it's third record print the latest seen.
It SOUNDS like this is what you're asking for but it won't produce any output from your sample input given you want something related to frame 3 because nothing in your sample input meets your requirements as I understand them:
$ cat tst.awk
BEGIN { FS="=" }
$1=="[FRAME]" { ++frameNr }
{ frame[$1] = $2 }
$1=="[/FRAME]" {
if ( frameNr == n ) {
if ( frame["pict_type"] == "I" ) {
print frame["pkt_pts_time"]
}
}
delete frame
}
$ awk -v n=3 -f tst.awk file
$ awk -v n=5 -f tst.awk file
0.959292
Anyway hopefully it's obvious enough what it's doing that you can massage it to suit if it's not exactly what you need.

awk | Add new row or update existing row in a file

I want to update file1 on the basis of file2. If any row is new in file2 then it should be added in file1. If any row from file2 is already in file1, then update that row with the row from file2 if the time is greater in file2.
file1
DL,1111111100,201312051013,val,FIX01,OptIn,N,Ext1,Ext2
DL,1111111101,201312051014,val,FIX01,OptIn,Y,Ext1,Ext2
DL,1111111102,201312051015,val,FIX01,OptIn,Y,Ext1,Ext2
DL,1111111103,201312051016,val,FIX01,OptIn,N,Ext1,Ext2
file2
DL,1111111101,201312041013,val,FIX02,OptIn,N,Ext1,Ext2
DL,1111111102,201312051016,val,FIX02,OptIn,N,Ext1,Ext2
DL,1111111102,201312051017,val,FIX02,OptIn,N,Ext1,Ext2
DL,1111111104,201312051014,val,FIX01,OptIn,Y,Ext1,Ext2
DL,1111111104,201312051016,val,FIX02,OptIn,Y,Ext1,Ext2
newfile1
DL,1111111100,201312051013,val,FIX01,OptIn,N,Ext1,Ext2
DL,1111111101,201312051014,val,FIX01,OptIn,Y,Ext1,Ext2
DL,1111111102,201312051017,val,FIX02,OptIn,N,Ext1,Ext2
DL,1111111103,201312051016,val,FIX01,OptIn,N,Ext1,Ext2
DL,1111111104,201312051016,val,FIX02,OptIn,Y,Ext1,Ext2
Notes:
2nd field should be unique in the output.
Addition of new value: the latest 2nd field for value "1111111104" in file2 is taken which is newer (201312051016) then old value (201312051014) on the basis of date column (3rd field).
Update an existing value: updated "1111111102" with newer value on the basis of date in 3rd column
file1 is very LARGE whereas file2 has 5-10 entries only.
row with 2nd field "1111111101" doesn't need to b updated because it's entry in file1 already has the latest date "201312051014" as compared to new date "201312041013" in file2.
I haven't tried much on this because it really has complex condition for me as beginner..
BEGIN { FS = OFS = "," }
FNR == NR {
m=$2;
a[m] = $0;
next
}
{
if($2 in a)
{
split(a[$2],datetime,",")
if($3>datetime[3])
print $0;
else
print a[$2]"Old time"
}
else print $0"NOMATCH";
delete a[$2];
}
Assuming that you can start your awk as follows:
awk -f script.awk input2.csv input1.csv > result.csv
you can use the following script to obtain the desired output:
BEGIN {
FS = OFS = ","
}
FILENAME == "input2.csv" {
date[$2] = $3
data[$2] = $0
used[$2] = 0
}
FILENAME == "input1.csv" {
if ($2 in date) {
used[$2] = 1
if ($3 < date[$2])
print data[$2]
else
print $0
} else {
print $0
}
}
END {
for (key in used) {
if (used[key] == 0)
print data[key]
}
}
Notes:
The script takes advantages of the assumption that file2 is smaller than file1 because it uses an array only for the few entries in file2.
The new entries are simply appended to the output. There is no sorting. If this is required there will have to be an extra effort.
EDIT
Heeding #JonathanLeffler's remark about the way I determine which file is being processed I would like to offer an alternate version that may (or may not :-) ) be a little more straight forward to understand than checking NR=FNR. However, it only works for sufficiently recent versions of awk which are capable of returning the size of an array as length(array):
BEGIN {
FS = ","
}
{
# The following effectively creates an array entry for each filename found (for "known" filenames existing entries are overwritten).
files[FILENAME] = 1
# check the number of files we have so far
if (length(files) == 1) {
# we are still in the first file
date[$2] = $3
data[$2] = $0
used[$2] = 0
} else {
# we are in the second file (or any other following file)
if ($2 in date) {
used[$2] = 1
if ($3 < date[$2])
print data[$2]
else
print $0
} else {
print $0
}
}
}
END {
for (key in used) {
if (used[key] == 0)
print data[key]
}
}
Also, if you require your output to be sorted according to the second row you can replace the call to awk by this:
awk -f script.awk input2.csv input1.csv | sort -t "," -n -k 2 > result.csv
The latter, of course, works for both versions of the script.
Since file1 is very large but file2 is very small (5-10 entries), you need to read all of file2 into memory first, dealing with the duplicate values. As a result, you'll have an array indexed by the record number with the new data; you should also have a record of the date for each record in a separate array. Then, as you read the main file, you look up the the record number and the date in the arrays, and if you need to, substitute the saved new record for the incoming old record.
Your outline script is most of the way there. It is more complex because you didn't save the dates coming in. This more or less works:
awk -F, '
FNR == NR { if (!($2 in date) || date[$2] < $3) { date[$2] = $3; line[$2] = $0; } next; }
{ if ($2 in date)
{
if (date[$2] > $3)
print line[$2]
else
print
delete line[$2]
delete date[$2]
}
else
print
}
END { for (l in line) print line[l]; }' file2 file1
Sample output for given data:
DL,1111111100,201312051013,val,FIX01,OptIn,N,Ext1,Ext2
DL,1111111101,201312051014,val,FIX01,OptIn,Y,Ext1,Ext2
DL,1111111102,201312051017,val,FIX02,OptIn,N,Ext1,Ext2
DL,1111111103,201312051016,val,FIX01,OptIn,N,Ext1,Ext2
DL,1111111104,201312051016,val,FIX02,OptIn,Y,Ext1,Ext2
However, if there were 4 new records, there's no guarantee that they'd be in sorted order, though they would all be at the end of the list. It would be possible to upgrade the script to print the new records at the appropriate place in the list if the input is guaranteed to be in sorted order. You simply have to search through the list of lines to see whether there are any lines that should be printed before the current line, and if so, do so (and delete the record so that they are not printed at the end).
Note that uniqueness in the output depends on uniqueness in the input (file1). That is, if field 2 in the input is repeated, this code won't notice. There is also nothing that can be done with the current design even if a duplicate was spotted; the old row has been printed so printing the new row will simply cause the duplicate. If you were worried about this, you could design the awk script to keep the whole of file1 in memory and only print anything when the whole of the input has been processed. Needless to say, this uses a lot more memory than the current design, and will generally be less efficient because of that. Nevertheless, it could be done if needed.

awk | Rearrange fields of CSV file on the basis of column value

I need you help in writing awk for the below problem. I have one source file and required output of it.
Source File
a:5,b:1,c:2,session:4,e:8
b:3,a:11,c:5,e:9,session:3,c:3
Output File
session:4,a=5,b=1,c=2
session:3,a=11,b=3,c=5|3
Notes:
Fields are not organised in source file
In Output file: fields are organised in their specific format, for example: all a values are in 2nd column and then b and then c
For value c, in second line, its coming as n number of times, so in output its merged with PIPE symbol.
Please help.
Will work in any modern awk:
$ cat file
a:5,b:1,c:2,session:4,e:8
a:5,c:2,session:4,e:8
b:3,a:11,c:5,e:9,session:3,c:3
$ cat tst.awk
BEGIN{ FS="[,:]"; split("session,a,b,c",order) }
{
split("",val) # or delete(val) in gawk
for (i=1;i<NF;i+=2) {
val[$i] = (val[$i]=="" ? "" : val[$i] "|") $(i+1)
}
for (i=1;i in order;i++) {
name = order[i]
printf "%s%s", (i==1 ? name ":" : "," name "="), val[name]
}
print ""
}
$ awk -f tst.awk file
session:4,a=5,b=1,c=2
session:4,a=5,b=,c=2
session:3,a=11,b=3,c=5|3
If you actually want the e values printed, unlike your posted desired output, just add ,e to the string in the split() in the BEGIN section wherever you'd like those values to appear in the ordered output.
Note that when b was missing from the input on line 2 above, it output a null value as you said you wanted.
Try with:
awk '
BEGIN {
FS = "[,:]"
OFS = ","
}
{
for ( i = 1; i <= NF; i+= 2 ) {
if ( $i == "session" ) { printf "%s:%s", $i, $(i+1); continue }
hash[$i] = hash[$i] (hash[$i] ? "|" : "") $(i+1)
}
asorti( hash, hash_orig )
for ( i = 1; i <= length(hash); i++ ) {
printf ",%s:%s", hash_orig[i], hash[ hash_orig[i] ]
}
printf "\n"
delete hash
delete hash_orig
}
' infile
that splits line with any comma or colon and traverses all odd fields to save either them and its values in a hash to print at the end. It yields:
session:4,a:5,b:1,c:2,e:8
session:3,a:11,b:3,c:5|3,e:9