Generating a new file after processing data in Shell script - awk

The input file which is shown below is generated by performing results of 2 other files
i.e awk 'BEGIN{FS=OFS=","} FNR==NR{arr[$0];next} {print $1,$2,$3,$5,($4 in arr)?1:0}' $NGW_REG_RESP_FILE $NGW_REG_REQ_FILE >> $NGW_REG_FILE
$NGW_REG_FILE file contains below data based on that i have to create a new file
2020-12-21,18,1,1,1
2020-12-21,18,1,1,0
2020-12-21,18,1,2,1
2020-12-21,18,1,2,1
2020-12-21,18,2,1,1
2020-12-21,18,2,1,1
2020-12-21,18,2,1,0
2020-12-21,18,3,2,1
2020-12-21,18,3,2,1
2020-12-21,18,4,2,0
2020-12-21,18,4,2,1
2020-12-21,18,3,2,0
What this data indicates is:
Date,Hour,Quarter,ReqType,Success/failed
Reqtype there were 2 possibilities: 1-> incoming 2-> outgoing
last field: 1->success 0-> failed
Quarter -> 1,2,3,4
I want to read this file and generate a new file that contains data like below (MY OUTPUT FILE):
2020-12-21,18,1,1,1,1
2020-12-21,18,1,2,2,0
2020-12-21,18,2,1,2,1
.....
Explanation:
heading: date,hour,quarter,reqType,Success_count,Failure_count (for reference to understand o/p file)
Date H Q ReqID SuccessCnt Fail Count
2020-12-21,18,1,1 ,1 ,1
Explanation: in input file for quarter 1 both reqTypes(1&2) were present
there will be at max 2 entry in each quarter.
in quarter 1 for reqid 1 there were 2 requests, 1 got success and other got failed
so 1 as success cnt and 1 as failure cnt
2020-12-21,18,1,2,2,0
here quarter 1 ,for req ID 2 there were 2 requests both got success so success count be 2 and failure count be 0
**UPDATE
The answer which is given in the comment is worked exactly what I was looking for.
I have some updates in the sample input file, i.e one more columns gets added before the last column i.e the STATUS CODE which you can see in the below input i,e 200,400,300
2020-12-21,18,1,1,200,1
2020-12-21,18,2,1,400,0
2020-12-21,18,2,1,300,0
The existing code gives the below result in the output file:
i.e Total count of success/failed in that quarter. Which is Correct.
What I want to do is add one more column to the output file, next to the total failed count i.e the array holding those status codes.
2020-12-21,18,1,1,1,0,[] //empty array in end bcs there is no failed req,1,success req
2020-12-21,18,2,1,0,2,[400,300] // here 2 failed req,0 success request
<DATE>,<HOUR>,<QUARTER>,<REQ_TYPE>,<SUCCESS_COUNT>,<FAIL_CNT>,<ARRAY_HOLDING_STATUSCODE>
I have added below changes to the code , Bu not getting how to iterate in side the same for loop
`cat $input_file | grep -v Orig | awk -F, '{
if ($NF==1) {
map[$1][$2][$3][$4]["success"]++
}
else {
map[$1][$2][$3][$4]["fail"]++
harish[$1][$2][$3][$4][$5]++ //ADDED THIS
}
}
END {
PROCINFO["sorted_in"]="#ind_num_asc";
for (i in map) {
for (j in map[i]) {
for (k in map[i][j]) {
for (l in map[i][j][k]) {
print i","j","k","l","(map[i][j][k][l]["success"]==""?"0":map[i][j][k][l]["success"])","(map[i][j][k][l]["fail"]==""?"0":map[i][j][k][l]["fail"])
}
}
}
}
}' >> OUTPUT_FILE.txt`

With awk (GNU awk for array sorting):
awk -F, '{ if ($NF==1) { map[$1][$2][$3][$4]["success"]++ } else { map[$1][$2][$3][$4]["fail"]++ } } END { PROCINFO["sorted_in"]="#ind_num_asc";for (i in map) { for (j in map[i]) { for (k in map[i][j]) { for (l in map[i][j][k]) { print i","j","k","l","(map[i][j][k][l]["success"]==""?"0":map[i][j][k][l]["success"])","(map[i][j][k][l]["fail"]==""?"0":map[i][j][k][l]["fail"]) } } } } }' $NGW_REG_FILE
Explanation:
awk -F, '{
if ($NF==1) {
map[$1][$2][$3][$4]["success"]++ # If last field is 1, increment a success index in array map with other fields as further indexes
}
else {
map[$1][$2][$3][$4]["fail"]++ # Otherwise increment a fail index
}
}
END {
PROCINFO["sorted_in"]="#ind_num_asc"; # Set the array ordering
for (i in map) {
for (j in map[i]) {
for (k in map[i][j]) {
for (l in map[i][j][k]) {
print i","j","k","l","(map[i][j][k][l]["success"]==""?"0":map[i][j][k][l]["success"])","(map[i][j][k][l]["fail"]==""?"0":map[i][j][k][l]["fail"]) # Loop through the array and print the data in the format required. If there is no entry in the success or fail index, print 0.
}
}
}
}
}' $NGW_REG_FILE

Related

awk - split column then get average

I'm trying to compute the average based on the values in column 5.
I intend to split the entries at the comma, sum the two numbers, compute the average and assign them to two new columns (ave1 and ave2).
I keep getting the error fatal: division by zero attempted, and I cannot get around it.
Below is the table in image format since I tried to post a markdown table and it failed.
Here's my code:
awk -v FS='\t' -v OFS='\t' '{split($5,a,","); sum=a[1]+a[2]}{ print $0, "ave1="a[1]/sum, "ave2="a[2]/sum}' vcf_table.txt
Never write a[2]/sum, always write (sum ? a[2]/sum : 0) or similar instead to protect from divide-by-zero.
You also aren't taking your header row into account. Try this:
awk '
BEGIN { FS=OFS="\t" }
NR == 1 {
ave1 = "AVE1"
ave2 = "AVE2"
}
NR > 1 {
split($5,a,",")
sum = a[1] + a[2]
if ( sum ) {
ave1 = a[1] / sum
ave2 = a[2] / sum
}
else {
ave1 = 0
ave2 = 0
}
}
{ print $0, ave1, ave2 }
' vcf_table.txt

Swapping / rearranging of columns and its values based on inputs using Unix scripts

Team,
I have an requirement of changing /ordering the column of csv files based on inputs .
example :
Datafile (source File) will be always with standard column and its values example :
PRODUCTCODE,SITE,BATCHID,LV1P_DESCRIPTION
MK3,Biberach,15200100_3,Biologics Downstream
MK3,Biberach,15200100_4,Sciona Upstream
MK3,Biberach,15200100_5,Drag envois
MK3,Biberach,15200100_8,flatsylio
MK3,Biberach,15200100_1,bioCovis
these columns (PRODUCTCODE,SITE,BATCHID,LV1P_DESCRIPTION) will be standard for source files and what i am looking for solution to format this and generate new file with the columns which we preferred .
Note : Source / Data file will be always comma delimited
Example : if I pass PRODUCTCODE,BATCHID as input then i would like to have only those column and its data extracted from source file and generate new file .
Something like script_name <output_column> <Source_File_name> <target_file_name>
target file example :
PRODUCTCODE,BATCHID
MK3,15200100_3
MK3,15200100_4
MK3,15200100_5
MK3,15200100_8
MK3,15200100_1
if i pass output_column as "LV1P_DESCRIPTION,PRODUCTCODE" then out file should be like below
LV1P_DESCRIPTION,PRODUCTCODE
Biologics Downstream,MK3
Sciona Upstream,MK3
Drag envios,MK3
flatsylio,MK3
bioCovis,MK3
It would be great if any one can help on this.
I have tried using some awk scripts (got it from some site) but it was not working as expected , since i don't have unix knowledge finding difficulties to modify this .
awk code:
BEGIN {
FS = ","
}
NR==1 {
split(c, ca, ",")
for (i = 1 ; i <= length(ca) ; i++) {
gsub(/ /, "", ca[i])
cm[ca[i]] = 1
}
for (i = 1 ; i <= NF ; i++) {
if (cm[$i] == 1) {
cc[i] = 1
}
}
if (length(cc) == 0) {
exit 1
}
}
{
ci = ""
for (i = 1 ; i <= NF ; i++) {
if (cc[i] == 1) {
if (ci == "") {
ci = $i
} else {
ci = ci "," $i
}
}
}
print ci
}
the above code is saves as Remove.awk and this will be called by another scripts as below
var1="BATCHID,LV2P_DESCRIPTION"
## this is input fields values used for testing
awk -f Remove.awk -v c="${var1}" RESULT.csv > test.csv
The following GNU awk solution should meet your objectives:
awk -F, -v flds="LV1P_DESCRIPTION,PRODUCTCODE" 'BEGIN { split(flds,map,",") } NR==1 { for (i=1;i<=NF;i++) { map1[$i]=i } } { printf "%s",$map1[map[1]];for(i=2;i<=length(map);i++) { printf ",%s",$map1[map[i]] } printf "\n" }' file
Explanation:
awk -F, -v flds="LV1P_DESCRIPTION,PRODUCTCODE" ' # Pass the fields to print as a variable field
BEGIN {
split(flds,map,",") # Split fld into an array map using , as the delimiter
}
NR==1 { for (i=1;i<=NF;i++) {
map1[$i]=i # Loop through the header and create and array map1 with the column header as the index and the column number the value
}
}
{ printf "%s",$map1[map[1]]; # Print the first field specified (index of map)
for(i=2;i<=length(map);i++) {
printf ",%s",$map1[map[i]] # Loop through the other field numbers specified, printing the contents
}
printf "\n"
}' file

Stored each of the first 2 blocks of lines in arrays

I've sorted it by using Google Sheet, but its gonna takes a long time, so I figured it out, to settle it down by awk.
input.txt
Column 1
2
2
2
4
4
Column 2
562
564
119
215
12
Range
13455,13457
13161
11409
13285,13277-13269
11409
I've tried this script, so it's gonna rearrange the value.
awk '/Column 1/' RS= input.txt
(as referred in How can I set the grep after context to be "until the next blank line"?)
But it seems, it's only gonna take one matched line
It should be sorted by respective lines.
Result:
562Value2#13455
562Value2#13457
564Value2#13161
119Value2#11409
215Value4#13285
215Value4#13277-13269
12Value4#11409
it should be something like that, the "comma" will be repeating the value from Column 1 and Column 2
etc:
Range :
13455,13457
Result :
562Value2#13455
562Value2#13457
idk what sorting has to do with it but it seems like this is what you're looking for:
$ cat tst.awk
BEGIN { FS=","; recNr=1; print "Result:" }
!NF { ++recNr; lineNr=0; next }
{ ++lineNr }
lineNr == 1 { next }
recNr == 1 { a[lineNr] = $0 }
recNr == 2 { b[lineNr] = $0 }
recNr == 3 {
for (i=1; i<=NF; i++) {
print b[lineNr] "Value" a[lineNr] "#" $i
}
}
$ awk -f tst.awk input.txt
Result:
562Value2#13455
562Value2#13457
564Value2#13161
119Value2#11409
215Value4#13285
215Value4#13277-13269
12Value4#11409

awk | Add new row or update existing row in a file

I want to update file1 on the basis of file2. If any row is new in file2 then it should be added in file1. If any row from file2 is already in file1, then update that row with the row from file2 if the time is greater in file2.
file1
DL,1111111100,201312051013,val,FIX01,OptIn,N,Ext1,Ext2
DL,1111111101,201312051014,val,FIX01,OptIn,Y,Ext1,Ext2
DL,1111111102,201312051015,val,FIX01,OptIn,Y,Ext1,Ext2
DL,1111111103,201312051016,val,FIX01,OptIn,N,Ext1,Ext2
file2
DL,1111111101,201312041013,val,FIX02,OptIn,N,Ext1,Ext2
DL,1111111102,201312051016,val,FIX02,OptIn,N,Ext1,Ext2
DL,1111111102,201312051017,val,FIX02,OptIn,N,Ext1,Ext2
DL,1111111104,201312051014,val,FIX01,OptIn,Y,Ext1,Ext2
DL,1111111104,201312051016,val,FIX02,OptIn,Y,Ext1,Ext2
newfile1
DL,1111111100,201312051013,val,FIX01,OptIn,N,Ext1,Ext2
DL,1111111101,201312051014,val,FIX01,OptIn,Y,Ext1,Ext2
DL,1111111102,201312051017,val,FIX02,OptIn,N,Ext1,Ext2
DL,1111111103,201312051016,val,FIX01,OptIn,N,Ext1,Ext2
DL,1111111104,201312051016,val,FIX02,OptIn,Y,Ext1,Ext2
Notes:
2nd field should be unique in the output.
Addition of new value: the latest 2nd field for value "1111111104" in file2 is taken which is newer (201312051016) then old value (201312051014) on the basis of date column (3rd field).
Update an existing value: updated "1111111102" with newer value on the basis of date in 3rd column
file1 is very LARGE whereas file2 has 5-10 entries only.
row with 2nd field "1111111101" doesn't need to b updated because it's entry in file1 already has the latest date "201312051014" as compared to new date "201312041013" in file2.
I haven't tried much on this because it really has complex condition for me as beginner..
BEGIN { FS = OFS = "," }
FNR == NR {
m=$2;
a[m] = $0;
next
}
{
if($2 in a)
{
split(a[$2],datetime,",")
if($3>datetime[3])
print $0;
else
print a[$2]"Old time"
}
else print $0"NOMATCH";
delete a[$2];
}
Assuming that you can start your awk as follows:
awk -f script.awk input2.csv input1.csv > result.csv
you can use the following script to obtain the desired output:
BEGIN {
FS = OFS = ","
}
FILENAME == "input2.csv" {
date[$2] = $3
data[$2] = $0
used[$2] = 0
}
FILENAME == "input1.csv" {
if ($2 in date) {
used[$2] = 1
if ($3 < date[$2])
print data[$2]
else
print $0
} else {
print $0
}
}
END {
for (key in used) {
if (used[key] == 0)
print data[key]
}
}
Notes:
The script takes advantages of the assumption that file2 is smaller than file1 because it uses an array only for the few entries in file2.
The new entries are simply appended to the output. There is no sorting. If this is required there will have to be an extra effort.
EDIT
Heeding #JonathanLeffler's remark about the way I determine which file is being processed I would like to offer an alternate version that may (or may not :-) ) be a little more straight forward to understand than checking NR=FNR. However, it only works for sufficiently recent versions of awk which are capable of returning the size of an array as length(array):
BEGIN {
FS = ","
}
{
# The following effectively creates an array entry for each filename found (for "known" filenames existing entries are overwritten).
files[FILENAME] = 1
# check the number of files we have so far
if (length(files) == 1) {
# we are still in the first file
date[$2] = $3
data[$2] = $0
used[$2] = 0
} else {
# we are in the second file (or any other following file)
if ($2 in date) {
used[$2] = 1
if ($3 < date[$2])
print data[$2]
else
print $0
} else {
print $0
}
}
}
END {
for (key in used) {
if (used[key] == 0)
print data[key]
}
}
Also, if you require your output to be sorted according to the second row you can replace the call to awk by this:
awk -f script.awk input2.csv input1.csv | sort -t "," -n -k 2 > result.csv
The latter, of course, works for both versions of the script.
Since file1 is very large but file2 is very small (5-10 entries), you need to read all of file2 into memory first, dealing with the duplicate values. As a result, you'll have an array indexed by the record number with the new data; you should also have a record of the date for each record in a separate array. Then, as you read the main file, you look up the the record number and the date in the arrays, and if you need to, substitute the saved new record for the incoming old record.
Your outline script is most of the way there. It is more complex because you didn't save the dates coming in. This more or less works:
awk -F, '
FNR == NR { if (!($2 in date) || date[$2] < $3) { date[$2] = $3; line[$2] = $0; } next; }
{ if ($2 in date)
{
if (date[$2] > $3)
print line[$2]
else
print
delete line[$2]
delete date[$2]
}
else
print
}
END { for (l in line) print line[l]; }' file2 file1
Sample output for given data:
DL,1111111100,201312051013,val,FIX01,OptIn,N,Ext1,Ext2
DL,1111111101,201312051014,val,FIX01,OptIn,Y,Ext1,Ext2
DL,1111111102,201312051017,val,FIX02,OptIn,N,Ext1,Ext2
DL,1111111103,201312051016,val,FIX01,OptIn,N,Ext1,Ext2
DL,1111111104,201312051016,val,FIX02,OptIn,Y,Ext1,Ext2
However, if there were 4 new records, there's no guarantee that they'd be in sorted order, though they would all be at the end of the list. It would be possible to upgrade the script to print the new records at the appropriate place in the list if the input is guaranteed to be in sorted order. You simply have to search through the list of lines to see whether there are any lines that should be printed before the current line, and if so, do so (and delete the record so that they are not printed at the end).
Note that uniqueness in the output depends on uniqueness in the input (file1). That is, if field 2 in the input is repeated, this code won't notice. There is also nothing that can be done with the current design even if a duplicate was spotted; the old row has been printed so printing the new row will simply cause the duplicate. If you were worried about this, you could design the awk script to keep the whole of file1 in memory and only print anything when the whole of the input has been processed. Needless to say, this uses a lot more memory than the current design, and will generally be less efficient because of that. Nevertheless, it could be done if needed.

How can I delete one line before and two after a string?

Say you have records in a text file which look like this:
header
data1
data2
data3
I would like to delete the whole record if data1 is a given string. I presume this needs awk which I do not know.
Awk can handle these multiline records by setting the record separator to the empty string:
BEGIN { RS = ""; ORS = "\n\n" }
$2 == "some string" { next } # skip this record
{ print } # print (non-skipped) record
You can save this in a file (eg remove.awk) and execute it with awk -f remove.awk data.txt > newdata.txt
This assumes your data is of the format:
header
data
....
header
data
...
If there are no blank lines between the records, you need to manually split the records (this is with 4 lines per record):
{ a[++i] = $0 }
i == 2 && a[i] == "some string" { skip = 1 }
i == 4 && ! skip { for (i = 1; i <= 4; i++) print a[i] }
i == 4 { skip = 0; i = 0 }
without knowing what output you desired and insufficient sample input.
awk 'BEGIN{RS=""}!/data1/' file