I currently have a awk method to parse through whether or not an expression output contains more than one line. If it does, it aggregates and prints the sum. For example:
someexpression=$'JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)'
might be the one-liner where it DOESN'T yield any information. Then,
echo "$someexpression" | awk '
NR>1 {a[$4]++}
END {
for (i in a) {
printf "%d\n", a[i]
}
}'
this will yield NULL or an empty return. Instead, I would like to have it return a numeric value of $0$ if empty. How can I modify the above to do this?
Nothing in UNIX "returns" anything (despite the unfortunately named keyword for setting the exit status of a function), everything (tools, functions, scripts) outputs X and exits with status Y.
Consider these 2 identical functions named foo(), one in C and one in shell:
C (x=foo() means set x to the return code of foo()):
foo() {
printf "7\n"; // this is outputting 7 from the full program
return 3; // this is returning 3 from this function
}
x=foo(); <- 7 is output on screen and x has value '3'
shell (x=foo means set x to the output of foo()):
foo() {
printf "7\n"; # this is outputting 7 from just this function
return 3; # this is setting this functions exit status to 3
}
x=foo <- nothing is output on screen, x has value '7', and '$?' has value '3'
Note that what the return statement does is vastly different in each. Within an awk script, printing and return codes from functions behave the same as they do in C but in terms of a call to the awk tool, externally it behaves the same as every other UNIX tool and shell script and produces output and sets an exit status.
So when discussing anything in UNIX avoid using the term "return" as it's imprecise and ambiguous and so different people will think you mean "output" while others think you mean "exit status".
In this case I assume you mean "output" BUT you should instead consider setting a non-zero exit status when there's no match like grep does, e.g.:
echo "$someexpression" | awk '
NR>1 {a[$4]++}
END {
for (i in a) {
print a[i]
}
exit (NR < 2)
}'
and then your code that uses the above can test for the success/fail exit status rather than testing for a specific output value, just like if you were doing the equivalent with grep.
You can of course tweak the above to:
echo "$someexpression" | awk '
NR>1 {a[$4]++}
END {
if ( NR > 1 ) {
for (i in a) {
print a[i]
}
}
else {
print "$0$"
exit 1
}
}'
if necessary and then you have both a specific output value and a success/fail exit status.
You may keep a flag inside for loop to detect whether loop has executed or not:
echo "$someexpression" |
awk 'NR>1 {
a[$4]++
}
END
{
for (i in a) {
p = 1
printf "%d\n", a[i]
}
if (!p)
print "$0$"
}'
$0$
I've figured out how to get the average of a file that contains numbers in all lines such as:
Numbers.txt
1
2
4
8
Output:
Average: 3.75
This is the code I use for that:
awk '{ sum += $1; tot++ } END { print sum / tot; }' Numbers.txt
However, the problem is that this doesn't take into account possible strings that might be in the file. For example, a file that looks like this:
NumbersAndExtras.txt
1
2
4
8
Hello
4
5
6
Cat
Dog
2
4
3
For such a file I'd want to print the multiple averages of the consecutive numbers, ignoring the strings such that the result looks something like this:
Output:
Average: 3.75
Average: 5
Average: 3
I could devise some complicated code that might accomplish that with variables and 'if' statements and loops and whatnot, but I've been told it's easier than that given some of awk features. I'd like to know how that might look like, along with an explanation of why it works.
BEGIN runs before reading the first line from file. Set sum and count to 0.
awk 'BEGIN{ sum=0; count=0} {if ( /[a-z][A-Z]/ ) { if (count > 0) {avg = sum/count; print avg;} count=0; sum=0} else { count++; sum += $1} } END{if (count > 0) {avg = sum/count; print avg}} ' NumbersAndExtras.txt
When there is an alphabet on the line, calculate and print average so far.
And do the same in the END block that runs after processing the whole file.
Keep it simple:
awk '/^$/{next}
/^[0-9]+/{a+=$1+0;c++;next}
c&&a{print "Average: "a/c;a=c=0}
END{if(c&&a){print "Average: "a/c}}' input_file
Results:
Average: 3.75
Average: 5
Average: 3
Another one:
$ awk '
function avg(s, c) { print "Average: ", s/c }
NF && !/^[[:digit:]]/ { if (count) avg(sum, count); sum = 0; count = 0; next}
NF { sum += $1; count++ }
END {if (count) avg(sum, count)}
' <file
Note: The value of this answer in explaining the solution; other answers offer more concise alternatives.
Try the following:
Note that this is an awk command with a script specified as a multi-line shell string literal - you can paste the whole thing into your terminal to try it; while it is possible to cram this into a single line, it hurts readability and the ability to comment:
awk '
# Define output function that prints an average.
function printAvg() { print "Average: ", sum/count }
# Skip blank lines
NF == 0 { next}
# Is the line non-numeric?
/[[:alpha:]]/ {
# If this line ends a numeric block, print its
# average now and reset the variables to start the next group.
if (count) {
printAvg()
wasNum = sum = count = 0
}
# Skip to next line.
next
}
# Numeric line: set flag, sum, and increment counter.
{ sum += $1; count++ }
# Finally:
END {
# If there is a group whose average has not been printed yet,
# do it now.
if (count) printAvg()
}
' NumbersAndExtras.txt
If we condense whitespace and strip the comments, we still get a reasonably readable solution, as long as we still use multiple lines:
awk '
function printAvg() { print "Average: ", sum/count }
NF == 0 { next}
/[[:alpha:]]/ { if (count) { printAvg(); sum = count = 0 } next }
{ sum += $1; count++ }
END { if (count) printAvg() }
' NumbersAndExtras.txt
I want to update file1 on the basis of file2. If any row is new in file2 then it should be added in file1. If any row from file2 is already in file1, then update that row with the row from file2 if the time is greater in file2.
file1
DL,1111111100,201312051013,val,FIX01,OptIn,N,Ext1,Ext2
DL,1111111101,201312051014,val,FIX01,OptIn,Y,Ext1,Ext2
DL,1111111102,201312051015,val,FIX01,OptIn,Y,Ext1,Ext2
DL,1111111103,201312051016,val,FIX01,OptIn,N,Ext1,Ext2
file2
DL,1111111101,201312041013,val,FIX02,OptIn,N,Ext1,Ext2
DL,1111111102,201312051016,val,FIX02,OptIn,N,Ext1,Ext2
DL,1111111102,201312051017,val,FIX02,OptIn,N,Ext1,Ext2
DL,1111111104,201312051014,val,FIX01,OptIn,Y,Ext1,Ext2
DL,1111111104,201312051016,val,FIX02,OptIn,Y,Ext1,Ext2
newfile1
DL,1111111100,201312051013,val,FIX01,OptIn,N,Ext1,Ext2
DL,1111111101,201312051014,val,FIX01,OptIn,Y,Ext1,Ext2
DL,1111111102,201312051017,val,FIX02,OptIn,N,Ext1,Ext2
DL,1111111103,201312051016,val,FIX01,OptIn,N,Ext1,Ext2
DL,1111111104,201312051016,val,FIX02,OptIn,Y,Ext1,Ext2
Notes:
2nd field should be unique in the output.
Addition of new value: the latest 2nd field for value "1111111104" in file2 is taken which is newer (201312051016) then old value (201312051014) on the basis of date column (3rd field).
Update an existing value: updated "1111111102" with newer value on the basis of date in 3rd column
file1 is very LARGE whereas file2 has 5-10 entries only.
row with 2nd field "1111111101" doesn't need to b updated because it's entry in file1 already has the latest date "201312051014" as compared to new date "201312041013" in file2.
I haven't tried much on this because it really has complex condition for me as beginner..
BEGIN { FS = OFS = "," }
FNR == NR {
m=$2;
a[m] = $0;
next
}
{
if($2 in a)
{
split(a[$2],datetime,",")
if($3>datetime[3])
print $0;
else
print a[$2]"Old time"
}
else print $0"NOMATCH";
delete a[$2];
}
Assuming that you can start your awk as follows:
awk -f script.awk input2.csv input1.csv > result.csv
you can use the following script to obtain the desired output:
BEGIN {
FS = OFS = ","
}
FILENAME == "input2.csv" {
date[$2] = $3
data[$2] = $0
used[$2] = 0
}
FILENAME == "input1.csv" {
if ($2 in date) {
used[$2] = 1
if ($3 < date[$2])
print data[$2]
else
print $0
} else {
print $0
}
}
END {
for (key in used) {
if (used[key] == 0)
print data[key]
}
}
Notes:
The script takes advantages of the assumption that file2 is smaller than file1 because it uses an array only for the few entries in file2.
The new entries are simply appended to the output. There is no sorting. If this is required there will have to be an extra effort.
EDIT
Heeding #JonathanLeffler's remark about the way I determine which file is being processed I would like to offer an alternate version that may (or may not :-) ) be a little more straight forward to understand than checking NR=FNR. However, it only works for sufficiently recent versions of awk which are capable of returning the size of an array as length(array):
BEGIN {
FS = ","
}
{
# The following effectively creates an array entry for each filename found (for "known" filenames existing entries are overwritten).
files[FILENAME] = 1
# check the number of files we have so far
if (length(files) == 1) {
# we are still in the first file
date[$2] = $3
data[$2] = $0
used[$2] = 0
} else {
# we are in the second file (or any other following file)
if ($2 in date) {
used[$2] = 1
if ($3 < date[$2])
print data[$2]
else
print $0
} else {
print $0
}
}
}
END {
for (key in used) {
if (used[key] == 0)
print data[key]
}
}
Also, if you require your output to be sorted according to the second row you can replace the call to awk by this:
awk -f script.awk input2.csv input1.csv | sort -t "," -n -k 2 > result.csv
The latter, of course, works for both versions of the script.
Since file1 is very large but file2 is very small (5-10 entries), you need to read all of file2 into memory first, dealing with the duplicate values. As a result, you'll have an array indexed by the record number with the new data; you should also have a record of the date for each record in a separate array. Then, as you read the main file, you look up the the record number and the date in the arrays, and if you need to, substitute the saved new record for the incoming old record.
Your outline script is most of the way there. It is more complex because you didn't save the dates coming in. This more or less works:
awk -F, '
FNR == NR { if (!($2 in date) || date[$2] < $3) { date[$2] = $3; line[$2] = $0; } next; }
{ if ($2 in date)
{
if (date[$2] > $3)
print line[$2]
else
print
delete line[$2]
delete date[$2]
}
else
print
}
END { for (l in line) print line[l]; }' file2 file1
Sample output for given data:
DL,1111111100,201312051013,val,FIX01,OptIn,N,Ext1,Ext2
DL,1111111101,201312051014,val,FIX01,OptIn,Y,Ext1,Ext2
DL,1111111102,201312051017,val,FIX02,OptIn,N,Ext1,Ext2
DL,1111111103,201312051016,val,FIX01,OptIn,N,Ext1,Ext2
DL,1111111104,201312051016,val,FIX02,OptIn,Y,Ext1,Ext2
However, if there were 4 new records, there's no guarantee that they'd be in sorted order, though they would all be at the end of the list. It would be possible to upgrade the script to print the new records at the appropriate place in the list if the input is guaranteed to be in sorted order. You simply have to search through the list of lines to see whether there are any lines that should be printed before the current line, and if so, do so (and delete the record so that they are not printed at the end).
Note that uniqueness in the output depends on uniqueness in the input (file1). That is, if field 2 in the input is repeated, this code won't notice. There is also nothing that can be done with the current design even if a duplicate was spotted; the old row has been printed so printing the new row will simply cause the duplicate. If you were worried about this, you could design the awk script to keep the whole of file1 in memory and only print anything when the whole of the input has been processed. Needless to say, this uses a lot more memory than the current design, and will generally be less efficient because of that. Nevertheless, it could be done if needed.