AWK Merge line accumulator not working - awk

Below is a single record split over 2 lines with an embedded newline after field 3 (the blank line is the embedded newline)
peter,9,ghi
mno
The algorithm is if there are less than 4 fields in a record then continue merging subsequent lines until you have 4 fields, then output the record.
I have awk code that supposedly does this. 2 cases.
CASE 1
> If the number of fields in the current line plus the accumulated
> number of fields = 4 then if there were no previous fields
> print current line else
> print previously accumulated line plus current line
CASE 2 Append the current line to the accumulated previous lines
BEGIN {
FS=","
flds=4
prevF=0
}
flds == NF + prevF {
print (prevF==0) ? $0 : prevLine $0
prevF=0
prevLine=""
next
}
{prevLine = (prevF==0) ? $0 FS : prevLine $0
prevF = prevF + NF}
Simple enough algorithm. When I run it against the data snippet I get
,mnohi,9,ghi
instead of the 2nd line tacked on to the end of the first.
I am interested in understanding why the code is behaving as it is and in awk solutions.

Related

Bash one liner for calculating the average of a specific row of numbers in bash

I just began learning bash.
Trying to figure out how to convert a two-liner into a one liner using bash.
The First Line of code...
searches the first column of the input.txt for the word - KEYWORD.
captures every number in this KEYWORD row from column2 until the last column.
dumps all these numbers into the values.txt file placing each number on a new line.
The second line of code calculates average value for all the numbers in the first column of values txt the prints out the this value.
awk '{if($1=="KEYWORD") for(i=2;i<=NF;i++) print $i}' input.txt > values.txt
awk 'BEGIN{sum=0; counter=0}{sum+=$1; counter+=1}END{print sum/counter}' values.txt
How do I create a one liner from this?
Something like
awk '
BEGIN { count = sum = 0 }
$1 == "KEYWORD" {
for (n = 2; n <= NF; n++) sum += $n
count += NF - 1
}
END { print sum/count }' input.txt
Just keep track of the sum and total count of numbers in the first (and only) pass through the file and then average them at the end instead of printing a value for each matching line.
After reviewing this problem with several people and learning some new bash/awk shortcuts, the code below appears to be the shortest answer.
awk '/KEYWORD/{for(n=3;n<=NF;n++)sum+=$n;print sum/(NF-2)}' input.txt
This code searches the input file for the row containing "KEYWORD".
Then sums up all the field from the 3rd column to the last column.
Then prints out the average value of all those numbers. (i.e. the mean).

How to delete lines after first checking next 3 lines

I have a text file similar to this
00:00:24.752
8,594
3,847
0
00:00:25.228
0
1,692
0
00:00:25.738
6,548
5,304
0
00:00:26.248
1,807
417
0
00:00:26.758
3,913
5,335
0
00:00:26.792
0
00:00:27.234
0
00:00:27.268
0
0
0
00:00:27.778
9,903
2,345
0
00:00:27.812
0
00:00:28.322
0
9,501
0
this is network traffic and the first part is a timestamp while the next two are sent and received traffic. The third is a zero which i do not know why is there.
So my goal is to keep only the lines that have at least a value of sent/receive traffic and also delete the third 0 every time. So i will have a result like this.
00:00:24.752
8,594
3,847
00:00:25.228
0
1,692
00:00:25.738
6,548
5,304
00:00:26.248
1,807
417
00:00:26.758
3,913
5,335
00:00:27.778
9,903
2,345
00:00:28.322
0
9,501
Have tried using awk in the sense of checking the length of the current line and if the line is less than 8 characters then print that line and the next 2. But since the file is not always having at least 2 values after the timestamp it is not working properly.
awk '
/[0-9]{2}:[0-9]{2}:[0-9]{2}\.[0-9]{3}/ {
if (NR > 1) p()
i = 0
}
{ buf[++i] = $0 }
END { p() }
function p() {
if (buf[2] || buf[3]) {
print buf[1]
print buf[2]
print buf[3]
}
delete buf
}' file
p is a function that prints buffered lines if 2nd and 3rd of them are not empty or zero, and clears the buffer. It is called whenever a timestamp is seen (and it's not the first line in the file) and when EOF is hit. So the script above basically buffers lines between two timestamps, and if they meet the criteria that there should be at least two lines after timestamp, and they shouldn't be zero, prints them.
This might work for you (GNU sed):
sed '/:/!{H;$!d};x;/\n.*\n.*\n/{/\n0\n0\n0/!s/\n0$//p};x;h;d' file
If the current line is not a time stamp (does not contain a :), append it to the hold space and if it is not the last line, delete it.
If the current line is either the last line or a time stamp, swap to the hold space and check that the previous record contains 4 lines and that the last 3 lines are not zeroed, if so remove the last line of the record and print the amended record.
Swap back to the pattern space, replace the hold space by the current line (time stamp) and delete it.
N.B. When a line is deleted no further sed processing takes place for the current line.
If you want to omit all 4th lines, this the awk script to achieve this:
awk 'RN % 4{print}' input.txt
Results with your desired output.

How to append last column of every other row with the last column of the subsequent row

I'd like to append every other (odd-numbered rows) row with the last column of the subsequent row (even-numbered rows). I've tried several different commands but none seem to do the task I'm trying to achieve.
Raw data:
user|396012_232|program|30720Mn|
|396012_232.batch|batch|30720Mn|5108656K
user|398498_2|program|102400Mn|
|398498_2.batch|batch|102400Mn|36426336K
user|391983_233|program|30720Mn|
|391983_233.batch|batch|30720Mn|5050424K
I'd like to take the last field in the "batch" lines and append the line above it with the last field in the "batch" line.
Desired output:
user|396012_232|program|30720Mn|5108656K
|396012_232.batch|batch|30720Mn|5108656K
user|398498_2|program|102400Mn|36426336K
|398498_2.batch|batch|102400Mn|36426336K
user|391983_233|program|30720Mn|5050424K
|391983_233.batch|batch|30720Mn|5050424K
The "batch" lines would then be discarded from the output, so in those lines there is no preference if the line is cut or copied or changed in any way.
Where I got stumped, my attempts to finish the logic were embarrassingly illogical:
awk 'BEGIN{OFS="|"} {FS="|"} {if ($3=="batch") {a=$5} else {} ' file.data
Thanks!
If you do not need to keep the lines with batch in Field 3, you may use
awk 'BEGIN{OFS=FS="|"} NR%2==1 { prev=$0 }; $3=="batch" { print prev $5 }' file.data
or
awk 'BEGIN{OFS=FS="|"} NR%2==1 { prev=$0 }; NR%2==0 { print prev $5 }' file.data
See the online awk demo and another demo.
Details
BEGIN{OFS=FS="|"} - sets the field separator to pipe
NR%2==1 { prev=$0 }; - saves the odd lines in prev variable
$3=="batch" - checks if Field 3 is equal to batch (probably, with this logic you may replace it with NR%2==0 to get the even line)
{ print prev $5 } - prints the previous line and Field 5.
You may consider also a sed option:
sed 'N;s/\x0A.*|\([^|]*\)$/\1/' file.data > newfile
See this demo
Details
N; - adds a newline to the pattern space, then appends the next line of
input to the pattern space and if there is no more input then sed
exits without processing any more commands
s/\x0A.*|\([^|]*\)$/\1/ - replaces with Group 1 contents a
\x0A - newline
.*| - any 0+ chars up to the last | and
\([^|]*\) - (Capturing group 1): any 0+ chars other than |
$ - end of line
if your data in 'd' file try gnu awk:
awk 'BEGIN{FS="|"} {if(getline n) {if(n~/batch/){b=split(n,a,"|");print $0 a[b]"\n"n} } }' d

Print every line from a large file where the previous N lines meet specific criteria

I'd like to print every line from a large file where the previous 10 lines have a specific value in in a specific column (in the example below, column 9 has a value < 1). I don't want to store the whole file in memory. I am trying to use awk for this purpose as follows:
awk 'BEGIN{FS=","}
{
for (i=FNR,i<FNR+10, i++) saved[++s] = $0 ; next
for (i=1,i<s, i++)
if ($9<1)
print saved[s]; delete saved; s=0
}' file.csv
The goal of this command is to save the 10 previous lines, then that check that column 9 in each of those lines meets my criteria, then print the current line. Any help with this, or suggestion on a more efficient way to do this, is much appreciated!
No need to store anything in memory or do any explicit looping on values. To print the current line if the last 10 lines (inclusive) had a $9 value < 1 is just:
awk -F, '(c=($9<1?c+1:0))>9' file
Untested of course since you didn't provide any sample input or expected output so check the math but that is the right approach and if the math is wrong then the tweak to fix it is just to change >9 to >10 or whatever you need.
Here is a solution for GNU Awk:
chk_prev_lines.awk
BEGIN { FS=","
CMP_LINE_NR=10
CMP_VAL = 1 }
FNR > CMP_LINE_NR {
ok = 1
# check the stored values
for( i = 0; i< CMP_LINE_NR; i++ ) {
if ( !(prev_Field9[ i ] < CMP_VAL) ) {
ok = 0
break # early return
}
}
if( ok ) print
}
{ # store $9 for the comparison
prev_Field9[ FNR % CMP_LINE_NR] = $9
}
Use it like this: awk -f chk_prev_lines.awk your_file.
Explanation
CMP_LINE_NR determines how many values from previous lines are stored
CMP_VAL determines the values used for the comparison
The condition FNR > CMP_LINE_NR takes care, that the first line, whose previous lines are checked, is the one with CMP_LINE_NR +1. It is the first with that much previous lines.
The last Action stores the value $9. This Action is executed for all lines.

Print lines containing the same second field for more than 3 times in a text file

Here is what I am doing.
The text file is comma separated and has three field,
and I want to extract all the line containing the same second field
more than three times.
Text file (filename is "text"):
11,keyword1,content1
4,keyword1,content3
5,keyword1,content2
6,keyword2,content5
6,keyword2,content5
7,keyword1,content4
8,keyword1,content2
1,keyword1,content2
My command is like below. cat the whole text file inside awk and grep with the second field of each line and count the number of the line.
If the number of the line is greater than 2, print the whole line.
The command:
awk -F "," '{ "cat text | grep "$2 " | wc -l" | getline var; if ( 2 < var ) print $0}' text
However, the command output contains only first three consecutive lines,
instead of printing also last three lines containing "keyword1" which occurs in the text for six times.
Result:
11,keyword1,content1
4,keyword1,content3
5,keyword1,content2
My expected result:
11,keyword1,content1
4,keyword1,content3
5,keyword1,content2
7,keyword1,content4
8,keyword1,content2
1,keyword1,content2
Can somebody tell me what I am doing wrong?
It is relatively straight-forward to make just two passes over the file. In the first pass, you count the number of occurrences of each value in column 2. In the second pass, you print out the rows where the value in column 2 occurs more than your threshold value of 3 times.
awk -F, 'FNR == NR { count[$2]++ }
FNR != NR { if (count[$2] > 3) print }' text text
The first line of code handles the first pass; it counts the occurrences of each different value of the second column.
The second line of code handles the second pass; if the value in column 2 was counted more than 3 times, print the whole line.
This doesn't work if the input is only available on a pipe rather than as a file (so you can't make two passes over the data). Then you have to work much harder.