Print lines containing the same second field for more than 3 times in a text file - awk

Here is what I am doing.
The text file is comma separated and has three field,
and I want to extract all the line containing the same second field
more than three times.
Text file (filename is "text"):
11,keyword1,content1
4,keyword1,content3
5,keyword1,content2
6,keyword2,content5
6,keyword2,content5
7,keyword1,content4
8,keyword1,content2
1,keyword1,content2
My command is like below. cat the whole text file inside awk and grep with the second field of each line and count the number of the line.
If the number of the line is greater than 2, print the whole line.
The command:
awk -F "," '{ "cat text | grep "$2 " | wc -l" | getline var; if ( 2 < var ) print $0}' text
However, the command output contains only first three consecutive lines,
instead of printing also last three lines containing "keyword1" which occurs in the text for six times.
Result:
11,keyword1,content1
4,keyword1,content3
5,keyword1,content2
My expected result:
11,keyword1,content1
4,keyword1,content3
5,keyword1,content2
7,keyword1,content4
8,keyword1,content2
1,keyword1,content2
Can somebody tell me what I am doing wrong?

It is relatively straight-forward to make just two passes over the file. In the first pass, you count the number of occurrences of each value in column 2. In the second pass, you print out the rows where the value in column 2 occurs more than your threshold value of 3 times.
awk -F, 'FNR == NR { count[$2]++ }
FNR != NR { if (count[$2] > 3) print }' text text
The first line of code handles the first pass; it counts the occurrences of each different value of the second column.
The second line of code handles the second pass; if the value in column 2 was counted more than 3 times, print the whole line.
This doesn't work if the input is only available on a pipe rather than as a file (so you can't make two passes over the data). Then you have to work much harder.

Related

Bash one liner for calculating the average of a specific row of numbers in bash

I just began learning bash.
Trying to figure out how to convert a two-liner into a one liner using bash.
The First Line of code...
searches the first column of the input.txt for the word - KEYWORD.
captures every number in this KEYWORD row from column2 until the last column.
dumps all these numbers into the values.txt file placing each number on a new line.
The second line of code calculates average value for all the numbers in the first column of values txt the prints out the this value.
awk '{if($1=="KEYWORD") for(i=2;i<=NF;i++) print $i}' input.txt > values.txt
awk 'BEGIN{sum=0; counter=0}{sum+=$1; counter+=1}END{print sum/counter}' values.txt
How do I create a one liner from this?
Something like
awk '
BEGIN { count = sum = 0 }
$1 == "KEYWORD" {
for (n = 2; n <= NF; n++) sum += $n
count += NF - 1
}
END { print sum/count }' input.txt
Just keep track of the sum and total count of numbers in the first (and only) pass through the file and then average them at the end instead of printing a value for each matching line.
After reviewing this problem with several people and learning some new bash/awk shortcuts, the code below appears to be the shortest answer.
awk '/KEYWORD/{for(n=3;n<=NF;n++)sum+=$n;print sum/(NF-2)}' input.txt
This code searches the input file for the row containing "KEYWORD".
Then sums up all the field from the 3rd column to the last column.
Then prints out the average value of all those numbers. (i.e. the mean).

How to append last column of every other row with the last column of the subsequent row

I'd like to append every other (odd-numbered rows) row with the last column of the subsequent row (even-numbered rows). I've tried several different commands but none seem to do the task I'm trying to achieve.
Raw data:
user|396012_232|program|30720Mn|
|396012_232.batch|batch|30720Mn|5108656K
user|398498_2|program|102400Mn|
|398498_2.batch|batch|102400Mn|36426336K
user|391983_233|program|30720Mn|
|391983_233.batch|batch|30720Mn|5050424K
I'd like to take the last field in the "batch" lines and append the line above it with the last field in the "batch" line.
Desired output:
user|396012_232|program|30720Mn|5108656K
|396012_232.batch|batch|30720Mn|5108656K
user|398498_2|program|102400Mn|36426336K
|398498_2.batch|batch|102400Mn|36426336K
user|391983_233|program|30720Mn|5050424K
|391983_233.batch|batch|30720Mn|5050424K
The "batch" lines would then be discarded from the output, so in those lines there is no preference if the line is cut or copied or changed in any way.
Where I got stumped, my attempts to finish the logic were embarrassingly illogical:
awk 'BEGIN{OFS="|"} {FS="|"} {if ($3=="batch") {a=$5} else {} ' file.data
Thanks!
If you do not need to keep the lines with batch in Field 3, you may use
awk 'BEGIN{OFS=FS="|"} NR%2==1 { prev=$0 }; $3=="batch" { print prev $5 }' file.data
or
awk 'BEGIN{OFS=FS="|"} NR%2==1 { prev=$0 }; NR%2==0 { print prev $5 }' file.data
See the online awk demo and another demo.
Details
BEGIN{OFS=FS="|"} - sets the field separator to pipe
NR%2==1 { prev=$0 }; - saves the odd lines in prev variable
$3=="batch" - checks if Field 3 is equal to batch (probably, with this logic you may replace it with NR%2==0 to get the even line)
{ print prev $5 } - prints the previous line and Field 5.
You may consider also a sed option:
sed 'N;s/\x0A.*|\([^|]*\)$/\1/' file.data > newfile
See this demo
Details
N; - adds a newline to the pattern space, then appends the next line of
input to the pattern space and if there is no more input then sed
exits without processing any more commands
s/\x0A.*|\([^|]*\)$/\1/ - replaces with Group 1 contents a
\x0A - newline
.*| - any 0+ chars up to the last | and
\([^|]*\) - (Capturing group 1): any 0+ chars other than |
$ - end of line
if your data in 'd' file try gnu awk:
awk 'BEGIN{FS="|"} {if(getline n) {if(n~/batch/){b=split(n,a,"|");print $0 a[b]"\n"n} } }' d

Awk: Append output to new field in existing file

Is there a way to print the output of an awk script to an existing file as a new field every time?
Hi!
I'm very new at awk (so my terminology might not be correct, sorry about that!) and I'm trying to print the output of a script that will operate on several hundred files to the same file, in different fields.
For example, my data files have this structure:
#File1
1
Values, 2, Hanna
20
15
Values, 2, Josh
30
56
Values, 2, Anna
50
70
#File2
2
Values, 2, Hanna
45
60
Values, 2, Josh
98
63
Values, 2, Anna
10
56
I have several of these files, which are divided by numbered month, with the same names, but different values. I want files that are named by the name of the person, and the values in fields by month, like so:
#Hanna
20 45
15 60
#Josh
30 98
56 63
#Anna
50 10
70 56
In my script, I search for the word "values", and determine which records to print (based on the number after "value"). This works fine. Then I want to print these values. It works fine for one file, with the command:
Print $0 > name #the varible name have I saved to be = $3 of the correct row
This creates three files correctly named "Hanna", "Josh" and "Anna", with their values. However, I would like to run the script for all my datafiles, and append them to only one "Hanna"-file etc, in a new field.
So what I'm looking for is something like print $0 > $month name, reading out like "print the record to the field corresponding to the month"
I have tried to find a solution, but most solutions either just paste temporary files together or append the values after the existing ones (so that they all are in field 1). I want to avoid the temporary files and have them in different fields (so that I get a kind of matrix-structure).
Thank you in advance!
try following, though I have not checked all permutations and combinations and only considered your post. Also your output Josh column is not consistent also(Or please do let us know if more conditions are there for same too). Let me know how it goes then.
awk 'FNR==NR{if($0 ~ /^Values/){Q=$NF;B[$NF]=$NF;i="";next};A[Q,++i]=$0;next} /^Values/{V=$NF;print "#"B[V];i="";next} B[V]{print A[V,++i],$0}' file1 file2
EDIT: Adding a non-one liner form of solution too.
awk 'FNR==NR{
if($0 ~ /^Values/){
Q=$NF;
B[$NF]=$NF;
i="";
next
};
A[Q,++i]=$0;
next
}
/^Values/{
V=$NF;
print "#"B[V];
i="";
next
}
B[V]{
print A[V,++i],$0
}
' file1 file2
EDIT2: Adding explanation too now for same.
awk 'FNR==NR{ ###Checking condition FNR==NR where this condition will be TRUE only when first file named file1 is being read. FNR and NR both indicate number of lines in a Input_file, only difference between them is FNR value will be RESET whenever there is next Input_file is being read and NR value will be keep on incresing till all the Input_files are read.
if($0 ~ /^Values/){ ###Checking here if any line starts from string Values if yes then perform following operations.
Q=$NF; ###Creating a variable named Q whose value is the last field of the line.
B[$NF]=$NF;###Creating an array named B whose index is $NF(last field of the line) and value is same too.
i=""; ###Making variable i value to NULL now.
next ###using next here, it is built-in keyword for awk and it will skip all further statements now.
};
A[Q,++i]=$0; ###Creating an array named A whose index is Q and variable i with increasing value with 1 to it, each time it comes on this statement.
next ###Using next will skip all further statements now.
}
/^Values/{ ###All statements from here will be executed when second file named file2 is being read. So I am checking here if a line starts from string Values then do following.
V=$NF; ###create variable V whose value is $NF of current line.
print "#"B[V]; ###printing the string # then value of array B whose index is variable V.
i=""; ###Nullifying the variable i value here.
next ###next will sip all the further statements now.
}
B[V]{ ###Checking here if array B with index V is having a value in it, then perform following on it too.
print A[V,++i],$0 ###printing the value of array A whose index is variable V and variable i increasing value with 1 and current line.
}
' file1 file2 ###Mentioning the Input_files here named file1 and file2.

Awk: printing undetermined number of columns

I have a file that contains a number of fields separated by tab. I am trying to print all columns except the first one but want to print them all in only one column with AWK. The format of the file is
col 1 col 2 ... col n
There are at least 2 columns in one row.
Sample
2012029754 901749095
2012028240 901744459 258789
2012024782 901735922
2012026032 901738573 257784
2012027260 901742004
2003062290 901738925 257813 257822
2012026806 901741040
2012024252 901733947 257493
2012024365 901733700
2012030848 901751693 260720 260956 264843 264844
So I want to tell awk to print column 2 to column n for n greater than 2 without printing blank lines when there is no info in column n of that row, all in one column like the following.
901749095
901744459
258789
901735922
901738573
257784
901742004
901738925
257813
257822
901741040
901733947
257493
901733700
901751693
260720
260956
264843
264844
This is the first time I am using awk, so bear with me. I wrote this from command line which works:
awk '{i=2;
while ($i ~ /[0-9]+/)
{
printf "%s\n", $i
i++
}
}' bth.data
It is more of a seeking approval than asking a question whether it is the right way of doing something like this in AWK or is there a better/shorter way of doing it.
Note that the actual input file could be millions of lines.
Thanks
Is this what you want as output?
awk '{for(i=2; i<=NF; i++) print $i}' bth.data
gives
901749095
901744459
258789
901735922
901738573
257784
901742004
901738925
257813
257822
901741040
901733947
257493
901733700
901751693
260720
260956
264843
264844
NF is one of several pre-defined awk variables. It indicates the number of fields on a given input line. For instance, it is useful if you want to always print out the last field in a line print $NF. Or of course if you want to iterate through all or part of the fields on a given line to the end of the line.
Seems like awk is the wrong tool. I would do:
cut -f 2- < bth.data | tr -s '\t' '\n'
Note that with -s, this avoids printing blank lines as stated in the original problem.

Awk - print next record following matched record

I'm trying to get a next field after matching field using awk.
Is there an option to do that or do I need to scan the record into array then check each field in array and print the one after that?
What I have so far:
The file format is:
<FIELD><separator "1"><VALUE><separator "1"><FIELD><separator "1"><VALUE>
... and so on, field|value pairs are repeated, will be at least one pair in line or multiple pairs <10 per line
dat.txt:
FIELDA01VALUEA01FIELDA21VALUEA21FIELDA31VALUEA3
FIELDB01VALUEB01FIELDB21VALUEB21FIELDB31VALUEB3
FIELDC01VALUEC01FIELDC21VALUEC21FIELDC31VALUEC3
FIELDD01VALUED01FIELDD21VALUED21FIELDD31VALUED3
FIELDE01VALUEE01FIELDE21VALUEE21FIELDE31VALUEE3
With simple awk script which prints the second field on a line matching FIELDB2:
#!/bin/awk -f
BEGIN { FS = "1" }
/FIELDB2/ { print $2 }
Running the above:
> ./scrpt.awk dat.txt
Gives me:
VALUEB0
This is because the line that match:
FIELDB01VALUEB01FIELDB21VALUEB21FIELDB31VALUEB3
When split into records looks:
FIELDB0 VALUEB0 FIELDB2 VALUEB2 FIELDB3 VALUEB3
From which the second field is VALUEB0
Now I don't know which FIELDXX will match but I would like to print the next record on the line after FIELDXX that matched, in this specific example when FIELDB2 matches I need to print VALUEB2.
Any suggestions?
You can match the line, then loop over the fields for a match, and print the next record:
awk -F1 '/FIELDB2/ { for (x=1;x<=NF;x++) if ($x~"FIELDB2") print $(x+1) }'
No need for slowing things down using a loop :)
awk -F1 '/FIELDB2/ {f=NR} f&&NR-1==f' RS="1" file
VALUEB2