Awk - print next record following matched record - awk

I'm trying to get a next field after matching field using awk.
Is there an option to do that or do I need to scan the record into array then check each field in array and print the one after that?
What I have so far:
The file format is:
<FIELD><separator "1"><VALUE><separator "1"><FIELD><separator "1"><VALUE>
... and so on, field|value pairs are repeated, will be at least one pair in line or multiple pairs <10 per line
dat.txt:
FIELDA01VALUEA01FIELDA21VALUEA21FIELDA31VALUEA3
FIELDB01VALUEB01FIELDB21VALUEB21FIELDB31VALUEB3
FIELDC01VALUEC01FIELDC21VALUEC21FIELDC31VALUEC3
FIELDD01VALUED01FIELDD21VALUED21FIELDD31VALUED3
FIELDE01VALUEE01FIELDE21VALUEE21FIELDE31VALUEE3
With simple awk script which prints the second field on a line matching FIELDB2:
#!/bin/awk -f
BEGIN { FS = "1" }
/FIELDB2/ { print $2 }
Running the above:
> ./scrpt.awk dat.txt
Gives me:
VALUEB0
This is because the line that match:
FIELDB01VALUEB01FIELDB21VALUEB21FIELDB31VALUEB3
When split into records looks:
FIELDB0 VALUEB0 FIELDB2 VALUEB2 FIELDB3 VALUEB3
From which the second field is VALUEB0
Now I don't know which FIELDXX will match but I would like to print the next record on the line after FIELDXX that matched, in this specific example when FIELDB2 matches I need to print VALUEB2.
Any suggestions?

You can match the line, then loop over the fields for a match, and print the next record:
awk -F1 '/FIELDB2/ { for (x=1;x<=NF;x++) if ($x~"FIELDB2") print $(x+1) }'

No need for slowing things down using a loop :)
awk -F1 '/FIELDB2/ {f=NR} f&&NR-1==f' RS="1" file
VALUEB2

Related

awk/sed replace multiple newlines in the record except end of record

I have file where:
field delimiter is \x01
the record delimiter is \n
Some lines contain multiple newlines I need to remove them, however I don't want to remove the legitimate newlines at the end of each lines. I have tried this with awk:
awk -F '\x01' 'NF < 87 {getline s; $0 = $0 s} 1' infile > outfile
But this is only working when the line contains one newline in the record (except end of line newline). This does not work for multiple newlines.
Note: the record contains 87 fields.
What am I doing wrong here?
Example of file:
PL^ANov-21^A29-11-2021^A0^A00^A00^A0000000
test^A00000000
Test^A^A^A^A
PL^ANov-21^A29-11-2021^A0^A00^A00^A0000000
test^A00000000
Test^A^A^A^A
SL^ANov-21^A30-11-2021^AB^A0000^A1234567^A00000
test^A12102120^A00000^A00^A^A
NOTE: The file contains 11 fields; field separate \x01; record separator \n
Expected result:
PL^ANov-21^A29-11-2021^A0^A00^A00^A0000000test^A00000000 Test^A^A^A^A
PL^ANov-21^A29-11-2021^A0^A00^A00^A0000000test^A00000000 Test^A^A^A^A
SL^ANov-21^A30-11-2021^AB^A0000^A1234567^A00000test^A12102120^A00000^A00^A^A
Note: I need to preserve the field delimiter (\x01) and record delimiter (\n)
Thank you very much in advance for looking into this.
The file always contains 87 fields;
The fild delimiter is '\x01', but when viewing in Linux it is represented as '^A'
Some lines contain newlines - I need to remove them, but I don't want to remove the legitimate newlines at the end of each line.
The newline appears twice in the 1st and second record and once in third record - this are the newlines I want to remove.
In the examples/expected results there are 11 delimiters "x01" represented as "^A",
I expect to have 3 records and not 6, i.e.:
First record:
test^A00000000 should be joined to the previous line
Test^A^A^A^A should be joined to the first line as well
forming one record:
PL^ANov-21^A29-11-2021^A0^A00^A00^A0000000test^A00000000 Test^A^A^A^A
Second record
test^A00000000 should be joined to the previous line
Test^A^A^A^A should be joined to that previous line as well
forming one record:
PL^ANov-21^A29-11-2021^A0^A00^A00^A0000000test^A00000000 Test^A^A^A^A
Third record:
test^A12102120^A00000^A00^A^A should be joined to the previous line
forming one record:
SL^ANov-21^A30-11-2021^AB^A0000^A1234567^A00000test^A12102120^A00000^A00^A^A
Note:
The example of awk - provided works when there is one unwanted newline in the record but not when there are multiple newlines
Thank you so very much. It works perfectly. Thank you for explaining it so well to me too.
This might work for you (GNU sed):
sed ':a;N;s/\x01/&/87;Ta;s/\n//g' file
Gather up lines until there are 87 separators, remove any newlines and print the result.
What's wrong with your attempt is that you concatenate two lines, print the result and move to the next line. NF is then reset to the next fields count. As all your lines have less than 87 fields the NF < 87 condition is useless, your script would work the same without it.
Try this awk script:
$ awk -F'\x01' -vn=87 -vi=0 '
{printf("%s", $0); i+=NF; if(i==n) {i=0; print "";} else i-=1;}' file
Here, we use the real \x01 field separator and the NF fields count. Variable i counts the number of already printed fields. We first print the current line without the trailing newline (printf("%s", $0)). Then we update our i fields counter. If it is equal to n we reset it and print a newline. Else we decrement it such that we do not count the last field of this line and the first of the next as 2 separate fields.
Demo with n=12 instead of 87 and your own input file (with \x01 field separators):
$ awk -F'\x01' -vn=12 -vi=0 '
{printf("%s", $0); i+=NF; if(i==n) {i=0; print "";} else i-=1;}' file |
sed 's/\x01/|/g'
PL|Nov-21|29-11-2021|0|00|00|0000000test|00000000 Test||||
PL|Nov-21|29-11-2021|0|00|00|0000000test|00000000 Test||||
SL|Nov-21|30-11-2021|B|0000|1234567|00000test|12102120|00000|00||
The sed command shows the result with the \x01 replaced by | for easier viewing.

How to append last column of every other row with the last column of the subsequent row

I'd like to append every other (odd-numbered rows) row with the last column of the subsequent row (even-numbered rows). I've tried several different commands but none seem to do the task I'm trying to achieve.
Raw data:
user|396012_232|program|30720Mn|
|396012_232.batch|batch|30720Mn|5108656K
user|398498_2|program|102400Mn|
|398498_2.batch|batch|102400Mn|36426336K
user|391983_233|program|30720Mn|
|391983_233.batch|batch|30720Mn|5050424K
I'd like to take the last field in the "batch" lines and append the line above it with the last field in the "batch" line.
Desired output:
user|396012_232|program|30720Mn|5108656K
|396012_232.batch|batch|30720Mn|5108656K
user|398498_2|program|102400Mn|36426336K
|398498_2.batch|batch|102400Mn|36426336K
user|391983_233|program|30720Mn|5050424K
|391983_233.batch|batch|30720Mn|5050424K
The "batch" lines would then be discarded from the output, so in those lines there is no preference if the line is cut or copied or changed in any way.
Where I got stumped, my attempts to finish the logic were embarrassingly illogical:
awk 'BEGIN{OFS="|"} {FS="|"} {if ($3=="batch") {a=$5} else {} ' file.data
Thanks!
If you do not need to keep the lines with batch in Field 3, you may use
awk 'BEGIN{OFS=FS="|"} NR%2==1 { prev=$0 }; $3=="batch" { print prev $5 }' file.data
or
awk 'BEGIN{OFS=FS="|"} NR%2==1 { prev=$0 }; NR%2==0 { print prev $5 }' file.data
See the online awk demo and another demo.
Details
BEGIN{OFS=FS="|"} - sets the field separator to pipe
NR%2==1 { prev=$0 }; - saves the odd lines in prev variable
$3=="batch" - checks if Field 3 is equal to batch (probably, with this logic you may replace it with NR%2==0 to get the even line)
{ print prev $5 } - prints the previous line and Field 5.
You may consider also a sed option:
sed 'N;s/\x0A.*|\([^|]*\)$/\1/' file.data > newfile
See this demo
Details
N; - adds a newline to the pattern space, then appends the next line of
input to the pattern space and if there is no more input then sed
exits without processing any more commands
s/\x0A.*|\([^|]*\)$/\1/ - replaces with Group 1 contents a
\x0A - newline
.*| - any 0+ chars up to the last | and
\([^|]*\) - (Capturing group 1): any 0+ chars other than |
$ - end of line
if your data in 'd' file try gnu awk:
awk 'BEGIN{FS="|"} {if(getline n) {if(n~/batch/){b=split(n,a,"|");print $0 a[b]"\n"n} } }' d

Print lines containing the same second field for more than 3 times in a text file

Here is what I am doing.
The text file is comma separated and has three field,
and I want to extract all the line containing the same second field
more than three times.
Text file (filename is "text"):
11,keyword1,content1
4,keyword1,content3
5,keyword1,content2
6,keyword2,content5
6,keyword2,content5
7,keyword1,content4
8,keyword1,content2
1,keyword1,content2
My command is like below. cat the whole text file inside awk and grep with the second field of each line and count the number of the line.
If the number of the line is greater than 2, print the whole line.
The command:
awk -F "," '{ "cat text | grep "$2 " | wc -l" | getline var; if ( 2 < var ) print $0}' text
However, the command output contains only first three consecutive lines,
instead of printing also last three lines containing "keyword1" which occurs in the text for six times.
Result:
11,keyword1,content1
4,keyword1,content3
5,keyword1,content2
My expected result:
11,keyword1,content1
4,keyword1,content3
5,keyword1,content2
7,keyword1,content4
8,keyword1,content2
1,keyword1,content2
Can somebody tell me what I am doing wrong?
It is relatively straight-forward to make just two passes over the file. In the first pass, you count the number of occurrences of each value in column 2. In the second pass, you print out the rows where the value in column 2 occurs more than your threshold value of 3 times.
awk -F, 'FNR == NR { count[$2]++ }
FNR != NR { if (count[$2] > 3) print }' text text
The first line of code handles the first pass; it counts the occurrences of each different value of the second column.
The second line of code handles the second pass; if the value in column 2 was counted more than 3 times, print the whole line.
This doesn't work if the input is only available on a pipe rather than as a file (so you can't make two passes over the data). Then you have to work much harder.

find duplicate in first field, then combine text from second field of duplicate lines

I have file.csv with two fields similar to this:
text,something
more,somethingelse
text,another
foo,bar
I sort the file so that everything in the first field is in order so that all the duplicates in the first column are grouped together.
foo,bar
more,somethingelse
text,something
text,another
What I need to do but can't figure out is to move the text in the second field to same line as the duplicate in the first field, separated by a ";". It doesn't matter what order the second field is entered. I just want the output to be something like this:
foo,bar
more,somethingelse
text,something; another
I've tried this but it doesn't work. Not surprising since I'm just learning awk.
sort file.csv | awk 'BEGIN{last = ""; value = 0;} {if ($1 == last) {print $0, "; value";}}'
I wanted 'last' to hold the value of the first field of the previous line and 'value' to hold the value of the second field of the previous line. But i couldn't figure out how to make that work.
Is it possible to do this with a shell script? Thanks for any input.
This should work without the need for sort:
awk -F, '{
lines[$1] = (lines[$1] ? lines[$1] "; " $2 : $0)
}
END {
for (line in lines) print lines[line]
}' file
more,somethingelse
text,something; another
foo,bar
Set the input field separator to ,.
Check if the column1 exists in our line array. If it is then pad the second column separated by ;.
If the column1 is not present in our array assign the entire line as value
In the END block iterate through our array and print the values.

Awk: printing undetermined number of columns

I have a file that contains a number of fields separated by tab. I am trying to print all columns except the first one but want to print them all in only one column with AWK. The format of the file is
col 1 col 2 ... col n
There are at least 2 columns in one row.
Sample
2012029754 901749095
2012028240 901744459 258789
2012024782 901735922
2012026032 901738573 257784
2012027260 901742004
2003062290 901738925 257813 257822
2012026806 901741040
2012024252 901733947 257493
2012024365 901733700
2012030848 901751693 260720 260956 264843 264844
So I want to tell awk to print column 2 to column n for n greater than 2 without printing blank lines when there is no info in column n of that row, all in one column like the following.
901749095
901744459
258789
901735922
901738573
257784
901742004
901738925
257813
257822
901741040
901733947
257493
901733700
901751693
260720
260956
264843
264844
This is the first time I am using awk, so bear with me. I wrote this from command line which works:
awk '{i=2;
while ($i ~ /[0-9]+/)
{
printf "%s\n", $i
i++
}
}' bth.data
It is more of a seeking approval than asking a question whether it is the right way of doing something like this in AWK or is there a better/shorter way of doing it.
Note that the actual input file could be millions of lines.
Thanks
Is this what you want as output?
awk '{for(i=2; i<=NF; i++) print $i}' bth.data
gives
901749095
901744459
258789
901735922
901738573
257784
901742004
901738925
257813
257822
901741040
901733947
257493
901733700
901751693
260720
260956
264843
264844
NF is one of several pre-defined awk variables. It indicates the number of fields on a given input line. For instance, it is useful if you want to always print out the last field in a line print $NF. Or of course if you want to iterate through all or part of the fields on a given line to the end of the line.
Seems like awk is the wrong tool. I would do:
cut -f 2- < bth.data | tr -s '\t' '\n'
Note that with -s, this avoids printing blank lines as stated in the original problem.