awk command to print only part of matching lines - awk

awk command to compare lines in file and print only first line if there are some new words in other lines.
For example: file.txt is having
i am going
i am going today
i am going with my friend
output should be
I am going

this will work for the sample input but perhaps will fail for the actual one, unless you have a representative input we wouldn't know...
$ awk 'NR>1 && $0~p {if(!f) print p; f=1; next} {p=$0; f=0}' file
i am going
you may want play with p=$0 to restrict matching number of fields if the line lengths are not in increasing order...

Related

How do I match a pattern and then copy multiple lines?

I have two files that I am working with. The first file is a master database file that I am having to search through. The second file is a file that I can make that allows me to name the items from the master database that I would like to pull out. I have managed to make an AWK solution that will search the master database and extract the exact line that matches the second file. However, I cannot figure out how to copy the lines after the match to my new file.
The master database looks something like this:
40005X/50005/60005/3/10/9/
10038A/20038/30038/0/5/23./XXXX/
10039A/20039/30039/0/5/23./XXXX/
10040A/20040/30040/0/5/23./XXXX/
10041A/20041/30041/0/5/23./XXXX/
10042A/20042/30042/0/5/23./XXXX/
10043A/20043/30043/0/5/23./XXXX/
10044A/20044/30044/0/5/23./XXXX/
10045A/20045/30045/0/5/23./XXXX/
10046A/20046/30046/0/5/23./XXXX/
40006X/50006/60006/3/10/3/
10047A/20047/30047/0/5/23./XXXX/
10048A/20048/30048/0/5/23./XXXX/
10049A/20049/30049/0/5/23./XXXX/
40007X/50007/60007/3/10/3/
10050A/20050/30050/0/5/23./XXXX/
10051A/20051/30051/0/5/23./XXXX/
10052A/20052/30052/0/5/23./XXXX/
40008X/50008/60008/3/10/1/
10053A/20053/30053/0/5/23./XXXX/
40009X/50009/60009/3/10/3/
10054A/20054/30054/0/5/23./XXXX/
10055A/20055/30055/0/5/23./XXXX/
10056A/20056/30056/0/5/23./XXXX/
40010X/50010/60010/3/10/3/
10057A/20057/30057/0/5/23./XXXX/
10058A/20058/30058/0/5/23./XXXX/
10059A/20059/30059/0/5/23./XXXX/
In my example, the lines that start with 4000 is the first line that I am matching up to. The last number in that row is what tells me how many lines there are to copy. So in the first line, 40005X/50005/60005/3/10/9/, I would be matching off of the 40005X, and the 9 in that line tells me that there are 9 lines underneath that I need to copy with it.
The second file is very simple and looks something like this:
40005X
40007X
40008X
As the script finds each match, I would like to move the information from the first file to a new file for analysis. The end result would look like this:
40005X/50005/60005/3/10/9/
10038A/20038/30038/0/5/23./XXXX/
10039A/20039/30039/0/5/23./XXXX/
10040A/20040/30040/0/5/23./XXXX/
10041A/20041/30041/0/5/23./XXXX/
10042A/20042/30042/0/5/23./XXXX/
10043A/20043/30043/0/5/23./XXXX/
10044A/20044/30044/0/5/23./XXXX/
10045A/20045/30045/0/5/23./XXXX/
10046A/20046/30046/0/5/23./XXXX/
40007X/50007/60007/3/10/3/
10050A/20050/30050/0/5/23./XXXX/
10051A/20051/30051/0/5/23./XXXX/
10052A/20052/30052/0/5/23./XXXX/
40008X/50008/60008/3/10/1/
10053A/20053/30053/0/5/23./XXXX/
The code that I currently have that will match the first line is this:
#! /bin/ksh
file1=input_file
file2=input_masterdb
file3=output_test
awk -F'/' 'NR==FNR {id[$1]; next} $1 in id' $file1 $file2 > $file3
I have had the most success with AWK, however I am open to any suggestion. However, I am working on this on a UNIX system. I would like to keep it as a KSH script, since most of the other scripts that I use with this are written in that format, and I am most familiar with it.
Thank you for your help!!
Your existing awk matches correctly the rows from the ids' file, you now need to add a condition to print N lines ahead after reading the last field of the matching row. So we will set a variable p to the number of lines to print plus one (the current one), and decrease per row printing.
awk -F'/' 'NR==FNR{id[$0]; next} $1 in id{p=$6+1} p-->0{print}' file1 file2
or the same with last condition more "awkish" (by Ed Morton) and covering any possible extreme case of a huge file
awk -F'/' 'NR==FNR{id[$0]; next} $1 in id{p=$6+1} p&&p--' file1 file2
here the print condition is omitted, as it is the default action, and the condition is true again as long as decreasing p is positive.
another one
$ awk -F/ 'NR==FNR {a[$1]; next}
!n && $1 in a {n=$(NF-1)+1}
n&&n--' file2 file1
40005X/50005/60005/3/10/9/
10038A/20038/30038/0/5/23./XXXX/
10039A/20039/30039/0/5/23./XXXX/
10040A/20040/30040/0/5/23./XXXX/
10041A/20041/30041/0/5/23./XXXX/
10042A/20042/30042/0/5/23./XXXX/
10043A/20043/30043/0/5/23./XXXX/
10044A/20044/30044/0/5/23./XXXX/
10045A/20045/30045/0/5/23./XXXX/
10046A/20046/30046/0/5/23./XXXX/
40007X/50007/60007/3/10/3/
10050A/20050/30050/0/5/23./XXXX/
10051A/20051/30051/0/5/23./XXXX/
10052A/20052/30052/0/5/23./XXXX/
40008X/50008/60008/3/10/1/
10053A/20053/30053/0/5/23./XXXX/
this takes care if any of the content lines match the ids given. This will only look for another id after the specified number of lines printed.
Could you please try following, written and tested with shown samples in GNU awk. Considering that you want to print lines from line which stars from digits X here. Where Input_file2 is file having only ids and Input_file1 is master file as per OP's question.
awk '
{
sub(/ +$/,"")
}
FNR==NR{
a[$0]
next
}
/^[0-9]+X/{
match($0,/[0-9]+\/$/)
no_of_lines_to_print=substr($0,RSTART,RLENGTH-1)
found=count=""
}
{
if(count==no_of_lines_to_print){ count=found="" }
for(i in a){
if(match($0,i)){
found=1
print
next
}
}
}
found{
++count
}
count<=no_of_lines_to_print && count!=""
' Input_file2 Input_file1

Using pipe and shell command in awk script

I am writing an awk script which needs to produce an output which needs to be sorted.
I am able to get the desired unsorted output in an awk array. I tried the following code to sort the array and it works and I don't know why and whether it is the expected behavior.
Sample Input to the question:
Ram,London,200
Alex,London,500
David,Birmingham,300
Axel,Mumbai,150
John,Seoul,450
Jen,Tokyo,600
Sarah,Tokyo,630
The expected output should be:
Birmingham,300
London,700
Mumbai,150
Seoul,450
Tokyo,1230
The following script is required to show the city name along with the respective cumulative total of the integers present in the third field.
BEGIN{
FS = ","
OFS = ","
}
{
if($2 in arr){
arr[$2]+=$3;
}else{
arr[$2]=$3;
}
}
END{
for(i in arr){
print i,arr[i] | "sort"
}
}
The following code is in question:
for(i in arr){
print i,arr[i] | "sort"
}
The output of the print is piped to sort, which is a bash command.
So, how does this output travel from awk to bash?
Is this the expected behavior or a mere side effect?
Is there a better awk way to do it? Have tried asort and asorti already, but they exist with gawk and not awk.
PS: I am trying to specifically write a .awk file for the task, without using bash commands. Please suggest the same.
Addressing your specific questions in order:
So, how does this output travel from awk to bash?
A pipe to a spawned process.
Is this the expected behavior or a mere side effect?
Expected
Is there a better awk way to do it? Have tried asort and asorti already, but they exist with gawk and not awk.
Yes, pipe the output of the whole awk command to sort.
PS: I am trying to specifically write a .awk file for the task, without using bash commands. Please suggest the same.
See https://web.archive.org/web/20150928141114/http://awk.info/?Sorting for the implementation of a few common sorting algorithms in awk. See also https://rosettacode.org/wiki/Category:Sorting_Algorithms.
With respect to the question in your comments:
Since a process is spawned to sort from within the loop in the END rule, I was confused whether this will call the sort function on a single line and the spawned process will die there after, and a new process to sort will be spawned in the next iteration of the loop
The spawned process won't die until your awk script terminates or you call close("sort").
Could you please try changing you sort to sort -t',' -k1 in your code. Since your delimiter is comma so you need to inform sort that your delimiter is different than space. By default sort takes delimiter as comma.
Also you could remove if, else block ftom your main block and you could use only arr[$2]+=$3. Keep the rest code as it is apart from sort changes which I mentioned above
I am on mobile so couldn't paste all code but explanation should help you here.
What I would suggest is piping the output of awk to sort and not try and worry about piping the output within the END rule. While GNU awk provides asorti() to allow sorting the contents of an array, in this case since it is just the output you want sorted, a single pipe to sort after your awk script completes is all you need, e.g.
$ awk -F, -v OFS=, '{a[$2]+=$3}END{for(i in a)print i, a[i]}' file | sort
Birmingham,300
London,700
Mumbai,150
Seoul,450
Tokyo,1230
And since it is a single pipe of the output, you incur no per-iteration overhead for the subshell required by the pipe.
If you want to avoid the pipe altogether, if you have bash, you can simply use process-substitution with redirection, e.g.
$ sort < <(awk -F, -v OFS=, '{a[$2]+=$3}END{for(i in a)print i, a[i]}' file)
(same result)
If you have GNU awk, then asorti() will sort a by index and you can place the sorted array in a new array b and then output the sorted results within the END rule, e.g.
$ awk -F, -v OFS=, '{a[$2]+=$3}END{asorti(a,b);for(i in b)print b[i], a[b[i]]}' file
Birmingham,300
London,700
Mumbai,150
Seoul,450
Tokyo,1230

awk to store and reset variable from file

Trying to use awk to lookup the string in file1 (which is always just 1 field) in the same line of file2. That is if row 1 is being used in file1 then only row 1 is used in file2. Since it is possible for the value to be missing this is a check done to ensure it is there. This is just an idea so there probably is a better way, but I just wanted to see. Thank you :).
file1
R_2017_01_13_12_11_56_user_S5-00580-24-Medexome
R_2017_01_13_14_46_04_user_S5-00580-25-Medexome
file2
The oldest folder is R_2017_01_13_12_11_56_user_S5-00580-24-Medexome, created on 2017-01-17+11:31:02.5035483130 and analysis done using v1.4 by cmccabe at 01/17/17 12:41:03 PM
desired output for $filename
R_2017_01_13_12_11_56_user_S5-00580-24-Medexome
After a bunch of processes are run using $filename, I need to reset that variable with a new one.
file1 (after rerunning some process)
R_2017_01_13_12_11_56_user_S5-00580-24-Medexome
R_2017_01_13_14_46_04_user_S5-00580-25-Medexome
file2 (after rerunning some process)
The oldest folder is R_2017_01_13_12_11_56_user_S5-00580-24-Medexome, created on 2017-01-17+11:31:02.5035483130 and analysis done using v1.4 by cmccabe at 01/17/17 12:41:03 PM
The oldest folder is R_2017_01_13_14_46_04_user_S5-00580-25-Medexome, created on 2017-01-17+06:53:07.3194950000 and analysis done using v1.4 by cmccabe at 01/18/17 06:59:08 AM
desired output for $filename now is (since this is value is new)
R_2017_01_13_14_46_04_user_S5-00580-25-Medexome
awk
filename=$(awk 'NR==1{print $1}' file1 file2)
You want to check if the last line of file2 contains a string given in file1.
For this, you just have to read that last line and then see if it matches with any of the words in file1.
$ awk 'ENDFILE {line=$0} FNR<NR && line ~ $1' file2 file1
R_2017_01_13_14_46_04_user_S5-00580-25-Medexome
This uses:
ENDFILE {line=$0}
after reading a file, $0 contains the last line that was read (well, not always, but I assume you have a modern version of awk, since ENDFILE is a GNU awk extension). With this, we store this last line into line, so that we can use it when reading the next file.
FNR<NR && line ~ $1
while reading the file1, check if the given word is present in the stored line. If so, print is automatically triggered.
This uses the FNR<NR trick, where FNR holds the number of the line in the current file, while NR the number of line in general. This way, FNR==NR is only true while reading the first file and FNR<NR from the second on.
If you only need to check the last line of file2 continuously, you could:
$ awk 'NR==FNR{a[$1];next}{for(i in a)if(i ~ $0) print i}' file1 <(tail -f file2)
Explained:
NR==FNR{a[$1];next} read into a array the search terms from file1
file2 is tail -f'd into awk using process substitution, ie. it reads a record from the end of file2, goes thru all search words in a and looks them from the record, printing search word if there is a match

Output field separators in awk after substitution in fields

Is it always the case, after modifying a specific field in awk, that information on the output field separator is lost? What happens if there are multiple field separators and I want them to be recovered?
For example, suppose I have a simple file example that contains:
a:e:i:o:u
If I just run an awk script, which takes account of the input field separator, that prints each line in my file, such as running
awk -F: '{print $0}' example
I will see the original line. If however I modify one of the fields directly, e.g. with
awk -F: '{$2=$2"!"; print $0}' example
I do not get back a modified version of the original line, rather I see the fields separated by the default whitespace separator, i.e:
a e! i o u
I can get back a modified version of the original by specifying OFS, e.g.:
awk -F: 'BEGIN {OFS=":"} {$2=$2"!"; print $0}' example
In the case, however, where there are multiple potential field separators but in the case of multiple separators is there a simple way of restoring the original separators?
For example, if example had both : and ; as separators, I could use -F":|;" to process the file but OFS would no be sufficient to restore the original separators in their relative positions.
More explicitly, if we switched to example2 containing
a:e;i:o;u
we could use
awk -F":|;" 'BEGIN {OFS=":"} {$2=$2"!"; print $0}' example2
(or -F"[:;]") to get
a:e!:i:o:u
but we've lost the distinction between : and ; which would have been maintained if we could recover
a:e!;i:o;u
You need to use GNU awk for the 4th arg to split() which saves the separators, like RT does for RS:
$ awk -F'[:;]' '{split($0,f,FS,s); $2=$2"!"; r=s[0]; for (i=1;i<=NF;i++) r=r $i s[i]; $0=r} 1' file
a:e!;i:o;u
There is no automatically populated array of FS matching strings because of how expensive it'd be in time and memory to store the string that matches FS every time you split a record into fields. Instead the GNU awk folks provided a 4th arg to split() so you can do it yourself if/when you want it. That is the result of a long conversation a few years ago in the comp.lang.awk newsgroup between experienced awk users and gawk providers before all agreeing that this was the best approach.
See split() at https://www.gnu.org/software/gawk/manual/gawk.html#String-Functions.

print last two words of last line

I have a script which returns few lines of output and I am trying to print the last two words of the last line (irrespective of number of lines in the output)
$ ./test.sh
service is running..
check are getting done
status is now open..
the test is passed
I tried running as below but it prints last word of each line.
$ ./test.sh | awk '{ print $NF }'
running..
done
open..
passed
how do I print the last two words "is passed" using awk or sed?
Just say:
awk 'END {print $(NF-1), $NF}'
"normal" awks store the last line (but not all of them!), so that it is still accessible by the time you reach the END block.
Then, it is a matter of printing the penultimate and the last one. This can be done using the NF-1 and NF trick.
For robustness if your last line can only contain 1 field and your awk doesn't retain the field values in the END section:
awk '{split($0,a)} END{print (NF>1?a[NF-1]OFS:"") a[NF]}'
This might work for you (GNU sed):
sed '$s/.*\(\<..*\<.*\)/\1/p;d' file
This deletes all lines in the file but on the last line it replaces all words by the last two words and prints them if successful.