How to combine two awk commands:
awk NF file_name.csv > file_name_after_NF.csv the output is used in next step:
awk 'BEGIN{f=""}{if($0!=f){print $0}if(NR==1){f=$0}}' file_name_after_NF.csv > file_name_postprocess.csv
Assuming the intermediate file file_name_after_NF.csv was written soley to feed the 'no blank lines' version of the .csv file into the 'remove repeat lines' command, the two procedures can be combined by making NF the condition pattern for the main awk code block:
awk 'BEGIN{f=""} NF{if($0!=f){print $0}if(NR==1){f=$0}}' file_name.csv > file_name_postprocess.csv
In the above procedure, the main awk block is only applied where there are one or more fields in the record.
If you need a file copy of file_name_after_NF.csv, this can be created by adding a file-write block within the main block of the previous awk procedure:
awk 'BEGIN{f=""} NF{if($0!=f){print $0}if(NR==1){f=$0}{print $0 > file_name_after_NF.csv}}' file_name.csv > file_name_postprocess.csv
{print $0 > file_name_after_NF.csv} does the file writing from within awk. As this block is within the main block processed according to the NF condition pattern, only lines with fields are written to file_name_after_NF.csv
More generally, if a little more cumbersomely, awk procedures can be combined by piping their output into successive awk procedures. In your case this could be achieved using:
awk NF file_name.csv | awk 'BEGIN{f=""}{if($0!=f){print $0}if(NR==1){f=$0}}' > file_name_postprocess.csv
or, if you need the intermediate file, again include an awk file print block:
awk NF file_name.csv | awk 'BEGIN{f=""}{if($0!=f){print $0}if(NR==1){f=$0}{print $0 > file_name_after_NF.csv}}' > file_name_postprocess.csv
Edit dealing with cases where the header line is after one or more blank lines
#glennjackman raised a point relating to an example where blank lines existed before the first/header row, since the NR==1 condition in the first two examples above would no longer reset f to contain the header. The last two examples, where awk procedures are joined by pipes, would still work but the first two would not.
To fix this, a further variable can be added to the BEGIN awk block that is updated as soon as the first non-empty line is seen in the main block. This allows for that line to be identified so that prior empty lines do not matter:
awk 'BEGIN{f="";headerFlag="none"} NF{if($0!=f){print $0}if(headerFlag=="none"){f=$0;headerFlag="yes"}}' file_name.csv > file_name_postprocess.csv
The conditional NR==1 in the original script is changed here to check whether the the headerFlag set in BEGIN has been changed. If not, f is set to track repeats of the line and headerFlag is changed so the block will only run on the first encounter of a non-empty record.
The same change can be used in the second solution above.
I wasn't planning to try to answer this until after you posted sample input and expected output but since you already have answers, here's my guess at what you might be asking how to write:
awk 'BEGIN{f=""} !NF{next} $0!=f; f==""{f=$0}' file_name.csv > file_name_postprocess.csv
but without sample input/output it's untested.
I'd recommend you start using white space in your code to improve readability btw.
Also - testing NF for a CSV without setting FS to , is probably the wrong thing to do but idk if you're trying to skip lines of all blanks or lines of all commas or something else so idk what the right thing is to do but maybe it's this:
awk 'BEGIN{FS=","; f=""} $0~("^"FS"*$"){next} $0!=f; f==""{f=$0}' file_name.csv > file_name_postprocess.csv
Related
I have two files that I am working with. The first file is a master database file that I am having to search through. The second file is a file that I can make that allows me to name the items from the master database that I would like to pull out. I have managed to make an AWK solution that will search the master database and extract the exact line that matches the second file. However, I cannot figure out how to copy the lines after the match to my new file.
The master database looks something like this:
40005X/50005/60005/3/10/9/
10038A/20038/30038/0/5/23./XXXX/
10039A/20039/30039/0/5/23./XXXX/
10040A/20040/30040/0/5/23./XXXX/
10041A/20041/30041/0/5/23./XXXX/
10042A/20042/30042/0/5/23./XXXX/
10043A/20043/30043/0/5/23./XXXX/
10044A/20044/30044/0/5/23./XXXX/
10045A/20045/30045/0/5/23./XXXX/
10046A/20046/30046/0/5/23./XXXX/
40006X/50006/60006/3/10/3/
10047A/20047/30047/0/5/23./XXXX/
10048A/20048/30048/0/5/23./XXXX/
10049A/20049/30049/0/5/23./XXXX/
40007X/50007/60007/3/10/3/
10050A/20050/30050/0/5/23./XXXX/
10051A/20051/30051/0/5/23./XXXX/
10052A/20052/30052/0/5/23./XXXX/
40008X/50008/60008/3/10/1/
10053A/20053/30053/0/5/23./XXXX/
40009X/50009/60009/3/10/3/
10054A/20054/30054/0/5/23./XXXX/
10055A/20055/30055/0/5/23./XXXX/
10056A/20056/30056/0/5/23./XXXX/
40010X/50010/60010/3/10/3/
10057A/20057/30057/0/5/23./XXXX/
10058A/20058/30058/0/5/23./XXXX/
10059A/20059/30059/0/5/23./XXXX/
In my example, the lines that start with 4000 is the first line that I am matching up to. The last number in that row is what tells me how many lines there are to copy. So in the first line, 40005X/50005/60005/3/10/9/, I would be matching off of the 40005X, and the 9 in that line tells me that there are 9 lines underneath that I need to copy with it.
The second file is very simple and looks something like this:
40005X
40007X
40008X
As the script finds each match, I would like to move the information from the first file to a new file for analysis. The end result would look like this:
40005X/50005/60005/3/10/9/
10038A/20038/30038/0/5/23./XXXX/
10039A/20039/30039/0/5/23./XXXX/
10040A/20040/30040/0/5/23./XXXX/
10041A/20041/30041/0/5/23./XXXX/
10042A/20042/30042/0/5/23./XXXX/
10043A/20043/30043/0/5/23./XXXX/
10044A/20044/30044/0/5/23./XXXX/
10045A/20045/30045/0/5/23./XXXX/
10046A/20046/30046/0/5/23./XXXX/
40007X/50007/60007/3/10/3/
10050A/20050/30050/0/5/23./XXXX/
10051A/20051/30051/0/5/23./XXXX/
10052A/20052/30052/0/5/23./XXXX/
40008X/50008/60008/3/10/1/
10053A/20053/30053/0/5/23./XXXX/
The code that I currently have that will match the first line is this:
#! /bin/ksh
file1=input_file
file2=input_masterdb
file3=output_test
awk -F'/' 'NR==FNR {id[$1]; next} $1 in id' $file1 $file2 > $file3
I have had the most success with AWK, however I am open to any suggestion. However, I am working on this on a UNIX system. I would like to keep it as a KSH script, since most of the other scripts that I use with this are written in that format, and I am most familiar with it.
Thank you for your help!!
Your existing awk matches correctly the rows from the ids' file, you now need to add a condition to print N lines ahead after reading the last field of the matching row. So we will set a variable p to the number of lines to print plus one (the current one), and decrease per row printing.
awk -F'/' 'NR==FNR{id[$0]; next} $1 in id{p=$6+1} p-->0{print}' file1 file2
or the same with last condition more "awkish" (by Ed Morton) and covering any possible extreme case of a huge file
awk -F'/' 'NR==FNR{id[$0]; next} $1 in id{p=$6+1} p&&p--' file1 file2
here the print condition is omitted, as it is the default action, and the condition is true again as long as decreasing p is positive.
another one
$ awk -F/ 'NR==FNR {a[$1]; next}
!n && $1 in a {n=$(NF-1)+1}
n&&n--' file2 file1
40005X/50005/60005/3/10/9/
10038A/20038/30038/0/5/23./XXXX/
10039A/20039/30039/0/5/23./XXXX/
10040A/20040/30040/0/5/23./XXXX/
10041A/20041/30041/0/5/23./XXXX/
10042A/20042/30042/0/5/23./XXXX/
10043A/20043/30043/0/5/23./XXXX/
10044A/20044/30044/0/5/23./XXXX/
10045A/20045/30045/0/5/23./XXXX/
10046A/20046/30046/0/5/23./XXXX/
40007X/50007/60007/3/10/3/
10050A/20050/30050/0/5/23./XXXX/
10051A/20051/30051/0/5/23./XXXX/
10052A/20052/30052/0/5/23./XXXX/
40008X/50008/60008/3/10/1/
10053A/20053/30053/0/5/23./XXXX/
this takes care if any of the content lines match the ids given. This will only look for another id after the specified number of lines printed.
Could you please try following, written and tested with shown samples in GNU awk. Considering that you want to print lines from line which stars from digits X here. Where Input_file2 is file having only ids and Input_file1 is master file as per OP's question.
awk '
{
sub(/ +$/,"")
}
FNR==NR{
a[$0]
next
}
/^[0-9]+X/{
match($0,/[0-9]+\/$/)
no_of_lines_to_print=substr($0,RSTART,RLENGTH-1)
found=count=""
}
{
if(count==no_of_lines_to_print){ count=found="" }
for(i in a){
if(match($0,i)){
found=1
print
next
}
}
}
found{
++count
}
count<=no_of_lines_to_print && count!=""
' Input_file2 Input_file1
I want to delete the records in a file once a pattern is found and delete all lines from that pattern until end of the file
I tried doing in awk but I haven't found if there is a simpler way to do it
So I want to match the pattern in the second column and then delete records from that pattern until the end of the file
awk -F"," '$2 ~ /100000/ {next} {print}' file.csv
So the above code skips those lines however as you can see i need to add multiple match patterns to ignore lines after the ones that have the value 100000 in 2nd column are ignored. Please note that the values in 2nd column appear sequentially so after 100000 would come 100001 and there is no fixed end number.
Not sure if I got your problem correctly IMHO I believe you need following.
awk '$2==100000{print;exit} 1' Input_file
This will print till the line whose 2nd column is 100000 and then since you don't want to print anything do in spite of simply reading further file and skipping it, this will exit from code which will additionally save our time too.
OR as per Ed Morton sir's nice suggestion:
awk '1; $2==100000{exit}' Input_file
awk command to compare lines in file and print only first line if there are some new words in other lines.
For example: file.txt is having
i am going
i am going today
i am going with my friend
output should be
I am going
this will work for the sample input but perhaps will fail for the actual one, unless you have a representative input we wouldn't know...
$ awk 'NR>1 && $0~p {if(!f) print p; f=1; next} {p=$0; f=0}' file
i am going
you may want play with p=$0 to restrict matching number of fields if the line lengths are not in increasing order...
Is it always the case, after modifying a specific field in awk, that information on the output field separator is lost? What happens if there are multiple field separators and I want them to be recovered?
For example, suppose I have a simple file example that contains:
a:e:i:o:u
If I just run an awk script, which takes account of the input field separator, that prints each line in my file, such as running
awk -F: '{print $0}' example
I will see the original line. If however I modify one of the fields directly, e.g. with
awk -F: '{$2=$2"!"; print $0}' example
I do not get back a modified version of the original line, rather I see the fields separated by the default whitespace separator, i.e:
a e! i o u
I can get back a modified version of the original by specifying OFS, e.g.:
awk -F: 'BEGIN {OFS=":"} {$2=$2"!"; print $0}' example
In the case, however, where there are multiple potential field separators but in the case of multiple separators is there a simple way of restoring the original separators?
For example, if example had both : and ; as separators, I could use -F":|;" to process the file but OFS would no be sufficient to restore the original separators in their relative positions.
More explicitly, if we switched to example2 containing
a:e;i:o;u
we could use
awk -F":|;" 'BEGIN {OFS=":"} {$2=$2"!"; print $0}' example2
(or -F"[:;]") to get
a:e!:i:o:u
but we've lost the distinction between : and ; which would have been maintained if we could recover
a:e!;i:o;u
You need to use GNU awk for the 4th arg to split() which saves the separators, like RT does for RS:
$ awk -F'[:;]' '{split($0,f,FS,s); $2=$2"!"; r=s[0]; for (i=1;i<=NF;i++) r=r $i s[i]; $0=r} 1' file
a:e!;i:o;u
There is no automatically populated array of FS matching strings because of how expensive it'd be in time and memory to store the string that matches FS every time you split a record into fields. Instead the GNU awk folks provided a 4th arg to split() so you can do it yourself if/when you want it. That is the result of a long conversation a few years ago in the comp.lang.awk newsgroup between experienced awk users and gawk providers before all agreeing that this was the best approach.
See split() at https://www.gnu.org/software/gawk/manual/gawk.html#String-Functions.
I want AWK to process my file, and change only some lines. But it prints only rule-matched lines. So I've added a /*/ {print $0} rule. But then I've got a duplications.
awk '/rule 1/ {actions 1;} /*/ {print $0}' file.txt
I want all lines in the output, with 'rule 1' lines changed.
Adding a 1 to the end of the script, forces awk to return true, which has the effect of enabling printing of all lines by default. For example, the following will print all lines in the file. However, if the line contains the words rule 1, only the first field of that line will be printed.
awk '/rule 1/ { print $1; next }1' file
The word next skips processing the rest of the code for that particular line. You can apply whatever action you'd like. I just chose to print $1. HTH.
I'll make a big edit, following Ed Norton latest explanations.
as Ed Morton pointed out, it can even be simplified as : changing lines with specific patterns, and then printing all lines
awk '/regex1/ {actions_1} 1' file.txt
(see his comments below for the reason why it's preferable to the one I proposed)
For the record, there exist ways to skip the rest of the processing for the current line, such as : continue or break if it is in a loop, or next if it is in the main loop.
See for example : http://www.gnu.org/software/gawk/manual/html_node/Next-Statement.html#Next-Statement
Or assign the result of actions 1 to $0:
awk '/rule 1/{$0=actions 1}1' file.txt
for example:
awk '/rule 1/{$0=$1}1' file.txt