Related
This is my sample input file:
xxxxx,12345,yy,ABN,ABE,47,20171018130030,122021010147421,2,IN,3,13,9741588177,32
xxxxxx,9741588177,yy,ABN,ABE,54,20171018130030,122025010227014,2,IN,3,15,12345,32
I want to compare 2 consecutive lines in this file with this condition:
The 12th field of the 1st line and 12th field of the 2nd line must be 13 and 15, respectively.
If the conditions in point 1 are met, then the 2nd field of line 1 (which has the 12th field value as 13) must match the 13th field of line 2 (which has the 12th field as 15).
The file contains many such lines where the above condition is not met, I would like to print only those lines which meet conditions 1 and 2.
Any help in this regard is greatly appreciated!
It's not clear if you want to compare the lines in groups of 2 (ie, compare lines 1 and 2, and then lines 3 and 4) or serially (ie, compare lines 1 and 2, and then 2 and 3). For the latter:
awk 'NR > 1 && prev_12 == 13 && $12 == 15 &&
prev_2 == $13 {print prev; print $0}
{prev=$0; prev_12=$12; prev_2=$2}' FS=, input-file
For the former, add the condition NR % 2 == 0 . (I'm assuming you intended to mention that fields are comma separated, which appears to be the case judging by the input.)
Wish you'd used a few more lines of sample input and provided expected output so we're not all just guessing but MAYBE this is what you want to do:
$ cat tst.awk
BEGIN { FS="," }
(p[12] == 13) && ($12 == 15) && (p[2] == $13) { print p[0] ORS $0 }
{ split($0,p); p[0]=$0 }
$ awk -f tst.awk file
xxxxx,12345,yy,ABN,ABE,47,20171018130030,122021010147421,2,IN,3,13,9741588177,32
xxxxxx,9741588177,yy,ABN,ABE,54,20171018130030,122025010227014,2,IN,3,15,12345,32
another awk
$ awk -F, '$12==13 {p0=$0; p2=$2; c=1; next}
c&&c-- && $12==15 && p2==$13 {print p0; print}' file
start capturing only when the initial match on $12 of the first line.
c&&c-- is a smart counter (count-down here), which will stop at 0 (due to first c before the ampersand). Ed Morton has a post with a lot more examples of the smart counters
The Problem
I just need to combine a whole bunch of files and strip out the header (line 1) from the 1st file.
The Data
Here are the last three lines (with line 1: header) from three of these files:
"START_DATE","END_DATE","UNITS","COST","COST_CURRENCY","AMOUNT"
"20170101","20170131","1","5.49","EUR","5.49"
"20170101","20170131","1","4.27","EUR","4.27"
"","","","","9.76",""
"START_DATE","END_DATE","UNITS","COST","COST_CURRENCY","AMOUNT"
"20170201","20170228","1","5.49","EUR","5.49"
"20170201","20170228","1","4.88","EUR","4.88"
"20170201","20170228","1","0.61","EUR","0.61"
"20170201","20170228","1","0.61","EUR","0.61"
"","","","","11.59",""
START_DATE","END_DATE","UNITS","COST","COST_CURRENCY","AMOUNT"
"20170301","20170331","1","4.88","EUR","4.88"
"20170301","20170331","1","4.27","EUR","4.27"
"","","","","9.15",""
Problem (Continued)
As you can see, the last line has a number (it's a column total) in column 5. Of course, I don't want that last line. But it's (obviously) on a different line number in each file.
(G)awk is clearly the solution, but I don't know (g)awk.
What I've Tried
I've tried a number of combinations of things, but I guess the one that I'm most surprised does not work is:
gawk '
{ if (!$1 ) nextfile }
NR == 1 {$0 = "Filename" "StartDate" OFS $0; print}
FNR > 1 {$0 = FILENAME StartDate OFS $0; print}
' OFS=',' */*.csv > ../path/file.csv
Expected Output (by request)
"START_DATE","END_DATE","UNITS","COST","COST_CURRENCY","AMOUNT
20170101","20170131","1","5.49","EUR","5.49
20170101","20170131","1","4.27","EUR","4.27
20170201","20170228","1","5.49","EUR","5.49
20170201","20170228","1","4.88","EUR","4.88
20170201","20170228","1","0.61","EUR","0.61
20170201","20170228","1","0.61","EUR","0.61
20170301","20170331","1","4.88","EUR","4.88
20170301","20170331","1","4.27","EUR","4.27"
And, of course, I've tried searching both Google and SO. Most of the answers I see require much more awk knowledge than I have, just to understand them. (I'm not a data wrangler, but I have a data wrangling task.)
Thanks for any help!
this should do...
awk 'NR==1; FNR==1{next} FNR>2{print p} {p=$0}' file{1..3}
print first header, skip other headers and last lines.
Another awk approach:-
awk -F, '
NR == 1 {
header = $0
print
next
}
FNR > 1 && $1 != "\"\""
' *.csv
Something like the following should do the trick:
awk -F"," 'NR==1{header=$0; print $0} $0!=header && $1!=""{print $0}' */*.csv > ../path/file.csv\
Here awk will:
Split the records by comma -F","
If this is the first record awk encounters, it sets variable header to the entire contents of the line and then prints the header NR==1{header=$0; print $0}
If the contents of the current line are not a header and the first field isn't empty (indicating a "total" line), then print the line $0!=header && $1!=""{print $0}'
As mentioned in my comment below, if the first field of your records always begin with an 8 digit date, then you could simplify (this is less generic than the code above):
awk -F"," 'NR == 1 || $1 ~ /"[0-9]{8}"/ {print $0} /*.csv > outfile.csv
Essentially that says if this is the first record to process then print it (it's a header) OR || if the first field is an 8 digit number surrounded by double quotes then print it.
Would like to extract the line items, if the dates between 5th Apr to 10th Apr from second field ($2) . Having many gun zip files into that directory.
Inputs.gz
Des1,DATE,Des1,Des2,Des3
ab,01-APR-15,10,0,4
ab,04-APR-15,25,0,12
ab,05-APR-15,40,0,6
ab,07-APR-15,55,0,6
ab,10-APR-15,70,0,1
ab,11-APR-15,85,0,1
I have tried below command and in-complete
zcat Inputs*.gz | awk 'BEGIN{FS=OFS=","} { if ( (substr($2,1,2) >=5) && (substr($2,1,2) <=10) ) print $0 }' > Output.txt
Expected Output
ab,05-APR-15,40,0,6
ab,07-APR-15,55,0,6
ab,10-APR-15,70,0,1
Please suggest ...
Try this:
awk -F",|-" '$2 >= 5 && $2 <= 10'
It adds the date delimiter to the FS using the -F flag. To ensure that it's APR of 2015, you could separately add tests like:
awk -F",|-" '$2 >= 5 && $2 <= 10 && $3=="APR" && $4==15'
While this makes the date easy to parse up front, if you want to print it out again, you'll need to reconstruct it with something like _date = $2 "-" $3 "-" $4. And if you need to manipulate the data in general, you'd want to add back in the BEGIN {OFS=","} part.
The field numbering I used assumes there are no "-" delimiters in the first field.
I get the following output:
ab,05-APR-15,40,0,6
ab,07-APR-15,55,0,6
ab,10-APR-15,70,0,1
If you have a whole mess of dates and you really only care about the one in the 2nd field via comma delimiters, you could use split like:
awk -F"," '{ split($2, darr, "-") } darr[1] >= 5 && darr[1] <= 10 && darr[2]=="APR" && darr[3]==15'
which is like saying:
for every line, parse the 2nd field into the darr array using the - delimiter
for every line, if the logic darr[1] >= 5 && darr[1] <= 10 && darr[2]=="APR" && darr[3]==15 is true print the whole line.
Another simple solution by using regular expression
awk -F',' '$2 ~ /([0][5-9]|10)-APR-15/{ print $0 }' txt
-F Field separator.
$2 second field
~ match regular expression
'/([0][5-9]|10)-APR-15/` reguler expression to match 05 to 09 or 10
APR-15
Using internal field separator
awk 'BEGIN{ FS="," } $2 ~ /([0][5-9]|10)-APR-15/{ print $0 }' txt
using explicate date number declarations
awk 'BEGIN{ FS="," } $2 ~ /(05|06|07|08|09|10)-APR-15/{ print $0 }' txt
I have a rather big file with 255 coma separated columns and I need to print out every third column only.
I was trying something like this
awk '{ for (i=0;i<=NF;i+=3) print $i }' file
but that doesn't seem to be the solution, since it prints to only one long column. Anybody can help? Thanks
Here is one way to do this.
The script prog.awk:
BEGIN {FS = ","} # field separator
{for (i = 1; i <= NF; i += 3) printf ("%s%c", $i, i + 3 <= NF ? "," : "\n");}
Invocation:
awk -f prog.awk <input.csv >output.csv
Example input.csv:
1,2,3,4,5,6,7,8,9,10
11,12,13,14,15,16,17,18,19,20
Example output.csv:
1,4,7,10
11,14,17,20
It behaves like that because by default awk splits fields in spaces. You have to tell it to split them with commas, and it's done using the FS variable or the -F switch. Besides that, first field is number one. The zero is the whole line, so also change the initial value of the for loop:
awk -F',' '{ for (i=1;i<=NF;i+=3) print $i }' file
I am beginner in AWK, so please help me to learn it. I have a text file with name snd and it values are
1 0 141
1 2 223
1 3 250
1 4 280
I want to print the entire row when the third column value is minimu
This should do it:
awk 'NR == 1 {line = $0; min = $3}
NR > 1 && $3 < min {line = $0; min = $3}
END{print line}' file.txt
EDIT:
What this does is:
Remember the 1st line and its 3rd field.
For the other lines, if the 3rd field is smaller than the min found so far, remember the line and its 3rd field.
At the end of the script, print the line.
Note that the test NR > 1 can be skipped, as for the 1st line, $3 < min will be false. If you know that the 3rd column is always positive (not negative), you can also skip the NR == 1 ... test as min's value at the beginning of the script is zero.
EDIT2:
This is shorter:
awk 'NR == 1 || $3 < min {line = $0; min = $3}END{print line}' file.txt
You don't need awk to do what you want. Use sort
sort -nk 3 file.txt | head -n 1
Results:
1 0 141
I think sort is an excellent answer, unless for some reason what you're looking for is the awk logic to do this in a larger script, or you want to avoid the extra pipes, or the purpose of this question is to learn more about awk.
$ awk 'NR==1{x=$3;line=$0} $3<x{line=$0} END{print line}' snd
Broken out into pieces, this is:
NR==1 {x=$3;line=$0} -- On the first line, set an initial value for comparison and store the line.
$3<x{line=$0} - On each line, compare the third field against our stored value, and if the condition is true, store the line. (We could make this run only on NR>1, but it doesn't matter.
END{print line} -- At the end of our input, print whatever line we've stored.
You should read man awk to learn about any parts of this that don't make sense.
a short answer for this would be:
sort -k3,3n temp|head -1
since you have asked for awk:
awk '{if(min>$3||NR==1){min=$3;a[$3]=$0}}END{print a[min]}' your_file
But i prefer the shorter one always.
For calculating the smallest value in any column , let say last column
awk '(FNR==1){a=$NF} {a=$NF < a?$NF:a} END {print a}'
this will only print the smallest value of the column.
In case if complete line is needed better to use sort:
sort -r -n -t [delimiter] -k[column] [file name]
awk -F ";" '(NR==1){a=$NF;b=$0} {a=$NF<a?$NF:a;b=$NF>a?b:$0} END {print b}' filename
this will print the line with smallest value which is encountered first.
awk 'BEGIN {OFS=FS=","}{if ( a[$1]>$2 || a[$1]=="") {a[$1]=$2;} if (b[$1]<$2) {b[$1]=$2;} } END {for (i in a) {print i,a[i],b[i]}}' input_file
We use || a[$1]=="" because when 1st value of field 1 is encountered it will have null in a[$1].