Conditional Performance of Matching Expression in Gawk - awk
I'm using gawk to match database entries (simple text file, "fields" separated with ::, one line = one record). I have up to 8 variables I want to match, but the variables are based on user input, and don't necessarily exist/ are empty. My logical operator is "AND" (&&). I only want to perform a match for a particular variable if the variable exists, so that an empty variable does not return a "false" for the entire search.
For example, my variables are "date" and "reps". I've tried:
{ if ( date ) { $2 ~ date } && if ( reps ) { $3 ~ reps }}
and I've also tried:
{ if ( date ) { $2 ~ date; && if ( reps ) $3 ~ reps }}
but the "&&" gives a syntax error (there may be other problems, too, of course).
How do I (1) perform a conditional match and (2) how to I string several of those together?
__
Follow up: from answers received so far (thank you!) I can tell I didn't state my logical requirements clearly. What I'm trying to achieve on an field basis is: if the variable exists and matches, select the record; but if the variable does not exist, ignore it as a test condition. What I don't want to happen is when the variable does not exist, it still gets used as a test condition and results in the record not getting selected. (Also, I'm not concerned about the variable existing and not matching.) For an entire record, I want to use all existing variables in a cumulative basis.
You might try something like
awk '
# start by accepting all records
{ filter = 1 }
# then, individually, see if any conditions fail
filter && date && $2 !~ date { filter = 0 }
filter && reps && $3 !~ reps { filter = 0 }
# ...
# then print the record if it has passed all the conditions
filter
'
add in your mechanism to pass variables into awk.
You can string them together into one giant condition, but readability suffers
if (date && $2 ~ date && reps && $3 ~ reps && ...)
awk -F'yourfs' -v date="$DATE" -v reps="$REPS" '
BEGIN{ if(!date || !reps) exit }
$1 ~ date && $2 ~ reps
' file
How do I(...)perform a conditional match
Take look at so-called ternary operator condition?valueiftrue:valueiffalse, consider following example, let say you have file.txt as follows
0 abc
1 abc
0 def
1 def
and 1st column determines if check is to be made and check is 2nd columnd is abc then you can do
awk '$1?$2=="abc":1' file.txt
which will give
0 abc
1 abc
0 def
(tested in GNU Awk 5.0.1)
Related
Value of an assignment in awk, ternary operator parsing in awk
I new to awk and playing around with it. When trying to use the ternary operator, at some point I wanted to execute two operations upon true condition, and as I couldn't find the syntax to do so I tried to smuggle one of the two operations inside the condition to take advantage of the lazy evaluation. I have an input file as follow: file.csv A,B,C,D 1,,, 2,,, 3,,, 4,,, 5,,, 6,,, 7,,, 8,,, And I'd like, for the sake of the exercise, to put assign B and C to 0 if A is less than 5 ; C to 1 if A is 5 or more. I guess the ternary operator is a terrible way to do this but this is not my point. The question is: why does the following line outputs that? How does awk parse this expression ? awk '(FNR!=1){$1<5 && $3=0 ? $2=0 : $2=1}{print $0}' FS=, OFS=, file.csv Output: 1,1,1, 2,1,1, 3,1,1, 4,1,1, 5,,, 6,,, 7,,, 8,,, I was expecting the $3=0 expression to be executed and evaluated to true, and being skipped when the first part of the condition ($1<5) is false. Expected result: 1,0,0, 2,0,0, 3,0,0, 4,0,0, 5,1,, 6,1,, 7,1,, 8,1,, Extra question: can I actually use the ternary operator and have in it several instructions executed depending on the condition value ? Is it only a bad practice or actually impossible ?
1st solution: You should have like this code, written and tested with your shown samples and attempts. I have used ternary operators to check if value of 1st field is lesser than 5 or not and based on that setting values for 2nd and 3rd fields here. awk ' BEGIN { FS=OFS="," } FNR==1{ print next } { $2=($1<5?0:$1) $3=($1<5?0:$3) } 1 ' Input_file 2nd solution(Generic approach): If you have to pass N number of fields to be checked in program then better create a function and do the checks and assignment there, using again ternary operators here for computation. Where: threshold is an awk variable which is assigned to 5 value by which you want to do comparison fir 1st field. fieldCompare is again an awk variable which contains 1 in this case since we want to compare 1st field value to 5 here. checkValue is function where field numbers(eg: 2 and 3 in this case) are being passed with comma separated values to be checked in a single shot within function. awk -v threshold="5" -v fieldCompare="1" ' function checkValue(fields){ num=split(fields,arr,",") for(i=1;i<=num;i++){ fieldNum = arr[i] $fieldNum = ($fieldCompare<threshold?0:$fieldNum) } } BEGIN { FS=OFS="," } FNR==1{ print next } checkValue("2,3") 1 ' Input_file
If I look at the expected outcome, the 2nd field should be one. Setting field 2 and 3 to zero if field 1 is smaller than five, else set field 2 to one. The one at the end }1 evaluates to true and will print the whole line. awk 'BEGIN{FS=OFS=","}(FNR!=1){($1 < 5) ? $2=$3=0 : $2=1}1' file.csv Output A,B,C,D 1,0,0, 2,0,0, 3,0,0, 4,0,0, 5,1,, 6,1,, 7,1,, 8,1,,
If you want to write cryptic code, this is one way to do it. You don't even need the ternary operator. $ awk 'BEGIN {FS=OFS=","} NR>1 {$2=$1>=5 || $3=0 }1' file A,B,C,D 1,0,0, 2,0,0, 3,0,0, 4,0,0, 5,1,, 6,1,, 7,1,, 8,1,,
I was expecting the $3=0 expression to be executed and evaluated to true The result of an assignment is the value assigned. Zero is false. ... and being skipped when the first part of the condition ($1<5) is false. Since && has a higher precedence than ?:, and ?: has a higher precedence than =, awk is doing this: $1<5 && ($3 = (0 ? $2=0 : $2=1)) When $1 < 5, if 0 is true (it is not) then assign $3 the result of $2 = 0, else assign $3 the result of $2 = 1. When $1 >= 5, do nothing.
tested and confirmed working on mawk-1, mawk-2, gawk, and nawk : only difference being order of precedence at 3rd section {g,n}awk 'BEGIN { _+=_^=FS=OFS="," } NR<_ || ($_=_^_<+$!_) || $(_--)=!++_ ""' or mawk 'BEGIN { _+=_^=FS=OFS="," } NR<_ || ($_=_^_<+$!_) || $++_ = !--_ ""' | A,B,C,D 1,0,0, 2,0,0, 3,0,0, 4,0,0, 5,1,, 6,1,, 7,1,, 8,1,, concating with empty string ("") at the tail ensures print out for a zero-value assignment
Loop through files in a directory and select rows based on column value using awk for large files
I have 15 text files (each about 1.5 - 2 GB) in a folder, each with about 300,000 to 500,000 rows and about 250 columns, each with a header row with column names. I also have a list of five values ("a123", "b234", "c345", "d456", and "e567"). (These are arbitrary values and the values are not in order and they do not have any relation with each other) For each of the five values, I would like to query in each of 15 text files and select the rows if "COL_ABC" or "COL_DEF" equals the value. ("COL_ABC" and "COL_DEF" are arbitrary names and the column names do not have any relation with each other.) I do not know which column number is "COL_ABC" or "COL_DEF". They differ between each file because each file has a different number of columns, but "COL_ABC"/"COL_DEF" would be named "COL_ABC"/"COL_DEF" in each of the files. Additionally, some of the files have both "COL_ABC" and "COL_DEF" but others have only "COL_ABC". If only "COL_ABC" exists, I would like to do the query on "COL_ABC" but if both exists, I would like to do the query on both columns (i.e. check if "a123" is present in other "COL_ABC" or "COL_DEF" and select the row if true). I'm very new to awk, so forgive me if this is a simple question. I am able to only do simple filtering such as: awk -F "\t" '{ if(($1 == "1") && ($2 == "2")) { print } }' file1.txt For each of the fifteen files, I would like to print the results to a new file. Typically I could do this in R but my files are too big to be read into R. Thank you!
Assuming: The input filenames have the form as "*.txt". The columns are separated by a tab character. Each of five values are compared with the target column (COL_ABC or COL_DEF) one by one and individual result files are created according to the value. Then 15 x 5 = 75 files will be created. (If this is not what you want, please let me know.) Then would you please try: awk -F"\t" ' BEGIN { values["a123"] # assign values values["b234"] values["c345"] values["d456"] values["e567"] } FNR==1 { # header line for (i in values) { # loop over values if (outfile[i] != "") close(outfile[i]) # close previous file outfile[i] = "result_" i "_" FILENAME # filename to create print > outfile[i] # print the header } abc = def = 0 # reset the indexes for (i = 1; i <= NF; i++) { # loop over the column names if ($i == "COL_ABC") abc = i # "COL_ABC" is found: assign abc to the index else if ($i == "COL_DEF") def = i # "COL_DEF" is found: assign def to the index } next } { for (i in values) { if (abc > 0 && $abc == i || def > 0 && $def == i) print > outfile[i] # abc_th column or def_th column matches i } } ' *.txt If your 15 text files are located in the directory, e.g. /path/to/the/dir/ and you want to specify the directory as an argument, change the *.txt in the last line to /path/to/the/dir/*.txt.
for f in file*.txt; do awk -F'\t' ' BEGIN { n1="COL_DEF" n2="COL_ABC" val["a123"] val["b234"] val["c345"] val["d456"] val["e567"] } NR==1 { for(i=1; i<=NR; i++) col[$i]=i c=col[n1] if(!c) c=col[n2] next } $c in val { print } ' "$f" > "$f.new" done we don't really need to set n1, n2 (we could use the string values directly) but it keeps all definitions in one place awk doesn't have a very nice way to declare all elements of an entire array at once, so we set val elements individually (alternatively, for simple values we could use split) on the first line of the file (NR==1), we store the header names, then immediately look up the ones we care about and store the index in c : we choose the first of col[n2] or col[n1] that is defined (non-zero) to be the column index to be searched next skips the remaining awk actions for this line then for every remaining line we check if the value in the relevant column is one of the values in val and, if so, print that line. The awk script is enclosed in a bash for loop and we write output to a new file based on the loop variable. (This could all be done in awk itself, but this way is easy enough.)
AWK script- Not showing data
I'm trying to create a variable to sum columns 26 to 30 and 32. SO far I have this code which prints me the hearder and the output format like I want but no data is being shown. #! /usr/bin/awk -f BEGIN { FS="," } NR>1 { TotalPositiveStats= ($26+$27+$28+$29+$30+$32) } {printf "%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%.2f %,%s,%s,%.2f %,%s,%s,%.2f %,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s, %s\n", EndYear,Rk,G,Date,Years,Days,Age,Tm,Home,Opp,Win,Diff,GS,MP,FG,FGA,FG_PCT,3P,3PA,3P_PCT,FT,FTA,FT_PCT,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,GmSc,TotalPositiveStats } NR==1 { print "EndYear,Rk,G,Date,Years,Days,Age,Tm,HOme,Opp,Win,Diff,GS,MP,FG,FGA,FG_PCT,3P,3PA,3P_PCT,FT,FTA,FT_PCT,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,GmSc,TotalPositiveStats" }#header Input data: EndYear,Rk,G,Date,Years,Days,Age,Tm,Home,Opp,Win,Diff,GS,MP,FG,FGA,FG_PCT,3P,3PA,3P_PCT,FT,FTA,FT_PCT,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,GmSc 1985,1,1,10/26/1984,21,252,21.6899384,CHI,1,WSB,1,16,1,40,5,16,0.313,0,0,,6,7,0.857,1,5,6,7,2,4,5,2,16,12.5 1985,2,2,10/27/1984,21,253,21.69267625,CHI,0,MIL,0,-2,1,34,8,13,0.615,0,0,,5,5,1,3,2,5,5,2,1,3,4,21,19.4 1985,3,3,10/29/1984,21,255,21.69815195,CHI,1,MIL,1,6,1,34,13,24,0.542,0,0,,11,13,0.846,2,2,4,5,6,2,3,4,37,32.9 1985,4,4,10/30/1984,21,256,21.7008898,CHI,0,KCK,1,5,1,36,8,21,0.381,0,0,,9,9,1,2,2,4,5,3,1,6,5,25,14.7 1985,5,5,11/1/1984,21,258,21.7063655,CHI,0,DEN,0,-16,1,33,7,15,0.467,0,0,,3,4,0.75,3,2,5,5,1,1,2,4,17,13.2 1985,6,6,11/7/1984,21,264,21.72279261,CHI,0,DET,1,4,1,27,9,19,0.474,0,0,,7,9,0.778,1,3,4,3,3,1,5,5,25,14.9 1985,7,7,11/8/1984,21,265,21.72553046,CHI,0,NYK,1,15,1,33,15,22,0.682,0,0,,3,4,0.75,4,4,8,5,3,2,5,2,33,29.3 Output expected: EndYear,Rk,G,Date,Years,Days,Age,Tm,Home,Opp,Win,Diff,GS,MP,FG,FGA,FG_PCT,3P,3PA,3P_PCT,FT,FTA,FT_PCT,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,GmSc,TotalPositiveStats 1985,1,1,10/26/1984,21,252,21.6899384,CHI,1,WSB,1,16,1,40,5,16,0.313,0,0,,6,7,0.857,1,5,6,7,2,4,5,2,16,12.5,35 1985,2,2,10/27/1984,21,253,21.69267625,CHI,0,MIL,0,-2,1,34,8,13,0.615,0,0,,5,5,1,3,2,5,5,2,1,3,4,21,19.4,34 1985,3,3,10/29/1984,21,255,21.69815195,CHI,1,MIL,1,6,1,34,13,24,0.542,0,0,,11,13,0.846,2,2,4,5,6,2,3,4,37,32.9,54 1985,4,4,10/30/1984,21,256,21.7008898,CHI,0,KCK,1,5,1,36,8,21,0.381,0,0,,9,9,1,2,2,4,5,3,1,6,5,25,14.7,38 1985,5,5,11/1/1984,21,258,21.7063655,CHI,0,DEN,0,-16,1,33,7,15,0.467,0,0,,3,4,0.75,3,2,5,5,1,1,2,4,17,13.2,29 1985,6,6,11/7/1984,21,264,21.72279261,CHI,0,DET,1,4,1,27,9,19,0.474,0,0,,7,9,0.778,1,3,4,3,3,1,5,5,25,14.9,36 1985,7,7,11/8/1984,21,265,21.72553046,CHI,0,NYK,1,15,1,33,15,22,0.682,0,0,,3,4,0.75,4,4,8,5,3,2,5,2,33,29.3,51 This script will be called like gawk -f script.awk <filename>. Currently when calling this is the output (It seems to be calculating the variable but the rest of fields are empty)
awk is well suited to summing columns: awk 'NR>1{$(NF+1)=$26+$27+$28+$29+$30+$32}1' FS=, OFS=, input-file > tmp mv tmp input-file That doesn't add a field in the header line, so you might want something like: awk '{$(NF+1) = NR>1 ? ($26+$27+$28+$29+$30+$32) : "TotalPositiveStats"}1' FS=, OFS=,
An explanation on the issues with the current printf output is covered in the 2nd half of this answer (below). It appears OP's objective is to reformat three of the current fields while also adding a new field on the end of each line. (NOTE: certain aspects of OPs code are not reflected in the expected output so I'm not 100% sure what OP is looking to generate; regardless, OP should be able to tweak the provided code to generate the desired result) Using sprintf() to reformat the three fields we can rewrite OP's current code as: awk ' BEGIN { FS=OFS="," } NR==1 { print $0, "TotalPositiveStats"; next } { TotalPositiveStats = ($26+$27+$28+$29+$30+$32) $17 = sprintf("%.3f",$17) # FG_PCT if ($20 != "") $20 = sprintf("%.3f",$20) # 3P_PCT $23 = sprintf("%.3f",$23) # FT_PCT print $0, TotalPositiveStats } ' raw.dat NOTE: while OP's printf shows a format of %.2f % for the 3 fields of interest ($17, $20, $23), the expected output shows that the fields are not actually being reformatted (eg, $17 remains %.3f, $20 is an empty string, $23 remains %.2f); I've opted to leave $20 as blank otherwise reformat all 3 fields as %.3f; OP can modify the sprintf() calls as needed This generates: EndYear,Rk,G,Date,Years,Days,Age,Tm,Home,Opp,Win,Diff,GS,MP,FG,FGA,FG_PCT,3P,3PA,3P_PCT,FT,FTA,FT_PCT,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,GmSc,TotalPositiveStats 1985,1,1,10/26/1984,21,252,21.6899384,CHI,1,WSB,1,16,1,40,5,16,0.313,0,0,,6,7,0.857,1,5,6,7,2,4,5,2,16,12.5,40 1985,2,2,10/27/1984,21,253,21.69267625,CHI,0,MIL,0,-2,1,34,8,13,0.615,0,0,,5,5,1.000,3,2,5,5,2,1,3,4,21,19.4,37 1985,3,3,10/29/1984,21,255,21.69815195,CHI,1,MIL,1,6,1,34,13,24,0.542,0,0,,11,13,0.846,2,2,4,5,6,2,3,4,37,32.9,57 1985,4,4,10/30/1984,21,256,21.7008898,CHI,0,KCK,1,5,1,36,8,21,0.381,0,0,,9,9,1.000,2,2,4,5,3,1,6,5,25,14.7,44 1985,5,5,11/1/1984,21,258,21.7063655,CHI,0,DEN,0,-16,1,33,7,15,0.467,0,0,,3,4,0.750,3,2,5,5,1,1,2,4,17,13.2,31 1985,6,6,11/7/1984,21,264,21.72279261,CHI,0,DET,1,4,1,27,9,19,0.474,0,0,,7,9,0.778,1,3,4,3,3,1,5,5,25,14.9,41 1985,7,7,11/8/1984,21,265,21.72553046,CHI,0,NYK,1,15,1,33,15,22,0.682,0,0,,3,4,0.750,4,4,8,5,3,2,5,2,33,29.3,56 NOTE: in OP's expected output it appears the last/new field (TotalPositiveStats) does not contain the value from $30 hence the mismatch between the expected results and this answer; again, OP can modify the assignment statement for TotalPositiveStats to include/exclude fields as needed Regarding the issues with the current printf ... {printf "%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%.2f %,%s,%s,%.2f %,%s,%s,%.2f %,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s, %s\n", EndYear,Rk,G,Date,Years,Days,Age,Tm,Home,Opp,Win,Diff,GS,MP,FG,FGA,FG_PCT,3P,3PA,3P_PCT,FT,FTA,FT_PCT,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,GmSc,TotalPositiveStats} ... is referencing (awk) variables that have not been defined (eg, EndYear, Rk, G). [NOTE: one exeception is the very last variable in the list - TotalPositiveStats - which has in fact been defined earlier in the script.] The default value for undefined variables is the empty string ("") or zero (0), depending on how the awk code is referencing the variable, eg: printf "%s", EndYear => EndYear is treated as a string and the printed result is an empty string; with an output field delimiter of a comma (,) this empty strings shows up as 2 commas next to each other (,,) printf "%.2f %", FG_PCT => FG_PCT is treated as a numeric (because of the %f format) and the printed result is 0.00 % Where it gets a little interesting is when the (undefined) variable name starts with a numeric (eg, 3P) in which case the P is ignored and the entire reference is treated as a number, eg: printf "%s", 3P => 3P is processed as 3 and the printed result is 3 This should explain the 5 static values (0.00 %, 3, 3, 3.00 % and 0.00 %) printed in all output lines as well as the 'missing' values between the rest of the commas (eg, ,,,,). Obviously the last value in the line is an actual number, ie, the value of the awk variable TotalPositiveStats.
Print every line from a large file where the previous N lines meet specific criteria
I'd like to print every line from a large file where the previous 10 lines have a specific value in in a specific column (in the example below, column 9 has a value < 1). I don't want to store the whole file in memory. I am trying to use awk for this purpose as follows: awk 'BEGIN{FS=","} { for (i=FNR,i<FNR+10, i++) saved[++s] = $0 ; next for (i=1,i<s, i++) if ($9<1) print saved[s]; delete saved; s=0 }' file.csv The goal of this command is to save the 10 previous lines, then that check that column 9 in each of those lines meets my criteria, then print the current line. Any help with this, or suggestion on a more efficient way to do this, is much appreciated!
No need to store anything in memory or do any explicit looping on values. To print the current line if the last 10 lines (inclusive) had a $9 value < 1 is just: awk -F, '(c=($9<1?c+1:0))>9' file Untested of course since you didn't provide any sample input or expected output so check the math but that is the right approach and if the math is wrong then the tweak to fix it is just to change >9 to >10 or whatever you need.
Here is a solution for GNU Awk: chk_prev_lines.awk BEGIN { FS="," CMP_LINE_NR=10 CMP_VAL = 1 } FNR > CMP_LINE_NR { ok = 1 # check the stored values for( i = 0; i< CMP_LINE_NR; i++ ) { if ( !(prev_Field9[ i ] < CMP_VAL) ) { ok = 0 break # early return } } if( ok ) print } { # store $9 for the comparison prev_Field9[ FNR % CMP_LINE_NR] = $9 } Use it like this: awk -f chk_prev_lines.awk your_file. Explanation CMP_LINE_NR determines how many values from previous lines are stored CMP_VAL determines the values used for the comparison The condition FNR > CMP_LINE_NR takes care, that the first line, whose previous lines are checked, is the one with CMP_LINE_NR +1. It is the first with that much previous lines. The last Action stores the value $9. This Action is executed for all lines.
AWK script for two columns
I have two columns like this: (A) (B) Adam 30 Jon 55 Robert 35 Jokim 99 Adam 32 Adam 31 Jokim 88 I want an AWK script to check if Adam( or any name ) in column A becomes 30 in column B then delete all Adam names from column A, it does not matter whether Adam becomes 31 or 32 later, and then print the results. I have a log list in reality and I do not want the code to be depended on "Adam". So, What I want exactly is basically wherever 30 is existed in $2 so delete the respective value in $1 and also search in $1 to find all values which are the same as the deleted value.
You can read the columns into variables and check the value of the second column for the value you are looking for then sed the file to delete all the column 1 entries: cp test.txt out.txt && CHK=30 && while read a b; do [ "${b}" = "${CHK}" ] && sed -i "/^${a}/d" out.txt done < test.txt Note: If you may have regex values in the columns you may need to escape them, also if you possibly have blanks you may want to check for null first before the test on column 2. And since you specified AWK here is a somewhat elegant awk way to do this, using a check flag to look ahead prior to printing: awk -vCHK=30 '{if($2~CHK)block=$1; if($1!=block)print}' test.txt
To remove the entries from the first occurence of Adam, 30: $1 == "Adam" && $2 == 30 { found = 1 } !(found && $1 == "Adam") To remove all Adam entries if any Adam, 30 exists: $1 == "Adam" && $2 == 30 { found = 1 } !(found && $1 == "Adam") { lines[nlines++] = $0 } END { for (i in lines) print lines[i] } To remove all names which have a 30 the second column: NR == FNR && $2 == 30 { foundnames[$1] = 1 } NR != FNR && !($1 in foundnames) You must call this last version with the input filename twice, ie awk process.awk file.txt file.txt