Faster way to lookup awk - awk
I have a list in a file as follows (in actual around 335 K):
abc
efg
hij
I want to look for the presence of this list in some files-all of which have the same extension of .count such that my output would be i.e what's the binary count of the above list in each .count file:
abc 1
efg 0
hij 1
(just gives me a binary score of 1 for present and 0 for absent)
In my code I am looping through each file with extension of .count and look for binary score for above list for characters and I am looking for it as follows:
awk -v lookup="$block" '$1 == lookup {count++ ; if (count > 0) exit} END {if (count) print 1 ; else print 0}' $file.count
The lookup is taking forever and I wonder if there is another way to expedite the lookup?
first, this doesn't make much sense
{count++ ; if (count > 0) exit}
can you see why?
Second you can reduce the looping by loading up the lookup into an array, for example,
awk 'NR==FNR{a[$1];next} {print $1 in a}' lookupfile otherfiles*
will print the 1/0 digits for each line
to print the ids as well
awk 'NR==FNR{a[$1];next} {print $1, $1 in a}' lookupfile otherfiles*
UPDATE: fixed the typo
for your example
$ echo -e "abc\ndef\nghi" > lookup
$ echo ghi > file1
$ awk 'NR==FNR{a[$1];next} {print $1, $1 in a}' lookup file1
ghi 1
UPDATE2: enhanced example
It will be easier if the order didn't matter but this preserves the order too and can run multiple files at the same time. You can tweak printing the header (print f)
with this setup
$ echo -e "abc\ndef\nghi" > lookup
$ echo ghi > file1
$ echo abc > file2
you can run
$ awk 'NR==FNR{a[NR]=$1;c++;next}
FNR==1 && f{print f;
for(k=1;k<=c;k++) print a[k], a[k] in b; delete b}
{b[$1]; f=FILENAME}
END{print f;
for(k=1;k<=c;k++) print a[k], a[k] in b; delete b}' lookup file1 file2
file1
abc 0
def 0
ghi 1
file2
abc 1
def 0
ghi 0
Explanation
NR==FNR{a[NR]=$1;c++;next} is for loading up the lookup table into array
in order (awk arrays are actually hash structures and iteration order
can be random) and count the number of entries.
FNR==1 && f{print f; at the beginning of each file after the first
one print the filename
for(k=1...) print a[k], a[k] in b; delete b} iterate over the lookup
table in order and check the file processed before has the corresponding entry and remove the processed file values (in b)
{b[$1]; f=FILENAME} load up the entries for each file and set the
filename (which will be used above to defer printing after the first
file)
END{print f; ... same printing step explained above now for last
file.
Related
selecting columns in awk discarding corresponding header
How to properly select columns in awk after some processing. My file here: cat foo A;B;C 9;6;7 8;5;4 1;2;3 I want to add a first column with line numbers and then extract some columns of the result. For the example let's get the new first (line numbers) and third columns. This way: awk -F';' 'FNR==1{print "linenumber;"$0;next} {print FNR-1,$1,$3}' foo gives me this unexpected output: linenumber;A;B;C 1 9 7 2 8 4 3 1 3 but expected is (note B is now the third column as we added linenumber as first): linenumber;B 1;6 2;5 3;2 [fixed and revised]
To get your expected output, use: $ awk 'BEGIN { FS=OFS=";" } { print (FNR==1?"linenumber":FNR-1),$(FNR==1?3:1) }' file Output: linenumber;C 1;9 2;8 3;1 To add a column with line number and extract first and last columns, use: $ awk 'BEGIN { FS=OFS=";" } { print (FNR==1?"linenumber":FNR-1),$1,$NF }' file Output this time: linenumber;A;C 1;9;7 2;8;4 3;1;3
Why do you print $0 (the complete record) in your header? And, if you want only two columns in your output, why to you print 3 (FNR-1, $1 and $3)? Finally, the reason why your output field separators are spaces instead of the expected ; is simply that... you did not specify the output field separator (OFS). You can do this with a command line variable assignment (OFS=\;), as shown in the second and third versions below, but also using the -v option (-v OFS=\;) or in a BEGIN block (BEGIN {OFS=";"}) as you wish (there are differences between these 3 methods but they don't matter here). [EDIT]: see a generic solution at the end. If the field you want to keep is the second of the input file (the B column), try: $ awk -F\; 'FNR==1 {print "linenumber;" $2; next} {print FNR-1 ";" $2}' foo linenumber;B 1;6 2;5 3;2 or $ awk -F\; 'FNR==1 {print "linenumber",$2; next} {print FNR-1,$2}' OFS=\; foo linenumber;B 1;6 2;5 3;2 Note that, as long as you don't want to keep the first field of the input file ($1), you could as well overwrite it with the line number: $ awk -F\; '{$1=FNR==1?"linenumber":FNR-1; print $1,$2}' OFS=\; foo linenumber;B 1;6 2;5 3;2 Finally, here is a more generic solution to which you can pass the list of indexes of the columns of the input file you want to print (1 and 3 in this example): $ awk -F\; -v cols='1;3' ' BEGIN { OFS = ";"; n = split(cols, c); } { printf("%s", FNR == 1 ? "linenumber" : FNR - 1); for(i = 1; i <= n; i++) printf("%s", OFS $(c[i])); printf("\n"); }' foo linenumber;A;C 1;9;7 2;8;4 3;1;3
linux csv file concatenate columns into one column
I've been looking to do this with sed, awk, or cut. I am willing to use any other command-line program that I can pipe data through. I have a large set of data that is comma delimited. The rows have between 14 and 20 columns. I need to recursively concatenate column 10 with column 11 per row such that every row has exactly 14 columns. In other words, this: a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p will become: a,b,c,d,e,f,g,h,i,jkl,m,n,o,p I can get the first 10 columns. I can get the last N columns. I can concatenate columns. I cannot think of how to do it in one line so I can pass a stream of endless data through it and end up with exactly 14 columns per row. Examples (by request): How many columns are in the row? sed 's/[^,]//g' | wc -c Get the first 10 columns: cut -d, -f1-10 Get the last 4 columns: rev | cut -d, -f1-4 | rev Concatenate columns 10 and 11, showing columns 1-10 after that: awk -F',' ' NF { print $1","$2","$3","$4","$5","$6","$7","$8","$9","$10$11}'
Awk solution: awk 'BEGIN{ FS=OFS="," } { diff = NF - 14; for (i=1; i <= NF; i++) printf "%s%s", $i, (diff > 1 && i >= 10 && i < (10+diff)? "": (i == NF? ORS : ",")) }' file The output: a,b,c,d,e,f,g,h,i,jkl,m,n,o,p
With GNU awk for the 3rd arg to match() and gensub(): $ cat tst.awk BEGIN{ FS="," } match($0,"(([^,]+,){9})(([^,]+,){"NF-14"})(.*)",a) { $0 = a[1] gensub(/,/,"","g",a[3]) a[5] } { print } $ awk -f tst.awk file a,b,c,d,e,f,g,h,i,jkl,m,n,o,p
If perl is okay - can be used just like awk for stream processing $ cat ip.txt a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p 1,2,3,4,5,6,3,4,2,4,3,4,3,2,5,2,3,4 1,2,3,4,5,6,3,4,2,4,a,s,f,e,3,4,3,2,5,2,3,4 $ awk -F, '{print NF}' ip.txt 16 18 22 $ perl -F, -lane '$n = $#F - 4; print join ",", (#F[0..8], join("", #F[9..$n]), #F[$n+1..$#F]) ' ip.txt a,b,c,d,e,f,g,h,i,jkl,m,n,o,p 1,2,3,4,5,6,3,4,2,43432,5,2,3,4 1,2,3,4,5,6,3,4,2,4asfe3432,5,2,3,4 -F, -lane split on , results saved in #F array $n = $#F - 4 magic number, to ensure output ends with 14 columns. $#F gives the index of last element of array (won't work if input line has less than 14 columns) join helps to stitch array elements together with specified string #F[0..8] array slice with first 9 elements #F[9..$n] and #F[$n+1..$#F] the other slices as needed Borrowing from Ed Morton's regex based solution $ perl -F, -lape '$n=$#F-13; s/^([^,]*,){9}\K([^,]*,){$n}/$&=~tr|,||dr/e' ip.txt a,b,c,d,e,f,g,h,i,jkl,m,n,o,p 1,2,3,4,5,6,3,4,2,43432,5,2,3,4 1,2,3,4,5,6,3,4,2,4asfe3432,5,2,3,4 $n=$#F-13 magic number ^([^,]*,){9}\K first 9 fields ([^,]*,){$n} fields to change $&=~tr|,||dr use tr to delete the commas e this modifier allows use of Perl code in replacement section this solution also has the added advantage of working even if input field is less than 14
You can try this gnu sed sed -E ' s/,/\n/9g :A s/([^\n]*\n)(.*)(\n)(([^\n]*\n){4})/\1\2\4/ tA s/\n/,/g ' infile
First variant - with awk awk -F, ' { for(i = 1; i <= NF; i++) { OFS = (i > 9 && i < NF - 4) ? "" : "," if(i == NF) OFS = "\n" printf "%s%s", $i, OFS } }' input.txt Second variant - with sed sed -r 's/,/#/10g; :l; s/#(.*)((#[^#]){4})/\1\2/; tl; s/#/,/g' input.txt or, more straightforwardly (without loop) and probably faster. sed -r 's/,(.),(.),(.),(.)$/#\1#\2#\3#\4/; s/,//10g; s/#/,/g' input.txt Testing Input a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u Output a,b,c,d,e,f,g,h,i,jkl,m,n,o,p a,b,c,d,e,f,g,h,i,jklmn,o,p,q,r a,b,c,d,e,f,g,h,i,jklmnopq,r,s,t,u
Solved a similar problem using csvtool. Source file, copied from one of the other answers: $ cat input.txt a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p 1,2,3,4,5,6,3,4,2,4,3,4,3,2,5,2,3,4 1,2,3,4,5,6,3,4,2,4,a,s,f,e,3,4,3,2,5,2,3,4 Concatenating columns: $ cat input.txt | csvtool format '%1,%2,%3,%4,%5,%6,%7,%8,%9,%10%11%12,%13,%14,%15,%16,%17,%18,%19,%20,%21,%22\n' - a,b,c,d,e,f,g,h,i,jkl,m,n,o,p,,,,,, 1,2,3,4,5,6,3,4,2,434,3,2,5,2,3,4,,,, 1,2,3,4,5,6,3,4,2,4as,f,e,3,4,3,2,5,2,3,4 anatoly#anatoly-workstation:cbs$ cat input.txt
awk: print each column of a file into separate files
I have a file with 100 columns of data. I want to print the first column and i-th column in 99 separate files, I am trying to use for i in {2..99}; do awk '{print $1" " $i }' input.txt > data${i}; done But I am getting errors awk: illegal field $(), name "i" input record number 1, file input.txt source line number 1 How to correctly use $i inside the {print }?
Following single awk may help you too here: awk -v start=2 -v end=99 '{for(i=start;i<=end;i++){print $1,$i > "file"i;close("file"i)}}' Input_file
An all awk solution. First test data: $ cat foo 11 12 13 21 22 23 Then the awk: $ awk '{for(i=2;i<=NF;i++) print $1,$i > ("data" i)}' foo and results: $ ls data* data2 data3 $ cat data2 11 12 21 22 The for iterates from 2 to the last field. If there are more fields that you desire to process, change the NF to the number you'd like. If, for some reason, a hundred open files would be a problem in your system, you'd need to put the print into a block and add a close call: $ awk '{for(i=2;i<=NF;i++){f=("data" i); print $1,$i >> f; close(f)}}' foo
If you want to do what you try to accomplish : for i in {2..99}; do awk -v x=$i '{print $1" " $x }' input.txt > data${i} done Note the -v switch of awk to pass variables $x is the nth column defined in your variable x Note2 : this is not the fastest solution, one awk call is fastest, but I just try to correct your logic. Ideally, take time to understand awk, it's never a wasted time
How to find the difference between two files using multiple conditions?
I have two files file1.txt and file2.txt like below - cat file1.txt 2016-07-20-22 4343250019 1003116 001 data45343 25-JUL-16 11-MAR-16 1 N 0 0 N 2016-06-20-22 654650018 1003116 001 data45343 25-JUL-17 11-MAR-16 1 N 0 0 N cat file2.txt 2016-07-20-22|9|1003116|001|data45343|25-JUL-16 11-MAR-16|1|N|0|0|N|hello|table|one 2016-06-20-22|8|1003116|001|data45343|25-JUL-17 11-MAR-16|1|N|0|0|N|hi|this|kill 2017-06-22-22|8|1003116|001|data45333|25-JUL-17 11-MAR-16|1|N|0|0|N|kill|boll|one Requirement is to fetch the records which are not available in file1.txt using below condition. file1.txt file2.txt col1(date) col1(Date) col2(number: 4343250019 ) col2(last value of number: 9) col3(number) col3(number) col5(alphanumeric) col5(alphanumeric) Expected Output : 2017-06-22-22|8|1003116|001|data45333|25-JUL-17 11-MAR-16|1|NULL|0|0|N|kill|boll|one This output line doesn't available in file1.txt but available in file2.txt after satisfying the matching criteria. I was trying below steps to achieve this output - ###Replacing the space/tab from the file1.txt with pipe awk '{print $1,$2,$3,$4,$5,$6,$7,$8,$9,$10}' OFS="|" file1.txt > file1.txt1 ### Looping on a combination of four column of file1.txt1 with combination of modified column of file2.txt and output in output.txt awk 'BEGIN{FS=OFS="|"} {a[$1FS$2FS$3FS$5];next} {(($1 FS substr($2,length($2),1) FS $3 FS $5) in a) print $0}' file2.txt file1.txt1 > output.txt ###And finally, replace the "N" from column 8th and put "NULL" if the value is "N". awk -F'|' '{ gsub ("N","NULL",$8);print}' OFS="|" output.txt > output.txt1 What is the issue? My 2nd operation is not working and I am trying to put all 3 operations in one operation.
awk -F'[|]|[[:blank:]]+' 'FNR==NR{E[$1($2%10)$3$5]++;next}!($1$2$3$5 in E)' file1.txt file2.txt and your sample output is wrong, it should be (last field if different: data45333) 2016-07-20-22|9|1003116|001|data45333|25-JUL-16 11-MAR-16|1|N|0|0|N|hello|table|one 2017-06-22-22|8|1003116|001|data45343|25-JUL-17 11-MAR-16|1|N|0|0|N|kill|boll|one Commented code # separator for both file first with blank, second with `|` awk -F'[|]|[[:blank:]]+' ' # for first file FNR==NR{ # create en index entry based on the 4 field. The forat of filed allow to use them directly without separator (univoq) E[ $1 ( $2 % 10 ) $3 $5 ]++ # for this line (file) don't go further next } # for next file lines # if not in the index list of entry, print the line (default action) ! ( ( $1 $2 $3 $5 ) in E ) { print } ' file1.txt file2.txt
Input $ cat f1 2016-07-20-22 4343250019 1003116 001 data45343 25-JUL-16 11-MAR-16 1 N 0 0 N 2016-06-20-22 654650018 1003116 001 data45343 25-JUL-17 11-MAR-16 1 N 0 0 N $ cat f2 2016-07-20-22|9|1003116|001|data45343|25-JUL-16 11-MAR-16|1|N|0|0|N|hello|table|one 2016-06-20-22|8|1003116|001|data45343|25-JUL-17 11-MAR-16|1|N|0|0|N|hi|this|kill 2017-06-22-22|8|1003116|001|data45333|25-JUL-17 11-MAR-16|1|N|0|0|N|kill|boll|one Output $ awk 'FNR==NR{a[$1,substr($2,length($2)),$3,$5];next}!(($1,$2,$3,$5) in a)' f1 FS="|" f2 2017-06-22-22|8|1003116|001|data45333|25-JUL-17 11-MAR-16|1|N|0|0|N|kill|boll|one Explanation awk ' # call awk. FNR==NR{ # This is true when awk reads first file a[$1,substr($2,length($2)),$3,$5] # array a where index being $1(field1), last char from $2, $3 and $5 next # stop processing go to next line } !(($1,$2,$3,$5) in a) # here we check index $1,$2,$3,$5 exists in array a by reading file f2 ' f1 FS="|" f2 # Read f1 then # set FS and then read f2 FNR==NR If the number of records read so far in the current file is equal to the number of records read so far across all files, condition which can only be true for the first file read. a[$1,substr($2,length($2)),$3,$5] populate array "a" such that the indexed by the first field, last char of second field, third field and fifth field from current record of file1 next Move on to the next record so we don't do any processing intended for records from the second file. !(($1,$2,$3,$5) in a) IF the array a index constructed from the fields ($1,$2,$3,$5) of the current record of file2 does not exist in array a, we get boolean true (! Called Logical NOT Operator. It is used to reverse the logical state of its operand. If a condition is true, then Logical NOT operator will make it false and vice versa.) so awk does default operation print $0 from file2 f1 FS="|" f2 read file1(f1), set field separator "|" after reading first file, and then read file2(f2) --edit-- When filesize is huge around 60GB(900 millions rows), its not a good idea to process the file two times. 3rd operation - (replace "N" with "NULL" from col - 8 ""awk -F'|' '{ gsub ("N","NULL",$8);print}' OFS="|" output.txt $ awk 'FNR==NR{ a[$1,substr($2,length($2)),$3,$5]; next } !(($1,$2,$3,$5) in a){ sub(/N/,"NULL",$8); print }' f1 FS="|" OFS="|" f2 2017-06-22-22|8|1003116|001|data45333|25-JUL-17 11-MAR-16|1|NULL|0|0|N|kill|boll|one
You can try this awk: awk -F'[ |]*' 'NR==FNR{su=substr($2,length($2),1); a[$1":"su":"$3":"$5]=1;next} !a[$1":"$2":"$3":"$5]{print $0}' f1 f2 Here, a[] - an associative array $1":"su":"$3":"$5 - this forms key for an array index. su is last digit of field $2 (su=substr($2,length($2),1)). Then, assigning an 1 as value for this key. NR==FNR{...;next} - this block works for processing f1. Update: awk 'NR==FNR{$2=substr($2,length($2),1); a[$1":"$2":"$3":"$5]=1;next} !a[$1":"$2":"$3":"$5]{gsub(/^N$/,"NULL",$8);print}' f1 FS="|" OFS='|' f2
awk print line of file2 based on condition of file1
I have two files: cat file1: 0 xxx 1 yyy 1 zzz 0 aaa cat file2: A bbb B ccc C ddd D eee How do I get the following output using awk: B ccc C ddd My question is, how do I print lines from file2 only if a certain field in file1 (i.e. field 1) matches a certain value (i.e. 1)? Additional information: Files file1 and file2 have an equal number of lines. Files file1 and file2 have millions of lines and cannot be read into memory. file1 has 4 columns. file2 has approximately 1000 columns.
Try doing this (a bit obfuscated): awk 'NR==FNR{a[NR]=$1}NR!=FNR&&a[FNR]' file1 file2 On multiples lines it can be clearer (reminder, awk works like this : condition{action} : awk ' NR==FNR{arr[NR]=$1} NR!=FNR && arr[FNR] ' file1 file2 If I remove the "clever" parts of the snippet : awk ' if (NR == FNR) {arr[NR]=$1} if (NR != FNR && arr[FNR]) {print $0} ' file1 file2 When awk find a condition alone (without action) like NR!=FNR && arr[FNR], it print by default on STDOUT implicitly is the expressions is TRUE (> 0) Explanations NR is the number of the current record from the start of input FNR is the ordinal number of the current record in the current file (so NR is different than FNR on the second file) arr[NR]=$1 : feeding array arr with indice of the current NR with the first column if NR!=FNR we are in the next file and if the value of the array if 1, then we print
No as clean as a awk solution $ paste file2 file1 | sed '/0/d' | cut -f1 B C You mentioned something about millions of lines, in order to just do a single pass through the files, I'd resort to python. Something like this perhaps (python 2.7): with open("file1") as fd1, open("file2") as fd2: for l1, l2 in zip(fd1, fd2): if not l1.startswith('0'): print l2.strip()
awk '{ getline value <"file2"; if ($1) print value; }' file1