I would like to subset a file while I keep the separator in the subsetted output using ´awk´ in bash.
That´s what I am using:
The input file is created in R language with:
inp <- 'AX-1 1 125 AA 0.2 1 AB -0.89 0 AA 0.005 0.56
AX-2 2 456 AA 0 0 AA -0.56 0.56 AB -0.003 0
AX-3 3 3445 BB 1.2 1 NA 0.002 0 AA 0.005 0.55'
inp <- read.table(text=inp, header=F)
write.table(inp, "inp.txt", col.names=F, row.names=F, quote=F, sep="\t")
(So fields are separated by tabs)
The code in bash:
awk {'print $1 $2 $3'} inp.txt
The result:
AX-11125
AX-22456
AX-333445
Please note that my columns were merged in the awkoutput (and I would like it to be tab delimited as the input file). Probably it is a simple syntax problem, but I would be grateful to any ideas.
Use
awk -v OFS='\t' '{ print $1, $2, $3 }'
or
awk '{ print $1 "\t" $2 "\t" $3 }'
Written one after another without an operator between them, variables in awk are concatenated - $1 $2 $3 is no different from $1$2$3 in this respect.
The first solution sets the output field separator OFS to a tab, then uses the comma operator to print separated fields. The second solution simply sprinkles tabs in there directly, and everything is concatenated as it was before.
Related
I have two files file1.txt and file2.txt like below -
cat file1.txt
2016-07-20-22 4343250019 1003116 001 data45343 25-JUL-16 11-MAR-16 1 N 0 0 N
2016-06-20-22 654650018 1003116 001 data45343 25-JUL-17 11-MAR-16 1 N 0 0 N
cat file2.txt
2016-07-20-22|9|1003116|001|data45343|25-JUL-16 11-MAR-16|1|N|0|0|N|hello|table|one
2016-06-20-22|8|1003116|001|data45343|25-JUL-17 11-MAR-16|1|N|0|0|N|hi|this|kill
2017-06-22-22|8|1003116|001|data45333|25-JUL-17 11-MAR-16|1|N|0|0|N|kill|boll|one
Requirement is to fetch the records which are not available in
file1.txt using below condition.
file1.txt file2.txt
col1(date) col1(Date)
col2(number: 4343250019 ) col2(last value of number: 9)
col3(number) col3(number)
col5(alphanumeric) col5(alphanumeric)
Expected Output :
2017-06-22-22|8|1003116|001|data45333|25-JUL-17 11-MAR-16|1|NULL|0|0|N|kill|boll|one
This output line doesn't available in file1.txt but available in
file2.txt after satisfying the matching criteria.
I was trying below steps to achieve this output -
###Replacing the space/tab from the file1.txt with pipe
awk '{print $1,$2,$3,$4,$5,$6,$7,$8,$9,$10}' OFS="|" file1.txt > file1.txt1
### Looping on a combination of four column of file1.txt1 with combination of modified column of file2.txt and output in output.txt
awk 'BEGIN{FS=OFS="|"} {a[$1FS$2FS$3FS$5];next} {(($1 FS substr($2,length($2),1) FS $3 FS $5) in a) print $0}' file2.txt file1.txt1 > output.txt
###And finally, replace the "N" from column 8th and put "NULL" if the value is "N".
awk -F'|' '{ gsub ("N","NULL",$8);print}' OFS="|" output.txt > output.txt1
What is the issue?
My 2nd operation is not working and I am trying to put all 3 operations in one operation.
awk -F'[|]|[[:blank:]]+' 'FNR==NR{E[$1($2%10)$3$5]++;next}!($1$2$3$5 in E)' file1.txt file2.txt
and your sample output is wrong, it should be (last field if different: data45333)
2016-07-20-22|9|1003116|001|data45333|25-JUL-16 11-MAR-16|1|N|0|0|N|hello|table|one
2017-06-22-22|8|1003116|001|data45343|25-JUL-17 11-MAR-16|1|N|0|0|N|kill|boll|one
Commented code
# separator for both file first with blank, second with `|`
awk -F'[|]|[[:blank:]]+' '
# for first file
FNR==NR{
# create en index entry based on the 4 field. The forat of filed allow to use them directly without separator (univoq)
E[ $1 ( $2 % 10 ) $3 $5 ]++
# for this line (file) don't go further
next
}
# for next file lines
# if not in the index list of entry, print the line (default action)
! ( ( $1 $2 $3 $5 ) in E ) { print }
' file1.txt file2.txt
Input
$ cat f1
2016-07-20-22 4343250019 1003116 001 data45343 25-JUL-16 11-MAR-16 1 N 0 0 N
2016-06-20-22 654650018 1003116 001 data45343 25-JUL-17 11-MAR-16 1 N 0 0 N
$ cat f2
2016-07-20-22|9|1003116|001|data45343|25-JUL-16 11-MAR-16|1|N|0|0|N|hello|table|one
2016-06-20-22|8|1003116|001|data45343|25-JUL-17 11-MAR-16|1|N|0|0|N|hi|this|kill
2017-06-22-22|8|1003116|001|data45333|25-JUL-17 11-MAR-16|1|N|0|0|N|kill|boll|one
Output
$ awk 'FNR==NR{a[$1,substr($2,length($2)),$3,$5];next}!(($1,$2,$3,$5) in a)' f1 FS="|" f2
2017-06-22-22|8|1003116|001|data45333|25-JUL-17 11-MAR-16|1|N|0|0|N|kill|boll|one
Explanation
awk ' # call awk.
FNR==NR{ # This is true when awk reads first file
a[$1,substr($2,length($2)),$3,$5] # array a where index being $1(field1), last char from $2, $3 and $5
next # stop processing go to next line
}
!(($1,$2,$3,$5) in a) # here we check index $1,$2,$3,$5 exists in array a by reading file f2
' f1 FS="|" f2 # Read f1 then
# set FS and then read f2
FNR==NR If the number of records read so far in the current file
is equal to the number of records read so far across all files,
condition which can only be true for the first file read.
a[$1,substr($2,length($2)),$3,$5] populate array "a" such that the
indexed by the first
field, last char of second field, third field and fifth field from
current record of file1
next Move on to the next record so we don't do any processing
intended for records from the second file.
!(($1,$2,$3,$5) in a) IF the array a index constructed from the
fields ($1,$2,$3,$5) of the current record of file2 does not exist
in array a, we get boolean true (! Called Logical NOT Operator. It is used to reverse the logical state of its operand. If a condition is true, then Logical NOT operator will make it false and vice versa.) so awk does default operation print $0 from file2
f1 FS="|" f2 read file1(f1), set field separator "|" after
reading first file, and then read file2(f2)
--edit--
When filesize is huge around 60GB(900 millions rows), its not a good
idea to process the file two times. 3rd operation - (replace "N" with
"NULL" from col - 8 ""awk -F'|' '{ gsub ("N","NULL",$8);print}'
OFS="|" output.txt
$ awk 'FNR==NR{
a[$1,substr($2,length($2)),$3,$5];
next
}
!(($1,$2,$3,$5) in a){
sub(/N/,"NULL",$8);
print
}' f1 FS="|" OFS="|" f2
2017-06-22-22|8|1003116|001|data45333|25-JUL-17 11-MAR-16|1|NULL|0|0|N|kill|boll|one
You can try this awk:
awk -F'[ |]*' 'NR==FNR{su=substr($2,length($2),1); a[$1":"su":"$3":"$5]=1;next} !a[$1":"$2":"$3":"$5]{print $0}' f1 f2
Here,
a[] - an associative array
$1":"su":"$3":"$5 - this forms key for an array index. su is last digit of field $2 (su=substr($2,length($2),1)). Then, assigning an 1 as value for this key.
NR==FNR{...;next} - this block works for processing f1.
Update:
awk 'NR==FNR{$2=substr($2,length($2),1); a[$1":"$2":"$3":"$5]=1;next} !a[$1":"$2":"$3":"$5]{gsub(/^N$/,"NULL",$8);print}' f1 FS="|" OFS='|' f2
$ cat awk.txt
12 32 45
5 2 3
33 11 33
$ cat awk.txt | awk '{FS='\t'} $1==5 {print $0}'
5 2 3
$ cat awk.txt | awk '{FS='\t'} $1==33 {print $0}'
Nothing is returned when judging the first field is 33 or not. It's confusing.
By saying
awk '{FS='\t'} $1==5 {print}' file
You are defining the field separator incorrectly. To make it be a tab, you need to say "\t" (with double quotes). Further reading: awk not capturing first line / separator.
Also, you are setting it every line, so it does not affect the first one. You want to use:
awk 'BEGIN{FS='\t'} $1==5' file
Yes, but why did it work in one case but not in the other?
awk '{FS='\t'} $1==5' file # it works
awk '{FS='\t'} $1==33' file # it does not work
You're using single quotes around '\t', which means that you're actually concatenating 3 strings together: '{FS=', \t and '} $1==5' to produce your awk command. The shell interprets the \t as t, so your awk script is actually:
awk '{FS=t} $1==5'
The variable t is unset, so you're setting the field separator to the empty string "". This means that the line is split into as many fields as characters you have. You can see it doing awk 'BEGIN{FS='\t'} {print NF}' file, that will show how many fields each record has.
Then, $1 is just 3 and $2 contains the second 3.
first of all !. Could you explain better what you really want to do before you ask ?. look....!
more awk.txt
12 32 45
5 2 3
33 11 33
awk -F"[ \t]" '$1 == 5 { print $0}' awk.txt
5 2 3
awk -F"[ \t]" '$1 == 33 { print $0}' awk.txt
33 11 33
awk -F"[ \t]" '$1 == 12 { print $0}' awk.txt
12 32 45
http://www.staff.science.uu.nl/~oostr102/docs/nawk/nawk_23.html
Fcs
I'm doing some simple calculations on a file using awk, but cannot get the output formatting right. The OFS is for some reason only applied to the first line (i.e. only within the BEGIN block), and for other rows a single space is inserted between fields.
Input:
title c1 c2 c3 n
AA 14 6 3 40
BB 8 2 2 38
Oneliner:
cat file.txt | awk -F'\t' 'BEGIN {OFS="\t"; print "Title","Freq1","Freq2","Freq3","Total"}; NR>1{printf "%s %.3f %.3f %.3f %d\n", $1, $2/$5, $3/$5, $4/$5, $5;}' > file2.txt
I've tried removing the header from BEGIN but this does not make a difference, and neither does BEGIN{FS="\t";OFS="\t";...}. I'm using awk in cygwin.
Since you're using printf you are actively avoiding OFS entirely. If you want to incorporate OFS, it gets ugly:
NR>1 {printf "%s%s%.3f%s%.3f%s%.3f%s%d\n", $1, OFS, $2/$5, OFS, $3/$5, OFS, $4/$5, OFS, $5}
Or, don't use printf:
NR>1 {print $1, sprintf("%.3f",$2/$5), sprintf("%.3f",$3/$5), sprintf("%.3f",$4/$5), $5}
I have two files:
cat file1:
0 xxx
1 yyy
1 zzz
0 aaa
cat file2:
A bbb
B ccc
C ddd
D eee
How do I get the following output using awk:
B ccc
C ddd
My question is, how do I print lines from file2 only if a certain field in file1 (i.e. field 1) matches a certain value (i.e. 1)?
Additional information:
Files file1 and file2 have an equal number of lines.
Files file1 and file2 have millions of lines and cannot be read into memory.
file1 has 4 columns.
file2 has approximately 1000 columns.
Try doing this (a bit obfuscated):
awk 'NR==FNR{a[NR]=$1}NR!=FNR&&a[FNR]' file1 file2
On multiples lines it can be clearer (reminder, awk works like this : condition{action} :
awk '
NR==FNR{arr[NR]=$1}
NR!=FNR && arr[FNR]
' file1 file2
If I remove the "clever" parts of the snippet :
awk '
if (NR == FNR) {arr[NR]=$1}
if (NR != FNR && arr[FNR]) {print $0}
' file1 file2
When awk find a condition alone (without action) like NR!=FNR && arr[FNR], it print by default on STDOUT implicitly is the expressions is TRUE (> 0)
Explanations
NR is the number of the current record from the start of input
FNR is the ordinal number of the current record in the current file (so NR is different than FNR on the second file)
arr[NR]=$1 : feeding array arr with indice of the current NR with the first column
if NR!=FNR we are in the next file and if the value of the array if 1, then we print
No as clean as a awk solution
$ paste file2 file1 | sed '/0/d' | cut -f1
B
C
You mentioned something about millions of lines, in order to just do a single pass through the files, I'd resort to python. Something like this perhaps (python 2.7):
with open("file1") as fd1, open("file2") as fd2:
for l1, l2 in zip(fd1, fd2):
if not l1.startswith('0'):
print l2.strip()
awk '{
getline value <"file2";
if ($1)
print value;
}' file1
I have two folders A1 and A2. The names and the number of files are same in these two folders. Each file has 15 columns. Column 6 of each file in folder 'A1' needs to substrate from the column 6 of each file in folder 'A2'. I would like to print column 2 and 6(after subtraction) from each file to a folder A3 with the same filenames. How can I do this with awk?
f1.txt file in folder A1
RAM AA 159.03 113.3 122.9 34.78 116.3
RAM BB 151.24 70 122.9 142.78 66.4
RAM CC 156.70 80 86.2 70.1 54.8
f1.txt file in folder A2
RAM AA 110.05 113 122.9 34.78 116.3
RAM BB 150.15 70 122.9 140.60 69.4
RAM CC 154.70 89.2 86.2 72.1 55.8
desired output
AA 0
BB 2.18
CC -2
Try this:
paste {A1,A2}/f1.txt | awk '{print $2,$6-$13}'
In bash: {A1,A2}/f1.txt will expand to A1/f1.txt A2/f1.txt (It's just a shortcut. Never mind.)
I use paste command to merge files vertically.
The awk command is quite simple here.
One way using awk:
awk 'FNR==NR { array[$1]=$2; next } { if ($1 in array) print $1, array[$1] - $2 > "A3/f1.txt" }' ~/A1/f1.txt ~/A2/f1.txt
FIRST EDIT:
Assuming an equal number of files in both directories (A1 and A2) with filenames paired in the way you describe:
for i in A1/*; do awk -v FILE=A3/${i/A1\//} 'FNR==NR { array[$1]=$2; next } { if ($1 in array) print $1, array[$1] - $2 > FILE }' A1/${i/A1\//} A2/${i/A1\//}; done
You will need to create the directory A3 first, or you'll get an error.
SECOND EDIT:
awk 'FNR==NR { array[$2]=$6; next } { if ($2 in array) print $2, array[$2] - $6 > "A3/f1.txt" }' ~/A1/f1.txt ~/A2/f1.txt
THIRD EDIT:
for i in A1/*; do awk -v FILE=A3/${i/A1\//} 'FNR==NR { array[$2]=$6; next } { if ($2 in array) print $2, array[$2] - $6 > FILE }' A1/${i/A1\//} A2/${i/A1\//}; done