I have two separate awk commands that replaces the third column of specific rows of the destination_file with the third column of specific rows of the source file. The commands work but is cumbersome with a lot of IO operations. Is there a way to combine the awk commands into one?
The commands I have at the moment are:
z2=$(awk 'NR==12 { print$1 }' ../source_folder/source_file)
z3=$(awk 'NR==14 { print$1 }' ../source_folder/source_file)
awk -v z2=$z2 'NR==10{$3=z2} 1' destination_file > temp && mv temp destination_file
awk -v z3=$z3 'NR==11{$3=z3} 1' destination_file > temp && mv temp destination_file
As an expansion to this problem, the rows I obtain values for z2, z3, z4 ... are from NR==12, 14, 16 ... and the rows in the destination_file that I want to replace are NR==10, 11, 12 ...; is there a way to write a sequence to do this string replacement operation instead of listing all the z1, z2, z3 ...?
You could combine both of your programs into one. Have only 1 awk program to check both conditions NR==10 and NR==11 and do respective assignments of values to respective fields as per condition. This will save output into output file then.
z2=$(awk 'NR==12 { print$1 }' ../source_folder/source_file)
z3=$(awk 'NR==14 { print$1 }' ../source_folder/source_file)
awk -v z2="$z2" -v z3="$z3" 'NR==10{$3=z2} NR==11{$3=z3} 1' destination_file > temp && mv temp destination_file
You don't need multiple calls to awk and shell variables for this:
awk '
NR==FNR {
if ( FNR==12 ) { map[10] = $1 }
else if ( FNR==14 ) { map[11] = $1; nextfile }
next
}
FNR in map {
$3 = map[FNR]
}
{ print }
' ../source_folder/source_file destination_file
or if your line numbers always increase by 2 in the source file and 1 in the dest file:
awk '
NR==FNR {
if ( (12 <= FNR) && !(FNR%2) ) {
map[10 + (c++)] = $1
}
next
}
FNR in map {
$3 = map[FNR]
}
{ print }
' ../source_folder/source_file destination_file
Related
I have a sparse matrix ("matrix.csv") with 10k rows and 4 columns (1st column is "user", and the rest columns are called "slots" and contain 0s or 1s), like this:
user1,0,1,0,0
user2,0,1,0,1
user3,1,0,0,0
Some of the slots that contain a "0" should be changed to contain a "1".
I have another file ("slots2change.csv") that tells me which slots should be changed, like this:
user1,3
user3,2
user3,4
So for user1, I need to change slot3 to contain a "1" instead of a "0", and for user3 I should change slot2 and slot4 to contain a "1" instead of a "0", and so on.
Expected result:
user1,0,1,1,0
user2,0,1,0,1
user3,1,1,0,1
How can I achieve this using awk or sed?
Looking at this post: awk or sed change field of file based on another input file, a user proposed an answer that is valid if the "slots2change.csv" file do not contain the same user in diferent rows, which is not the case in here.
The solution proposed was:
awk 'BEGIN{FS=OFS=","}
NR==FNR{arr[$1]=$2;next}
NR!=FNR {for (i in arr)
if ($1 == i) {
F=arr[i] + 1
$F=1
}
print
}
' slots2change.csv matrix.csv
But that answer doesn't apply in the case where the "slots2change.csv" file contain the same user in different rows, as is now the case.
Any ideas?
Using GNU awk for arrays of arrays:
$ cat tst.awk
BEGIN { FS=OFS="," }
NR == FNR {
users2slots[$1][$2]
next
}
$1 in users2slots {
for ( slot in users2slots[$1] ) {
$(slot+1) = 1
}
}
{ print }
$ awk -f tst.awk slots2change.csv matrix.csv
user1,0,1,1,0
user2,0,1,0,1
user3,1,1,0,1
or using any awk:
$ cat tst.awk
BEGIN { FS=OFS="," }
NR == FNR {
if ( !seen[$0]++ ) {
users2slots[$1] = ($1 in users2slots ? users2slots[$1] FS : "") $2
}
next
}
$1 in users2slots {
split(users2slots[$1],slots)
for ( idx in slots ) {
slot = slots[idx]
$(slot+1) = 1
}
}
{ print }
$ awk -f tst.awk slots2change.csv matrix.csv
user1,0,1,1,0
user2,0,1,0,1
user3,1,1,0,1
Using sed
while IFS="," read -r user slot; do
sed -Ei "/$user/{s/(([^,]*,){$slot})[^,]*/\11/}" matrix.csv
done < slots2change.csv
$ cat matrix.csv
user1,0,1,1,0
user2,0,1,0,1
user3,1,1,0,1
If the order in which the users are outputted doesn't matter then you could do something like this:
awk '
BEGIN { FS = OFS = "," }
FNR == NR {
fieldsCount[$1] = NF
for (i = 1; i <= NF; i++ )
matrix[$1,i] = $i
next
}
{ matrix[$1,$2+1] = 1 }
END {
for ( id in fieldsCount ) {
nf = fieldsCount[id]
for (i = 1; i <= nf; i++)
printf "%s%s", matrix[id,i], (i < nf ? OFS : ORS)
}
}
' matrix.csv slots2change.csv
user1,0,1,1,0
user2,0,1,0,1
user3,1,1,0,1
This might work for you (GNU sed):
sed -E 's#(.*),(.*)#/^\1/s/,[01]/,1/\2#' fileChanges | sed -f - fileCsv
Create a sed script from the file containing the changes and apply it to the intended file.
The solution above, manufactures a match and substitution for each line in the file changes. This is then piped through to second invocation of sed which applies the sed script to the csv file.
I have a file that looks like this:
3, abc, x
2, def, y
3, ghi, z
I want to find the highest value in $1 and print all rows that contain this highest value in $1.
sort -t, -k1,1n| tail -n1
would just give one of the rows that contain 3 in $1, but I need both.
Any suggestions are appreciated (:
I’m not sure if this is the nicest way to get lines while they have the same value with awk, but:
awk 'NR == 1 { t = $1; print } NR > 1 { if (t != $1) { exit; } print }'
which can be combined with sort as follows:
sort -t, -k1,1nr | awk 'NR == 1 { t = $1; print } NR > 1 { if (t != $1) { exit; } print }'
There’s also this, but it does unnecessary work:
sort -t, -k1,1nr | awk 'NR == 1 { t = $1 } t == $1 { print }'
Here is another approach that does not require sorting, but requires two passes over the data.
max=$(awk -F',' '{if(max < $1) max = $1}END{print max}' Input.txt )
awk -v max=$max -F',' '$1 == max' Input.txt
In awk, only one pass over the data:
$ awk -F, '
$1>m { # when new max is found
delete a; m=$1; i=0 # reset all
}
a[1]=="" || $1==m { # if $1 equals max or we're processing the first record
a[++i]=$0 # store the record to a
}
END { # in the end
for(j=1;j<=i;j++)
print a[j] # print a with stored records
}
' file
3, abc, x
3, ghi, z
aNumber|bNumber|startDate|timeZone|duration|currencyType|cost|
22677512549|778|2014-07-02 10:16:35.000|NULL|NULL|localCurrency|0.00|
22675557361|76457227|2014-07-02 10:16:38.000|NULL|NULL|localCurrency|10.00|
22677521277|778|2014-07-02 10:16:42.000|NULL|NULL|localCurrency|0.00|
22676099496|77250331|2014-07-02 10:16:42.000|NULL|NULL|localCurrency|1.00|
22667222160|22667262389|2014-07-02 10:16:43.000|NULL|NULL|localCurrency|10.00|
22665799922|70110055|2014-07-02 10:16:45.000|NULL|NULL|localCurrency|20.00|
22676239633|433|2014-07-02 10:16:48.000|NULL|NULL|localCurrency|0.00|
22677277255|76919167|2014-07-02 10:16:51.000|NULL|NULL|localCurrency|1.00|
This is the input (sample of million of line) i have in csv file.
I want to sum up duration based on date.
My concern is i want to sum up first 1000000 lines
the awk program i'm using is:
test.awk
BEGIN { FS = "|" }
NR>1 && NR<=1000000
FNR == 1{ next }
{
sub(/ .*/,"",$3)
key=sprintf("%10s",$3)
duration[key] += $5 } END {
printf "%-10s %16s,"dAccused","Duration"
for (i in duration) {
printf "%-4s %16.2f i,duration[i]
}}
i run my script as
$awk -f test.awk 'file'
The input i have doesn't condsidered my condition NR>1 && NR<=1000000
ANY SUGGESTION? PLEASE!
You're looking for this:
BEGIN { FS = "|" }
1 < NR && NR <= 1000000 {
sub(/ .*/, "", $3)
key = sprintf("%10s",$3)
duration[key] += $5
}
END {
printf "%-10s %16s\n", "dAccused", "Duration"
for (i in duration) {
printf "%-4s %16.2f i,duration[i]
}
}
A lot of errors become obvious with proper indentation.
The reason you saw 1,000,000 lines was due to this:
NR>1 && NR<=1000000
That is a condition with no action block. The default action is to print the current record if the condition is true. That's why you see a lot of awk one-liners end with the number 1
You didn't post any expected output and your duration field is always NULL so it's still not clear what you really want output, but this is probably the right approach:
$ cat tst.awk
BEGIN { FS = "|" }
NR==1 { for (i=1;i<NF;i++) f[$i] = i; next }
{
sub(/ .*/,"",$(f["startDate"]))
sum[$(f["startDate"])] += $(f["duration"])
}
NR==1000000 { exit }
END { for (date in sum) print date, sum[date] }
$ awk -f tst.awk file
2014-07-02 0
Instead of discarding your header line, it uses it to create an array f[] that maps the field names to their order in each line so instead of having to hard-code that duration is field 4 (or whatever) you just reference it as $(f["duration"]).
Any time your input file has a header line, don't discard it - use it so your script is not coupled to the order of fields in your input file.
If we have an input:
TargetIDs,CPD,Value,SMILES
95,CPD-1111111,-2,c1ccccc1
95,CPD-2222222,-3,c1ccccc1
95,CPD-2222222,-4,c1ccccc1
95,CPD-3333333,-1,c1ccccc1N
Now we would like to separate the duplicates and non-duplicates based on the fourth column (smiles)
duplicate:
95,CPD-1111111,-2,c1ccccc1
95,CPD-2222222,-3,c1ccccc1
95,CPD-2222222,-4,c1ccccc1
non-duplicate
95,CPD-3333333,-1,c1ccccc1N
Now the following attempt could do separate the duplicate without any problem. However, the first occurrence of the duplicate will still be included into the non-duplicate file.
BEGIN { FS = ","; f1="a"; f2="b"}
{
# Keep count of the fields in fourth column
count[$4]++;
# Save the line the first time we encounter a unique field
if (count[$4] == 1)
first[$4] = $0;
# If we encounter the field for the second time, print the
# previously saved line
if (count[$4] == 2)
print first[$4] > f1 ;
# From the second time onward. always print because the field is
# duplicated
if (count[$4] > 1)
print > f1;
if (count[$4] == 1) #if (count[$4] - count[$4] == 0) <= change to this doesn't work
print first[$4] > f2;
duplicate output results from the attempt:
95,CPD-1111111,-2,c1ccccc1
95,CPD-2222222,-3,c1ccccc1
95,CPD-2222222,-4,c1ccccc1
non-duplicate output results from the attempt
TargetIDs,CPD,Value,SMILES
95,CPD-3333333,-1,c1ccccc1N
95,CPD-1111111,-2,c1ccccc1
May I know if any guru might have comments/solutions? Thanks.
I would do this:
awk '
NR==FNR {count[$2] = $1; next}
FNR==1 {FS=","; next}
{
output = (count[$NF] == 1 ? "nondup" : "dup")
print > output
}
' <(cut -d, -f4 input | sort | uniq -c) input
The process substitution will pre-process the file and perform a count on the 4th column. Then, you can process the file and decide if that line is "duplicated".
All in awk: Ed Morton shows a way to collect the data in a single pass. Here's a 2 pass solution that's virtually identical to my example above
awk -F, '
NR==FNR {count[$NF]++; next}
FNR==1 {next}
{
output = (count[$NF] == 1 ? "nondup" : "dup")
print > output
}
' input input
Yes, the input file is given twice.
$ cat tst.awk
BEGIN{ FS="," }
NR>1 {
if (cnt[$4]++) {
dups[$4] = nonDups[$4] dups[$4] $0 ORS
delete nonDups[$4]
}
else {
nonDups[$4] = $0 ORS
}
}
END {
print "Duplicates:"
for (key in dups) {
printf "%s", dups[key]
}
print "\nNon Duplicates:"
for (key in nonDups) {
printf "%s", nonDups[key]
}
}
$ awk -f tst.awk file
Duplicates:
95,CPD-1111111,-2,c1ccccc1
95,CPD-2222222,-3,c1ccccc1
95,CPD-2222222,-4,c1ccccc1
Non Duplicates:
95,CPD-3333333,-1,c1ccccc1N
This solution only works if the duplicates are grouped together.
awk -F, '
function fout( f, i) {
f = (cnt > 1) ? "dups" : "nondups"
for (i = 1; i <= cnt; ++i)
print lines[i] > f
}
NR > 1 && $4 != lastkey { fout(); cnt = 0 }
{ lastkey = $4; lines[++cnt] = $0 }
END { fout() }
' file
Little late
My version in awk
awk -F, 'NR>1{a[$0":"$4];b[$4]++}
END{d="\n\nnondupe";e="dupe"
for(i in a){split(i,c,":");b[c[2]]==1?d=d"\n"i:e=e"\n"i} print e d}' file
Another built similar to glenn jackmans but all in awk
awk -F, 'function r(f) {while((getline <f)>0)a[$4]++;close(f)}
BEGIN{r(ARGV[1])}{output=(a[$4] == 1 ? "nondup" : "dup");print >output} ' file
I use awk to extract and calculate information from two different files and I want to merge the results into a single file in columns ( for example, the output of first file in columns 1 and 2 and the output of the second one in 3 and 4 ).
The input files contain:
file1
SRR513804.1218581HWI-ST695_116193610:4:1307:17513:49120 SRR513804.16872HWI ST695_116193610:4:1101:7150:72196 SRR513804.2106179HWI-
ST695_116193610:4:2206:10596:165949 SRR513804.1710546HWI-ST695_116193610:4:2107:13906:128004 SRR513804.544253
file2
>SRR513804.1218581HWI-ST695_116193610:4:1307:17513:49120
TTTTGTTTTTTCTATATTTGAAAAAGAAATATGAAAACTTCATTTATATTTTCCACAAAG
AATGATTCAGCATCCTTCAAAGAAATTCAATATGTATAAAACGGTAATTCTAAATTTTAT
ACATATTGAATTTCTTTGAAGGATGCTGAATCATTCTTTGTGGAAAATATAAATGAAGTT
TTCATATTTCTTTTTCAAAT
To parse the first file I do this:
awk '
{
s = NF
center = $1
}
{
printf "%s\t %d\n", center, s
}
' file1
To parse the second file I do this:
awk '
/^>/ {
if (count != "")
printf "%s\t %d\n", seq_id, count
count = 0
seq_id = $0
next
}
NF {
long = length($0)
count = count+long
}
END{
if (count != "")
printf "%s\t %d\n", seq_id, count
}
' file2
My provisional solution is create one temporal and overwrite in the second step. There is a more "elegant" way to get this output?
I am not fully clear on the requirement and if you can update the question may be we can help improvise the answer. However, from what I have gathered is that you would like to summarize the output from both files. I have made an assumption that content in both files are in sequential order. If that is not the case, then we will have to add additional checks while printing the summary.
Content of script.awk (re-using most of your existing code):
NR==FNR {
s[NR] = NF
center[NR] = $1
next
}
/^>/ {
seq_id[++y] = $0
++i
next
}
NF {
long[i] += length($0)
}
END {
for(x=1;x<=length(s);x++) {
printf "%s\t %d\t %d\n", center[x], s[x], long[x]
}
}
Test:
$ cat file1
SRR513804.1218581HWI-ST695_116193610:4:1307:17513:49120 SRR513804.16872HWI ST695_116193610:4:1101:7150:72196 SRR513804.2106179HWI-
ST695_116193610:4:2206:10596:165949 SRR513804.1710546HWI-ST695_116193610:4:2107:13906:128004 SRR513804.544253
$ cat file2
>SRR513804.1218581HWI-ST695_116193610:4:1307:17513:49120
TTTTGTTTTTTCTATATTTGAAAAAGAAATATGAAAACTTCATTTATATTTTCCACAAAG
AATGATTCAGCATCCTTCAAAGAAATTCAATATGTATAAAACGGTAATTCTAAATTTTAT
ACATATTGAATTTCTTTGAAGGATGCTGAATCATTCTTTGTGGAAAATATAAATGAAGTT
TTCATATTTCTTTTTCAAAT
$ awk -f script.awk file1 file2
SRR513804.1218581HWI-ST695_116193610:4:1307:17513:49120 4 200
ST695_116193610:4:2206:10596:165949 3 0