splitting data under specific value after the sum of it

splitting data under specific value after the sum of it - awk

How do I split the input.txt after the sum of size column 1 under 80000
input.txt
11736 textDYN.txt
65736 textMV.txt
61812 textDYN_1.txt
11750 textGB.txt
$1 < 80000
total sum = 77472 under 80000 then, it will be output as output_001.txt
11736 textDYN.txt
65736 textMV.txt
total sum = 73562 under 80000 then, it will be output as output_002.txt
61812 textDYN_1.txt
11750 textGB.txt

$ cat tst.awk
{
sub(/\r$/,"")
sum += $1
}
(NR == 1) || (sum > 80000) {
close(out)
out = sprintf("output_%03d.txt",++cnt)
sum = $1
}
{ print $0 " > " out }
$ awk -f tst.awk file
11736 textDYN.txt > output_001.txt
65736 textMV.txt > output_001.txt
61812 textDYN_1.txt > output_002.txt
11750 textGB.txt > output_002.txt
Change print $0 " > " out to print > out when done testing.

If I understand how you want to split your files, you would need to handle the case where the sum of $1 is less than 80000 (by adding to sum), the case where it is more than 80000 (updating to output to a new filename and resetting the sum) and a third rule that writes to the current filename.
You could do something like:
awk 'BEGIN {sum=0; cnt=1; fn=sprintf("output_%03d.txt", cnt++)}
sum < 80000 {sum+=$1}
sum >= 80000 {fn=sprintf("output_%03d.txt", cnt++); sum=0}
{print $0 > fn}
' file
An example using your input data in file would be:
$ awk 'BEGIN {sum=0; cnt=1; fn=sprintf("output_%03d.txt", cnt++)}
> sum < 80000 {sum+=$1}
> sum >= 80000 {fn=sprintf("output_%03d.txt", cnt++); sum=0}
> {print $0 > fn}
> ' file
Resulting Files
$ cat output_001.txt
11736 textDYN.txt
65736 textMV.txt
$ cat output_002.txt
61812 textDYN_1.txt
11750 textGB.txt
If that isn't what you were attempting, please let me know and I'm happy to help further.

Related

join 2 files with different number of rows

Good morning
I got 2 files and I want to join them.
I am using awk but I can use other command in bash.
the problem is that when I try to awk some records that are not in both files do not appear in the final file.
file1
supply_DBReplication, 27336
test_after_upgrade, 0
test_describe_topic, 0
teste2e_funcional, 0
test_latency, 0
test_replication, 0
ticket_dl, 90010356798
ticket_dl.replica_cloudera, 0
traza_auditoria_eventos, 0
Ezequiel1,473789563
Ezequiel2,526210437
Ezequiel3,1000000000
file2
Domimio2,supply_bdsupply-stock-valorado-sherpa
Domimio8,supply_DBReplication
Domimio9,test_after_upgrade
Domimio7,test_describe_topic
Domimio3,teste2e_funcional
,test_latency
,test_replication
,ticket_dl
,ticket_dl.replica_cloudera
,traza_auditoria_eventos
And I wish:
file3
Domimio2,0
Domimio8,27336
Domimio9,0
Domimio7,0
Domimio3,0
NoDomain,0
NoDomain,0
NoDomain,90010356798
NoDomain,0
NoDomain,0
NoDomain,473789563
NoDomain,526210437
NoDomain,1000000000
I am executing this
awk 'NR==FNR {T[$1]=FS $2; next} {print $1 T[$2]}' FS="," file1 file2
But i got:
Domimio2, 0
Domimio8, 27336
Domimio9, 0
Domimio7, 0
Domimio3, 0
, 0
, 0
, 90010356798
, 0
, 23034
, 0
How can i do it?
Thank you

Assumptions:
join criteria: file1.field#1 == file2.field#2
output format: file2.field#1 , file1,field#2
file2 - if field#1 is blank then replace with NoDomain
file2.field#2 - if no match in file1.field#1 then output file2.field#1 + 0
file1.field#1 - if no match in file2.field#2 then output NoDomain + file1.field#2 (sorted by field#2 values)
One GNU awk idea:
awk '
BEGIN { FS=OFS="," }
NR==FNR { gsub(" ","",$2) # strip blanks from field #2
a[$1]=$2
next
}
{ $1 = ($1 == "") ? "NoDomain" : $1 # if file2.field#1 is missing then set to "NoDomain"
print $1,a[$2]+0
delete a[$2] # delete file1 entry so we do not print again in the END{} block
}
END { PROCINFO["sorted_in"]="#val_num_asc" # any entries leftover from file1 (ie, no matches) then sort by value and ...
for (i in a)
print "NoDomain",a[i] # print to stdout
}
' file1 file2
NOTE: GNU awk is required for the use of PROCINFO["sorted_in"]; if sorting of the file1 leftovers is not required then PROCINFO["sorted_in"]="#val_num_asc" can be removed from the code
This generates:
Domimio2,0
Domimio8,27336
Domimio9,0
Domimio7,0
Domimio3,0
NoDomain,0
NoDomain,0
NoDomain,90010356798
NoDomain,0
NoDomain,0
NoDomain,473789563
NoDomain,526210437
NoDomain,1000000000

Divide largest value by second largest value

I am having a file in the following format. Column one has ~20,000 uniq entry and column 2 has ~120,000 different entry and column 3 has count associated with column 2. For a single entry in column 1 there can be multiple entry in column 2. For each unique entry in column 1, I am trying to get ratio of maximum value to second maximum value of column 3.
F1.txt
S1 S2 C1
A A1 1
A AA 10
A A6 5
A A0 4
B BB 12
B BC 11
B B1 19
B B9 4
Expected Output
S1 S2 C1
B B1 19 1.58333
A AA 10 2
I can do in steps like bellow. But is there a smart way of doing in in one script?
awk 'NR==1; NR > 1 {print $0 | "sort -k3 -n -r "}' F1.txt | awk '!seen[$1]++' >del1.txt
awk 'FNR==NR{a[$2]=1; next}FNR==1{print $0;}!a[$2]' del1.txt F1.txt | awk 'NR==1; NR > 1 {print $0 | "sort -k3 -n -r"}' | awk '!seen[$1]++' >del2.txt
awk 'FNR==NR{a[$1]=$3; next}FNR==1{print $0"\t";"RT"}FNR>1 a[$1]{print $0"\t"$3/a[$1]}' del2.txt del1.txt

#!/usr/bin/awk -f
NR == 1 { print $1, $2, $3; next }
{ data[$1][$3] = $2 }
END {
for (key in data) {
asorti(data[key], s, "#ind_num_desc")
print key, data[key][s[1]], s[1], s[1] / s[2]
}
}
This^^^ assumes an arbitrary permutation of the lines (and requires gawk (which is pretty common) or another implementation with native multi-dimensional “arrays”).
If you can make more assumptions about the input — e.g. that it is always grouped by the first column —, then you can make it more memory-efficient and get rid of multi-dimensional arrays (by not delaying the evaluation until END and instead calculating it in a per-line block each time the first column’s value changes (and then one last time in END).)
To get a different handling of equal numeric values (e.g. to report the “subkey” (column 2) of the first (instead of last) encountered occurrence of a value), you could add if (!($3 in data[$1])) ... or the like.

Whenever you find yourself creating a pipeline containing awk, there is a very good chance that what you are trying to do can be done in a single call to awk much more efficiently.
A non-GNU awk approach that presumes all field1 'A' records are together and all 'B' records are together (as you show in your sample data) could be:
awk '
FNR==1 { print; next } ## 1st line, output heading
$1 != n { ## 1st field changed
if (n) { ## if n set, output result of last block
printf "%s\t%s\n", rec, max / nextmax
}
rec = $0 ## initialize vars for next block
n = $1
max = $3
nextmax = 1
next ## skip to next record
}
{
if ($3 > max) { ## check if 3rd field > max
rec = $0 ## save record
nextmax = max ## update nextmax
max = $3 ## update max
}
else if ($3 > nextmax) { ## if 3rd field > nextmax
nextmax = $3 ## update nextmax
}
} ## output final block results
END { printf "%s\t%s\n", rec, max / nextmax }
' file
Example Use/Output
With your data in the file file, you would have:
$ awk '
> FNR==1 { print; next } ## 1st line, output heading
> $1 != n { ## 1st field changed
> if (n) { ## if n set, output result of last block
> printf "%s\t%s\n", rec, max / nextmax
> }
> rec = $0 ## initialize vars for next block
> n = $1
> max = $3
> nextmax = 1
> next ## skip to next record
> }
> {
> if ($3 > max) { ## check if 3rd field > max
> rec = $0 ## save record
> nextmax = max ## update nextmax
> max = $3 ## update max
> }
> else if ($3 > nextmax) { ## if 3rd field > nextmax
> nextmax = $3 ## update nextmax
> }
> } ## output final block results
> END { printf "%s\t%s\n", rec, max / nextmax }
> ' file
S1 S2 C1
A AA 10 2
B B1 19 1.58333

Using any awk in any shell on every Unix box and using almost no memory (important since your input file would be huge given your description of it):
$ cat tst.awk
BEGIN { FS=OFS="\t" }
NR == 1 { print; next }
$1 != prev {
if ( prev != "" ) {
print prev, val, max, (preMax ? max/preMax : 0)
}
prev = $1
max = ""
}
(max == "") || ($3 > max) {
val = $2
preMax = max
max = $3
}
END { print prev, val, max, (preMax ? max/preMax : 0) }
$ awk -f tst.awk F1.txt
S1 S2 C1
A AA 10 10
B B1 19 1.58333

Process multiple files with awk

I would like to count the number of points in each interval. I have the positions of the points in the first file and the intervals in the second. First I store the point attributes in two arrays(pos and name) and then i want to loop over them in order to determine wheter it belongs to the given interval ($1 is the name and $2 is the start and $3 is the end of the interval). I have the following code:
awk 'NR==FNR{name[NR]=$1;pos[NR]=$2;next}; {for (i in name) if (name[i] == $1 && pos[i] >= $2 && pos[i] <= $3) {sum[NR] += 1;}} END {for (i = 1; i <=length(sum); i++) {print sum[i]}} file1 file2 > out'
I have a syntax error: "syntax error near unexpected token `i"
I am beginner in awk. Any help is highly appriciated. Thanks
awk '
NR==FNR{
name[NR]=$1
pos[NR]=$2
next
}
{
for(i in name){
if(name[i] == $1 && pos[i] >= $2 && pos[i] <= $3){ sum[FNR] += 1; }
}
}
END {
for(i = 1; i <=FNR; i++){
print sum[i];
}
}
' points windows > output
points:
chr1 52
chr1 65
chr2 120
chr2 101
chr2 160
chr3 20
chr4 50
windows:
chr1 0 100
chr1 100 200
chr2 0 100
chr2 100 200
chr3 0 100
chr3 100 200
chr4 0 100
chr5 0 100
chr6 0 100
chr6 100 200
chr7 0 100
chr8 0 100
gave me the desired output:
2
3
1
1
Thank You

Your ' is in wrong place and awk command is not ending properly, could you please try following. Couldn't test it since no samples are given.
awk 'NR==FNR{name[NR]=$1;pos[NR]=$2;next}; {for (i in name) if (name[i] == $1 && pos[i] >= $2 && pos[i] <= $3) {sum[NR] += 1;}} END {for (i = 1; i <=length(sum); i++) {print sum[i]}}' file1 file2
Non-one liner form of above solution.
awk '
NR==FNR{
name[NR]=$1
pos[NR]=$2
next
}
{
for(i in name){
if(name[i] == $1 && pos[i] >= $2 && pos[i] <= $3){ sum[NR] += 1 }
}
}
END{
for(i = 1; i <=length(sum); i++){
print sum[i]
}
}
' file1 file2 > out
As per sir #Ed Morton 's comment following could be few recommendations: Again these are not tested since no samples were given but you could try to apply them.
sum[NR] should be as sum[FNR] if in case you want to put index as per line number of Input_file2, why because difference between NR and FNR is that NR's value will be keep keep increasing till all Input_file(s) are read but FNR value will be RESET to 1 whenever there is new Input_file is being read.
Then, length(sum) could be value of FNR because basically you may be looking for total number of times loop has to run which you could get by FNR value.

awk code in file comparision

two files which has component name and version number separated by a space:
cat file1
com.acc.invm:FNS_PROD 94.0.5
com.acc.invm:FNS_TEST_DCCC_Mangment 94.1.6
com.acc.invm:FNS_APIPlat_BDMap 100.0.9
com.acc.invm:SendEmail 29.6.113
com.acc.invm:SendSms 12.23.65
cat file2
com.acc.invm:FNS_PROD 95.0.5
com.acc.invm:FNS_TEST_DCCC_Mangment 94.0.6
com.acc.invm:FNS_APIPlat_BDMap 100.0.10
com.acc.invm:SendEmail 29.60.113
com.acc.invm:SendSms 133.28.65
com.acc.invm:distri_cob 110
desired output :
com.acc.invm:FNS_PROD 95.0.5
com.acc.invm:SendSms 133.28.65
needed output is: All components from file1 with a higher version than in file2 in only in first decimal position.
in desired output "com.acc.invm:FNS_PROD" is coming because 96(in file1) > 95(in file2)
"com.acc.invm:FNS_TEST_DCCC_Mangment" is not coming because 94.1.6(in file1) 94.0.6 ( in file2), first decimal value is same (94=94).
tried awk code but no luck.
tst.awk
{ split($2,a,/\./); curr = a[1]*10000 + a[2]*100 + a[3] }
NR==FNR { prev[$1] = curr; next }
!($1 in prev) || (curr > prev[$1])
/usr/bin/nawk -f file2 file1 tst.awk
Any suggestion will be welcome.

According to your statement(only in first decimal position), you don't need the curr = a[1]*10000 + a[2]*100 + a[3]. Just use curr = a[1] would be fine.
As your desired output only contain the line both in file1 and file2, so ($1 in prev) && (curr > prev[$1]) is needed.
{split($2,a,/\./); curr = a[1];}
NR==FNR {prev[$1] = curr; next }
($1 in prev) && (curr > prev[$1])
DEMO
lo#ubuntu:~$ cat f1
com.acc.invm:FNS_PROD 94.0.5
com.acc.invm:FNS_TEST_DCCC_Mangment 94.1.6
com.acc.invm:FNS_APIPlat_BDMap 100.0.9
com.acc.invm:SendEmail 29.6.113
com.acc.invm:SendSms 12.23.65
lo#ubuntu:~$ cat f2
com.acc.invm:FNS_PROD 95.0.5
com.acc.invm:FNS_TEST_DCCC_Mangment 94.0.6
com.acc.invm:FNS_APIPlat_BDMap 100.0.10
com.acc.invm:SendEmail 29.60.113
com.acc.invm:SendSms 133.28.65
com.acc.invm:distri_cob 110
lo#ubuntu:~$ awk -f t.awk f1 f2
com.acc.invm:FNS_PROD 95.0.5
com.acc.invm:SendSms 133.28.65
lo#ubuntu:~$ cat t.awk
{split($2,a,/\./); curr = a[1];}
NR==FNR {prev[$1] = curr; next }
($1 in prev) && (curr > prev[$1])

awk '{ Version = $2
sub( /[.].*/, "", Version)
if ( FNR == NR ) Versionning[ $1] = Version
else if( Versionning[ $1] < Version) print
}' file1 file2
You can adapt the last if to discard non existing line/product in file 1 changing the condition with Versionning [ $1] != "" && Versionning[ $1] < Version

awk one row substracts the next row if their first two colums are the same

If we would like to substract $17 if their $1 & $2 are the same: input
targetID,cpd_number,Cell_assay_id,Cell_alt_assay_id,Cell_type_desc,Cell_Operator,Cell_result_value,Cell_unit_value,assay_id,alt_assay_id,type_desc,operator,result_value,unit_value,Ratio_operator,Ratio,log_ratio,Cell_experiment_date,experiment_date,Cell_discipline,discipline
111,CPD-123456,2222,1111,IC50,,6.1,uM,1183,1265,Ki,,0.16,uM,,38.125,1.7511,2003-03-03 00:00:00,2003-02-10 00:00:00,Cell,Enzyme
111,CPD-123456,2222,1111,IC50,,9.02053,uM,1183,1265,Ki,,0.16,uM,,56.3783,-1.5812,2003-02-27 00:00:00,2003-02-10 00:00:00,Cell,Enzyme
111,CPD-777888,3333,4444,IC50,,6.1,uM,1183,1265,Ki,,0.16,uM,,38.125,-1,2003-03-03 00:00:00,2003-02-10 00:00:00,Cell,Enzyme
111,CPD-777888,3333,4444,IC50,,9.02053,uM,1183,1265,Ki,,0.16,uM,,56.3783,-3,2003-02-27 00:00:00,2003-02-10 00:00:00,Cell,Enzyme
The desired output should be (1.7511-(-1.5812)=3.3323); (-1-(-3)=2)
3.3323
2
First attempt:
awk -F, ' last != $1""$2 && last{ # ONLY When last key "TargetID + Cpd_number"
print C # differs from actual , print line + substraction
C=0} # reset acumulators
{ # This block process each line of infile
C -= $17 # C calc
line=$0 # Line will be actual line without activity
last=$1""$2} # Store the key in orther to track switching
END{ # This block triggers after the complete file read
# to print the last average that cannot be trigger during
# the previous block
print C}' input
It will give the output:
-0.1699
4
The second attempt:
#!/bin/bash
tail -n+2 test > test2 # remove the title/header
awk -F, '$1 == $1 && $2 == $2 {print $17}' test2 >> test3 # print $17 if the $1 and $2 are the same
awk 'NR==1{s=$1;next}{s-=$1}END{print s}' test3
rm test2 test3
test3 will be
1.7511
-1.5812
-1
-3
Output is
7.3323
Could any guru kindly give some comments? Thanks!

You could try the below awk command,
$ awk -F, 'NR==1{next} {var=$1; foo=$2; bar=$17; getline;} $1==var && $2==foo{xxx=bar-$17; print xxx}' file
3.3323
2

awk '
BEGIN { FS = "," }
NR == 1 { next } # skip header line
{ # accumulate totals
if ($1 SUBSEP $2 in a) # if key already exists
a[$1,$2] -= $17 # subtract $17 from value
else # if first appearance of this key
a[$1,$2] = $17 # set value to $17
}
END { # print results
for (x in a)
print a[x]
}
' file

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

splitting data under specific value after the sum of it - awk

Related

join 2 files with different number of rows

Divide largest value by second largest value

Process multiple files with awk

awk code in file comparision

awk one row substracts the next row if their first two colums are the same

Categories

Resources