awk sum collect in groups - awk

question is for awk script (ref previous question some weeks ago) but bit more complicated .
input file looks :
Group1
id val1 val2
---------------------------
idone 2 10
idone 3 12
idone 6 9
idtwo 8 3
idtwo 14 1
Subtotal 33 35
Group2
id val1 val2
------------------------
idone 2 3
idone 1 4
idtwo 3 6
idtwo 4 7
Subtotal 10 20
Total 43 55
There might be more groups and in each group more entries .
I limited my example to 2 detail names idone, idtwo and 2 groups.
Now the purpose is to have them summarized.
with result as :
val1 val2
idone 14 38
idtwo 29 17
total 43 55
The output layout is free to choose :
if you prefer it may look like this as well :
total_idone_val1=14
total_idone_val2=38
total_idtwo_val1=29
total_idtwo_val2=17
overall_total_val1=43
overall_total_val2=55

give this awk cmd a try:
awk 'NF==3&&FNR>3&&!/[Tt]otal/{v1[$1]+=$2;v2[$1]+=$3}END{print "id","v1","v2";
for(x in v1){
print x,v1[x],v2[x]
s1+=v1[x]
s2+=v2[x]
}
print "total",s1,s2}' f1 f2
it gives:
id v1 v2
idtwo 29 17
idone 14 38
total 43 55

Related

Print Pandas Unique Rows by Column Condition

I am trying to print the rows whereby a data condition is met in a pandas DF based on the unique values in the DF. For example, I have data that looks like this:
DF:
site temp month day
A 15 7 18
A 11 6 12
A 22 9 3
B 9 4 23
B 3 2 11
B -1 5 18
I need the result to print the rows where the max in the 'temp' column occurs such as this for the final result:
A 15
B 9
I have tried this but it is not working correctly:
for i in DF['site'].unique():
print(DF.temp.max())
I get the same answer of:
22
22
but the answer should be:
site temp month day
A 22 9 3
B 9 4 23
thank you!
A possible solution:
df.groupby('site', as_index=False).max()
Output:
site temp
0 A 22
1 B 9
In case you want to use a for loop:
for i in df['site'].unique():
print(df.loc[df['site'].eq(i), 'temp'].max())
Output:
22
9
df.groupby('site').max()
output:
temp month day
site
A 22 9 18
B 9 5 23
Let us do sort_values + drop_duplicates
df = df.sort_values('temp',ascending=False).drop_duplicates('site')
Out[190]:
site temp month day
2 A 22 9 3
3 B 9 4 23

SQL: Subtracting certain rows with restrictions from a data table into a new table

I Have a data table in postgresql which has these columns and some rows like this:
st
epochnum
satnum
l1
l2
c1
p1
p2
1
1
1
10
11
12
13
14
1
1
2
15
16
17
18
19
1
2
1
20
21
22
23
24
1
2
2
25
26
27
28
29
20
1
1
30
41
52
63
74
20
1
2
75
76
87
88
null
20
2
1
...
I want to get some pairs of rows that have the same value for epochnum and satnum but have different value in "st". By the way, I have a list that specifies which "st" pairs should be subtracted. Its just another table that looks like this:
st1
st2
1
20
The rows in the first table have to be subtracted in l1,l2,c1,p1 and p2 with same epochnum and satnum according to this table and then inserted into a new table like this:
epochnum
st1
st2
satnum
dl1
dl2
dc1
dp1
dp2
1
1
20
1
20
30
40
50
60
1
1
20
2
65
65
75
75
null
...
The actual data has more than 400000 rows that has same epochnums and satnums like this. I have tried java programming in net-beans and used loops to simply get queries for each row and make the new table.
But I think maybe it is not efficient and unnecessarily takes long time due to the lots of queries that has to be done in java.
I wonder if there is a way that this can be done using just a few queries, or creating extra tables and .... I haven't come up with the best solution yet.
Are you looking for joins like this?
select t1.st, t1.epochnum, t1.satnum,
(t2.l1 - t1.l1),
(t2.l2 - t1.l2),
(t2.p1 - t1.p1),
(t2.p2 - t1.p2)
from t t1 join
t t2
on t1.epochnum = t2.epochnum and
t1.satnum = t2.satnum join
pairs p
on t1.st = p.st1 and t2.st = p.st2

Match data values for two tables in Teradata Sql

Using Teradata :I have two tables with 10 records and 3 variables. All columns and values are same expect for three values in one variable.
My task is to make code changes for table2 where both records are matched, by not hard coding any value.
The second table was created by the first table , so there is no way to pick values by join etc .
Code :
Create multiset table table2 as (
Select * from table1 )
With data primary index(var1);
Eg:
Var1
Var2
Var3
1
Abc
20
2
Cde
30
3
kgk
87
4
kjj
98
5
gvy
67
6
jbn
78
7
hvb
56
8
ihg
62
9
jhn
22
10
hbn
34
Var1
Var2
Var3
1
Abc
20
2
Cde
30
3
kgk
87
4
kjj
98
5
gvy
67
6
jbn
78
7
hvb
56
8
ihg
77
9
jhn
56
10
hbn
23
Not sure what you want but you can find all the matching records using exists as follows:
select t.* from table2 t
where exists
(select 1 from table1 tt
where t.var1 = tt.var1 and t.var2 = tt.var2)

calculating sum and average only for selected data set only

I have a dataset as below:
col-1 col-2 col-3 col-4 col-5 col-6 col-7 col-8
0 17 215 55.7059 947 BMR_42 O22-BMR_1 O23-H23
1 1 1 1.0000 1 BMR_42 O23-BMR_1 O23-H23
2 31 3 1.0968 34 BMR_31 O22-BMR_1 O26-H26
3 11 2 1.0909 12 BMR_31 O13-BMR_1 O26-H26
4 20 5 1.8500 37 BMR_49 O22-BMR_1 O26-H26
5 24 4 1.7917 43 BMR_49 O23-BMR_1 O26-H26
6 41 2 1.0488 43 BMR_49 O12-BMR_1 O12-H12
7 28 2 1.0357 29 BMR_49 O22-BMR_1 O13-H13
8 1 1000 1000.0000 1000 BMR_49 O13-BMR_1 O13-H13
9 1 1 1.0000 1 BMR_22 O12-BMR_2 O22-H22
10 50 62 18.9400 947 BMR_59 O13-BMR_2 O22-H22
11 1 1 1.0000 1 BMR_59 O25-BMR_2 O23-H23
12 34 5 1.1471 39 BMR_59 O13-BMR_2 O23-H23
13 7 6 2.1429 15 BMR_59 O26-BMR_2 O24-H24
14 6 8 3.6667 22 BMR_59 O25-BMR_2 O24-H24
15 28 2 1.1071 31 BMR_10 O26-BMR_2 O26-H26
16 52 121 15.1346 787 BMR_10 O25-BMR_2 O26-H26
17 65 9 1.9231 125 BMR_10 O13-BMR_2 O26-H26
18 4 4 2.2500 9 BMR_59 O26-BMR_2 O26-H26
19 9 1 1.0000 9 BMR_22 O15-BMR_2 O13-H13
20 1 1 1.0000 1 BMR_10 O11-BMR_2 O16-H16
21 7 2 1.1429 8 BMR_53 O13-BMR_2 O16-H16
22 2 3 2.5000 5 BMR_33 O13-BMR_3 O22-H22
23 97 54 6.8247 662 BMR_61 O26-BMR_3 O22-H22
24 1 1 1.0000 1 BMR_29 O26-BMR_3 O23-H23
25 31 36 3.3226 103 BMR_29 O16-BMR_3 O23-H23
(The real file contains over 2000 lines).
I want to select data under certain criteria and find the sum and average of that. For example I want to select lines containing O22 in column $7 and $8 and calculate the sum and average of the values in column $4.
I tried a script as below:
awk '$7 ~ /O22/ && $8 ~ /O22/ {sum += $4} END {print sum, (sum/NR) }' hhsolute.lifetime2.dat
This code could select the line correctly but when I want to calculate the average (sum/NR), I don't get the correct value.
I wish to get some help on this. How I could get the sum and average values only for the data lines I wanted?
Appreciate any help in advance.
awk -v tgt="O22" '
$7 ~ tgt && $8 ~ tgt { sum+=$4; cnt++ }
END { print sum+0, (cnt ? sum/cnt : 0) }
' file
Try this:
awk '$7~/O22/ && $8~/O22/{++n;sum+=$4}END{if(n) print "Sum = " (sum), "Average= "(sum/n)}' File
If 7th and 8th field both contains pattern O22, add 4th field value to variable sum, increase n. Within END block, print the sum and average.

PostgreSQL: Select particular column and its total row count

I will explain my problem with an sample example
create table foo(id int,idx int,idy int,fld int,fldx varchar);
insert into foo values (1,2,3,55,'AA'),(2,3,4,77,'AB'),(3,4,8,55,'AX'),(9,10,15,77,'AR'),
(3,4,8,11,'AX'),(3,4,8,65,'AX'),(3,4,8,77,'AX');
id,idx,idy, fld,fldx
1 2 3 55 AA
2 3 4 77 AB
3 4 8 55 AX
9 10 15 77 AR
3 4 8 11 AX
3 4 8 65 AX
3 4 8 77 AX
I need to select only column fld and its total count of each column(fld) in descending order
Expected Result :
fld count
---------
77 3
55 2
11 1
65 1
select fld
,count(fld) rw_count
from foo
group by fld
order by rw_count desc
Group By
select fld,count(*) from foo group by 1 order by 2 desc ;