I have 2 dataframes. 1 with clients and departments where they belong, 2 with departments and managers located there. In 1 department there can be any number of clients from 1 to 1000, for example, and the number of departments and managers there can also change every day. I need to connect the dataframes so that there is 1 manager per 1 client and so that the distribution of clients is divided exactly between the managers in the department (if it is not exactly divided, then with a difference of 1). For example, if there are 5 clients and 2 managers in one department, then respectively one manager has 3 clients, the second has 2 clients. If the number of managers is more than clients, then distribute the random (if we have 2 clients, but 3 managers.)
df_all = df_c.merge(df_d, how='left', on='dep')
when I do a regular merge, the data is duplicated
df_all_new = df_all.groupby(['cli']).head(1)
Tried through groupby and head(1), but in this case, only 1 manager got it, the rest were lost.
First df with clients and deps
client
dep
001
34
002
34
003
34
004
34
005
34
006
42
007
42
second df with deps and managers
dep
manager
34
847
34
147
47
472
47
198
47
374
and the result should be
client
dep
manager
001
34
847
002
34
847
003
34
847
004
34
147
005
34
147
006
42
472
007
42
198
Related
new to learning sql/postgresql and have been hunting all over looking for help with a query to find only the matching id values in a table so i can pull data from another table for a learning project. I have been trying to use the count command, which doesn't seem right, and struggling with group by.
here is my table
id acct_num sales_tot cust_id date_of_purchase
1 9001 1106.10 116 12-Jan-00
2 9002 645.22 125 13-Jan-00
3 9003 1096.18 137 14-Jan-00
4 9004 1482.16 118 15-Jan-00
5 9005 1132.88 141 16-Jan-00
6 9006 482.16 137 17-Jan-00
7 9007 1748.65 147 18-Jan-00
8 9008 3206.29 122 19-Jan-00
9 9009 1184.16 115 20-Jan-00
10 9010 2198.25 133 21-Jan-00
11 9011 769.22 141 22-Jan-00
12 9012 2639.17 117 23-Jan-00
13 9013 546.12 122 24-Jan-00
14 9014 3149.18 116 25-Jan-00
trying to write a simple query to only find matching customer id's, and export them to the query window.
I have a dataset, in which it has a lot of entries for a single location. I am trying to find a way to sum up all of those entries without affecting any of the other columns. So, just in case I'm not explaining it well enough, I want to use a dataset like this:
Locations Cyclists maleRunners femaleRunners maleCyclists femaleCyclists
Bedford 10 12 14 17 27
Bedford 11 40 34 9 1
Bedford 7 1 2 3 3
Leeds 1 1 2 0 0
Leeds 20 13 6 1 1
Bath 101 20 33 41 3
Bath 11 2 3 1 0
And turn it into something like this:
Locations Cyclists maleRunners femaleRunners maleCyclists femaleCyclists
Bedford 28 53 50 29 31
Leeds 21 33 39 1 1
Bath 111 22 36 42 3
Now, I have read up that a groupby should work in a way, but from my understanding a group by will change it into 2 columns and I don't particularly want to make hundreds of 2 columns and then merge it all. Surely there's a much simpler way to do this?
IIUC, groupby+sum will work for you:
df.groupby('Locations',as_index=False,sort=False).sum()
Output:
Locations Cyclists maleRunners femaleRunners maleCyclists femaleCyclists
0 Bedford 28 53 50 29 31
1 Leeds 21 14 8 1 1
2 Bath 112 22 36 42 3
Pivot table should work for you.
new_df = pd.pivot_table(df, values=['Cyclists', 'maleRunners', 'femalRunners',
'maleCyclists','femaleCyclists'],index='Locations', aggfunc=np.sum)
I have a text file like this small example:
chr10:103909786-103910082 147 148 24 BA
chr10:103909786-103910082 149 150 11 BA
chr10:103909786-103910082 150 151 2 BA
chr10:103909786-103910082 152 153 1 BA
chr10:103909786-103910082 274 275 5 CA
chr10:103909786-103910082 288 289 15 CA
chr10:103909786-103910082 294 295 4 CA
chr10:103909786-103910082 295 296 15 CA
chr10:104573088-104576021 2925 2926 134 CA
chr10:104573088-104576021 2926 2927 10 CA
chr10:104573088-104576021 2932 2933 2 CA
chr10:104573088-104576021 58 59 1 BA
chr10:104573088-104576021 689 690 12 BA
chr10:104573088-104576021 819 820 33 BA
in this file there are 5 tab separated columns. the first column is considered as ID. for example in the first row the whole "chr10:103909786-103910082" is ID.
1- in the 1st step I would like to filter out the rows based on the 4th column.
if the number in the 4th column is less than 10 and the same row but in the 5th column the group is BA, that row will be filtered out. also if the number in the 4th column is less than 5 and the same row but in the 5th column the group is CA, that row will be filtered out.
3- 3rd step:
I want to get the ratio of number in 4th column. in fact in the 1st column there are repeated values which represent the same ID. I want to get one ratio per ID, so in the output every ID will be repeated only once. each ID has both BA and CA in the 5th column. for each ID I should get 2 values for CA and BA separately and get the ration of CA/BA as the final value for each ID. to get one value as CA, I should add up all values in the 4th column which belong the same ID and classified as CA and to get one value as BA, I should add up all values in the 4th column which belong the same ID and classified as BA. the last step is to get the ration of CA/BA per ID. the expected output for the small example would look like this:
1- after filtration:
chr10:103909786-103910082 147 148 24 BA
chr10:103909786-103910082 149 150 11 BA
chr10:103909786-103910082 274 275 5 CA
chr10:103909786-103910082 288 289 15 CA
chr10:103909786-103910082 295 296 15 CA
chr10:104573088-104576021 2925 2926 134 CA
chr10:104573088-104576021 2926 2927 10 CA
chr10:104573088-104576021 689 690 12 BA
chr10:104573088-104576021 819 820 33 BA
2- after summarizing each group (CA and BA):
chr10:103909786-103910082 147 148 35 BA
chr10:103909786-103910082 274 275 35 CA
chr10:104573088-104576021 2925 2926 144 CA
chr10:104573088-104576021 819 820 45 BA
3- the final output(this ratio is made using the values in 4th column):
chr10:103909786-103910082 1
chr10:104573088-104576021 3.2
in the above lines, 1 = 35/35 and 3.2 = 144/45.
I am trying to do that in awk
awk 'ID==$1 {
if (ID) {
print ID, a["CA"]/a["BA"]; a["CA"]=a["BA"]=0;
}
ID=$1
}
$5=="BA" && $4>=10 || $5=="CA" && $4>=5 { a[$5]+=$4 }
END{ print ID, a["CA"]/a["BA"] }' file.txt
I tried to use the code but did not succeed. this code returns one number. in fact sum of all CA and divides it by sum of all BAs but I want to do that per ID and get the ration per ID. do you know how to solve the problem and correct the code?
awk '$4 >= 5 && $5 == "CA" { a[$1]+=$4 }
$4 >= 10 && $5 == "BA" { b[$1]+=$4 }
END{ for ( i in a) print i,a[i]/b[i]}' file
output:
chr10:103909786-103910082 1
chr10:104573088-104576021 3.2
I will explain my problem briefly
have duplicate rino like below (actually this rino is the serial number in front end)
chqid rino branchid
----- ---- --------
876 6 2
14 6 2
18 10 2
828 10 2
829 11 2
19 11 2
830 12 2
20 12 2
78 40 2
1092 40 2
1094 41 2
79 41 2
413 43 2
1103 43 2
82 44 2
1104 44 2
1105 45 2
83 45 2
91 46 2
1106 46 2
here in my case I don't want to delete these duplicate rino instead of that I planned to update the rino having max date(column not specified in the above sample actually a date column is there) to the next rino number
what exactly I meant is :
if I sort out the above result according to the max(date) I will get
chqid rino branchid
----- ---- --------
876 6 2
828 10 2
19 11 2
830 12 2
1092 40 2
79 41 2
413 43 2
82 44 2
83 45 2
1106 46 2
(NOTE : total number of duplicate rows are 10 in branchid=2)
the last entered rino in the table for branchid=2 is 245
So I just want to update the 10 rows(Of column rino) with numbers starting from 246 to 255( just added 245+10 like this select lastno+ generate_series(1,10) nos from tab where cola=4 and branchid = 2 and vrid=20;)
Expected Output:
chqid rino branchid
----- ---- --------
876 246 2
828 247 2
19 248 2
830 249 2
1092 250 2
79 251 2
413 252 2
82 253 2
83 254 2
1106 255 2
using postgresql
Finally I found a solution, am using dynamic-sql to solve my issue
do
$$
declare
arow record;
begin
for arow in
select chqid,rino,branchid from (
select chqid,rino::int ,vrid,branchid ,row_number()over (partition by rino::int ) rn
from tab
where vrid =20
and branchid = 2)t
where rn >1
loop
execute format('
update tab
set rino=(select max(rino::int)+1 from gtab19 where acyrid=4 and branchid = 2 and vrid=20)
where chqid=%s
',arow.chqid);
end loop;
end;
$$;
I have two tables:
person table accident table
pid name phone acc_id pid type
1 Mike 3123223232 132 1 car
2 Kyle 133 3 snow
3 Nick 3124567654 134 4 cold
4 John 3124566663 135 2 sun
5 Pety 4234345453 136 3 hot
137 2 sun
138 3 snow
139 2 cold
140 1 hot
I need to find all accidents acc_id with a reference to each other that incurred to the same person given that she has a valid telephone number
So the result would be the following:
acc_id reference
133 136
133 138
136 133
136 138
138 133
138 136
132 140
140 132
So, person with pid = 3 had accidents 133, 136, 138 and this person has a phone, thus these three acc_id refer to each other. Next, pid = 2 also had three accidents, however since her phone number is unknown, we do not include her. Next, pid = 1 had two accidents 132, 140 and she has a phone number , so we include her accident numbers.
I know a method how to write a query to do this (for the sake of space I did not include), but it includes joining these tables two times and I think that there must be a more efficient way. Can anybody help me?
How about something like this? (not sure if this is what you already had)
select acc1.acc_id, acc2.acc_id as reference
from accidents acc1
inner join accidents acct2 on acc1.pid = acc2.pid and acc1.acc_id <> acc2.acc_id
inner join people on people.pid = acc1.pid
where people.phone <> ""