Pandas - Grouping Rows With Same Value in Dataframe - pandas

Here is the dataframe in question:
|City|District|Population| Code | ID |
| A | 4 | 2000 | 3 | 21 |
| A | 8 | 7000 | 3 | 21 |
| A | 38 | 3000 | 3 | 21 |
| A | 7 | 2000 | 3 | 21 |
| B | 34 | 3000 | 6 | 84 |
| B | 9 | 5000 | 6 | 84 |
| C | 4 | 9000 | 1 | 28 |
| C | 21 | 1000 | 1 | 28 |
| C | 32 | 5000 | 1 | 28 |
| C | 46 | 20 | 1 | 28 |
I want to regroup the population counts by city to have this kind of output:
|City|Population| Code | ID |
| A | 14000 | 3 | 21 |
| B | 8000 | 6 | 84 |
| C | 15020 | 1 | 28 |

df = df.groupby(['City', 'Code', 'ID'])['Population'].sum()
You can make a group by 'City', 'Code' and 'ID then make sum of 'population'.

Related

Theil–Sen estimator using Hive

I would like to calculate the Theil–Sen estimator per ID for the value column in the sample table below using hive. The Theil–Sen estimator is defined here https://en.wikipedia.org/wiki/Theil%E2%80%93Sen_estimator, I tried to use arrays but could not figure out a solution. Any help is appreciated.
+----+-------+-------+
| 1 | 1 | 10 |
| 1 | 2 | 20 |
| 1 | 3 | 30 |
| 1 | 4 | 40 |
| 1 | 5 | 50 |
| 2 | 1 | 100 |
| 2 | 2 | 90 |
| 2 | 3 | 102 |
| 2 | 4 | 75 |
| 2 | 5 | 70 |
| 2 | 6 | 50 |
| 2 | 7 | 100 |
| 2 | 8 | 80 |
| 2 | 9 | 60 |
| 2 | 10 | 50 |
| 2 | 11 | 40 |
| 2 | 12 | 40 |
+----+-------+-------+

Get aggregate quantity from JOINED tables

I have the following two tables in my database
inventory_transactions table
id | date_created | company_id | product_id | quantity | amount | is_verified | buy_or_sell_to | transaction_type | parent_tx | invoice_id | order_id | transaction_comment
----+----------------------------+------------+------------+----------+--------+-------------+----------------+------------------+-----------+------------+----------+---------------------
1 | 2022-04-25 10:42:00.627495 | 20 | 100 | 23 | 7659 | t | | BUY | | 1 | |
2 | 2022-04-25 10:48:48.02342 | 21 | 2 | 10 | 100 | t | | BUY | | 2 | |
3 | 2022-04-25 11:00:11.624176 | 21 | 7 | 10 | 100 | t | | BUY | | 3 | |
4 | 2022-04-25 11:08:14.607117 | 23 | 1 | 11 | 1210 | t | | BUY | | 4 | |
5 | 2022-04-25 11:13:24.084845 | 23 | 28 | 16 | 2560 | t | | BUY | | 5 | |
6 | 2022-04-25 11:26:56.338881 | 23 | 28 | 15 | 3525 | t | 5 | BUY | | 6 | 1 |
7 | 2022-04-25 11:26:56.340112 | 5 | 28 | 15 | 3525 | t | 23 | SELL | 6 | 6 | 1 |
8 | 2022-04-25 11:30:08.529288 | 23 | 30 | 65 | 15925 | t | 5 | BUY | | 7 | 2 |
9 | 2022-04-25 11:30:08.531005 | 5 | 30 | 65 | 15925 | t | 23 | SELL | 8 | 7 | 2 |
14 | 2022-04-25 12:28:51.658902 | 23 | 28 | 235 | 55225 | t | 5 | BUY | | 11 | 5 |
15 | 2022-04-25 12:28:51.660478 | 5 | 28 | 235 | 55225 | t | 23 | SELL | 14 | 11 | 5 |
20 | 2022-04-25 13:01:31.091524 | 20 | 4 | 4 | 176 | t | | BUY | | 15 | |
10 | 2022-04-25 11:50:48.4519 | 21 | 38 | 1 | 10 | t | | BUY | | 8 | |
11 | 2022-04-25 11:50:48.454118 | 21 | 36 | 1 | 10 | t | | BUY | | 8 | |
12 | 2022-04-25 11:52:19.827671 | 21 | 29 | 1 | 10 | t | | BUY | | 9 | |
13 | 2022-04-25 11:53:16.699881 | 21 | 74 | 1 | 10 | t | | BUY | | 10 | |
16 | 2022-04-25 12:37:39.739125 | 20 | 1 | 228 | 58824 | t | | BUY | | 12 | |
17 | 2022-04-25 12:37:39.741106 | 20 | 3 | 228 | 58824 | t | | BUY | | 12 | |
18 | 2022-04-25 12:49:09.922686 | 21 | 41 | 10 | 1000 | t | | BUY | | 13 | |
19 | 2022-04-25 12:55:11.986451 | 20 | 5 | 22 | 484 | t | | BUY | | 14 | |
NOTE each transaction in the inventory_transactions table is recorded twice with the company_id and buy_or_sell_to swapped for the 2nd row and transaction_type BUY or SELL reserved. (similar to how a journal is menatained in accounting).
db# select * from inventory_transactions where buy_or_sell_to is not Null order by date_created limit 50;
id | date_created | company_id | product_id | quantity | amount | is_verified | buy_or_sell_to | transaction_type | parent_tx | invoice_id | order_id | transaction_comment
----+----------------------------+------------+------------+----------+--------+-------------+----------------+------------------+-----------+------------+----------+---------------------
6 | 2022-04-25 11:26:56.338881 | 23 | 28 | 15 | 3525 | t | 5 | BUY | | 6 | 1 |
7 | 2022-04-25 11:26:56.340112 | 5 | 28 | 15 | 3525 | t | 23 | SELL | 6 | 6 | 1 |
8 | 2022-04-25 11:30:08.529288 | 23 | 30 | 65 | 15925 | t | 5 | BUY | | 7 | 2 |
9 | 2022-04-25 11:30:08.531005 | 5 | 30 | 65 | 15925 | t | 23 | SELL | 8 | 7 | 2 |
companies table (consider this as the users table, in my project all users are companies)
id | company_type | gstin | name | phone_no | address | pincode | is_hymbee_verified | is_active | district_id | pancard_no
----+--------------+-----------------+-------------+------------+---------+---------+--------------------+-----------+-------------+------------
26 | RETAILER | XXXXXXXXXXXXXXX | ACD LLC | 12345%7898 | AQWSAQW | 319401 | | | 11 | AQWSDERFVV
27 | DISTRIBUTOR | XXXXXXXXXXXXXXX | CDF LLC | 123XX7898 | AGWSAQW | 319201 | | | 13 | AQWSDERFVV
28 | RETAILER | XXXXXXXXXXXXXXX | !## LLC | 1234!67XX9 | AQCCAQW | 319101 | | | 16 | AQWSDERFVV
29 | COMPANY | XXXXXXXXXXXXXXX | ZAZ LLC | 123456S898 | AQWQQQW | 319001 | | | 19 | AQWSDERFVV
Problem statement
The query I am trying to write will fetch quantity sold only to users who are RETAILERs and DISTRIBUTORS by users who are either a RETAILER or a DISTRIBUTOR.
for example, if a user is a RETAILER, we need to calculate how much quantity this RETAILER has sold to other users who are either RETAILER or DISTRIBUTORs.
In other words, for all rows in the companies table check if the company is of company_type, RETAILER or DISTRIBUTOR and from the inventory_transactions table, check how much quantity a partiuclar RETAILER OR DISTRIBUTOR has sold to other RETAILERs and DISTRIBUTORs
I have very basic knowledge of SQL and have only gotten so far:
select Seller.id as Seller_ROW, Buyer.id as Buyer_row, Seller.company_id, Buyer.buy_or_sell_to, Seller.company_type as Seller_Type, Buyer.company_type as Buyer_Type, Seller.quantity, Buyer.quantity
FROM
(select t.id, t.company_id, t.quantity, c.company_type
from inventory_transactions as t
join companies as c on c.id = t.company_id
where c.company_type = 'RETAILER' or company_type = 'DISTRIBUTOR'
) as Seller
JOIN
(select t.id, t.buy_or_sell_to, t.quantity, c.company_type
from inventory_transactions as t
join companies as c on c.id = t.buy_or_sell_to
where c.company_type = 'RETAILER' or company_type = 'DISTRIBUTOR') as Buyer on Seller.id = Buyer.id
output
seller_row | buyer_row | company_id | buy_or_sell_to | seller_type | buyer_type | quantity | quantity
------------+-----------+------------+----------------+-------------+-------------+----------+----------
25 | 25 | 22 | 25 | RETAILER | DISTRIBUTOR | 1 | 1
26 | 26 | 25 | 22 | DISTRIBUTOR | RETAILER | 1 | 1
31 | 31 | 37 | 43 | DISTRIBUTOR | RETAILER | 10 | 10
32 | 32 | 43 | 37 | RETAILER | DISTRIBUTOR | 10 | 10
33 | 33 | 21 | 43 | DISTRIBUTOR | RETAILER | 1 | 1
34 | 34 | 43 | 21 | RETAILER | DISTRIBUTOR | 1 | 1
35 | 35 | 21 | 49 | DISTRIBUTOR | RETAILER | 1 | 1
36 | 36 | 49 | 21 | RETAILER | DISTRIBUTOR | 1 | 1
37 | 37 | 21 | 51 | DISTRIBUTOR | RETAILER | 1 | 1
38 | 38 | 51 | 21 | RETAILER | DISTRIBUTOR | 1 | 1
There are duplicate rows in the resulting table and so i am unable to do a SUM().
Expected result
SELLER.company_id | SELLER.company_name | SELLER.company_type | QUANTITY | BUYER.company_type
26 | XYZ Retail Co. | RETAILER | 14 | RETAILER
26 | XYZ Retail Co. | RETAILER | 1 | DISTRIBUTOR
27 | ACD Distributions | DISTRIBUTOR | 0 | RETAILER
27 | ACD Distributions | DISTRIBUTOR | 10 | DISTRIBUTOR
This answer assumes that every sale is represented as two rows in inventory_transactions, which makes it possible to avoid duplicates by working with only one transaction_type, so we'll filter on SELL transactions.
SELECT t.company_id AS seller_company_id
, s.company_name AS seller_company_name
, s.company_type AS seller_company_type
, SUM(t.quantity) AS quantity
, b.company_type AS buyer_company_type
FROM inventory_transactions AS t
INNER JOIN companies AS s
ON s.id = t.company_id
INNER JOIN companies AS b
ON b.id = buy_or_sell_to
WHERE t.transaction_type = 'SELL'
AND s.company_type IN ('RETAILER','DISTRIBUTOR')
AND b.company_type IN ('RETAILER','DISTRIBUTOR')
GROUP BY t.company_id, s.company_name, s.company_type, b.company_type
ORDER BY seller_company_id, seller_company_name, seller_company_type, buyer_company_type
;

grouping dataframe based on specific column value

I am working on realtime project.I have I dataframe looks like below.
| id | name | values|
| 101 | a | 13 |
| 101 | b | 14 |
| cv |
59 |
| 101 | c | 13 |
| 23 |
| 102 | a | 13 |
| 102 | b | 14 |
| cv |
56 |
| 102 | c | 17 |
| 23
I need the data fame looks like below when the value is same like 'cv'
| 101 | a | 13 |
| 101 | b | cv |
| 101 | c | 13 |
| 23 |
| 102 | a | 13 |
| 102 | b | cv |
| 102 | c | 17 |
23 |

How do I get around aggregate function error?

I have the following sql to calculate a % total:
SELECT tblTourns_atp.ID_Ti,
Sum([FS_1]/(SELECT Sum(FSOF_1)
FROM stat_atp
WHERE stat_atp.ID_T = tblTourns_atp.ID_T)) AS S1_IP
FROM stat_atp
INNER JOIN tblTourns_atp ON stat_atp.ID_T = tblTourns_atp.ID_T
GROUP BY tblTourns_atp.ID_Ti
I'm getting the 'aggregate error' because it wants the ID_T fields either grouped or in an aggregate function. I've read loads of examples but none of them seem to apply when the offending field is the subject of 'WHERE'.
Tables and output as follows:
+----------+------+--------+--+---------------+-------+--+--------+--------+
| stat_atp | | | | tblTourns_atp | | | Output | |
+----------+------+--------+--+---------------+-------+--+--------+--------+
| ID_T | FS_1 | FSOF_1 | | ID_T | ID_Ti | | ID_Ti | S1_IP |
| 1 | 20 | 40 | | 1 | 1 | | 1 | 31.03% |
| 2 | 30 | 100 | | 2 | 1 | | 2 | 28.57% |
| 3 | 40 | 150 | | 3 | 1 | | 3 | 33.33% |
| 4 | 30 | 100 | | 4 | 2 | | | |
| 5 | 30 | 100 | | 5 | 2 | | | |
| 6 | 40 | 150 | | 6 | 2 | | | |
| 7 | 20 | 40 | | 7 | 3 | | | |
| 8 | 30 | 100 | | 8 | 3 | | | |
| 9 | 40 | 150 | | 9 | 3 | | | |
| 10 | 20 | 40 | | 10 | 3 | | | |
+----------+------+--------+--+---------------+-------+--+--------+--------+
Since you already have an inner join between the two tables, a separate subquery isn't required:
select t.id_ti, sum(s.fs_1)/sum(s.fsof_1) as pct
from tbltourns_atp t inner join stat_atp s on t.id_t = s.id_t
group by t.id_ti

Segregate rows according to their HEAD (parent) - sql

I have the following SQL table.
+----+--------+----------+--------+
| ID | TestNo | TestName | HeadID |
+----+--------+----------+--------+
| 1 | 21 | Comp-1 | null |
| 2 | 22 | C1 | 21 |
| 3 | 23 | C2 | 21 |
| 4 | 24 | C3 | 21 |
| 5 | 47 | Comp-2 | null |
| 6 | 25 | C4 | 47 |
| 7 | 26 | C1+ | 21 |
+----+--------+----------+--------+
I want to get all the child rows (according to their HeadID) below their head test.
select * from ranges order by HeadID
The ACTUAL OUPUT I get from the above query:
+----+--------+----------+--------+
| ID | TestNo | TestName | HeadID |
+----+--------+----------+--------+
| 1 | 21 | Comp-1 | null |
| 5 | 47 | Comp-2 | null |
| 2 | 22 | C1 | 21 |
| 3 | 23 | C2 | 21 |
| 4 | 24 | C3 | 21 |
| 7 | 26 | C1+ | 21 |
| 6 | 25 | C4 | 47 |
+----+--------+----------+--------+
but my DESIRED OUTPUT is:
+----+--------+----------+--------+
| ID | TestNo | TestName | HeadID |
+----+--------+----------+--------+
| 1 | 21 | Comp-1 | null |
| 2 | 22 | C1 | 21 |
| 3 | 23 | C2 | 21 |
| 4 | 24 | C3 | 21 |
| 7 | 26 | C1+ | 21 |
| 5 | 47 | Comp-2 | null |
| 6 | 25 | C4 | 47 |
+----+--------+----------+--------+
How can I achieve this?
If you have only one level of children, then you can achieve this ordering like this:
SELECT *
FROM Ranges
ORDER BY
CASE WHEN HeadID IS NULL THEN TestNo ELSE HeadID END
,HeadID
,ID
;