Turn results of count distinct into something that can be aggregated - sql

I have a table like this:
+----------+--------------+-------------+
| category | sub_category | customer_id |
+----------+--------------+-------------+
| A | AB2 | A876 |
| A | AB2 | A876 |
| A | AA1 | A876 |
| A | AA1 | A876 |
| A | AC3 | A756 |
| B | AB2 | A876 |
| B | AA1 | A756 |
| B | AB7 | A908 |
| C | AA1 | A756 |
| C | AB7 | A908 |
| C | AC3 | A908 |
+----------+--------------+-------------+
And I want to count distinct customers so I can easily do something like:
SELECT category, sub_category, COUNT(DISTINCT customer_id) as count_of_customers
FROM tbl
GROUP BY category, sub_category
And I get a report that gives me distinct customers for each sub_category and category. But these numbers can no longer be aggregated as there needs to be de-duplication if I just need distinct customers by category only.
For e.g customer_id = 'A876' will be counted twice in category='A' (once in sub_category = 'AB2' and once in sub_category = 'AA1') if I just sum the count_of_customers from my query result.
So here is the question, I would like to make these query results "aggregatable". Looking at the problem, it looks like this just isn't possible but I am wondering if there some clever way of distributing these results across categories? so that in my reporting layer (like an excel pivot table), I can get a result that counts 'A876' once in category='A' but counts it twice when I also include sub_category in the fields. Basically converting the results into something summable.
I should mention that this is an overly simplified example. The solution will need to generalize across n different categories and sub_categories.
I am looking for an output that would easily allow me to get either of the following results in something similar to a pivot table (think tableau-like reporting tools):
+----------+--------------------+
| category | distinct_customers |
+----------+--------------------+
| A | 2 |
| B | 3 |
| C | 2 |
+----------+--------------------+
+--------------+--------------------+
| sub_category | distinct_customers |
+--------------+--------------------+
| AA1 | 2 |
| AB2 | 1 |
| AB7 | 1 |
| AC3 | 2 |
+--------------+--------------------+
My immediate thought is to assign weights to a customer_id depending on how many categories and sub_categories it occurs in but I don't know exactly how I'd go about doing this.

You can do exactly what you want -- assigning weights. But this still won't aggregate correctly. Assuming there are no duplicates:
select category, sub_category,
count(distinct customer_id),
sum(1.0 / num_cs) as weighted_customers
from (select t.*,
count(*) over (partition by customer_id) as num_cs
from t
) t
group by category, sub_category;
This weights by both category and sub_category. Obviously, you can adjust the partition by to weight by just one or the other.

Related

MariaDB joining tables on themselfs

Ok, I've googled, i've tried but mostly i failed.
I've got a table with 5 columns
ID (just a primary key)
UserUUID
Category
Value
I can pull a query where i get the rankings of a specific category for all users
SELECT
RANK() OVER (PARTITION BY t1.cat ORDER BY value DESC) as rank,
t1.UUID, t1.cat, t1.value
FROM t1
WHERE t1.cat='Category1'
ORDER by t1.value DESC
So this outputs something like:
| 1 | sdc9c4-541 | cat1 | 16102 |
| 2 | sqdf5d-542 | cat1 | 7313 |
| 3 | sqsd5d-685 | cat1 | 7116 |
| 4 | s45sdf-213 | cat1 | 4158 |
.....
This works, but now i'm trying to get the reverse view on this.
So I'm trying to pull a query where i get the rankings of a user category for all categories
The desired output should look something like:
| 1 | sdc9c4-541 | cat1 | 16102 |
| 37 | sdc9c4-541 | cat2 | 25 |
| 15 | sdc9c4-541 | cat3 | 2345 |
| 2 | sdc9c4-541 | cat4 | 912 |
This showing the Rank, User, Category and value's. where the rank represents the users ranking on that category in comparison with other users
I've already messed around with subqueries, with clauses, variables, joins. but i can't get this result to come out and work.
Is there anybody that can give me some pointers in what direction i need to look to make this work.
Thanks in advance

Complex nested aggregations to get order totals

I have a system to track orders and related expenditures. This is a Rails app running on PostgreSQL. 99% of my app gets by with plain old Rails Active Record call etc. This one is ugly.
The expenditures table look like this:
+----+----------+-----------+------------------------+
| id | category | parent_id | note |
+----+----------+-----------+------------------------+
| 1 | order | nil | order with no invoices |
+----+----------+-----------+------------------------+
| 2 | order | nil | order with invoices |
+----+----------+-----------+------------------------+
| 3 | invoice | 2 | invoice for order 2 |
+----+----------+-----------+------------------------+
| 4 | invoice | 2 | invoice for order 2 |
+----+----------+-----------+------------------------+
Each expenditure has many expenditure_items and can the orders can be parents to the invoices. That table looks like this:
+----+----------------+-------------+-------+---------+
| id | expenditure_id | cbs_item_id | total | note |
+----+----------------+-------------+-------+---------+
| 1 | 1 | 1 | 5 | Fuit |
+----+----------------+-------------+-------+---------+
| 2 | 1 | 2 | 15 | Veggies |
+----+----------------+-------------+-------+---------+
| 3 | 2 | 1 | 123 | Fuit |
+----+----------------+-------------+-------+---------+
| 4 | 2 | 2 | 456 | Veggies |
+----+----------------+-------------+-------+---------+
| 5 | 3 | 1 | 34 | Fuit |
+----+----------------+-------------+-------+---------+
| 6 | 3 | 2 | 76 | Veggies |
+----+----------------+-------------+-------+---------+
| 7 | 4 | 1 | 26 | Fuit |
+----+----------------+-------------+-------+---------+
| 8 | 4 | 2 | 98 | Veggies |
+----+----------------+-------------+-------+---------+
I need to track a few things:
amounts left to be invoiced on orders (thats easy)
above but rolled up for each cbs_item_id (this is the ugly part)
The cbs_item_id is basically an accounting code to categorize the money spent etc. I have visualized what my end result would look like:
+-------------+----------------+-------------+---------------------------+-----------+
| cbs_item_id | expenditure_id | order_total | invoice_total | remaining |
+-------------+----------------+-------------+---------------------------+-----------+
| 1 | 1 | 5 | 0 | 5 |
+-------------+----------------+-------------+---------------------------+-----------+
| 1 | 2 | 123 | 60 | 63 |
+-------------+----------------+-------------+---------------------------+-----------+
| | | | Rollup for cbs_item_id: 1 | 68 |
+-------------+----------------+-------------+---------------------------+-----------+
| 2 | 1 | 15 | 0 | 15 |
+-------------+----------------+-------------+---------------------------+-----------+
| 2 | 2 | 456 | 174 | 282 |
+-------------+----------------+-------------+---------------------------+-----------+
| | | | Rollup for cbs_item_id: 2 | 297 |
+-------------+----------------+-------------+---------------------------+-----------+
order_total is the sum of total for all the expenditure_items of the given order ( category = 'order'). invoice_total is the sum of total for all the expenditure_items with parent_id = expenditures.id. Remaining is calculated as the difference (but not greater than 0). In real terms the idea here is you place and order for $1000 and $750 of invoices come in. I need to calculate that $250 left on the order (remaining) - broken down into each category (cbs_item_id). Then I need the roll-up of all the remaining values grouped by the cbs_item_id.
So for each cbs_item_id I need group by each order, find the total for the order, find the total invoiced against the order then subtract the two (also can't be negative). It has to be on a per order basis - the overall aggregate difference will not return the expected results.
In the end looking for a result something like this:
+-------------+-----------+
| cbs_item_id | remaining |
+-------------+-----------+
| 1 | 68 |
+-------------+-----------+
| 2 | 297 |
+-------------+-----------+
I am guessing this might be a combination of GROUP BY and perhaps a sub query or even CTE (voodoo to me). My SQL skills are not that great and this is WAY above my pay grade.
Here is a fiddle for the data above:
http://sqlfiddle.com/#!17/2fe3a
Alternate fiddle:
https://dbfiddle.uk/?rdbms=postgres_11&fiddle=e9528042874206477efbe0f0e86326fb
This query produces the result you are looking for:
SELECT cbs_item_id, sum(order_total - invoice_total) AS remaining
FROM (
SELECT cbs_item_id
, COALESCE(e.parent_id, e.id) AS expenditure_id -- ①
, COALESCE(sum(total) FILTER (WHERE e.category = 'order' ), 0) AS order_total -- ②
, COALESCE(sum(total) FILTER (WHERE e.category = 'invoice'), 0) AS invoice_total
FROM expenditures e
JOIN expenditure_items i ON i.expenditure_id = e.id
GROUP BY 1, 2 -- ③
) sub
GROUP BY 1
ORDER BY 1;
db<>fiddle here
① Note how I assume a saner table definition with expenditures.parent_id being integer, and true NULL instead of the string 'nil'. This allows the simple use of COALESCE.
② About the aggregate FILTER clause:
Aggregate columns with additional (distinct) filters
③ Using short syntax with ordinal numbers of an SELECT list items. Example:
Select first row in each GROUP BY group?
can I get the total of all the remaining for all rows or do I need to wrap that into another sub select?
There is a very concise option with GROUPING SETS:
...
GROUP BY GROUPING SETS ((1), ()) -- that's all :)
db<>fiddle here
Related:
Converting rows to columns

Make a query making groups on the same result row

I have two tables. Like this.
select * from extrafieldvalues;
+----------------------------+
| id | value | type | idItem |
+----------------------------+
| 1 | 100 | 1 | 10 |
| 2 | 150 | 2 | 10 |
| 3 | 101 | 1 | 11 |
| 4 | 90 | 2 | 11 |
+----------------------------+
select * from items
+------------+
| id | name |
+------------+
| 10 | foo |
| 11 | bar |
+------------+
I need to make a query and get something like this:
+--------------------------------------+
| idItem | valtype1 | valtype2 | name |
+--------------------------------------+
| 10 | 100 | 150 | foo |
| 11 | 101 | 90 | bar |
+--------------------------------------+
The quantity of types of extra field values is variable, but every item ALWAYS uses every extra field.
If you have only two fields, then left join is an option for this:
select i.*, efv1.value as value_1, efv2.value as value_2
from items i left join
extrafieldvalues efv1
on efv1.iditem = i.id and
efv1.type = 1 left join
extrafieldvalues efv2
on efv1.iditem = i.id and
efv1.type = 2 ;
In terms of performance, two joins are probably faster than an aggregation -- and it makes it easier to bring in more columns from items. One the other hand, conditional aggregation generalizes more easily and the performance changes by little as more columns from extrafieldvalues are added to the select.
Use conditional aggregation
select iditem,
max(case when type=1 then value end) as valtype1,
max(case when type=2 then value end) as valtype2,name
from extrafieldvalues a inner join items b on a.iditem=b.id
group by iditem,name

PostgreSQL: Using the LEAST() command after GROUP BY to achieve first transactions

I am working with a magento table like this:
+-----------+--------------+------------+--------------+----------+-------------+
| date | email | product_id | product_type | order_id | qty_ordered |
+-----------+--------------+------------+--------------+----------+-------------+
| 2017/2/15 | x#y.com | 18W1 | custom | 12 | 1 |
+-----------+--------------+------------+--------------+----------+-------------+
| 2017/2/15 | x#y.com | 18W2 | simple | 17 | 3 |
+-----------+--------------+------------+--------------+----------+-------------+
| 2017/2/20 | z#abc.com | 22Y34 | simple | 119 | 1 |
+-----------+--------------+------------+--------------+----------+-------------+
| 2017/2/20 | z#abc.com | 22Y35 | custom | 31 | 2 |
+-----------+--------------+------------+--------------+----------+-------------+
I want to make a new view by grouping by email, and then taking the row with the LEAST of order_id only.
So my final table after doing this operation from above should look like this:
+-----------+--------------+------------+--------------+----------+-------------+
| date | email | product_id | product_type | order_id | qty_ordered |
+-----------+--------------+------------+--------------+----------+-------------+
| 2017/2/15 | x#y.com | 18W1 | custom | 17 | 1 |
+-----------+--------------+------------+--------------+----------+-------------+
| 2017/2/15 | z#abc.com | 18W2 | simple | 31 | 3 |
+-----------+--------------+------------+--------------+----------+-------------+
I'm trying to use the following query (but it's not working):
SELECT * , (SELECT DISTINCT table.email, table.order_id,
LEAST (order_id) AS first_transaction_id
FROM
table
GROUP BY
email)
FROM table;
Would really love any help with this, thank you!
I think you want distinct on:
select distinct on (email) t.*
from t
order by email, order_id;
distinct on is a Postgres extension. It takes one record for all combinations of keys in parentheses, based on the order by clause. In this case, it is one row per email, with the first one being the one with the smallest order_id (because of the order by). The keys in the select also need to be the first keys in the order by.

Create a pivot table from two tables based on dates

I have two MS Access tables sharing a one to many relationship. Their structures are like the following:
tbl_Persons
+----------+------------+-----------+
| PersonID | PersonName | OtherData |
+----------+------------+-----------+
| 1 | PersonA | etc. |
| 2 | PersonB | |
| 3 | PersonC | |
tbl_Visits
+----------+------------+------------+-----------------------
| VisitID | PersonID | VisitDate | dozens of other fields
+----------+------------+------------+-----------
| 1 | 1 | 09/01/13 |
| 2 | 1 | 09/02/13 |
| 3 | 2 | 09/03/13 |
| 4 | 2 | 09/04/13 | etc...
I wish to create a new table based on the VisitDate field, the column headings of which are Visit-n where n is 1 to the number of visits, Visit-n-Data1, Visit-n-Data2, Visit-n-Data3 etc.
MergedTable
+----------+----------+---------------+-----------------+----------+----------------+
| PersonID | Visit1 | Visit1Data1 | Visit1Data2... | Visit2 | Visit2Data1... |
+----------+----------+---------------+-----------
| 1 | 09/01/13 | | | 09/02/13 |
| 2 | 09/03/13 | | | 09/04/13 |
| 3 | etc. | |
I am really not sure how to do this. Whether SQL query or using DAO then looping through records and columns. It is essential that there is only 1 PersonID per row and all his data appears chronologically into columns.
Start of by ranking the visits with something like
SELECT PersonID, VisitID,
(SELECT COUNT(VisitID) FROM tbl_Visits AS C
WHERE C.PersonID = tbl_Visits.PersonID
AND C.VisitDate < tbl_Visits.VisitDate) AS RankNumber
FROM tbl_Visits
Use this query as a base for the 'pivot'
Since you seem to have some visits of persons on the same day (visit 1 and 2) the WHERE clause needs to be a bit more sophisticated. But I hope you get the basic concept.
Pivoting can be done with multiple LEFT JOINs.
I question if my solution will have a high performance, since I did not test it. It is easier in SQL Server than in MS Access to accomplish.