SQL how to calculate median not based on rows - sql

I have a sample of cars in my table and I would like to calculate the median price for my sample with SQL. What is the best way to do it?
+-----+-------+----------+
| Car | Price | Quantity |
+-----+-------+----------+
| A | 100 | 2 |
| B | 150 | 4 |
| C | 200 | 8 |
+-----+-------+----------+
I know that I can use percentile_cont (or percentile_disc) if my table is like this:
+-----+-------+
| Car | Price |
+-----+-------+
| A | 100 |
| A | 100 |
| B | 150 |
| B | 150 |
| B | 150 |
| B | 150 |
| C | 200 |
| C | 200 |
| C | 200 |
| C | 200 |
| C | 200 |
| C | 200 |
| C | 200 |
| C | 200 |
+-----+-------+
But in the real world, my first table has about 100 million rows and the second table should have about 3 billiard rows (and moreover I don't know how to transform my first table into the second).

Here is a way to do this in sql server
In the first step i do is calculate the indexes corresponding to the lower and upper bounds for the median (if we have odd number of elements then the lower and upper bounds are same else its based on the x/2 and x/2+1th value)
Then i get the cumulative sum of the quantity and the use that to choose the elements corresponding to the lower and upper bounds as follows
with median_dt
as (
select case when sum(quantity)%2=0 then
sum(quantity)/2
else
sum(quantity)/2 + 1
end as lower_limit
,case when sum(quantity)%2=0 then
(sum(quantity)/2) + 1
else
sum(quantity)/2 + 1
end as upper_limit
from t
)
,data
as (
select *,sum(quantity) over(order by price asc) as cum_sum
from t
)
,rnk_val
as(select *
from (
select price,row_number() over(order by d.cum_sum asc) as rnk
from data d
join median_dt b
on b.lower_limit<=d.cum_sum
)x
where x.rnk=1
union all
select *
from (
select price,row_number() over(order by d.cum_sum asc) as rnk
from data d
join median_dt b
on b.upper_limit<=d.cum_sum
)x
where x.rnk=1
)
select avg(price) as median
from rnk_val
+--------+
| median |
+--------+
| 200 |
+--------+
db fiddle link
https://dbfiddle.uk/?rdbms=sqlserver_2019&fiddle=c5cfa645a22aa9c135032eb28f1749f6

This looks right on few results, but try on a larger set to double-check.
First create a table which has the total for each car (or use CTE or sub-query), your choice. I'm just creating a separate table here.
create table table2 as
(
select car,
quantity,
price,
price * quantity as total
from table1
)
Then run this query, which looks for the price group that falls in the middle.
select price
from (
select car, price,
sum(total) over (order by car) as rollsum,
sum(total) over () as total
from table2
)a
where rollsum >= total/2
Correctly returns a value of $200.

Related

Check if Item has Data

I have a table that contains a whole lot of fields.
What I am trying to do is see if any items are a missing certain fields.
Example of data:
+--------+----------+-------+
| ITEMNO | OPTFIELD | VALUE |
+--------+----------+-------+
| 0 | x | 1 |
+--------+----------+-------+
| 0 | x | 1 |
+--------+----------+-------+
| 0 | x | 1 |
+--------+----------+-------+
| 0 | x | 1 |
+--------+----------+-------+
| 0 | x | 1 |
+--------+----------+-------+
There are 4 "OPTFIELD" which I want to see if all "ITEMNO" have.
So the logic I want to apply is something along the lines of:
Show all items that do not have the "OPTFIELD" - "LABEL","PG4","PLINE","BRAND"
Is this even possible?
Your data makes no sense. From the description of your question, it looks like you want itemno that do not have all 4 optfields. For this, one method uses aggregation:
select itemno
from mytable
where optfield in ('LABEL', 'PG4', 'PLINE', 'BRAND')
group by itemno
having count(*) < 4
On the other hand, if you want to exhibit all missing (itemno, optfield) tuples, then you can cross join the list of itemnos with a a derived table with of optfields, then use not exists:
select i.itemno, o.optfield
from (select distinct itemno from mytable) i
cross join (values ('LABEL'), ('PG4'), ('PLINE'), ('BRAND')) o(optfield)
where not exists (
select 1
from mytable t
where t.itemno = i.itemno and t.optfield = o.optfield
)

Calculating consecutive range of dates with a value in Hive

I want to know if it is possible to calculate the consecutive ranges of a specific value for a group of Id's and return the calculated value(s) of each one.
Given the following data:
+----+----------+--------+
| ID | DATE_KEY | CREDIT |
+----+----------+--------+
| 1 | 8091 | 0.9 |
| 1 | 8092 | 20 |
| 1 | 8095 | 0.22 |
| 1 | 8096 | 0.23 |
| 1 | 8098 | 0.23 |
| 2 | 8095 | 12 |
| 2 | 8096 | 18 |
| 2 | 8097 | 3 |
| 2 | 8098 | 0.25 |
+----+----------+--------+
I want the following output:
+----+-------------------------------+
| ID | RANGE_DAYS_CREDIT_LESS_THAN_1 |
+----+-------------------------------+
| 1 | 1 |
| 1 | 2 |
| 1 | 1 |
| 2 | 1 |
+----+-------------------------------+
In this case, the ranges are the consecutive days with credit less than 1. If there is a gap between date_key column, then the range won't have to take the next value, like in ID 1 between 8096 and 8098 date key.
Is it possible to do this with windowing functions in Hive?
Thanks in advance!
You can do this with a running sum classifying rows into groups, incrementing by 1 every time a credit<1 row is found(in the date_key order). Thereafter it is just a group by.
select id,count(*) as range_days_credit_lt_1
from (select t.*
,sum(case when credit<1 then 0 else 1 end) over(partition by id order by date_key) as grp
from tbl t
) t
where credit<1
group by id
The key is to collapse all the consecutive sequence and compute their length, I struggled to achieve this in a relatively clumsy way:
with t_test as
(
select num,row_number()over(order by num) as rn
from
(
select explode(array(1,3,4,5,6,9,10,15)) as num
)
)
select length(sign)+1 from
(
select explode(continue_sign) as sign
from
(
select split(concat_ws('',collect_list(if(d>1,'v',d))), 'v') as continue_sign
from
(
select t0.num-t1.num as d from t_test t0
join t_test t1 on t0.rn=t1.rn+1
)
)
)
Get the previous number b in the seq for each original a;
Check if a-b == 1, which shows if there is a "gap", marked as 'v';
Merge all a-b to a string, and then split using 'v', and compute length.
To get the ID column out, another string which encode id should be considered.

PostgreSQL : SQL Request with a Group By and a Percentage on two differents tables

I'm currently blocked on an complex request (with a join):
I have this table "DATA":
order | product
----------------
1 | A
1 | B
2 | A
2 | D
3 | A
3 | C
4 | A
4 | B
5 | Y
5 | Z
6 | W
6 | A
7 | A
And this table "DICO":
order | couple | first | second
-------------------------------
1 | A-B | A | B
2 | A-D | A | D
3 | A-C | A | C
4 | A-B | A | B
5 | Y-Z | Y | Z
6 | W-A | W | A
I would like to obtain, on one line :
order | count | total1stElem | %1stElem | total2ndElem | %2ndElem
------------------------------------------------------------------
A-B | 2 | 6 | 33% | 2 | 100%
A-D | 1 | 6 | 16% | 1 | 100%
A-C | 1 | 6 | 16% | 1 | 100%
Y-Z | 1 | 1 | 100% | 1 | 100%
W-A | 1 | 1 | 100% | 6 | 16%
Information:
Fields: (On the 1st Line example)
total1stElem : count ALL('A') in table Data (all occurrences of A in Data)
total2ndElem : count ALL('B') in table Data (all occurrences of B in Data)
Count : count the number of 'A-B' occurence in table Dico
%1stElem = ( Count / total1stElem ) * 100
%1ndElem = ( Count / total2ndElem ) * 100
I'm based on this request:
select couple, count(*),
sum(count(*)) over (partition by first) as total,
(count(*) * 1.0 / sum(count(*)) over (partition by first) ) as ratio
from dico1
group by couple, first ORDER BY ratio DESC;
And I want to do something like :
select couple, count(*) as COUNT,
count(*) over (partition by product #FROM DATA WHERE product = first#) as total1stElem,
(count(*) * 1.0 / sum(count(*)) over (partition by product #FROM DATA WHERE product = first#) as %1stElem
count(*) over (partition by product #FROM DATA WHERE product = second#) as total2ndElem,
(count(*) * 1.0 / sum(count(*)) over (partition by product #FROM DATA WHERE product = second#) as %2ndElem
from dico1
group by couple, first ORDER BY COUNT DESC;
I'm totally blocked on the jointure part of my request. Somebody can help me ? I've been helped for this kind or request on Oracle, but unfortunately it's impossible to adapt UNPIVOT and PIVOT function in PostgreSQL.
I'd create CTEs that aggregate each table and count the occurrences you listed, and join dico's aggregation on data's aggregation twice, once for first and once for second:
WITH data_count AS (
SELECT product, COUNT(*) AS product_count
FROM data
GROUP BY product
),
dico_count AS (
SELECT couple, first, second, COUNT(*) AS dico_count
FROM dico
GROUP BY couple, first, second
)
SELECT couple,
dico_count,
data1.product_count AS total1stElem,
TRUNC(dico_count * 100.0 / data1.product_count) AS percent1stElem,
data2.product_count AS total2ndElem,
TRUNC(dico_count * 100.0 / data2.product_count) AS percent2ndElem
FROM dico_count dico
JOIN data_count data1 ON dico.first = data1.product
JOIN data_count data2 ON dico.second = data2.product
ORDER BY 1

Find one single row for a column with a unique value using SQL

I have a table which contains data that similar to this:
RowID | CustomerID | Quantity | Type | .....
1 | 345 | 100 | Software | .....
2 | 1280 | 200 | Software | .....
3 | 456 | 20 | Hub | .....
4 | 345 | 100 | Software | .....
5 | 345 | 180 | Monitor | .....
6 | 23 | 15 | Router | .....
7 | 1280 | 120 | Software | .....
8 | 345 | 5 | Mac | .....
.... | .... | ... | ..... | .....
The database have hundreds of thousand of rows. As you can see, the CustomerID has duplicates.
What I want to do is to find EXACTLY ONE row for each unique CustomerID and Type combination and with Quantity more than 10.
For example, for the above table, I want to get:
RowID | CustomerID | Quantity | Type | .....
2 | 1280 | 200 | Software | .....
3 | 456 | 20 | Hub | .....
4 | 345 | 100 | Software | .....
5 | 345 | 180 | Monitor | .....
6 | 23 | 15 | Router | .....
What I tried to do is:
select distinct CustomerID, Type from MyTable
where Quantity > 10
Which gives me:
CustomerID | Type
1280 | Software
456 | Hub
345 | Software
345 | Monitor
23 | Router
But I don't know how to select other columns because if I do:
select distinct CustomerID, Type, RowID, Quantity from MyTable
where Quantity > 10
It returns every rows because the RowID is unique.
I think maybe I should use a subquery by iterating the result of the above query. Can someone help me on this?
Use Partition Over. This will allow you to group all similar rows together, and then you query that table to get just the first row. Note: An "order by" must be specified in the partition, even if you don't use the value. But it is useful for pulling the combination with the highest quantity. If you also want distinct Quantity, add that column to the select in the partition.
select CustomerId
, Type
FROM
(
select
CustomerId
, Type
, row_number() over (partition by CustomerId, Type order by Quantity desc) as rn
From MyTable
where Quantity > 10
) dta
Where rn = 1
Something like this will work (unless you have more requirements that you didn't mention):
SELECT CustomerID, Type, SUM(Quantity) AS Quantity
FROM MyTable
GROUP BY CustomerID, Type
HAVING SUM(Quantity) > 10
You need to choose which one of the "duplicated" rows to retrieve.
I wrote duplicated with quotes because they are not technically duplicated:
+-------+------------+----------+----------+
| RowID | CustomerID | Type | Quantity |
+-------+------------+----------+----------+
| 1 | 345 | Software | 100 |
| 2 | 345 | Software | 200 |
| 3 | 345 | Software | 300 |
+-------+------------+----------+----------+
All of this are different rows because of the different RowID and Quantity columns.
So you must to specify which one of these you want to retrieve.
For this example I will use the RowID and Quantity with the minimum value.
So I will tell SQL to pick this one, for this I will order the table by RowID and Quantity in ascending order and I will do a join with the same table
so I can pick up the first row with the lower RowID and Quantity for the same CustomerID and Type.
+-------+------------+----------+----------+
| RowID | CustomerID | Type | Quantity |
+-------+------------+----------+----------+
| 1 | 345 | Software | 100 |
+-------+------------+----------+----------+
The SQL code for this is the following:
SELECT
*
FROM
MyTable originalTable
WHERE
originalTable.Quantity > 10 AND
originalTable.RowID =
(
SELECT TOP 1 orderedTable.RowID
FROM MyTable orderedTable
WHERE orderedTable.CustomerID = originalTable.CustomerID AND orderedTable.Type = originalTable.Type
ORDER BY orderedTable.RowID ASC, orderedTable.Quantity ASC
)
One way is to use the row_number window function as partition the data by CustomerID and Type, and the filter out the first rows in each partition.
WITH Uniq AS (
SELECT
CustomerID, Type, RowID, Quantity,
rn = ROW_NUMBER() OVER (PARTITION BY CustomerID, Type ORDER BY RowID)
FROM MyTable WHERE Quantity > 10
)
SELECT * FROM Uniq WHERE rn = 1;
SQL Fiddle
Or you could find the a unique RowID (min or max) for each group of CustomerID and Type and use that as a source in a join, either as a common table expression of derived table:
WITH Uniq AS (
SELECT MIN(RowID) RowID FROM MyTable WHERE Quantity > 10 GROUP BY CustomerID, Type
)
SELECT MyTable.* FROM MyTable JOIN Uniq ON MyTable.RowID = Uniq.RowID
Sample SQL Fiddle

how to get median for every record?

There's no median function in sql server, so I'm using this wonderful suggestion:
https://stackoverflow.com/a/2026609/117700
this computes the median over an entire dataset, but I need the median per record.
My dataset is:
+-----------+-------------+
| client_id | TimesTested |
+-----------+-------------+
| 214220 | 1 |
| 215425 | 1 |
| 212839 | 4 |
| 215249 | 1 |
| 210498 | 3 |
| 110655 | 1 |
| 110655 | 1 |
| 110655 | 12 |
| 215425 | 4 |
| 100196 | 1 |
| 110032 | 1 |
| 110032 | 1 |
| 101944 | 3 |
| 101232 | 2 |
| 101232 | 1 |
+-----------+-------------+
here's the query I am using:
select client_id,
(
SELECT
(
(SELECT MAX(TimesTested ) FROM
(SELECT TOP 50 PERCENT t.TimesTested
FROM counted3 t
where t.timestested>1
and CLIENT_ID=t.CLIENT_ID
ORDER BY t.TimesTested ) AS BottomHalf)
+
(SELECT MIN(TimesTested ) FROM
(SELECT TOP 50 PERCENT t.TimesTested
FROM counted3 t
where t.timestested>1
and CLIENT_ID=t.CLIENT_ID
ORDER BY t.TimesTested DESC) AS TopHalf)
) / 2 AS Median
) TotalAvgTestFreq
from counted3
group by client_id
but it is giving my funny data:
+-----------+------------------+
| client_id | median???????????|
+-----------+------------------+
| 100007 | 84 |
| 100008 | 84 |
| 100011 | 84 |
| 100014 | 84 |
| 100026 | 84 |
| 100027 | 84 |
| 100028 | 84 |
| 100029 | 84 |
| 100042 | 84 |
| 100043 | 84 |
| 100071 | 84 |
| 100072 | 84 |
| 100074 | 84 |
+-----------+------------------+
i can i get the median for every client_id ?
I am currently trying to use this awesome query from Aaron's site:
select c3.client_id,(
SELECT AVG(1.0 * TimesTested ) median
FROM
(
SELECT o.TimesTested ,
rn = ROW_NUMBER() OVER (ORDER BY o.TimesTested ), c.c
FROM counted3 AS o
CROSS JOIN (SELECT c = COUNT(*) FROM counted3) AS c
where count>1
) AS x
WHERE rn IN ((c + 1)/2, (c + 2)/2)
) a
from counted3 c3
group by c3.client_id
unfortunately, as Richardthekiwi points out:
it's for a single median whereas this question is about a median
per-partition
i would like to know how i can join it on counted3 to get the median per partition?>
Note: If testFreq is an int or bigint type, you need to CAST it before taking an average, otherwise you'll get integer division, e.g. (2+5)/2 => 3 if 2 and 5 are the median records - e.g. AVG(Cast(testfreq as float)).
select client_id, avg(testfreq) median_testfreq
from
(
select client_id,
testfreq,
rn=row_number() over (partition by CLIENT_ID
order by testfreq),
c=count(testfreq) over (partition by CLIENT_ID)
from tbk
where timestested>1
) g
where rn in (round(c/2,0),c/2+1)
group by client_id;
The median is found either as the central record in an ODD number of rows, or the average of the two central records in an EVEN number of rows. This is handled by the condition rn in (round(c/2,0),c/2+1) which picks either the one or two records required.
try this:
select client_id,
(
SELECT
(
(SELECT MAX(testfreq) FROM
(SELECT TOP 50 PERCENT t.testfreq
FROM counted3 t
where t.timestested>1
and c3.CLIENT_ID=t.CLIENT_ID
ORDER BY t.testfreq) AS BottomHalf)
+
(SELECT MIN(testfreq) FROM
(SELECT TOP 50 PERCENT t.testfreq
FROM counted3 t
where t.timestested>1
and c3.CLIENT_ID=t.CLIENT_ID
ORDER BY t.testfreq DESC) AS TopHalf)
) / 2 AS Median
) TotalAvgTestFreq
from counted3 c3
group by client_id
I added the c3 alias to the outer CLIENT_ID references and the outer table.