I'm doing something wrong with calculating the median in Hive - sql

My Hive table currently looks like this:
Numbers
0
0
-0.12745098
-0.218905473
0.026011561
0.235294118
-0.028
-0.052356021
0.052753355
0.008032129
0.012768817
0.115384615
0.040816327
The type is DOUBLE_TYPE. I would like to calculate the median. I would expect the answer to be 0.008032129, since this is the 7th observation ordering my numbers.
When I run this code (as suggested here How to calculate median in Hive):
select percentile_approx(Numbers, 0.5) AS Numbers
from tryout1
The answer I get is : 0.0040160642570281121. This is unexpected, and not even one of the numbers in my list! Does anyone know why Hive gives me this number, and what I should fix to make it work? If you know an entirely different way to calculate the median, I am also very interested!

Indeed the function percentile_approx in hive is not performing well.
Kudos to Liza for getting an approx answer:
FROM MY TRIALs:
select percentile_approx(numbers , 0.5 , 10 ) as A_mdn from tryout1 ;
-0.007249852187499999
FROM LIZA:
select (percentile(cast((numbers*1000000) as BIGINT), 0.5))/1000000 as A_mdn from tryout1;
0.008032

You can use the percentile function to compute the median and Try to cast the complete column into int or BIGINT and see if you come close to the answer. Try this:
select percentile(cast(g_rek_brutowinst as BIGINT), 0.5) AS g_rek_brutowinst from tryout1

Related

how to calculate prevalence using sql code

I am trying to calculate prevalence in sql.
kind of stuck in writing the code.
I want to make automative code.
I have check that I have 1453477 of sample size and number of people who has disease is 851451 using count.
The formula of calculating prevalence is no.of person who has disease/no.sample size.
select (COUNT(condition_id)/COUNT(person_id)) as prevalence
from disease
where condition_id=12345;
when I run above code, I get 1 as a output where I am suppose to get 0.5858.
Can some one please help me out?
Thanks!
In your current query you count the number of rows in the disease table, once using the column condition_id, once using the column person_id. But the number of rows is the same - this is why you get 1 as a result.
I think you need to find the number of different values for these columns. This can be done using count distinct:
select (COUNT(DISTINCT condition_id)/COUNT(DISTINCT person_id)) as prevalence
from disease
where condition_id=12345;
You can cast by
count(...)/count(...)::numeric(6,4) or
count(...)/count(...)::decimal
as two options.
Important point is apply cast to denominator or numerator part(in this case denominator), Do not apply to division as
(count(...)/count(...))::numeric(6,4) which again results an integer.
I am pretty sure that the logic that you want is something like this:
select avg( (condition_id = 12345)::int )
from disease;
Your version doesn't have the sample size, because you are filtering out people without the condition.
If you have duplicate people in the data, then this is a little more complicated. One method is:
select (count(distinct person_id) filter (where condition_id = 12345)::numeric /
count(distinct person_id
)
from disease;

Round function query 2 to 3 arguments?

I am attempting to find the revenue per distinct user in this query but seem to be running in this error.
select concat('$',format(cast(round(sum(total)/count(distinct(customers))),2)
as int),N'N','en-US')
from table
My error:
The round function requires 2 to 3 arguments
I suspect you mean:
SELECT CONCAT('$',FORMAT(CAST(ROUND(SUM(Total)/COUNT(DISTINCT customers),2) AS int),N'N'),'en-US')
FROM [table];
But, really, worry about the formatting of your values in your presentation layer (The FORMAT and CONCAT don't need to be there).
Also, Why ROUND({expr},2) and then CAST({expr} AS int)? Why not ROUND({expr},0)?
For instance used 2 as length to round
round(sum(total)/count(distinct(customers)),2)

Select SUM() SQL

I need to get the total from a column with SQL but it's not working, can anybody see what I do wrong.
SELECT a.Artikelnummer
,a.Artikelnamn
,a.Antalperpall
,COUNT(*) AS AntalArtiklar
,SUM(e.Antalpallar) AS TotalPall
,SUM(e.Antalperpall) AS TotalStyck
FROM Artikel AS a
INNER JOIN Evig AS e ON a.ArtikelnummerID = e.ArtikelnummerID
WHERE (e.Datum <= '{0}')
AND (a.Kundkund = '{1}')
AND (a.Artikelnamn = '{2}')
GROUP BY a.Artikelnummer
,a.Artikelnamn
,a.Antalperpall
SUM(e.Antalperpall) AS TotalStyck: it is this one who returns a strange value. What I wanna do is take the integer value in each row and get a total from that.
OK I went down to the basement and visited the server, and I found the problem. I needed to multiply with Antalpallar like this SUM(e.Antalperpall * ABS(e.Antalpallar)) . But it is still not working and I think it is becouse of the negative values.
se data here
so where it is negativ value in Antalpallar like this -1200 *-2 should be -2400 but i don't think it's doing that, or? It is stuff going in and out of a warehouse.
Anyhow, the final value of adding those togheter should be 14320, but i get one on 20 000 something and without ABS()(or with) a sum on 5000 something.
Anyone knows how to write this SUM(e.Antalperpall * ABS(e.Antalpallar)) to get the value i want?
You might wanna try it by eliminating if there are any strange values (characters) in Antalperpall. Use sum(cast (e.Antalperpall as Money)) and in filterclause
where ISNUMERIC( e.Antalperpall) = 1. If there are any stange values in the field, you will obviously get conversion error.

Converting pounds to kilos in SQL

I am trying to pull data from a table with the filter weight < 25 kgs , but my table has weight in pounds, I tried using below sql can some one please tell me is this the right way to do it or is there any other way .
select * from dbo.abc
where (round((WEIGHT * 0.453592 ),0) < 25)
Your solution would work, but it's not sargaeble. A better solution would be to convert your 25kgs to lbs. That way, if you have an index on your WEIGHT column, the query analyzer could make use of it.
One additional note: Why round to 0 decimal places? You'll lose accuracy that way. Unless you have some requirement to do so, I'd drop the rounding. It's unnecessary overhead.
As other people mentioned, you don't want to convert weight as it will cause SQL Server not to use your index. So try this instead:
SELECT *
FROM dbo.acb
WHERE WEIGHT < ROUND(25/.453592,4)

In Oracle, find number which is larger than 80% of a set of a numbers

Assume I have a table with a column of integers in Oracle. There are a good amount of rows; somewhere in the millions. I want to write a query that gives me back an integer that is larger than 80% of all of the numbers in table. What is the best way to approach this?
If it matters, this is Oracle 10g r1.
Sounds like you want to use the PERCENTILE_DISC function if you want an actual value from the set, or PERCENTILE_CONT if you want an interpolated value for a particular percentile, say 80%:
SELECT PERCENTILE_DISC(0.8)
WITHIN GROUP(ORDER BY integer_col ASC)
FROM some_table
EDIT
If you use PERCENTILE_DISC, it will return an actual value from the dataset, so if you wanted a larger value, you'd want to increment that by 1 (for an integer column).
I think you could use the NTILE function to divide the input into 5 buckets, then select the MIN(Column) from the top bucket.