Flaw in my logic of understanding the percentile() function in Hive - sql

Apologies for the rather basic question, however I have been struggling to understand and find any useful examples for a problem I have using the percentile() function in Hive.
Let's say I have a basic table:
Name | ID | Salary
Tom 25 20,000
Jim 01 25,000
Larry 72 80,000
King 05 32,000
and I want a percentile value for each row (calculated using the Salary column).
What I've tried to use is
Select
Name,
ID,
Salary,
percentile(Salary, array(0.25, 0.5, 0.75)) as percentile_value
group by
Name,
ID,
Salary
however the output was the exact Salary values which have led me to believe that I have misunderstood how this function works. I was expecting something along the lines of
0.25
0.5
0.75
0.25
If someone can point me in the right direction or help me further understand this it would be very helpful.

I think its working fine. This is as per documentation -
This Returns the exact pth percentile (or percentiles p1, p2, ..) of a column in the group.
You are using Salary in the percentile and in the group by. Which is like you are issuing a command percentile(constant_value, array(0.25, 0.5, 0.75)) which will always return [constant_value,constant_value,constant_value].
As far as i know percentile will be on a range of values so your group should have multiple different values. Your sample data has all unique values so i created my own data and experimented. Let me know what you think :)
My code and data below. i inserted multiple values with same id to calculate proper percentiles.
create table tmp2(id int, name string, sal int);
insert into tmp2 values (25, 'Larry',55000);
insert into tmp2 values (25, 'Larry',5000);
insert into tmp2 values (25, 'Larry',125000);
insert into tmp2 values (5, 'Tim',125000);
Select id, percentile(sal, array(0.25, 0.5, 0.75)) as percentile_value from tmp2 group by id ;
Result -
id percentile_value
5 [125000.0,125000.0,125000.0]
25 [30000.0,55000.0,90000.0]

Related

How to take average of two columns row by row in SQL?

I have a table match which looks like this (please see attached image). I wanted to retrieve a dataset that had a column of average values for home_goal and away_goal using this code
SELECT
m.country_id,
m.season,
m.home_goal,
m.away_goal,
AVG(m.home_goal + m.away_goal) AS avg_goal
FROM match AS m;
However, I got this error
column "m.country_id" must appear in the GROUP BY clause or be used in an aggregate function
LINE 3: m.country_id,
My question is: why was GROUP BY clause required? Why couldn't SQL know how to take average of two columns row by row?
Thank you.
try this:
SELECT
m.country_id,
m.season,
m.home_goal,
m.away_goal,
(m.home_goal + m.away_goal)/2 AS avg_goal
FROM match AS m;
You have been asked for the group_by as avg() much like sum() work on multiple values of one column where you classify all columns that are not a columns wise operation in the group by
You are looking to average two distinct columns - it is a row-wise operations instead of column-wise
how to take average of two columns row by row?
You don't use AVG() for this; it is an aggregate function, that operates over a set of rows. Here, it seems like you just want a simple math computation:
SELECT
m.country_id,
m.season,
m.home_goal,
m.away_goal,
(m.home_goal + m.away_goal) / 2.0 AS avg_goal
FROM match AS m;
Note the decimal denominator (2.0): this avoids integer division in databases that implement it.
Avg in the context of the function mentioned above is calculating the average of the values of the columns and not the average of the two values in the same row. It is an aggregate function and that’s why the group by clause is required.
In order to take the average of two columns in the same row you need to divide by 2.
Let's consider the following table:
CREATE TABLE Numbers([x] int, [y] int, [category] nvarchar(10));
INSERT INTO Numbers ([x], [y], [category])
VALUES
(1, 11, 'odd'),
(2, 22, 'even'),
(3, 33, 'odd'),
(4, 44, 'even');
Here is an example of using two aggregate functions - AVG and SUM - with GROUP BY:
SELECT
Category,
AVG(x) as avg_x,
AVG(x+y) as avg_xy,
SUM(x) as sum_x,
SUM(x+y) as sum_xy
FROM Numbers
GROUP BY Category
The result has two rows:
Category avg_x avg_xy sum_x sum_xy
even 3 36 6 72
odd 2 24 4 48
Please note that Category is available in the SELECT part because the results are GROUP BY'ed by it. If a GROUP BY is not specified then the result would be 1 row and Category is not available (which value should be displayed if we have sums and averages for multiple rows with different caetories?).
What you want is to compute a new column and for this you don't use aggregate functions:
SELECT
(x+y)/2 as avg_xy,
(x+y) as sum_xy
FROM Numbers
This returns all rows:
avg_xy sum_xy
6 12
12 24
18 36
24 48
If your columns are integers don't forget to handle rounding, if needed. For example (CAST(x AS DECIMAL)+y)/2 as avg_xy,
The simple arithmetic calculation:
(m.home_goal + m.away_goal) / 2.0
is not exactly equivalent to AVG(), because NULL values mess it up. Databases that support lateral joins provide a pretty easy (and efficient) way to actually use AVG() within a row.
The safe version looks like:
(coalesce(m.home_goal, 0) + coalesce(m.away_goal, 0)) /
nullif( (case when m.home_goal is not null then 1 else 0 end +
case when m.away_goal is not null then 1 else 0 end
), 0
)
Some databases have syntax extensions that allow the expression to be simplified.

Postgres - Changing values within columns

I have a query that returns a wide dataset with one row per student and multiple columns per 'score':
Student ID score1 score2 score3...
12345 101 102 103
67890 102 103 104
The scores are not actual scores, but instead are score ids that need to be translated to actual scores.
I would like to return the actual scores instead of the score ids. I know that I can just write a bunch of CASE statements that will do the translation for each column, but there are about 20 columns that need to be translated. I'm hoping that there is a more efficient way of doing this.
Cheers,
Jonathon
You probably want to make a scores table and then join to that. That will take away the need to write an absurd case query.
CREATE TABLE code_scores (
ScoreID INT
, Value INT)
GO
INSERT INTO code_scores (scoreid, value)
VALUES
(101, 100)
, (102, 99)
GO
SELECT studentID, score1, value
FROM yourtable
INNER JOIN code_scores
on score1 = scoreID

SELECT on INNER JOIN taking 9 hours (and counting) to complete

I'm using a sqlite database I have from the output of another script. I have a query that is taking a huge amount of time to complete. The samples table and multiclass table both have the same ~4,000,000 name's. The multiclass table has one row for each name (4 million rows), and the sample table could have one or many rows for each name (>100 million rows). I am joining on the names and summing the count grouped by the tax_id, day, and sample that names belong to. This query should return ~25000 rows
Here is a toy version of the schema and query I'm using:
SQL Fiddle
SQLite (SQL.js) Schema Setup:
CREATE TABLE samples
(
name varchar(20),
day integer,
sample integer,
count integer
);
CREATE TABLE multiclass
(
name varchar(20),
tax_id varchar(20),
details varchar(30)
);
INSERT INTO samples
(name, day, sample, count)
VALUES
('seq1', 204, 37, 50),
('seq2', 205, 37, 50),
('seq2', 206, 37, 50),
('seq3', 204, 37, 50),
('seq4', 205, 37, 50),
('seq4', 206, 37, 50);
INSERT INTO multiclass
(name, tax_id, details)
VALUES
('seq1', 'Vibrio', 'unimportant'),
('seq2', 'Shewenella', 'still_unimportant'),
('seq3', 'Vibrio', 'also_unimportant'),
('seq4', 'Shewenella', 'doesntmatter');
Query 1:
SELECT tax_id, day, sample, SUM(count)
FROM samples INNER JOIN multiclass USING(name)
GROUP BY tax_id, day, sample
ORDER BY day, sample;
Results:
| tax_id | day | sample | SUM(count) |
|------------|-----|--------|------------|
| Vibrio | 204 | 37 | 100 |
| Shewenella | 205 | 37 | 100 |
| Shewenella | 206 | 37 | 100 |
I am very new to SQL and am not sure how to proceed. This is a query I would only need to execute once. so I'm not sure adding indexes to the table is appropriate.
Is there a different way to construct the query to make it run faster? Would adding indexes make sense or take too long? If it is taking 9 hours, is it likely to still be hung up on the SQL, or is something else going wrong?
Edit: updated question to include database schema and intended results. I am currently building indexes on the samples.name column, it's been running for over 4 hours (using a node on a cluster environment with 60 Gigs of ram and many cpus).
This query:
SELECT tax_id, day, sample, SUM(count)
FROM samples INNER JOIN
multiclass
ON samples.name = multiclass.name
GROUP BY tax_id, day, sample
ORDER BY day, sample;
is pretty simple. An index on either samples(name) or multiclass(name) would normally be recommended.
However, there is a hint in your question that both tables contain 4 million rows, but you are only expecting 25,000. I suspect that you have duplicate names in each table. To determine the number of intermediate rows generated by the join, run this query:
select sum(s.cnt * m.cnt), max(s.cnt * m.cnt)
from (select name, count(*) as cnt from samples group by name
) s join
(select name, count(*) as cnt from multiclass group by name
) m
on s.name = m.name;
I am guessing that you will get a really large number, explaining why the query is taking so long.
Unfortunately, at this point, I don't have real answer on how to solve the problem, because your question doesn't specify what you actually want the query to produce. However, aggregating the tables before joining them is likely to be one possible solution.
The issue was the version of sqlite3 that was installed on the cluster I was using. The version on the cluster was 3.6.20. It seems incredible, but downloading the binary for 3.9.2 from the sqlite website and running the exact same query completed in less than 10 minutes.

Project multiple rows into a single row based on columns values in sql

I would like to know how you project multiple related rows into a single row, for example, a product that comes in multiple parts will have multiple SKUs but I want to project the multiple parts into a single row.
I'm sure this is possible but struggling to define the query for the desired result.
Given the example dataset
I would like to project my result to the following
What ends up in the product code or product name columns is irrelevant, essentially I just need a single row to represent these two rows.
How would I achieve this?
It depends on the format of data stored in ProductCode and ProductName.
According to this, you have to write appropriate expressions extracting all the useful data.
Then, of course, you have to decide what ID you will leave for new rows.
In my example I do simple transformation with substr(…) to extract necessary data,
and I use max(ID) to choose what ID will be for the row.
Test data:
insert table1(CustId, ProductCode, ProductName)
values
(10, 'Prod1Part1', 'Product1 Part1'),
(10, 'Prod1Part2', 'Product1 Part2'),
(10, 'Prod1Part3', 'Product1 Part3'),
(10, 'Prod2Part1', 'Product2 Part1'),
(10, 'Prod2Part2', 'Product2 Part2')
;
A query:
SELECT
(SELECT
MAX(id)
FROM
table1
WHERE
SUBSTR(ProductCode, 1, 5) = NewProductCode) id,
CustId,
NewProductCode,
NewProductName
FROM
(SELECT DISTINCT
CustId, SUBSTR(ProductCode, 1, 5) NewProductCode,
substr(ProductName, 1, instr(ProductName, ' ')) NewProductName
FROM
table1) x
The output:
8 10 Prod1 Product1
10 10 Prod2 Product2
Is it clear? Ask me to improve the answer, if it's not.

Calculating Grades in SQL

Very simply, I need to find student grades using SQL.
If, for example, I have following table that define grades
Marks (int)
Grade (Char)
and the data like this:
Marks | Grade
__90 | A+
__80 | A
__70 | A-
__60 | B
__50 | C
__40 | D
Okay, having said that, if I have a student that gained marks 73, how do I calculate her grade using above gradings in SQL.
Thank you so much...
You want the highest value below or equal to your value, substitue 73 for your value...
select top 1 Grade from TableName where Mark <= 73 order by Mark desc
Assuming your GradeCutoff table is created with something like:
CREATE TABLE GradeCutoff
( mark int
, grade char(3)
)
and you want to check #studentMark
SELECT grade
FROM GradeCutoff
WHERE mark =
( SELECT max(mark)
FROM gradeCutoff
WHERE #studentMark >= mark
)
;
Note: you may also have to add a (0, 'E') row in your cutoff table.
I think you should define a UDF for this which takes student markes as parameter and returns the grade according to the table given.
Then you can get grade for any student from student table as -
select studentID, getGrade(studentMarks) from student
Here getGrade(studentMarks) is UDF and studentMarks is column in student table with marks (for eg: in your case it is 73)
HINT: You need to use CASE construct in the UDF to get the grade.
SELECT Grade FROM 'table name here' WHERE student_mark <= 79 AND student_mark >= 70 - in order to be more specific I would need to see the actual layout of the tables. But, something to that affect would work.
If the marks are actually regular multiples of 10, you could look into SQLs MOD function
Since I didn't find any way to do it using MySQL, I had to do some PHP programming to achieve the result. The goal is to get the closest value from gradings.
Okay suppose, we have the grades as defined in my question.
$MarksObt = 73 <- Marks obtained by the student:
Step 1: Get grades from mysql ordered by Marks in ASC order (ASC order is important):
SELECT marks, grade FROM gradings ORDER BY marks
Step 2: Create an array variable "$MinGrades". Loop through the result of above query. Suppose MySQL results are stored in "$Gradings" array. Now on each iteration, do the following:
Subtract the $Grading['marks'] From $MarksObt
If result is greater than or equal to 0, add the result to "$MinGrades" array
Step 3: When loop ends, the "$MinGrades" array's first element will be the closest value ... DONE
Below is the PHP code that implements the above:
$MinGrades = array();
foreach($Gradings as $Key=>$Grading){
$Subtract = $MarksObt - $Grading['marks'];
if( $Subtract >= 0 )
array_push($MinGrades, array($Key=>$Subtract))
}
$GradeKey = key($MinGrades[0]); // Get key of first element in the array
print $Gradings[$GradeKey]['grade'];
If you have some better approach, please mention here.
Thanks for your contribution...