MySQL Query - getting missing records when using group-by - sql

I have a query :
select score, count(1) as 'NumStudents' from testresults where testid = 'mytestid'
group by score order by score
where testresults table contains the performances of students in a test. A sample result looks like the following, assuming maximum marks of the test is 10.
score, NumStudents
0 10 1 20 2 12 3 5 5 34 .. 10 23
As you can see, this query does not return any records for scores which no student have scored. For eg. nobody scored 4/10 in the test and there are no records for score = 4 in the query output.
I would like to change the query so that I can get these missing records with 0 as the value for the NumStudents field. So that my end output would have max + 1 records, one for each possible score.
Any ideas ?
EDIT:
The database contains several tests and the maximum marks for the test is part of the test definition. So having a new table for storing all possible scores is not feasible. In the sense that whenever I create a new test with a new max marks, I need to ensure that the new table should be changed to contain these scores as well.

SQL is good at working with sets of data values in the database, but not so good at sets of data values that are not in the database.
The best workaround is to keep one small table for the values you need to range over:
CREATE TABLE ScoreValues (score int);
INSERT INTO ScoreValues (score)
VALUES (0), (1), (2), (3), (4), (5), (6), (7), (8), (9), (10);
Given your comment that you define the max marks of a test in another table, you can join to that table in the following way, as long as ScoreValues is sure to have values at least as high or higher than the greatest test's max marks:
SELECT v.score, COUNT(tr.score) AS 'NumStudents'
FROM ScoreValues v
JOIN Tests t ON (v.score <= t.maxmarks)
LEFT OUTER JOIN TestResults tr ON (v.score = tr.score AND t.testid = tr.testid)
WHERE t.testid = 'mytestid'
GROUP BY v.score;

The most obvious way would be to create a table named "Scores" and left outer join your table to it.
SELECT s.score, COUNT(1) AS scoreCount
FROM score AS s
LEFT OUTER JOIN testScores AS ts
ON s.score = ts.score
GROUP BY s.score
If you don't want to create the table, you could use
SELECT
1 as score, SUM(CASE WHEN ts.score = 1 THEN 1 ELSE 0 END) AS scoreCount,
2 as score, SUM(CASE WHEN ts.score = 2 THEN 1 ELSE 0 END) AS scoreCount,
3 as score, SUM(CASE WHEN ts.score = 3 THEN 1 ELSE 0 END) AS scoreCount,
4 as score, SUM(CASE WHEN ts.score = 4 THEN 1 ELSE 0 END) AS scoreCount,
...
10 as score, SUM(CASE WHEN ts.score = 10 THEN 1 ELSE 0 END) AS scoreCount
FROM testScores AS ts

Does MySQL support set-returning functions? Recent releases of PostgreSQL have a function, generate_series(start, stop) that produces the value start on the first row, start+1 on the second, and so on up to stop on the stopth row. The advantage of this is that you can put this function in a subselect in the FROM clause and then join to it, instead of creating and populating a table and joining to that as suggested by le dorfier and Bill Karwin.

Just as a mental exercise I came up with this to generate a sequence in MySQL. As long as the number of tables in all databases on the box squared are less than the total length of the sequence it will work. I wouldn't recommend it for production though ;)
SELECT #n:=#n+1 as n from (select #n:=-1) x, Information_Schema.Tables y, Information_Schema.Tables WHERE #n<20; /* sequence from 0 to 20 inclusive */

Related

how can I count some values for data in a table based on same key in another table in Bigquery?

I have one table like bellow. Each id is unique.
id
times_of_going_out
fef666
2
S335gg
1
9a2c50
1
and another table like this one ↓. In this second table the "id" is not unique, there are different "category_name" for a single id.
id
category_name
city
S335gg
Games & Game Supplies
tk
9a2c50
Telephone Companies
os
9a2c50
Recreation Centers
ky
fef666
Recreation Centers
ky
I want to find the difference between destinations(category_name) of people who go out often(times_of_going_out<5) and people who don't go out often(times_of_going_out<=5).
** Both tables are a small sample of large tables.
 ・ Where do people who go out twice often go?
 ・ Where do people who go out 6times often go?
Thank you
The expected result could be something like
less than 5
more than 5
top ten “category_name” for uid’s with "times_of_going_out" less than 5 times
top ten “category_name” for uid’s with "times_of_going_out" more than 5 times
Steps:
combining data and aggregating total time_going_out
creating the categories that you need : less than equal to 5 and more than 5. if you don't need equal to 5, you can adjust the code
ranking both categories with top 10, using dense_rank(). this will produce the rank from 1 - 10 based on the total time_going out
filtering the cases so it takes top 10 values for both categories
with main as (
select
category_name,
sum(coalesce(times_of_going_out,0)) as total_time_per_category
from table1 as t1
left join table2 as t2
on t1.id = t2.id
group by 1
),
category as (
select
*,
if(total_time_per_category >= 5, 'more than 5', 'less than equal to 5') as is_more_than_5_times
from main
),
ranking_ as (
select *,
case when
is_more_than_5_times = 'more than 5' then
dense_rank() over (partition by is_more_than_5_times order by total_time_per_category desc)
else NULL
end AS rank_more_than_5,
case when
is_more_than_5_times = 'less than equal to 5' then
dense_rank() over (partition by is_more_than_5_times order by total_time_per_category)
else NULL
end AS rank_less_than_equal_5
from category
)
select
is_more_than_5_times,
string_agg(category_name,',') as list
from ranking_
where rank_less_than_equal_5 <=10 or rank_more_than_5 <= 10
group by 1

Creating row with different where

I have this code to get the number of users of all items in the list and the average level.
select itemId,count(c.characterid) as numberOfUse, avg(maxUpgrade) as averageLevel
from items i inner join characters c on i.characterId=c.characterId
where itemid in (22001,22002,22003,22004,22005,22006,22007,22008,22009,22010,22011,22012,22013,22014,22015,22016,22030,22031,22032,22033,22034,22035,22036,22037,22038,22039,22040,22041,22042,22050,22051,22052,22053,22054,22055,22056,22057,22058,22059,22060,22070,22071,22072,22073,22074,22075,22076,22077,22085,22086,22087,22091,22092)
and attached>0
group by itemId
It does is creating a row for the rune id, one for the number of users, and one for the average-level people who upgrade it, and it does that for all players of the server.
I would like to create a new column every 10 levels to have stats every 10 levels, so I can see what item is more used depending on player level. The item level depending on the level, so the way I do to select only a certain level is using WHERE itemid>0 and itemid<10, and I do that every 10 levels, copy data, and push them in a google sheet.
So I would like a result with columns :
itemid use_1-10 avg_level_1-10 use_11-20 avg_level_21-30 etc...
So I could copy all the results at once and not having to do the same process 15 times.
If I am following this correctly, you can do conditional aggregation. Assuming that a "level" is stored in column level in table characters, you would do:
select i.itemId,
sum(case when c.level between 1 and 10 then 1 else 0 end) as use_1_10,
avg(case when c.level between 1 and 10 then maxUpgrade end) as avg_level_1_10,
sum(case when c.level between 11 and 20 then 1 else 0 end) as use_11_20,
avg(case when c.level between 11 and 20 then maxUpgrade end) as avg_level_11_20,
...
from items i
inner join characters c on i.characterId = c.characterId
where i.itemid in (...) and attached > 0
group by i.itemId
Note: consider prefixing column attached in the where clause with the table it belongs to, in order to avoid ambiguity.

HIVE/Impala query: Count the number of rows between rows fulfilling certain conditions

I need to count the number of rows that fulfill certain conditions contained in intervals defined by other rows that fulfill other conditions. Examples: the number of rows N between 'Reference' having values 1 and 4 that fulfill the condition 'Other_condition' = b is N=1, the number of rows N between 'Reference' having values 2 and 5 that fulfill the condition 'Other_condition' = b is N=2 etc.
Date Reference Other_condition
20171111 1 a
20171112 2 a
20171113 3 b
20171114 4 b
20171115 5 b
I'm accessing the database through Hive/Impala SQL queries and unfortunately I have no idea where to start implementing such a window function. A half-pseudocode version of what I want would be something like:
SELECT COUNT (DISTINCT database.Date) AS counter, Other_condition, reference
FROM database
WHERE database.Other_condition = a AND database.Reference BETWEEN
(window function condition 1: database.Reference = 2) AND
(window function condition 2: database.Reference = 5)
GROUP BY counter
Your question is rather hard to follow. I get the first conditions, which is the number of rows between "1" and "4".
Here is one method that should be pretty easy to generalize:
select (max(case when reference = 4 then seqnum end) -
max(case when reference = 1 then seqnum end)
) as num_rows_1_4
from (select t.*,
row_number() over (order by date) as seqnum
from t
) t;

Calculate percentages of columns in Oracle SQL

I have three columns, all consisting of 1's and 0's. For each of these columns, how can I calculate the percentage of people (one person is one row/ id) who have a 1 in the first column and a 1 in the second or third column in oracle SQL?
For instance:
id marketing_campaign personal_campaign sales
1 1 0 0
2 1 1 0
1 0 1 1
4 0 0 1
So in this case, of all the people who were subjected to a marketing_campaign, 50 percent were subjected to a personal campaign as well, but zero percent is present in sales (no one bought anything).
Ultimately, I want to find out the order in which people get to the sales moment. Do they first go from marketing campaign to a personal campaign and then to sales, or do they buy anyway regardless of these channels.
This is a fictional example, so I realize that in this example there are many other ways to do this, but I hope anyone can help!
The outcome that I'm looking for is something like this:
percentage marketing_campaign/ personal campaign = 50 %
percentage marketing_campaign/sales = 0%
etc (for all the three column combinations)
Use count, sum and case expressions, together with basic arithmetic operators +,/,*
COUNT(*) gives a total count of people in the table
SUM(column) gives a sum of 1 in given column
case expressions make possible to implement more complex conditions
The common pattern is X / COUNT(*) * 100 which is used to calculate a percent of given value ( val / total * 100% )
An example:
SELECT
-- percentage of people that have 1 in marketing_campaign column
SUM( marketing_campaign ) / COUNT(*) * 100 As marketing_campaign_percent,
-- percentage of people that have 1 in sales column
SUM( sales ) / COUNT(*) * 100 As sales_percent,
-- complex condition:
-- percentage of people (one person is one row/ id) who have a 1
-- in the first column and a 1 in the second or third column
COUNT(
CASE WHEN marketing_campaign = 1
AND ( personal_campaign = 1 OR sales = 1 )
THEN 1 END
) / COUNT(*) * 100 As complex_condition_percent
FROM table;
You can get your percentages like this :
SELECT COUNT(*),
ROUND(100*(SUM(personal_campaign) / sum(count(*)) over ()),2) perc_personal_campaign,
ROUND(100*(SUM(sales) / sum(count(*)) over ()),2) perc_sales
FROM (
SELECT ID,
CASE
WHEN SUM(personal_campaign) > 0 THEN 1
ELSE 0
end AS personal_campaign,
CASE
WHEN SUM(sales) > 0 THEN 1
ELSE 0
end AS sales
FROM the_table
WHERE ID IN
(SELECT ID FROM the_table WHERE marketing_campaign = 1)
GROUP BY ID
)
I have a bit overcomplicated things because your data is still unclear to me. The subquery ensures that all duplicates are cleaned up and that you only have for each person a 1 or 0 in marketing_campaign and sales
About your second question :
Ultimately, I want to find out the order in which people get to the
sales moment. Do they first go from marketing campaign to a personal
campaign and then to sales, or do they buy anyway regardless of these
channels.
This is impossible to do in this state because you don't have in your table, either :
a unique row identifier that would keep the order in which the rows were inserted
a timestamp column that would tell when the rows were inserted.
Without this, the order of rows returned from your table will be unpredictable, or if you prefer, pure random.

DB2 SQL filter query result by evaluating an ID which has two types of entries

After many attempts I have failed at this and hoping someone can help. The query returns every entry a user makes when items are made in the factory against and order number. For example
Order Number Entry type Quantity
3000 1 1000
3000 1 500
3000 2 300
3000 2 100
4000 2 1000
5000 1 1000
What I want to the query do is to return filter the results like this
If the order number has an entry type 1 and 2 return the row which is type 1 only
otherwise just return row whatever the type is for that order number.
So the above would end up:
Order Number Entry type Quantity
3000 1 1000
3000 1 500
4000 2 1000
5000 1 1000
Currently my query (DB2, in very basic terms looks like this ) and was correct until a change request came through!
Select * from bookings where type=1 or type=2
thanks!
select * from bookings
left outer join (
select order_number,
max(case when type=1 then 1 else 0 end) +
max(case when type=2 then 1 else 0 end) as type_1_and_2
from bookings
group by order_number
) has_1_and_2 on
type_1_and_2 = 2
has_1_and_2.order_number = bookings.order_number
where
bookings.type = 1 or
has_1_and_2.order_number is null
Find all the orders that have both type 1 and type 2, and then join it.
If the row matched the join, only return it if it is type 1
If the row did not match the join (has_type_2.order_number is null) return it no matter what the type is.
A "common table expression" [CTE] can often simplify your logic. You can think of it as a way to break a complex problem into conceptual steps. In the example below, you can think of g as the name of the result set of the CTE, which will then be joined to
WITH g as
( SELECT order_number, min(type) as low_type
FROM bookings
GROUP BY order_number
)
SELECT b.*
FROM g
JOIN bookings b ON g.order_number = b.order_number
AND g.low_type = b.type
The JOIN ON conditions will work so that if both types are present then low_type will be 1, and only that type of record will be chosen. If there is only one type it will be identical to low_type.
This should work fine as long as 1 and 2 are the only types allowed in the bookings table. If not then you can simply add a WHERE clause in the CTE and in the outer SELECT.