SQL group by and count fixed column values - sql

I'm facing a problem in a data importation script in SQL(MySQL) where I need to GROUP rows by type to COUNT how much rows there are from each type. So far, it isn't really a problem, because I know that I can do:
SELECT
data.type,
COUNT(data.type)
FROM data
GROUP BY data.type;
So, by doing it, I have the result:
-------------- ---------------------
| type | COUNT(data.type) |
|--------------|---------------------|
| 0 | 1 |
| 1 | 46 |
| 2 | 35 |
| 3 | 423 |
| 4 | 64 |
| 5 | 36 |
| 9 | 1 |
-------------- ---------------------
I know that in the type column the values will always be in the range from 0 to 9, like the above result. So, I would like to list not only the existing values in the table content but the missing type values too, with their COUNT value set to 0.
Based on the above query result, the expected result would be:
-------------- ---------------------
| type | COUNT(data.type) |
|--------------|---------------------|
| 0 | 1 |
| 1 | 46 |
| 2 | 35 |
| 3 | 423 |
| 4 | 64 |
| 5 | 36 |
| 6 | 0 |
| 7 | 0 |
| 8 | 0 |
| 9 | 1 |
-------------- ---------------------
I could trickly INSERT one row of each type before GROUP/COUNT-1 the table content, flagging some other column on INSERT to be able to DELETE these rows after. So, the steps of my importation script would change to:
TRUNCATE table; (I can't securily import new content if there were old data in the table)
INSERT "control" rows;
LOAD DATA INFILE INTO TABLE;
GROUP/COUNT-1 the table content;
DELETE "control" rows; (So I can still work with the table content)
Do any other jobs;
But, I was looking for a cleaner way to reach the expected result. If possible, a single query, without a bunch of JOINs.
I would appreciate any suggestion or advice. Thank you very much!
EDIT
I would like to thank for the answers about CREATE a table to store all types to JOIN it. It really solves the problem. My approach solves it too, but does it storing the types, as you did.
So, I have "another" question, just a clarification, based on the received answers and my desired scope... is it possible to reach the expected result with some MySQL command that will not CREATE a new table and/or INSERT these types?
I don't see any problem, actually, in solve my question storing the types... I just would like to find a simplified command... something like a 'best practice'... some kind of filter... as I could run:
GROUP BY data.type(0,1,2,3,4,5,6,7,8,9)
and it could return these filtered values.
I am really interested to learn such a command, if it really exists/is possible.
And again, thank you very much!

Let's assume that you have a types table with all the valid types:
SELECT t.type,
COUNT(data.type)
FROM data join types t on data.type = t.type
GROUP BY t.type
order by t.type
You should include the explicit order by and not depend on the group by to produce results in a particular order.

The easiest way is to create a table of all type values and then join on that table when getting the count:
select t.type,
count(d.type)
from types t
left join data d
on t.type = d.type
group by t.type
See SQL Fiddle with demo
Or you can use the following:
select t.type,
count(d.type)
from
(
select 0 type
union all
select 1
union all
select 2
union all
select 3
union all
select 4
union all
select 5
union all
select 6
union all
select 7
union all
select 8
union all
select 9
) t
left join data d
on t.type = d.type
group by t.type
See SQL Fiddle with Demo

One option would be having a static numbers table with the values 0-9. Not sure if this is the most elegant approach, and if you were using SQL Server, I could think of another approach.
Try something like this:
SELECT
numbers.number,
COUNT(data.type)
FROM numbers
left join data
on numbers.number = data.type
GROUP BY numbers.number;
And the SQL Fiddle.

Okay... I think I found it! Thank you all!!! I'm accepting my own answer.
I agree with the #GordonLinoff comment that the best practice refers to store the types values and describe them, so you can keep a concise/understandable database and queries.
But, as far as I've learned, if you have some data which might be an irrelevant information, it is preferable to treat it in some other way than storing it.
So, I developed this query:
SELECT
SUM(IF(data.type = 0, 1, 0)) AS `0`,
SUM(IF(data.type = 1, 1, 0)) AS `1`,
SUM(IF(data.type = 2, 1, 0)) AS `2`,
SUM(IF(data.type = 3, 1, 0)) AS `3`,
SUM(IF(data.type = 4, 1, 0)) AS `4`,
SUM(IF(data.type = 5, 1, 0)) AS `5`,
SUM(IF(data.type = 6, 1, 0)) AS `6`,
SUM(IF(data.type = 7, 1, 0)) AS `7`,
SUM(IF(data.type = 8, 1, 0)) AS `8`,
SUM(IF(data.type = 9, 1, 0)) AS `9`
FROM data;
Not a so faster, optimized and beauty query, but to the size of data I'll manage (less than 100.000 rows each importation) it "manually" does the GROUP/COUNT job, running in 0.13 sec in a common developer machine.
It differs from my expected result just in the way rows and columns are selected - instead of 10 rows with 2 columns I've got 1 row with 10 columns, labeled with the matching type. Also, as we have a standardization to the type value (and we'll not change it for sure) which gives it a name and description, I'm now able to use the type name as the column label, instead of joining to a table with the types info to select a third column in the result (which really, is not that important as it's an importation script based on some standards).
Thank you all so much for the help!

Related

Is it possible to map values onto a table given corresponding row and column indices in SQL?

I have a SQL table in the form of:
| value | row_loc | column_loc |
|-------|---------|------------|
| a | 0 | 1 |
| b | 1 | 1 |
| c | 1 | 0 |
| d | 0 | 0 |
I would like to find a way to map it onto a table/grid, given the indices, using SQL. Something like:
| d | a |
| c | b |
(The context being, I would like to create a colour map with colours corresponding to values a, b, c, d, in the locations specified)
I would be able to do this iteratively in python, but cannot figure out how to do it in SQL, or if it is even possible! Any help or guidance on this problem would be greatly appreciated!
EDIT: a, b, c, d are examples of numeric values (which would not be able to be selected using named variables in practice, so I'm relying on selecting them based on location. Also worth noting, the number of rows and columns will always be the same. The value column is also not the primary key to this table, so is not necessarily unique, it is just as a continuous value.
Yes, it is possible, assuming the column number is limited since SQL supports only determined number of columns. The number of rows in result set depends on number of distinct row_loc values so we have to group by column row_loc. Then choose value using simple case.
with t (value, row_loc, column_loc) as (
select 'a', 0, 1 from dual union all
select 'b', 1, 1 from dual union all
select 'c', 1, 0 from dual union all
select 'd', 0, 0 from dual
)
select max(case column_loc when 0 then value else null end) as column0
, max(case column_loc when 1 then value else null end) as column1
from t
group by row_loc
order by row_loc
I tested it on Oracle. Not sure what to do if multiple values match on same coordinate, I chose max. For different vendors you could also utilize special clauses such as count ... filter (where ...). Or the Oracle pivot clause can also be used.

SQL - Representing SUM's after using CASE to transform STR -> INT

I am sorry for what may be a long post in advance.
Background:
I am using Rational Team Concert (RTC) which stores work item data in conjunction with Jazz Reporting Service to create reports. Using the Report Builder tool, it allows you to write your own queries to pull data as a table, and has its own interface to represent the table as a graph.
There is not much options for of graphing; the chart type defaults as a count, unless you specify it to show a sum. In order to graph by sum, the data must be a number rather than a string. By default, the Report Builder assumes all variables in the SELECT statement are strings.
The data which I will be using are a bunch of work items. Each work item is associated to a team (A, B) and has a work estimation number (count1, count2).
Item # | Team | Work |
------------------------
123 | A | count1 |
------------------------
124 | A | count2 |
------------------------
125 | B | count2 |
------------------------
....
Problem:
Since the work estimation is entered as a Tag, the first step was to use a CATCH WHEN block when using SELECT to transform count1 -> 1, and count2 -> 2 (the string tag to an actual number which can be summed). This resulted in a table with numbers 1 and 2 in place of the typed tag (good so far).
Item # | Team | Work |
------------------------
123 | A | 1 |
------------------------
124 | A | 2 |
------------------------
125 | B | 2 |
------------------------
....
The problem is that I am trying to graph by sum, which means getting the tool to identify the variables in the SELECT statement as numbers, except for some reason any variable I declare in a SELECT statement is always viewed as a string (The tool has a table of the current columns i.e. variables in the SELECT, along with that the tool identifies as its variable type).
Attempted Solutions:
The first query I did was to return a table of each work item with its team name and work estimate
SELECT T1.NAME,
(CASE WHEN T1.TAGs='count1' THEN 1 ELSE 2 END) AS WORK
FROM RIDW.VW_REQUEST T1
WHERE T1.PROJECT_ID = 73
Which resulted in
Team | Work |
----------------
A | 1 |
----------------
A | 2 |
----------------
B | 2 |
----------------
....
but the tool still sees the numbers as strings. I then tried explicitly casting the CASE to an integer, but resulted in the same issue
...
CAST(CASE WHEN T1.TAGs='count1' THEN 1 ELSE 2 END AS Integer) AS WORK
...
Which again the tool still represents as a string.
Current Goal:
As I cannot confirm if the tool has an underlying problem, compatibility issues with queries, etc. What I believe will work now would be to return a table with 2 rows: The sum of the work for each team
|Sum of 1's and 2's |
-----------------------------
Team A | SUM(1) + SUM(2) |
-----------------------------
Team B | SUM(1) + SUM(2) |
-----------------------------
What I am having trouble with is using sub queries to use SUM to sum the data. When I try
SUM(CASE WHEN ... END) AS TIME2 I get an error that "Column modifiers AVG and SUM apply only to number attributes". This has me thinking that I need to have a sub query which returns the column after the CASE, and then SUM that, but I am sailing into uncharted waters and can't seem to get the syntax to work.
I understand that a post like this would be better off on the product help forum. I have tried asking around but cannot get any help. The solution I am proposing of returning the 2 row/column table should bypass any issues the software may have, but I need help sub-querying the SUM when using a case.
I appreciate your time and help!
EDIT 1:
Below is the full query code which preforms the CASE correctly, but still causes with the interpreted type by the tool:
SELECT
T1.Name,
CAST(CASE WHEN T1.TAGS='|release_points_1|' THEN 1 ELSE (CASE WHEN T1.TAGS='|release_points_2|' THEN 2 ELSE 0 END) END AS Integer) AS TAG,
FROM RIDW.VW_REQUEST T1
WHERE T1.PROJECT_ID = 73
AND
(T1.ISSOFTDELETED = 0) AND
(T1.REQUEST_ID <> -1 AND T1.REQUEST_ID IS NOT NULL
This small adjustment to your current query should work:
SELECT
T1.Name,
SUM(CAST(CASE WHEN T1.TAGS='|release_points_1|' THEN 1 ELSE (CASE WHEN T1.TAGS='|release_points_2|' THEN 2 ELSE 0 END) END AS Integer)) AS TAG,
FROM RIDW.VW_REQUEST T1
WHERE T1.PROJECT_ID = 73
AND
(T1.ISSOFTDELETED = 0) AND
(T1.REQUEST_ID <> -1 AND T1.REQUEST_ID IS NOT NULL
GROUP BY T1.Name

Transforming a 2 column SQL table into 3 columns, column 3 lagged on 2

Here's my problem: I want to write a query (that goes into a larger query) that takes a table like this;
ID | DATE
A | 1
A | 2
A | 3
B | 1
B | 2
and so on, and transforms it into;
ID | DATE1 | DATE2
A | 1 | 2
A | 2 | 3
A | 3 | NOW
B | 1 | 2
B | 2 | NOW
Where the numbers are dates, and NOW() is always appended to the most recent date. Given free rein I would do this in Python, but unfortunately this goes into a larger query. We're using SyBase's SQL Anywhere 12, I think? I interact with the database using SQuirreL SQL.
I'm very stumped. I thought (SQL query to transform a list of numbers into 2 columns) would help, but I'm afraid I don't know enough to make it work. I was thinking of JOINing the table to itself, but I don't know how to SELECT for only the A-1-2 rows instead of the A-1-3 rows as well, for instance, or how to insert the NOW() value into it. Does anyone have any ideas?
I made a an sqlfiddle.com to outline a solution for your example. You were mentioning dates, but using integers so I chose to do an integer example, but it can be modified. I wrote it in postgresql so the coalesce() function can be substituted with nvl() or similar. Also, the parameter '0' can be substituted with any value, including now(), but you must change the data type of the "i" column in the table to be a date as well. Please let me know if you need further help on this.
select a.id, a.i, coalesce(min(b.i),'0') from
test a
left join test b on b.id=a.id and a.i<b.i
group by a.id,a.i
order by a.id, a.i
http://sqlfiddle.com/#!15/f1fba/6

Access 2007 select first value of query results

I am running into a rather annoying thingy in Access (2007) and I am not sure if this is a feature or if I am asking for the impossible.
Although the actual database structure is more complex, my problem boils down to this:
I have a table with data about Units for specific years. This data comes from different sources and might overlap.
Unit | IYR | X1 | Source |
-----------------------------
A | 2009 | 55 | 1 |
A | 2010 | 80 | 1 |
A | 2010 | 101 | 2 |
A | 2010 | 150 | 3 |
A | 2011 | 90 | 1 |
...
Now I would like the user to select certain sources, order them by priority and then extract one data value for each year.
For example, if the user selects source 1, 2 and 3 and orders them by (3, 1, 2), then I would like the following result:
Unit | IYR | X1 | Source |
-----------------------------
A | 2009 | 55 | 1 |
A | 2010 | 150 | 3 |
A | 2011 | 90 | 1 |
I am able to order the initial table, based on a specific order. I do this with the following query
SELECT Unit, IYR, X1, Source
FROM TestTable
WHERE Source In (1,2,3)
ORDER BY Unit, IYR,
IIf(Source=3,1,IIf(Source=1,2,IIf(Source=2,3,4)))
This gives me the following intermediate result:
Unit | IYR | X1 | Source |
-----------------------------
A | 2009 | 55 | 1 |
A | 2010 | 150 | 3 |
A | 2010 | 80 | 1 |
A | 2010 | 101 | 2 |
A | 2011 | 90 | 1 |
Next step is to only get the first value of each year. I was thinking to use the following query:
SELECT X.Unit, X.IYR, first(X.X1) as FirstX1
FROM (...) AS X
GROUP BY X.Unit, X.IYR
Where (…) is the above query.
Now Access goes bananas. Whatever order I give to the intermediate results, the result of this query is.
Unit | IYR | X1 |
--------------------
A | 2009 | 55 |
A | 2010 | 80 |
A | 2011 | 90 |
In other words, for year 2010 it shows the value of source 1 instead of 3. It seems that Access does not care about the ordering of the nested query when it applies the FIRST() function and sticks to the original ordering of the data.
Is this a feature of Access or is there a different way of achieving the desired results?
Ps: Next step would be to use a self join to add the source column to the results again, but I first need to resolve above problem.
Rather than use first it may be better to determine the MIN Priority and then join back e.g.
SELECT
t.UNIT,
t.IYR,
t.X1,
t.Source ,
t.PrioritySource
FROM
(SELECT
Unit,
IYR,
X1,
Source,
SWITCH ( [Source]=3, 1,
[Source]=1, 2,
[Source]=2, 3) as PrioritySource
FROM
TestTable
WHERE
Source In (1,2,3)
) as t
INNER JOIN
(SELECT
Unit,
IYR,
MIN(SWITCH ( [Source]=3, 1,
[Source]=1, 2,
[Source]=2, 3)) as PrioritySource
FROM
TestTable
WHERE
Source In (1,2,3)
GROUP BY
Unit,
IYR ) as MinPriortiy
ON t.Unit = MinPriortiy.Unit and
t.IYR = MinPriortiy.IYR and
t.PrioritySource = MinPriortiy.PrioritySource
which will produce this result (Note I include Source and priority source for demonstration purposes only)
UNIT | IYR | X1 | Source | PrioritySource
----------------------------------------------
A | 2009 | 55 | 1 | 2
A | 2010 | 150 | 3 | 1
A | 2011 | 90 | 1 | 2
Note the first subquery is to handle the fact that Access won't let you join on a Switch
Yes, FIRST() does use an arbitrary ordering. From the Access Help:
These functions return the value of a specified field in the first or
last record, respectively, of the result set returned by a query. If
the query does not include an ORDER BY clause, the values returned by
these functions will be arbitrary because records are usually returned
in no particular order.
I don't know whether FROM (...) AS X means you are using an ORDER BY inline (assuming that is actually possible) or if you are using a VIEW ('stored Query object') here but either way I assume the ORDER BY is being disregarded (because an ORDER BY should only apply to the final result).
The alternative is to use MIN() (or possibly MAX()).
This is the most concise way I have found to write such queries in Access that require pulling back all columns that correspond to the first row in a group of records that are ordered in a particular way.
First, I added a UniqueID to your table. In this case, it's just an AutoNumber field. You may already have a unique value in your table, in which case you can use that.
This will choose the row with a Source 3 first, then Source 1, then Source 2. If there is a tie, it picks the one with the higher X1 value. If there is a further tie, it is broken by the UniqueID value:
SELECT t.* INTO [Chosen Rows]
FROM TestTable AS t
WHERE t.UniqueID=
(SELECT TOP 1 [UniqueID] FROM [TestTable]
WHERE t.IYR=IYR ORDER BY Choose([Source],2,3,1), X1 DESC, UniqueID)
This yields:
Unit IYR X1 Source UniqueID
A 2009 55 1 1
A 2010 150 3 4
A 2011 90 1 5
I recommend (1) you create an index on the IYR field -- this will dramatically increase your performance for this type of query, and (2) if you have a lot (>~100K) records, this isn't the best choice. I find it works quite well for tables in the 1-70K range. For larger datasets, I like to use my GroupIncrement function to partition each group (similar to SQL Server's ROW_NUMBER() OVER statement).
The Choose() function is a VBA function and may not be clear here. In your case, it sounds like there is some interactivity required. For that, you could create a second table called "Choices", like so:
Rank Choice
1 3
2 1
3 2
Then, you could substitute the following:
SELECT t.* INTO [Chosen Rows]
FROM TestTable AS t
WHERE t.UniqueID=(SELECT TOP 1 [UniqueID] FROM
[TestTable] t2 INNER JOIN [Choices] c
ON t2.Source=c.Choice
WHERE t.IYR=t2.IYR ORDER BY c.[Rank], t2.X1 DESC, t2.UniqueID);
Indexing Source on TestTable and Choice on the Choices table may be helpful here, too, depending on the number of choices required.
Q:
Can you get this to work without the need for surrogate key? For
example what if the unique key is the composite of
{Unit,IYR,X1,Source}
A:
If you have a compound key, you can do it like this-- however I think that if you have a large dataset, it will totally kill the performance of the query. It may help to index all four columns, but I can't say for sure because I don't regularly use this method.
SELECT t.* INTO [Chosen Rows]
FROM TestTable AS t
WHERE t.Unit & t.IYR & t.X1 & t.Source =
(SELECT TOP 1 Unit & IYR & X1 & Source FROM [TestTable]
WHERE t.IYR=IYR ORDER BY Choose([Source],2,3,1), X1 DESC, Unit, IYR)
In certain cases, you may have to coalesce some of the individual parts of the key as follows (though Access generally will coalesce values automatically):
t.Unit & CStr(t.IYR) & CStr(t.X1) & CStr(t.Source)
You could also use a query in your FROM statements instead of the actual table. The query itself would build a composite of the four fields used in the key, and then you'd use the new key name in the WHERE clause of the top SELECT statement, and in the SELECT TOP 1 [key] of the subquery.
In general, though, I will either: (a) create a new table with an AutoNumber field, (b) add an AutoNumber field, (c) add an integer and populate it with a unique number using VBA - this is useful when you get a MaxLocks error when trying to add an AutoNumber, or (d) use an already indexed unique key.

How can I efficiently transfer data from a vertical databaselayout to a horizontal one

I want to transfer data from a vertical db layout like this:
---------------------
| ID | Type | Value |
---------------------
| 1 | 10 | 111 |
---------------------
| 1 | 14 | 222 |
---------------------
| 2 | 10 | 333 |
---------------------
| 2 | 25 | 444 |
---------------------
to a horizontal one:
---------------------------------
| ID | Type10 | Type14 | Type25 |
---------------------------------
| 1 | 111 | 222 | |
---------------------------------
| 2 | 333 | | 444 |
---------------------------------
Creating the layout is not a problem but the database is rather large with millions of entries and queries get canceled if they take to much time.
How can this be done efficiently (so that the query is not canceled).
with t as
(
select 1 as ID, 10 as type, 111 as Value from dual
union
select 1, 14, 222 from dual
union
select 2, 10, 333 from dual
union
select 2, 25, 444 from dual
)
select ID,
max(case when type = 10 then Value else null end) as Type10,
max(case when type = 14 then Value else null end) as Type14,
max(case when type = 25 then Value else null end) as Type25
from t
group by id
Returns what you want, and I think it is the better way.
Note that the max function is just here to perform the group by clause, any group function can be use here (like sum, min...)
Break it up into smaller chunks and don't wrap the whole thing in a single transaction. First, create the table, and then do groups of inserts from the old table into the new table. Insert by range of ID, for example, in small enough chunks that it won't overwhelm the database's log and take too long.
The vertical table -- also known as the Entity-Attribute-Value anti-pattern -- always becomes a problem, sometimes very shortly after it is put into practice. If you haven't done so already, check out what Joe Celko has to say about this tactic, and you'll see even more proof of how troublesome this approach is. I'll stop there, since you're the smart person who knew to come to this site, and not the guilty but well-intentioned party who perpetrated the EAV table in your database.
The options for dealing with this type of table are not pretty, and, as you've stated, they get worse/slower as the amount of data needed for production queries grows.
Build a declared global temporary table (DGTT) that is not logged and preserves committed rows, and use it to stage the horizontal version of the EAV table contents. DGTTs are good for this kind of data shoveling because they do not incur any logging overhead.
Employ the traditional CASE and MAX() groupings as shown in the previous recommendation. The problem is that the query changes every time a new TYPE is introduced into your EAV table.
Use DB2's SQL-XML publishing features to turn the vertical data into XML. Here's an example that works with the table and column names you provided:
WITH t(id, type, value) as (
VALUES (1,10,111), (1,14,222), (2,10,333), (2,25,444)
)
SELECT
XMLSERIALIZE( CONTENT
XMLELEMENT(NAME "outer",
XMLATTRIBUTES(id AS "id"),
XMLAGG(XMLELEMENT(NAME attr ,
XMLATTRIBUTES(type as "typeid"), value) ORDER BY type)
) AS VARCHAR(1024)
)
FROM t as t group by id;
The benefit of the SQL-XML approach is that any new values handled by the EAV table will not require a rewrite to the SQL that pivots the values.