I am new to SQL so don't know much about it. Please help.
I have a database something like this:
User Model
A X
A X
A X
B Y
C Y
C Y
D X
D X
E Z....
I want to calculate the frequency of the each unique model respective to unique users. I mean the output should be something like this:
Model Count
X 2
Y 2
Z 1
as A and D use model X, so X=2. Similarly B and C use model Y, so Y=2. and same goes for Z(only user E). How do I achieve this?
Use group by.
select model,count(distinct user)
from tbl
group by model
Related
I have some data that looks like this, and identifies pairs that are related:
From_ID To_ID
A C
B C
D E
E D (note this is the same pair as above, in a different order)
E F
A F
G H
Using the logic of 'if x is paired with y, and y is paired with z, then x is paired with z', how can I run an SQL query to return all members of a group?
So for the table above I would like a set of results that identifies or returns two groups: 'A, B, C, D, E, F' and 'G, H', not fussy about how this is done.
It feels like some kind of iterative query but I really have no idea where to start with this so any pointers would be appreciated.
edit: could be run in SQL Developer or HiveQL.
Say I have some sorted result from a SQL query that looks like:
x y z
0 0 0
0 0 1
0 0 2
0 1 0
0 1 1
0 2 0
0 2 1
Where x, y and z are sort ranks. These sort ranks are always greater than 0, and smaller than 500mil.
Is there a way to combine the values from x, y and z into one "master" sort rank? Sorting the dataset using this "master" sort rank should result in the same ordering.
I'm thinking I can do something with bit shifting but I am not sure...
Assuming that every value in each of the three columns in between 1 and 500 million, you could use the following formula to generate a unique rank:
1000000
z + (500 x 10^6)*y + (500 x 10^6)*(500 x 10^6)*x
To generate this rank you could use the following query:
SELECT
x, y, z,
z + (500 * 1000000)*y + (500 * 1000000)*(500 * 1000000)*x AS master_rank
FROM yourTable;
The reason this works can be seen by examining say the z and y columns. The largest value from z is 500 million, which is guaranteed to be smaller than the smallest value in y, which is 1 billion. This logic applies to the whole formula. This approach is similar to using a bit mask, on a larger scale.
Note that I assume that your version of SQL can tolerate numbers this large. If it doesn't, then you might want to consider another approach here, possibly just ordering as #Gordon mentioned in his answer. Besides this, having 1 bil x 1 bil records would make for a very large table and would have other problems.
Do you mean something like this?
order by x * 10000 + y * 100 + z
(You would adjust the numbers for the width you need.)
I'm not sure why you would want to do that instead of:
order by x, y, z
If you do combine into a single value, be careful about integer overflow.
I'm doing a query on a complex db:
SELECT *
FROM
table1,
table2,
table3,
table4,
table5,
table6,
table7,
table8
WHERE
a = b and
c = d and
e = d and
(
(strfldvar = 'BROKEN_ARROW' AND x = g)
OR (strfldvar = 'BROKEN_BOX' AND y = g)
) and
f = h and
i = j
Only works when strfldvar = 'BROKEN_BOX' and not when strfldvar = 'BROKEN_ARROW'. When I replace
(
(strfldvar = 'BROKEN_ARROW' AND x = g)
OR (strfldvar = 'BROKEN_BOX' AND y = g)
) and
with either x = g and or y = g and it works fine in two seperate queries runs like that. The error message for the case strfldvar = 'BROKEN_ARROW' is:
ORA-01013: user requested cancel of current operation
Before this error message comes the computer goes into deep thought for I guess 2 minutes.
What am I doing wrong here?
f.y.i. I looked at the names of the fields of the of the two seperate runs and they appear idendical. I mean the scema of the output looks the same for both. But I'm not 100% sure they are the same, if that matters i.e.
Thanks for your help
When strfldvar = 'BROKEN_ARROW' AND x = g (or if strfldvar is not BROKEN_ARROW or BROKEN_BOX), the y = g part is not evaluated, which seems to be causing the query to run for longer than you expect - until it's eventually killed by you, your client or resource limits. I suspect that's the only join condition for whichever table y is from, so you end up with a cartesian product.
When strfldvar = 'BROKEN_BOX' then both x = g and y = g will be evaluated, so you wouldn't get the same cartesian product, against either of the tables providing x and y.
If you are essentially deciding which table to include in the query based on that flag then you'll need to redesign this; possibly with a union of two queries, one which joins to x and the other on y; or with separate queries and you decide which to run; or maybe even with outer joins. But it depends on what you're really trying to do and what the data looks like. The code you have shown is a too generic to guess what will be appropriate.
I'm really hoping I can describe this question in an understandable way. This is a puzzle that I have not been able to begin to solve even though I (mostly) understand it. I'm just not sure where to start, and I'm really hoping someone out there can get me headed in the right direction.
I have a LARGE table of data. It describes relationships between objects. Let's say the Y-axis has items numbered 1-1000, and the X-axis has items 1-1000 also. If item #234 on the Y-axis is related to item #791 on X, there will be a mark in the table where the row and column cross. In some industries this is referred to an a Truth Table. One can, at a glance, see how many items in a system relate to each other. The marks in the table can help to identify trends and patterns.
Here's some other helpful stuff about the nature of the table:
The full range of the number of relationships (r) for each item on either axis can be 1 <= r <= axisTotal.
The X and Y axis will share common items, but each axis will also have items that the other axis does not.
Each item will only exist once per axis. It can be on X and Y, but it would only be on each one 1 time.
The total number of items on each axis will most likely NOT be equal. Each axis could have from 50 to 1000's of items.
The end result is that this is going to be a report that needs to be printed. We have successfully printed a table that had about 100-150 items on each axis on an 11in X 17in piece of paper. Any more than that and it begins to be so small it's unreadable.
What I am trying to do is split the super large tables into smaller tables, but related points need to stay together. If I grab item 1-100 on X then I would need each item they relate to from Y.
I've generated a number of these tables and, while the number of relationships CAN be arbitrary, I have never seen an item relate to all other items. So in real practice the range is more like 1 <= r <= (10% * axisTotal). If an item's relationships exceed this range, it can be split up into multiple tables, but that is not optimal at all.
At the end of the day I think we, and our clients, would be happy if a 1000x1000 item table was split into 8 to 10 printed pages of smaller, related tables.
Any guidance would be a great help! Thanks.
---EDIT---
One other thing worth noting, there will be no empty rows or columns in the table. Every item on both the x and y axis will relate to at least 1 item on the opposite axis.
---EDIT---
Here is an example of a small truth table that I'm describing: . Every row and column has at least one relationship.
---EDIT---
May 18th, 2011
For what it's worth, I was moving pretty good on this project and I got pulled off for a couple of weeks. So it's going to a little while before I get back to this problem. But it is one that I will have to solve soon.
---EDIT---
July 11th, 2011
Bummer. Well, looks like I'm not going to be able to solve this problem right now. I was really hoping to be able to figure this out. Through discussion we decided to present the truth table in an Excel spreadsheet as an add-on resource to the main report. Excel 2007 and later will handle 1000's of columns which will more than suffice. Plus, we added some VBA which allows the viewer to double click on the column titles. This action will reduce the rows to only ones where there are interactions. Then it removes empty columns. In this way they can see a small sub-table based on the item they want to view, and can print it if they want.
This isn't an answer, I just want to try to visualize your data a little better. Does it kind of look like this?
Alice Bob Charlie ... Zelda
Shoes X X
Hats X X
Gloves X
...
Pants X
EDIT
Is it a requirement to show the data in tabular format? Or could you just list each out? Something like:
Alice
Shoes
Bob
Hats
Pants
Charlie
Shoes
Gloves
Zelda
Hats
Or the other way:
Shoes
Alice
Charlie
Hats
Bob
Zelda
Gloves
Charlie
Pants
Bob
EDIT 2
Okay, I've made another larger truth table to hopefully get a better understanding of how you want to split things up:
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
1 x x x x
2 x x x x x x
3 x x x x
4 x x x
5 x x x
6 x x x
7 x x x
8 x x x
For argument's sake lets just say that you can only fit 4 rows on a page (because I don't feel like typing out a giant table this early in the morning) so we're going to split this into two pages. First, it is important to show every row, right? Second, do you need to show columns that never have a value. For instance Y and Z never have a value for rows 1 through 8 in this table, can they be excluded from the report or do they still need to be there? Third, is order of the rows important?
If its not important to show completely empty columns then we could remove 10 columns from the table above and compress it down to:
A B C E F H I L M O P Q R U V W
1 x x x x
2 x x x x x x
3 x x x x
4 x x x
5 x x x
6 x x x
7 x x x
8 x x x
Then if row order isn't important you can compress it further by taking an optimum row arrangement (not necessarily shown here). The two tables below have further been compress to 11 and 10 columns:
A B C F H I M P Q R U
1 x x x x
2 x x x x x x
5 x x x
7 x x x
A E H I L M O P U W
3 x x x x
4 x x x
6 x x x
8 x x x
Am I going down a completely wrong path here? These are all just questions to help me better understand your data and output requirements.
Also, in all seriousness, is it an option to get larger printers/plotters? Also, is it an option to just generate a PDF and use Acrobat's print tile's option?
Last year I read an article at the Computational Biology PLoS journal (www.ploscompbiol.org), that seems related to your problem.
In short, it describes a new approach when we already have a set of proteins and tabular data about their one-to-one interaction and we want to to group them so that interaction inside a group and interaction between two groups is either maximized or (this is the innovative idea) minimized .
If we plot the start data table with black for high and white for low interaction it looks randomly gray. The result table, after the calculations and rearranging is done (so grouped items are placed near one another), looks more like orthogonal areas of black and white.
The article: Protein Interaction Networks—More Than Mere Modules,
where there are also references to other older techniques for grouping this kind of data.
I'm not sure how to express this problem, so my apologies if it's already been addressed.
I have business rules summarized as a table of outputs given two inputs. For each of five possible value on one axis, and each of five values on another axis, there is a single output. There are ten distinct possibilities in these 25 cells, so it's not the case that each input pair has a unique output.
I have encoded these rules in TSQL with nested CASE statements, but it's hard to debug and modify. In C# I might use an array literal. I'm wondering if there's an academic topic which relates to converting logical rules to matrices and vice versa.
As an example, one could translate this trivial matrix:
A B C
-- -- -- --
X 1 1 0
Y 0 1 0
...into rules like so:
if B OR (A and X) then 1 else 0
...or, in verbose SQL:
CASE WHEN FieldABC = 'B' THEN 1
WHEN FieldABX = 'A' AND FieldXY = 'X' THEN 1
ELSE 0
I'm looking for a good approach for larger matrices, especially one I can use in SQL (MS SQL 2K8, if it matters). Any suggestions? Is there a term for this type of translation, with which I should search?
Sounds like a lookup into a 5x5 grid of data. The inputs on axis and the output in each cell:
Y=1 Y=2 Y=3 Y=4 Y=5
x=1 A A D B A
x=2 B A A B B
x=3 C B B B B
x=4 C C C D D
x=5 C C C C C
You can store this in a table of x,y,outvalue triplets and then just do a look up on that table.
SELECT OUTVALUE FROM BUSINESS_RULES WHERE X = #X and Y = #Y;