I have some data that looks like this, and identifies pairs that are related:
From_ID To_ID
A C
B C
D E
E D (note this is the same pair as above, in a different order)
E F
A F
G H
Using the logic of 'if x is paired with y, and y is paired with z, then x is paired with z', how can I run an SQL query to return all members of a group?
So for the table above I would like a set of results that identifies or returns two groups: 'A, B, C, D, E, F' and 'G, H', not fussy about how this is done.
It feels like some kind of iterative query but I really have no idea where to start with this so any pointers would be appreciated.
edit: could be run in SQL Developer or HiveQL.
What is the complexity (Big O notation) of a CUBE operation in SQL (Microsoft) or Oracle?
e.g.
SELECT x1, x2, SUM(x3)
FROM xyz
GROUP BY CUBE (x1, x2)
The complexity is:
2^c * n log(n)
where:
c = number of columns in the cube
n = number of rows in the table
The 2^c is for all combinations of the columns. n log(n) is for the aggregation operator -- which is generally equivalent to a sort in the absence of an index.
Because c is never really that big -- for instance, 10 would generate a lot of rows -- we could treat it as a constant (in that case 1,000,000) and say the operation is essentially n log(n).
I'm not sure how to express this problem, so my apologies if it's already been addressed.
I have business rules summarized as a table of outputs given two inputs. For each of five possible value on one axis, and each of five values on another axis, there is a single output. There are ten distinct possibilities in these 25 cells, so it's not the case that each input pair has a unique output.
I have encoded these rules in TSQL with nested CASE statements, but it's hard to debug and modify. In C# I might use an array literal. I'm wondering if there's an academic topic which relates to converting logical rules to matrices and vice versa.
As an example, one could translate this trivial matrix:
A B C
-- -- -- --
X 1 1 0
Y 0 1 0
...into rules like so:
if B OR (A and X) then 1 else 0
...or, in verbose SQL:
CASE WHEN FieldABC = 'B' THEN 1
WHEN FieldABX = 'A' AND FieldXY = 'X' THEN 1
ELSE 0
I'm looking for a good approach for larger matrices, especially one I can use in SQL (MS SQL 2K8, if it matters). Any suggestions? Is there a term for this type of translation, with which I should search?
Sounds like a lookup into a 5x5 grid of data. The inputs on axis and the output in each cell:
Y=1 Y=2 Y=3 Y=4 Y=5
x=1 A A D B A
x=2 B A A B B
x=3 C B B B B
x=4 C C C D D
x=5 C C C C C
You can store this in a table of x,y,outvalue triplets and then just do a look up on that table.
SELECT OUTVALUE FROM BUSINESS_RULES WHERE X = #X and Y = #Y;
I have a table with multiple indexes, several of which duplicate the same columns:
Index 1 columns: X, B, C, D
Index 2 columns: Y, B, C, D
Index 3 columns: Z, B, C, D
I'm not very knowledgeable on indexing in practice, so I'm wondering if somebody can explain why X, Y and Z were paired with these same columns. B is an effective date. C is a semi-unique key ID for this table for a specific effective date B. D is a sequence that identifies the priority of this record for the identifier C.
Why not just create 6 indexes, one for each X, Y, Z, B, C, D?
I want to add an index to another column T, but in some contexts I'll only be querying on T alone while in others I will also be specifying the B, C and D columns... so should I create just one index like above or should I create one for T and one for (T, B, C, D)?
I've not had as much luck as expected when googling for comprehensive coverage of indexing. Any resources where I can get a through explanation and lots of examples of B-tree indexing?
The rule with indexing is that an index can be used to filter on any list of columns that constitute a prefix of the columns used for that index.
In other words, we can use Index 1 when we filter on X and B, or X, B and C, or just X, or all four.
However, we cannot use the index to filter "in the middle". This is because indexes work not entirely unlike concatenating the values of those columns for each row, and sorting the result. If we know what the thing we're looking for begins with, we can figure out where in the index to look - just like when doing binary search.
That's why a single index is no good: if we need to filter on B, C, D, and one of X, Y and Z, we need three indexes; X, Y is no good as an index for just filtering on Y, because the prefix of the values we're looking for - the X - is not known.
As Daniel mentioned, a covering index is a possible explanation for repeating B, C, and D: even if D is never filtered on, it may be the case that we need exactly the columns which you see in your indexes, and we can then just read the columns from the index instead of just using the index to locate the row.
One reason for having B, C and D in those indexes might be to have a covering index for frequently used queries. You will have a covering index when the index itself contains all the required data fields for a particular query.
A covering index can dramatically speed up data retrieval, since only the index pages, not the data pages, will be used to retrieve the data.
Below is an example query where index 1 would be a covering index:
SELECT B, C, D FROM table WHERE X = '10'
You should create it in (T, B, C, D).
Let's say you have two fields with an index in a table: A and B. When you create a separate index on each one of the columns, and have a query such as:
SELECT * FROM table WHERE A = 10 AND B = 20
What happens is either:
1) The DB creates two intermediate result-sets, one with rows where A = 10, and another one with rows where B = 20. It then has to merge these two result-sets into one (and also check for duplicate rows).
2) The DB creates one result-set with rows where A = 10. It then has to go manually through all of the rows in this intermediate result-set and check in each one where B = 10.
However when you know that index B depends on index A, and your query uses A before B, you can create one index for both of the columns: (A, B)
What this means that now the DB will first find all rows where A = 10, but because B is part of the same index, it can use the same index information to filter the result-set into rows where B is also 20. It doesn't have to make two intermediate result-sets + merge them, or only use one of the indexes and do manual scan for the other.
There might be other ways that the DB deals with these situations as well, it largely depends on an implementation.
The indexes in the form (X, B, C, D) can be used to optimize queries like:
... WHERE X rel sthg (possibly ORDER BY B, C, D)
... WHERE X = sthg AND B rel sthg (possibly ORDER BY C, D)
... WHERE X = sthf AND B = sthg AND C rel sthg (possibly ORDER BY D)
etc. where rel are arbitrary relation operators (<, >, =, <=, >=) and sthg are values or expressions. Especially the second two, and the sorting variants wouldn't be optimized by the "single column indexes variant".
OTOH, it cannot optimize a query
... WHERE B = sthg
because it starts in the middle of the index; here, the single column index would work.
For a resource where you can get a through explanation and lots of examples regarding indexes on Oracle (and any other Oracle-related issue), you should visit and bookmark askTom.
Suppose we have the contents of tables x and y in two dataframes in R. Which is the suggested way to perform an operation like the following in sql:
Select x.X1, x.X2, y.X3
into z
from x inner join y on x.X1 = y.X1
I tried the following in R. Is there a better way?
Thank you
x<-data.frame(cbind('X1'=c(5,9,7,6,4,8,3,1,10,2),'X2'=c(5,9,7,6,4,8,3,1,10,2)^2))
y<-data.frame(cbind('X1'=c(9,5,8,2),'X3'=c('nine','five','eight','two')))
z<-cbind(x[which(x$X1 %in% (y$X1)), c(1:2)][order(x[which(x$X1 %in% (y$X1)), c(1:2)]$X1),],y[order(y$X1),2])
This was already answered on stackoverflow.
Beyond merge, if you're more comfortable with SQL you should check out the sqldf package, which allows you to run SQL queries on data frames.
library(sqldf)
z <- sqldf("SELECT X1, X2, X3 FROM x JOIN y
USING(X1)")
That said, you will be better off learning the base R functions (merge, intersect, union, etc.) in the long run.
Ok, it was easy
merge(x,y)