Combining many sort ranks into one master sort rank - sql

Say I have some sorted result from a SQL query that looks like:
x y z
0 0 0
0 0 1
0 0 2
0 1 0
0 1 1
0 2 0
0 2 1
Where x, y and z are sort ranks. These sort ranks are always greater than 0, and smaller than 500mil.
Is there a way to combine the values from x, y and z into one "master" sort rank? Sorting the dataset using this "master" sort rank should result in the same ordering.
I'm thinking I can do something with bit shifting but I am not sure...

Assuming that every value in each of the three columns in between 1 and 500 million, you could use the following formula to generate a unique rank:
1000000
z + (500 x 10^6)*y + (500 x 10^6)*(500 x 10^6)*x
To generate this rank you could use the following query:
SELECT
x, y, z,
z + (500 * 1000000)*y + (500 * 1000000)*(500 * 1000000)*x AS master_rank
FROM yourTable;
The reason this works can be seen by examining say the z and y columns. The largest value from z is 500 million, which is guaranteed to be smaller than the smallest value in y, which is 1 billion. This logic applies to the whole formula. This approach is similar to using a bit mask, on a larger scale.
Note that I assume that your version of SQL can tolerate numbers this large. If it doesn't, then you might want to consider another approach here, possibly just ordering as #Gordon mentioned in his answer. Besides this, having 1 bil x 1 bil records would make for a very large table and would have other problems.

Do you mean something like this?
order by x * 10000 + y * 100 + z
(You would adjust the numbers for the width you need.)
I'm not sure why you would want to do that instead of:
order by x, y, z
If you do combine into a single value, be careful about integer overflow.

Related

Finding the index of max value of columns in numpy array but removing the previous max

I have an array with N rows and M columns.
I would like to run through all the columns, finding the index of the row in which contains the max value of the column. However, each row should be selected only once.
For instance, let's consider a matrix
1 1
2 2
The output should be [1, 0]. Because the row 1 (value of 2) is the max value of column 0, then we move to column 2, the row 1 is out of consideration, so the row 0 will be the highest cell.
Indeed, things can be solved easily with for a nested for loop, and something like:
removed_rows = []
for i in range (nb_columns):
index_max = 0
value_max = A[0,i]
for j in range (nb_rows):
if j in removed_rows:
continue
else:
if value_max < A[j,i]:
index_max = j
value_max = A[j,i]
removed_rows.append (index_max)
However, it seems slow for a huge matrix. Is there any method we can do it faster (with numpy?)?
Many thanks
This might not be very fast as it still loop through the columns, which I think is unavoidable due to the constrain, but should be faster than your solution as it finds the maximum's index with argmax:
out = []
mm = A.min() - 1
for j in range(A.shape[1]):
idx = np.argmax(A[:,j])
# replace the entire row with mm
# so next `argmax` will ignore this row
A[idx] = mm
out.append(idx)
The above takes about 640 us on 100 x 100 arrays, and 18ms on 1k x 1k arrays. Your code refuses to run on 1k x 1k array within reasonable time on my system.

Prolog: how to optimize this code(Solving 123456789=100 puzzle)

So there was a puzzle:
This equation is incomplete: 1 2 3 4 5 6 7 8 9 = 100. One way to make
it accurate is by adding seven plus and minus signs, like so: 1 + 2 +
3 – 4 + 5 + 6 + 78 + 9 = 100.
How can you do it using only 3 plus or minus signs?
I'm quite new to Prolog, solved the puzzle, but i wonder how to optimize it
makeInt(S,F,FinInt):-
getInt(S,F,0,FinInt).
getInt(Start, Finish, Acc, FinInt):-
0 =< Finish - Start,
NewAcc is Acc*10 + Start,
NewStart is Start +1,
getInt(NewStart, Finish, NewAcc, FinInt).
getInt(Start, Finish, A, A):-
0 > Finish - Start.
itCounts(X,Y,Z,Q):-
member(XLastDigit,[1,2,3,4,5,6]),
FromY is XLastDigit+1,
numlist(FromY, 7, ListYLastDigit),
member(YLastDigit, ListYLastDigit),
FromZ is YLastDigit+1,
numlist(FromZ, 8, ListZLastDigit),
member(ZLastDigit,ListZLastDigit),
FromQ is ZLastDigit+1,
member(YSign,[-1,1]),
member(ZSign,[-1,1]),
member(QSign,[-1,1]),
0 is XLastDigit + YSign*YLastDigit + ZSign*ZLastDigit + QSign*9,
makeInt(1, XLastDigit, FirstNumber),
makeInt(FromY, YLastDigit, SecondNumber),
makeInt(FromZ, ZLastDigit, ThirdNumber),
makeInt(FromQ, 9, FourthNumber),
X is FirstNumber,
Y is YSign*SecondNumber,
Z is ZSign*ThirdNumber,
Q is QSign*FourthNumber,
100 =:= X + Y + Z + Q.
Not sure this stands for an optimization. The code is just shorter:
sum_123456789_eq_100_with_3_sum_or_sub(L) :-
append([G1,G2,G3,G4], [0'1,0'2,0'3,0'4,0'5,0'6,0'7,0'8,0'9]),
maplist([X]>>(length(X,N), N>0), [G1,G2,G3,G4]),
maplist([G,F]>>(member(Op, [0'+,0'-]),F=[Op|G]), [G2,G3,G4], [F2,F3,F4]),
append([G1,F2,F3,F4], L),
read_term_from_codes(L, T, []),
100 is T.
It took me a while, but I got what your code is doing. It's something like this:
itCounts(X,Y,Z,Q) :- % generate X, Y, Z, and Q s.t. X+Y+Z+Q=100, etc.
generate X as a list of digits
do the same for Y, Z, and Q
pick the signs for Y, Z, and Q
convert all those lists of digits into numbers
verify that, with the signs, they add to 100.
The inefficiency here is that the testing is all done at the last minute. You can improve the efficiency if you can throw out some possible solutions as soon as you pick one of your numbers, that is, testing earlier.
itCounts(X,Y,Z,Q) :- % generate X, Y, Z, and Q s.t. X+Y+Z+Q=100, etc.
generate X as a list of digits, and convert it to a number
if it's so big or small the rest can't possibly bring the sum back to 100, fail
generate Y as a list of digits, convert to number, and pick it sign
if it's so big or so small the rest can't possibly bring the sum to 100, fail
do the same for Z
do the same for Q
Your function is running pretty fast already, even if I search all possible solutions. It only picks 6 X's; 42 Y's; 224 Z's; and 15 Q's. I don't think optimizing will be worth your while.
But if you really wanted to: I tested this by putting a testing function immediately after selecting an X. It reduced the 6 X's to 3 (all before finding the solution); 42 Y's to 30; 224 Z's to 184; and 15 Q's to 11. I believe we could reduce it further by testing immediately after a Y is picked, to see whether X YSign Y is already so large or small there can be no solution.
In PROLOG programs that are more computationally intensive, putting parts of the 'test' earlier in 'generate and test' algorithms can help a lot.

Shuffle data in a repeatable way (ability to get the same "random" order again)

This is the opposite of what most "random order" questions are about.
I want to select data from a database in random order. But I want to be able to repeat certain selects, getting the same order again.
Current (random) select:
SELECT custId, rand() as random from
(
SELECT DISTINCT custId FROM dummy
)
Using this, every key/row gets a random number. Ordering those ascending results in a random order.
But I want to repeat this select, getting the very same order again. My idea is to calculate a random number (r) once per session (e.g. "4") and use this number to shuffle the data in some way.
My first idea:
SELECT custId, custId * 4 as random from
(
SELECT DISTINCT custId FROM dummy
)
(in real life "4" would be something like 4005226664240702)
This results in a different number for each line but the same ones every run. By changing "r" to 5 all numbers will change.
The problem is: multiplication is not sufficient here. It just increases the numbers but keeps the order the same. Therefore I need some other kind of arithmetic function.
More abstract
Starting with my data (A-D). k is the key and r is the random number currently used:
k r
A = 1 4
B = 2 4
C = 3 4
D = 4 4
Doing some calculation using k and r in every line I want to get something like:
k r
A = 1 4 --> 12
B = 2 4 --> 13
C = 3 4 --> 11
D = 4 4 --> 10
The numbers can be whatever they want, but when I order them ascending I want to get a different order than the initial one. In this case D, C, A, B, E.
Setting r to 7 should result in a different order (C, A, B, D):
k r
A = 1 7 --> 56
B = 2 7 --> 78
C = 3 7 --> 23
D = 4 7 --> 80
Every time I use r = 7 should result in the same numbers => same order.
I'm looking for a mathematical function to do the calculation with k and r. Seeding the RAND() function is not suitable because it's not supported by some databases we support
Please note that r is already a randomly generated number
Background
One Table - Two data consumers. One consumer will get random 5% of the table, the other one the other 95%. They don't just get the data but a generated SQL. So there are two SQL's which must not select the same data twice but still random.
You could try and implement the Multiply-With-Carry PseudoRandomNumberGenerator. The C version goes like this (source: Wikipedia):
m_w = <choose-initializer>; /* must not be zero, nor 0x464fffff */
m_z = <choose-initializer>; /* must not be zero, nor 0x9068ffff */
uint get_random()
{
m_z = 36969 * (m_z & 65535) + (m_z >> 16);
m_w = 18000 * (m_w & 65535) + (m_w >> 16);
return (m_z << 16) + m_w; /* 32-bit result */
}
In SQL, you could create a table Random, with two columns to contain w and z, and one ID column to identify each session. Perhaps your vendor supports variables and you need not bother with the table.
Nonetheless, even if we use a table, we immediately run into trouble cause ANSI SQL doesn't support unsigned INTs. In SQL Server I could switch to BIGINT, unsure if your vendor supports that.
CREATE TABLE Random (ID INT, [w] BIGINT, [z] BIGINT)
Initialize a new session, say number 3, by inserting 1 into z and the seed into w:
INSERT INTO Random (ID, w, z) VALUES (3, 8921, 1);
Then each time you wish to generate a new random number, do the computations:
UPDATE Random
SET
z = (36969 * (z % 65536) + z / 65536) % 4294967296,
w = (18000 * (w % 65536) + w / 65536) % 4294967296
WHERE ID = 3
(Note how I have replaced bitwise operands with div and mod operations and how, after computing, you need to mod 4294967296 to stay within the proper 32 bits unsigned int range.)
And select the new value:
SELECT(z * 65536 + w) % 4294967296
FROM Random
WHERE ID = 3
SQLFiddle demo
Not sure if this applies in non-SQL Server, but typically when you use a RAND() function, you can specify a seed. Everytime you specify the same seed, the randomization will be the same.
So, it sounds like you just need to store the seed number and use that each time to get the same set of random numbers.
MSDN Article on RAND
Each vendor has solved this in its own way. Creating your own implementation will be hard, since random number generation is difficult.
Oracle
dbms_random can be initialized with a seed: http://docs.oracle.com/cd/B19306_01/appdev.102/b14258/d_random.htm#i998255
SQL Server
First call to RAND() can provide a seed: http://technet.microsoft.com/en-us/library/ms177610.aspx
MySql
First call to RAND() can provide a seed: http://dev.mysql.com/doc/refman/4.1/en/mathematical-functions.html#function_rand
Postgresql
Use SET SEED or SELECT setseed() : http://www.postgresql.org/docs/8.3/static/sql-set.html

Approaches to converting a table of possibilities into logical statements

I'm not sure how to express this problem, so my apologies if it's already been addressed.
I have business rules summarized as a table of outputs given two inputs. For each of five possible value on one axis, and each of five values on another axis, there is a single output. There are ten distinct possibilities in these 25 cells, so it's not the case that each input pair has a unique output.
I have encoded these rules in TSQL with nested CASE statements, but it's hard to debug and modify. In C# I might use an array literal. I'm wondering if there's an academic topic which relates to converting logical rules to matrices and vice versa.
As an example, one could translate this trivial matrix:
A B C
-- -- -- --
X 1 1 0
Y 0 1 0
...into rules like so:
if B OR (A and X) then 1 else 0
...or, in verbose SQL:
CASE WHEN FieldABC = 'B' THEN 1
WHEN FieldABX = 'A' AND FieldXY = 'X' THEN 1
ELSE 0
I'm looking for a good approach for larger matrices, especially one I can use in SQL (MS SQL 2K8, if it matters). Any suggestions? Is there a term for this type of translation, with which I should search?
Sounds like a lookup into a 5x5 grid of data. The inputs on axis and the output in each cell:
Y=1 Y=2 Y=3 Y=4 Y=5
x=1 A A D B A
x=2 B A A B B
x=3 C B B B B
x=4 C C C D D
x=5 C C C C C
You can store this in a table of x,y,outvalue triplets and then just do a look up on that table.
SELECT OUTVALUE FROM BUSINESS_RULES WHERE X = #X and Y = #Y;

Treatment of error values in the SQL standard

I have a question about the SQL standard which I'm hoping a SQL language lawyer can help with.
Certain expressions just don't work. 62 / 0, for example. The SQL standard specifies quite a few ways in which expressions can go wrong in similar ways. Lots of languages deal with these expressions using special exceptional flow control, or bottom psuedo-values.
I have a table, t, with (only) two columns, x and y each of type int. I suspect it isn't relevant, but for definiteness let's say that (x,y) is the primary key of t. This table contains (only) the following values:
x y
7 2
3 0
4 1
26 5
31 0
9 3
What behavior is required by the SQL standard for SELECT expressions operating on this table which may involve division(s) by zero? Alternatively, if no one behavior is required, what behaviors are permitted?
For example, what behavior is required for the following select statements?
The easy one:
SELECT x, y, x / y AS quot
FROM t
A harder one:
SELECT x, y, x / y AS quot
FROM t
WHERE y != 0
An even harder one:
SELECT x, y, x / y AS quot
FROM t
WHERE x % 2 = 0
Would an implementation (say, one that failed to realize on a more complex version of this query that the restriction could be moved inside the extension) be permitted to produce a division by zero error in response to this query, because, say it attempted to divide 3 by 0 as part of the extension before performing the restriction and realizing that 3 % 2 = 1? This could become important if, for example, the extension was over a small table but the result--when joined with a large table and restricted on the basis of data in the large table--ended up restricting away all of the rows which would have required division by zero.
If t had millions of rows, and this last query were performed by a table scan, would an implementation be permitted to return the first several million results before discovering a division by zero near the end when encountering one even value of x with a zero value of y? Would it be required to buffer?
There are even worse cases, ponder this one, which depending on the semantics can ruin boolean short-circuiting or require four-valued boolean logic in restrictions:
SELECT x, y
FROM t
WHERE ((x / y) >= 2) AND ((x % 2) = 0)
If the table is large, this short-circuiting problem can get really crazy. Imagine the table had a million rows, one of which had a 0 divisor. What would the standard say is the semantics of:
SELECT CASE
WHEN EXISTS
(
SELECT x, y, x / y AS quot
FROM t
)
THEN 1
ELSE 0
END AS what_is_my_value
It seems like this value should probably be an error since it depends on the emptiness or non-emptiness of a result which is an error, but adopting those semantics would seem to prohibit the optimizer for short-circuiting the table scan here. Does this existence query require proving the existence of one non-bottoming row, or also the non-existence of a bottoming row?
I'd appreciate guidance here, because I can't seem to find the relevant part(s) of the specification.
All implementations of SQL that I've worked with treat a division by 0 as an immediate NaN or #INF. The division is supposed to be handled by the front end, not by the implementation itself. The query should not bottom out, but the result set needs to return NaN in this case. Therefore, it's returned at the same time as the result set, and no special warning or message is brought up to the user.
At any rate, to properly deal with this, use the following query:
select
x, y,
case y
when 0 then null
else x / y
end as quot
from
t
To answer your last question, this statement:
SELECT x, y, x / y AS quot
FROM t
Would return this:
x y quot
7 2 3.5
3 0 NaN
4 1 4
26 5 5.2
31 0 NaN
9 3 3
So, your exists would find all the rows in t, regardless of what their quotient was.
Additionally, I was reading over your question again and realized I hadn't discussed where clauses (for shame!). The where clause, or predicate, should always be applied before the columns are calculated.
Think about this query:
select x, y, x/y as quot from t where x%2 = 0
If we had a record (3,0), it applies the where condition, and checks if 3 % 2 = 0. It does not, so it doesn't include that record in the column calculations, and leaves it right where it is.