Intersection between two tables in DolphinDB server - sql

I'm trying to use the intersection function in DolphinDB as follows:
n=1000000
ID=rand(100, n)
dates=2017.08.07..2017.08.11
date=rand(dates, n)
x=rand(10.0, n)
t=table(ID, date, x)
dbDate = database(, VALUE, 2017.08.07..2017.09.11)
dbID = database(, RANGE, 0 50 100)
db = database("dfs://compodb", COMPO, [dbDate, dbID])
pt = db.createPartitionedTable(t, `pt, `date`ID).append!(t)
dfsTable=loadTable("dfs://compodb","pt")
A = select * from dfsTable where date = 2017.08.07
B = select * from dfsTable where date = 2017.08.08
intersection(A[`x],B[`x])
But I am getting the error:
The both arguments for 'bitAnd'(&) must be integers
Apparently something doesn’t work in this query... any idea?

This document section says this about how to create a vector:
A vector from a table column. For example, trades.qty indicates column qty from table trades.
And it looks like intersection is an alias for &, which for vectors is treated as bitAnd, as said here:
Arguments
Set Operation: X and Y are sets.
Bit Operation: X and Y are equal sized vectors, or Y is a scalar.
So you need to convert vector to set with set(A[`x]) function.

Related

How to convert the following if conditions to Linear integer programming constraints?

These are the conditions:
if(x > 0)
{
y >= a;
z <= b;
}
It is quite easy to convert the conditions into Linear Programming constraints if x were binary variable. But I am not finding a way to do this.
You can do this in 2 steps
Step 1: Introduce a binary dummy variable
Since x is continuous, we can introduce a binary 0/1 dummy variable. Let's call it x_positive
if x>0 then we want x_positive =1. We can achieve that via the following constraint, where M is a very large number.
x < x_positive * M
Note that this forces x_positive to become 1, if x is itself positive. If x is negative, x_positive can be anything. (We can force it to be zero by adding it to the objective function with a tiny penalty of the appropriate sign.)
Step 2: Use the dummy variable to implement the next 2 constraints
In English: if x_positive = 1, then y >= a
However, if x_positive = 0, y can be anything (y > -inf)
y > a - M (1 - x_positive)
Similarly,
if x_positive = 1, then z <= b
z <= b + M * (1 - x_positive)
Both the linear constraints above will kick in if x>0 and will be trivially satisfied if x <=0.

Generate normally distributed series using BIgQuery

Is there a way to generate normally distributed series in BQ? ideally specifying the mean and sd of the distribution.
I found a way using Marsaglia polar method , but it is not ideal for I do not want polar coordinates of the distribution but to generate an array that follows the parameters specified for it to be normally distributed.
Thank you in advance.
This query gives you the euclidean coordinates of the normal distribution centred in 0. You can adjust both the mean (mean variable) or the sd (variance variable) and the x-axis values (GENERATE_ARRAY(beginning,end,step)) :
CREATE TEMPORARY FUNCTION normal(x FLOAT64)
RETURNS FLOAT64
LANGUAGE js AS """
var mean=0;
var variance=1;
var x0=1/(Math.sqrt(2*Math.PI*variance));
var x1=-Math.pow(x-mean,2)/(2*Math.pow(variance,2));
return x0*Math.pow(Math.E,x1);
""";
WITH numbers AS
(SELECT x FROM UNNEST(GENERATE_ARRAY(-10, 10,0.5)) AS x)
SELECT x, normal(x) as normal
FROM numbers;
For doing that, I used "User Defined Funtions" [1]. They are used when you want to have another SQL expression or when you want to use Java Script (as I did).
NOTE: I used the probability density function of the normal distribution, if you want to use another you'd need to change variables x0,x1 and the return (I wrote them separately so it's clearer).
Earlier answers give the probability distribution function of a normal rv. Here I modify previous answers to give a random number generated with the desired distribution, in BQ standard SQL, using the 'polar coordinates' method. The question asks not to use polar coordinates, which is an odd request, since polar coordinates are not use in the generation of the normally distributed random number.
CREATE TEMPORARY FUNCTION rnorm ( mu FLOAT64, sigma FLOAT64 ) AS
(
(select mu + sigma*(sqrt( 2*abs(
log( RAND())
)
)
)*cos( 2*ACOS(-1)*RAND())
)
)
;
select
num ,
rnorm(-1, 5.3) as RAND_NORM
FROM UNNEST(GENERATE_ARRAY(1, 17) ) AS num
The easiest way to do it in BQ is by creating a custom function:
CREATE OR REPLACE FUNCTION
`your_project.functions.normal_distribution_pdf`
(x ANY TYPE, mu ANY TYPE, sigma ANY TYPE) AS (
(
SELECT
safe_divide(1,sigma * power(2 * ACOS(-1),0.5)) * exp(-0.5 * power(safe_divide(x-mu,sigma),2))
)
);
Next you only need to apply the function:
with inputs as (
SELECT 1 as x, 0 as mu, 1 as sigma
union all
SELECT 1.5 as x, 1 as mu, 2 as sigma
union all
SELECT 2 as x , 2 as mu, 3 as sigma
)
SELECT x,
`your_project.functions.normal_distribution_pdf`(x, mu, sigma) as normal_pdf
from
inputs

How do I convert an object from an external collection of items to an Integer type

How do I convert object data type to an integer?
I am trying to do a calculation on my application, reading the OPC item’s value. Objects are used to identify the OPC items within a server.
Here is what I am trying to do:
Itemvalues(0) * 1000 + itemvalues(1)
Itemvalues is the OPC item’s value.
Itemvalues is an object data type and it can contain any data type. But you have to convert it.
(*) is multiplication.
1000 is an integer
(+) is addition
Below is the code I tried:
Dim y As Object
Dim yR As Integer
Dim z As Object
Dim zR As Integer
Dim x = 1000
yR = CInt(y)
zR = CInt(z)
y = itemValues(1).Value
z = itemValues(2).Value
itemValues(1).Value = yR * x + zR
But it is displaying 0:
Which is the wrong calculation, and it is because the default value of Object is Nothing (a null reference).. How do I calculate this?
Your question is not entirely clear, but there is an obvious logic problem in your code.
You are assigning the values of yR and zR from the variables y and z before you are assigning anything to y and z. So these values will always be 0.
Swap the assignment order like this:
y = itemValues(1).Value
z = itemValues(2).Value
yR = CInt(y)
zR = CInt(z)
There are other issues in your code but this is probably your immediate problem.

Shuffle data in a repeatable way (ability to get the same "random" order again)

This is the opposite of what most "random order" questions are about.
I want to select data from a database in random order. But I want to be able to repeat certain selects, getting the same order again.
Current (random) select:
SELECT custId, rand() as random from
(
SELECT DISTINCT custId FROM dummy
)
Using this, every key/row gets a random number. Ordering those ascending results in a random order.
But I want to repeat this select, getting the very same order again. My idea is to calculate a random number (r) once per session (e.g. "4") and use this number to shuffle the data in some way.
My first idea:
SELECT custId, custId * 4 as random from
(
SELECT DISTINCT custId FROM dummy
)
(in real life "4" would be something like 4005226664240702)
This results in a different number for each line but the same ones every run. By changing "r" to 5 all numbers will change.
The problem is: multiplication is not sufficient here. It just increases the numbers but keeps the order the same. Therefore I need some other kind of arithmetic function.
More abstract
Starting with my data (A-D). k is the key and r is the random number currently used:
k r
A = 1 4
B = 2 4
C = 3 4
D = 4 4
Doing some calculation using k and r in every line I want to get something like:
k r
A = 1 4 --> 12
B = 2 4 --> 13
C = 3 4 --> 11
D = 4 4 --> 10
The numbers can be whatever they want, but when I order them ascending I want to get a different order than the initial one. In this case D, C, A, B, E.
Setting r to 7 should result in a different order (C, A, B, D):
k r
A = 1 7 --> 56
B = 2 7 --> 78
C = 3 7 --> 23
D = 4 7 --> 80
Every time I use r = 7 should result in the same numbers => same order.
I'm looking for a mathematical function to do the calculation with k and r. Seeding the RAND() function is not suitable because it's not supported by some databases we support
Please note that r is already a randomly generated number
Background
One Table - Two data consumers. One consumer will get random 5% of the table, the other one the other 95%. They don't just get the data but a generated SQL. So there are two SQL's which must not select the same data twice but still random.
You could try and implement the Multiply-With-Carry PseudoRandomNumberGenerator. The C version goes like this (source: Wikipedia):
m_w = <choose-initializer>; /* must not be zero, nor 0x464fffff */
m_z = <choose-initializer>; /* must not be zero, nor 0x9068ffff */
uint get_random()
{
m_z = 36969 * (m_z & 65535) + (m_z >> 16);
m_w = 18000 * (m_w & 65535) + (m_w >> 16);
return (m_z << 16) + m_w; /* 32-bit result */
}
In SQL, you could create a table Random, with two columns to contain w and z, and one ID column to identify each session. Perhaps your vendor supports variables and you need not bother with the table.
Nonetheless, even if we use a table, we immediately run into trouble cause ANSI SQL doesn't support unsigned INTs. In SQL Server I could switch to BIGINT, unsure if your vendor supports that.
CREATE TABLE Random (ID INT, [w] BIGINT, [z] BIGINT)
Initialize a new session, say number 3, by inserting 1 into z and the seed into w:
INSERT INTO Random (ID, w, z) VALUES (3, 8921, 1);
Then each time you wish to generate a new random number, do the computations:
UPDATE Random
SET
z = (36969 * (z % 65536) + z / 65536) % 4294967296,
w = (18000 * (w % 65536) + w / 65536) % 4294967296
WHERE ID = 3
(Note how I have replaced bitwise operands with div and mod operations and how, after computing, you need to mod 4294967296 to stay within the proper 32 bits unsigned int range.)
And select the new value:
SELECT(z * 65536 + w) % 4294967296
FROM Random
WHERE ID = 3
SQLFiddle demo
Not sure if this applies in non-SQL Server, but typically when you use a RAND() function, you can specify a seed. Everytime you specify the same seed, the randomization will be the same.
So, it sounds like you just need to store the seed number and use that each time to get the same set of random numbers.
MSDN Article on RAND
Each vendor has solved this in its own way. Creating your own implementation will be hard, since random number generation is difficult.
Oracle
dbms_random can be initialized with a seed: http://docs.oracle.com/cd/B19306_01/appdev.102/b14258/d_random.htm#i998255
SQL Server
First call to RAND() can provide a seed: http://technet.microsoft.com/en-us/library/ms177610.aspx
MySql
First call to RAND() can provide a seed: http://dev.mysql.com/doc/refman/4.1/en/mathematical-functions.html#function_rand
Postgresql
Use SET SEED or SELECT setseed() : http://www.postgresql.org/docs/8.3/static/sql-set.html

Treatment of error values in the SQL standard

I have a question about the SQL standard which I'm hoping a SQL language lawyer can help with.
Certain expressions just don't work. 62 / 0, for example. The SQL standard specifies quite a few ways in which expressions can go wrong in similar ways. Lots of languages deal with these expressions using special exceptional flow control, or bottom psuedo-values.
I have a table, t, with (only) two columns, x and y each of type int. I suspect it isn't relevant, but for definiteness let's say that (x,y) is the primary key of t. This table contains (only) the following values:
x y
7 2
3 0
4 1
26 5
31 0
9 3
What behavior is required by the SQL standard for SELECT expressions operating on this table which may involve division(s) by zero? Alternatively, if no one behavior is required, what behaviors are permitted?
For example, what behavior is required for the following select statements?
The easy one:
SELECT x, y, x / y AS quot
FROM t
A harder one:
SELECT x, y, x / y AS quot
FROM t
WHERE y != 0
An even harder one:
SELECT x, y, x / y AS quot
FROM t
WHERE x % 2 = 0
Would an implementation (say, one that failed to realize on a more complex version of this query that the restriction could be moved inside the extension) be permitted to produce a division by zero error in response to this query, because, say it attempted to divide 3 by 0 as part of the extension before performing the restriction and realizing that 3 % 2 = 1? This could become important if, for example, the extension was over a small table but the result--when joined with a large table and restricted on the basis of data in the large table--ended up restricting away all of the rows which would have required division by zero.
If t had millions of rows, and this last query were performed by a table scan, would an implementation be permitted to return the first several million results before discovering a division by zero near the end when encountering one even value of x with a zero value of y? Would it be required to buffer?
There are even worse cases, ponder this one, which depending on the semantics can ruin boolean short-circuiting or require four-valued boolean logic in restrictions:
SELECT x, y
FROM t
WHERE ((x / y) >= 2) AND ((x % 2) = 0)
If the table is large, this short-circuiting problem can get really crazy. Imagine the table had a million rows, one of which had a 0 divisor. What would the standard say is the semantics of:
SELECT CASE
WHEN EXISTS
(
SELECT x, y, x / y AS quot
FROM t
)
THEN 1
ELSE 0
END AS what_is_my_value
It seems like this value should probably be an error since it depends on the emptiness or non-emptiness of a result which is an error, but adopting those semantics would seem to prohibit the optimizer for short-circuiting the table scan here. Does this existence query require proving the existence of one non-bottoming row, or also the non-existence of a bottoming row?
I'd appreciate guidance here, because I can't seem to find the relevant part(s) of the specification.
All implementations of SQL that I've worked with treat a division by 0 as an immediate NaN or #INF. The division is supposed to be handled by the front end, not by the implementation itself. The query should not bottom out, but the result set needs to return NaN in this case. Therefore, it's returned at the same time as the result set, and no special warning or message is brought up to the user.
At any rate, to properly deal with this, use the following query:
select
x, y,
case y
when 0 then null
else x / y
end as quot
from
t
To answer your last question, this statement:
SELECT x, y, x / y AS quot
FROM t
Would return this:
x y quot
7 2 3.5
3 0 NaN
4 1 4
26 5 5.2
31 0 NaN
9 3 3
So, your exists would find all the rows in t, regardless of what their quotient was.
Additionally, I was reading over your question again and realized I hadn't discussed where clauses (for shame!). The where clause, or predicate, should always be applied before the columns are calculated.
Think about this query:
select x, y, x/y as quot from t where x%2 = 0
If we had a record (3,0), it applies the where condition, and checks if 3 % 2 = 0. It does not, so it doesn't include that record in the column calculations, and leaves it right where it is.