SQL where/or confusion - sql

I'm running sql statements on a huge db for the first time and I have code as such.
Select x, sum(y), sum(z) from db
where n = 'xxx' or n = 'yyy' and m = int
group by x
Now if I do this
Select x, sum(y), sum(z) from db
where n = 'xxx' and m = int
group by x
Select x, sum(y), sum(z) from db
where n = 'yyy' and m = int
group by x
And manually add the grouped values together from the 2 tables I am getting different results in my queries, with the separated queries being more accurate.
E.G. Result for row 1 will in the first query will be 20 million, Result for adding Row 1's together in the second block of code will be like 18 million? Not sure what the issue is...?

Best to use parentheses when OR's are used with AND's.
select x, sum(y), sum(z) from db
where (n = 'xxx' or n = 'yyy') and m = int
group by x
In SQL, an AND takes precedence over an OR.
So this:
where n = 'xxx' or n = 'yyy' and m = int
Is actually processed as:
where n = 'xxx' or (n = 'yyy' and m = int)
And that gets the n that are 'xxx' with any m.
Anyway, Gordon has a point.
Using an IN for this is better. Even if it's only 2.

Use in. Your code doesn't really make sense:
where n in ('xxx', 'yyy') and m = int
This query:
where n = 'xxx' or 'yyy' and m = int
should return an error in SQL Server, because of the dangling 'yyy'. MySQL accepts this syntax. In that database, it would be processed as:
where n = 'xxx' or 'yyy' and m = int
-- AND has higher precedence than `or`
where n = 'xxx' or ('yyy' and m = int)
-- `'yyy'` is converted to an integer
where n = 'xxx' or (0 and m = int)
-- which is treated as a boolean
where n = 'xxx' or (false and m = int)
-- which is grouped like this
where n = 'xxx' or (false and (m = int))
-- which is equivalent to
where n = 'xxx'

Related

postgresql Multiple identical conditions are unified into one parameter

I have one sql that need convert string column to array and i have to filter with this column,sql like this:
select
parent_line,
string_to_array(parent_line, '-')
from
bx_crm.department
where
status = 0 and
'851' = ANY(string_to_array(parent_line, '-')) and
array_length(string_to_array(parent_line, '-'), 1) = 5;
parent_line is a varchar(50) column,the data in this like 0-1-851-88
question:
string_to_array(parent_line, '-') appear many times in my sql.
how many times string_to_array(parent_line) calculate in each row. one time or three times
how convert string_to_array(parent_line) to a parameter. at last,my sql may like this:
depts = string_to_array(parent_line, '-')
select
parent_line,
depts
from
bx_crm.department
where
status = 0 and
'851' = ANY(depts) and
array_length(depts, 1) = 5;
Postgres supports lateral joins which can simplify this logic:
select parent_line, v.parents, status, ... other columns ...
from bx_crm.department d cross join lateral
(values (string_to_array(parent_line, '-')) v(parents)
where d.status = 0 and
cardinality(v.parents) = 5
'851' = any(v.parents)
Use a derived table:
select *
from (
select parent_line,
string_to_array(parent_line, '-') as parents,
status,
... other columns ...
from bx_crm.department
) x
where status = 0
and cardinality(parents) = 5
and '851' = any(parents)

Record type comparison with different numbers of columns isn't failing

Why does the following query not trigger a "cannot compare record types with different numbers of columns" error in PostgreSQL 11.6?
with
s AS (SELECT 1)
, main AS (
SELECT (a) = (b) , (a) = (a), (b) = (b), a, b -- I expect (a) = (b) fails
FROM s
, LATERAL (select 1 as x, 2 as y) AS a
, LATERAL (select 5 as x) AS b
)
select * from main;
While this one does:
with
x AS (SELECT 1)
, y AS (select 1, 2)
select (x) = (y) from x, y;
See the note in the docs on row comparison
Errors related to the number or types of elements might not occur if the comparison is resolved using earlier columns.
In this case, because a.x=1 and b.x=5, it returns false without ever noticing that the number of columns doesn't match. Change them to match, and you will get the same exception (which is also why the 2nd query does have that exception).
testdb=# with
s AS (SELECT 1)
, main AS (
SELECT a = b , (a) = (a), (b) = (b), a, b -- I expect (a) = (b) fails
FROM s
, LATERAL (select 5 as x, 2 as y) AS a
, LATERAL (select 5 as x) AS b
)
select * from main;
ERROR: cannot compare record types with different numbers of columns

Nested conditions in sql

I have the where condition in the sql:
WHERE
( Spectrum.access.dim_member.centene_ind = 0 )
AND
(
Spectrum.access.Client_List_Groups.Group_Name IN ( 'Centene Health Plan Book of Business' )
AND
Spectrum.access.dim_member.referral_route IN ( 'Claims Data' )
AND
***(
Spectrum.access.fact_task_metrics.task = 'Conduct IHA'
AND
Spectrum.access.fact_task_metrics.created_by_name <> 'BMU, BMU'
AND
Spectrum.access.fact_task_metrics.created_date BETWEEN '01/01/2015 00:0:0' AND '06/30/2015 00:0:0'
)***
AND
***(
Spectrum.access.fact_outreach_metrics.outreach_type IN ( 'Conduct IHA' )
AND
(
Spectrum.dbo.ufnTruncDate(Spectrum.access.fact_outreach_metrics.metric_date) >= Spectrum.access.fact_task_metrics.metric_date
OR
Spectrum.access.fact_outreach_metrics.metric_date >= Spectrum.access.fact_task_metrics.created_date
)
)***
AND
Spectrum.access.fact_outreach_metrics.episode_seq = 1
AND
Spectrum.access.dim_member.reinstated_date Is Null
)
I have marked two of the conditions in the above code.
The 1st condition have 2 AND operators.
The 2nd condition has an AND and an OR operator.
Question 1: Does removing the outer brackets "(" in the 1st condition impact the results?
Question 2: Does removing the outer brackets "(" in the 2nd condition impact the results?
After removing the outer bracket the filters will look like:
Spectrum.access.dim_member.referral_route IN ( 'Claims Data' )
AND
Spectrum.access.fact_task_metrics.task = 'Conduct IHA'
AND
Spectrum.access.fact_task_metrics.created_by_name <> 'BMU, BMU'
AND
Spectrum.access.fact_task_metrics.created_date BETWEEN '01/01/2015 00:0:0' AND '06/30/2015 00:0:0'
AND
Spectrum.access.fact_outreach_metrics.outreach_type IN ( 'Conduct IHA' )
AND
(
Spectrum.dbo.ufnTruncDate(Spectrum.access.fact_outreach_metrics.metric_date) >= Spectrum.access.fact_task_metrics.metric_date
OR
Spectrum.access.fact_outreach_metrics.metric_date >= Spectrum.access.fact_task_metrics.created_date
)
AND
Spectrum.access.fact_outreach_metrics.episode_seq = 1
Appreciate your help.
Regards,
Jude
Order of operations dictate that AND will be processed before OR when these expressions are evaluated within a parenthesis set.
WHERE (A AND B) OR (C AND D)
Is equivalent to:
WHERE A AND B OR C AND D
But the example below:
WHERE (A OR B) AND (C OR D)
Is not equivalent to:
WHERE A OR B AND C OR D
Which really becomes:
WHERE A OR (B AND C) OR D
Technically, you should be able to safely remove the parenthesis in question for both of your examples. With the AND statement, you are adding all of your conditions together to be one large condition. When using the OR clause, you should carefully place the parenthesis so that the groups are properly segmented.
Take the following examples into consideration:
a) where y = 1 AND n = 2 AND x = 3 or x = 5
b) where y = 1 AND n = 2 AND (x = 3 or x = 5)
c) where (y = 1 AND n = 2 AND x = 3) or x = 5
In example A, the intended outcome is unclear.
In example B, the intended outcome states that all of the conditions must be met and X can be either 3 or 5.
In example C, the intended outcome states that either Y=1, N=2 and X=3 OR x=5. As long as X = 5, it doesn't matter what Y and N equal.

Theta Notation for N to the Power of Log Manipulation

In Asymptotic Notations for Order of Growth; Is the form
Theta(N ^ ( ( LOGb( a / b) + 1 ) ) )
Equivalent to
Theta(N ^ (LOGb( a ) ) ) ??
Where LOGb(a) means LOG a to base b.
Since log(a/b) = log a - log b and LOGb(b) = 1, we have LOGb(a/b)-1 = LOGb(a) - 1 + 1 = LOGb(a). No mention of asymptotics necessary, this equality is exact for all a, b > 0.

PIG query to pivot rows and columns with count of like rows

Trying to create sql or PIG queries that will yield a count of distinct values results based on type.
In other words, given this table:
Type: Value:
A x
B y
C y
B y
C z
A x
A z
A z
A x
B x
B z
B x
C x
I want to get the following results:
Type: x: y: z:
A 3 0 2
B 2 2 1
C 1 1 1
Additionally, a table of averages as a result would be helpful too
Type: x: y: z:
A 0.60 0.00 0.40
B 0.40 0.40 0.20
C 0.33 0.33 0.33
EDIT 4
I am a nooby at PIG, but reading 8 different stack overflow posts I came up with this.
When I use this PIG query
A = LOAD 'tablex' USING org.apache.hcatalog.pig.HCatLoader();
x = foreach A GENERATE id_orig_h;
xx = distinct x;
y = foreach A GENERATE id_resp_h;
yy = distinct y;
yyy = group yy all;
zz = GROUP A BY (id_orig_h, id_resp_h);
B = CROSS xx, yy;
C = foreach B generate xx::id_orig_h as id_orig_h, yy::id_resp_h as id_resp_h;
D = foreach zz GENERATE flatten (group) as (id_orig_h, id_resp_h), COUNT(A) as count;
E = JOIN C by (id_orig_h, id_resp_h) LEFT OUTER, D BY (id_orig_h, id_resp_h);
F = foreach E generate C::id_orig_h as id_orig_h, C::id_resp_h as id_resp_h, D::count as count;
G = foreach yyy generate 0 as id:chararray, flatten(BagToTuple(yy));
H = group F by id_orig_h;
I = foreach H generate group as id_orig_h, flatten(BagToTuple(F.count)) as count;
dump G;
dump I;
Sort of works.......
I get this:
(0,x,y,z)
(A,3,0,2)
(B,2,2,1)
(C,1,1,1)
Which I can import to a text file strip the "(" and ")" and use as a CSV with schema being first line. This sort of works it is SO SLOW. I would like a nicer, faster, cleaner way of doing this. If anyone out there knows of a way please let me know.
The best I can think of would work only with Oracle, and although it wouldn't provide you with a column for each value, it would present you the data like this:
A x=3,y=3,z=3
B x=4,y=3
C y=3,z=2
of course if you have 900 values it would show:
A x=3,y=6,...,ff=12
etc...
I'm not able to add comment so I can't ask you if oracle is ok. Anyway here's the query that would achieve that:
SELECT type, values FROM
(SELECT type, SUBSTR(SYS_CONNECT_BY_PATH(value || '=' || OCC, ','),2) values, seq,
MAX(seq) OVER (partition by type) max
FROM
(SELECT type, value, OCC, ROW_NUMBER () OVER (partition by type ORDER BY type, value) seq
FROM
(SELECT type, value, COUNT(*) OCC
FROM tableName
GROUP BY type, value))
START WITH seq=1
CONNECT by PRIOR
seq+1=seq
AND PRIOR
type=type)
WHERE seq = max;
For the Average you need to add the information before all the rest, here's the code:
SELECT * FROM
(SELECT type,
SUBSTR(SYS_CONNECT_BY_PATH(value || '=' || OCC, ','),2) values,
SUBSTR(SYS_CONNECT_BY_PATH(value || '=' || (OCC / TOT), ','),2) average,
seq, MAX(seq) OVER (partition by type) max
FROM
(SELECT type, value, TOT, OCC, ROW_NUMBER () OVER (partition by type ORDER BY type, value) seq
FROM
(
SELECT type, value, TOT, COUNT(*) OCC
FROM (SELECT type, value, COUNT(*) OVER (partition by type) TOT
FROM tableName)
GROUP BY type, value, TOT
))
START WITH seq=1
CONNECT by PRIOR
seq+1=seq
AND PRIOR
type=type)
WHERE seq = max;
You can do this using the vector operation UDF's in Brickhouse ( http://github.com/klout/brickhouse ) Consider that each 'value' is a dimension in a very high dimension space. You can interpret a single value instance as a vector in that dimension, with value 1. In Hive, we would represent such a vector simply as a map with a string as the key, and an int or other numeric as the value.
What you want to create is a vector which is the sum of all vectors, grouped by type. The query would be :
SELECT type,
union_vector_sum( map( value, 1 ) ) as vector,
FROM table
GROUP BY type;
Brickhouse even has a normalize function, which will produce your 'averages'
SELECT type,
vector_normalize(union_vector_sum( map( value, 1 ) ))
as normalized_vector,
FROM table
GROUP BY type;
Updated code according to Edit#3 in question:
A = load '/path/to/input/file' using AvroStorage();
B = group A by (type, value);
C = foreach B generate flatten(group) as (type, value), COUNT(A) as count;
-- Now get all the values.
M = foreach A generate value;
-- Left Outer Join all the values with C, so that every type has exactly same number of values associated
N = join M by value left outer, C by value;
O = foreach N generate
C::type as type,
M::value as value,
(C::count == null ? 0 : C::count) as count; --count = 0 means value was not associated with the type
P = group O by type;
Q = foreach P {
R = order O by value asc; --Ordered by value, so values counts are ordered consistently in all the rows.
generate group as type, flatten(R.count);
}
Please note that I did not execute the code above. These are just the representational steps.
A = LOAD 'tablex' USING org.apache.hcatalog.pig.HCatLoader();
x = foreach A GENERATE id_orig_h;
xx = distinct x;
y = foreach A GENERATE id_resp_h;
yy = distinct y;
yyy = group yy all;
zz = GROUP A BY (id_orig_h, id_resp_h);
B = CROSS xx, yy;
C = foreach B generate xx::id_orig_h as id_orig_h, yy::id_resp_h as id_resp_h;
D = foreach zz GENERATE flatten (group) as (id_orig_h, id_resp_h), COUNT(A) as count;
E = JOIN C by (id_orig_h, id_resp_h) LEFT OUTER, D BY (id_orig_h, id_resp_h);
F = foreach E generate C::id_orig_h as id_orig_h, C::id_resp_h as id_resp_h, D::count as count;
G = foreach yyy generate 0 as id:chararray, flatten(BagToTuple(yy));
H = group F by id_orig_h;
I = foreach H generate group as id_orig_h, flatten(BagToTuple(F.count)) as count;
dump G;
dump I;