PIG query to pivot rows and columns with count of like rows - sql

Trying to create sql or PIG queries that will yield a count of distinct values results based on type.
In other words, given this table:
Type: Value:
A x
B y
C y
B y
C z
A x
A z
A z
A x
B x
B z
B x
C x
I want to get the following results:
Type: x: y: z:
A 3 0 2
B 2 2 1
C 1 1 1
Additionally, a table of averages as a result would be helpful too
Type: x: y: z:
A 0.60 0.00 0.40
B 0.40 0.40 0.20
C 0.33 0.33 0.33
EDIT 4
I am a nooby at PIG, but reading 8 different stack overflow posts I came up with this.
When I use this PIG query
A = LOAD 'tablex' USING org.apache.hcatalog.pig.HCatLoader();
x = foreach A GENERATE id_orig_h;
xx = distinct x;
y = foreach A GENERATE id_resp_h;
yy = distinct y;
yyy = group yy all;
zz = GROUP A BY (id_orig_h, id_resp_h);
B = CROSS xx, yy;
C = foreach B generate xx::id_orig_h as id_orig_h, yy::id_resp_h as id_resp_h;
D = foreach zz GENERATE flatten (group) as (id_orig_h, id_resp_h), COUNT(A) as count;
E = JOIN C by (id_orig_h, id_resp_h) LEFT OUTER, D BY (id_orig_h, id_resp_h);
F = foreach E generate C::id_orig_h as id_orig_h, C::id_resp_h as id_resp_h, D::count as count;
G = foreach yyy generate 0 as id:chararray, flatten(BagToTuple(yy));
H = group F by id_orig_h;
I = foreach H generate group as id_orig_h, flatten(BagToTuple(F.count)) as count;
dump G;
dump I;
Sort of works.......
I get this:
(0,x,y,z)
(A,3,0,2)
(B,2,2,1)
(C,1,1,1)
Which I can import to a text file strip the "(" and ")" and use as a CSV with schema being first line. This sort of works it is SO SLOW. I would like a nicer, faster, cleaner way of doing this. If anyone out there knows of a way please let me know.

The best I can think of would work only with Oracle, and although it wouldn't provide you with a column for each value, it would present you the data like this:
A x=3,y=3,z=3
B x=4,y=3
C y=3,z=2
of course if you have 900 values it would show:
A x=3,y=6,...,ff=12
etc...
I'm not able to add comment so I can't ask you if oracle is ok. Anyway here's the query that would achieve that:
SELECT type, values FROM
(SELECT type, SUBSTR(SYS_CONNECT_BY_PATH(value || '=' || OCC, ','),2) values, seq,
MAX(seq) OVER (partition by type) max
FROM
(SELECT type, value, OCC, ROW_NUMBER () OVER (partition by type ORDER BY type, value) seq
FROM
(SELECT type, value, COUNT(*) OCC
FROM tableName
GROUP BY type, value))
START WITH seq=1
CONNECT by PRIOR
seq+1=seq
AND PRIOR
type=type)
WHERE seq = max;
For the Average you need to add the information before all the rest, here's the code:
SELECT * FROM
(SELECT type,
SUBSTR(SYS_CONNECT_BY_PATH(value || '=' || OCC, ','),2) values,
SUBSTR(SYS_CONNECT_BY_PATH(value || '=' || (OCC / TOT), ','),2) average,
seq, MAX(seq) OVER (partition by type) max
FROM
(SELECT type, value, TOT, OCC, ROW_NUMBER () OVER (partition by type ORDER BY type, value) seq
FROM
(
SELECT type, value, TOT, COUNT(*) OCC
FROM (SELECT type, value, COUNT(*) OVER (partition by type) TOT
FROM tableName)
GROUP BY type, value, TOT
))
START WITH seq=1
CONNECT by PRIOR
seq+1=seq
AND PRIOR
type=type)
WHERE seq = max;

You can do this using the vector operation UDF's in Brickhouse ( http://github.com/klout/brickhouse ) Consider that each 'value' is a dimension in a very high dimension space. You can interpret a single value instance as a vector in that dimension, with value 1. In Hive, we would represent such a vector simply as a map with a string as the key, and an int or other numeric as the value.
What you want to create is a vector which is the sum of all vectors, grouped by type. The query would be :
SELECT type,
union_vector_sum( map( value, 1 ) ) as vector,
FROM table
GROUP BY type;
Brickhouse even has a normalize function, which will produce your 'averages'
SELECT type,
vector_normalize(union_vector_sum( map( value, 1 ) ))
as normalized_vector,
FROM table
GROUP BY type;

Updated code according to Edit#3 in question:
A = load '/path/to/input/file' using AvroStorage();
B = group A by (type, value);
C = foreach B generate flatten(group) as (type, value), COUNT(A) as count;
-- Now get all the values.
M = foreach A generate value;
-- Left Outer Join all the values with C, so that every type has exactly same number of values associated
N = join M by value left outer, C by value;
O = foreach N generate
C::type as type,
M::value as value,
(C::count == null ? 0 : C::count) as count; --count = 0 means value was not associated with the type
P = group O by type;
Q = foreach P {
R = order O by value asc; --Ordered by value, so values counts are ordered consistently in all the rows.
generate group as type, flatten(R.count);
}
Please note that I did not execute the code above. These are just the representational steps.

A = LOAD 'tablex' USING org.apache.hcatalog.pig.HCatLoader();
x = foreach A GENERATE id_orig_h;
xx = distinct x;
y = foreach A GENERATE id_resp_h;
yy = distinct y;
yyy = group yy all;
zz = GROUP A BY (id_orig_h, id_resp_h);
B = CROSS xx, yy;
C = foreach B generate xx::id_orig_h as id_orig_h, yy::id_resp_h as id_resp_h;
D = foreach zz GENERATE flatten (group) as (id_orig_h, id_resp_h), COUNT(A) as count;
E = JOIN C by (id_orig_h, id_resp_h) LEFT OUTER, D BY (id_orig_h, id_resp_h);
F = foreach E generate C::id_orig_h as id_orig_h, C::id_resp_h as id_resp_h, D::count as count;
G = foreach yyy generate 0 as id:chararray, flatten(BagToTuple(yy));
H = group F by id_orig_h;
I = foreach H generate group as id_orig_h, flatten(BagToTuple(F.count)) as count;
dump G;
dump I;

Related

Cosmos DB select with where clause with max date

I want to select item with max date.
I have select:
select d.MaxDate from (select max(c.ChangedDateTime) MaxDate FROM c WHERE AND c.IsLatest = true) d
and the result is:
[
{
"MaxDate": "2020-07-16 12:23:57"
}
]
And now I want to select row with max date:
select * FROM c WHERE IsLatest = true
WHERE c.ChangedDateTime = (select d.MaxDate from (select max(c.ChangedDateTime) MaxDate
FROM c WHERE AND c.IsLatest = true) d)
Result is empty, it should return one row with the date 2020-07-16 12:23:57
When I do select like that:
select * from c where c.IsLatest = true AND c.ChangedDateTime = '2020-07-16 12:23:57'
it returns exactly one row which I want to have, so I think that there is something wrong with subselect because it returns array with object [{"MaxDate": "2020-07-16 12:23:57"}]
How about just selecting one row after sorting?
SELECT *
FROM c
WHERE IsLatest = true
ORDER BY c.ChangedDateTime
OFFSET 0 LIMIT 1;

Record type comparison with different numbers of columns isn't failing

Why does the following query not trigger a "cannot compare record types with different numbers of columns" error in PostgreSQL 11.6?
with
s AS (SELECT 1)
, main AS (
SELECT (a) = (b) , (a) = (a), (b) = (b), a, b -- I expect (a) = (b) fails
FROM s
, LATERAL (select 1 as x, 2 as y) AS a
, LATERAL (select 5 as x) AS b
)
select * from main;
While this one does:
with
x AS (SELECT 1)
, y AS (select 1, 2)
select (x) = (y) from x, y;
See the note in the docs on row comparison
Errors related to the number or types of elements might not occur if the comparison is resolved using earlier columns.
In this case, because a.x=1 and b.x=5, it returns false without ever noticing that the number of columns doesn't match. Change them to match, and you will get the same exception (which is also why the 2nd query does have that exception).
testdb=# with
s AS (SELECT 1)
, main AS (
SELECT a = b , (a) = (a), (b) = (b), a, b -- I expect (a) = (b) fails
FROM s
, LATERAL (select 5 as x, 2 as y) AS a
, LATERAL (select 5 as x) AS b
)
select * from main;
ERROR: cannot compare record types with different numbers of columns

SQL where/or confusion

I'm running sql statements on a huge db for the first time and I have code as such.
Select x, sum(y), sum(z) from db
where n = 'xxx' or n = 'yyy' and m = int
group by x
Now if I do this
Select x, sum(y), sum(z) from db
where n = 'xxx' and m = int
group by x
Select x, sum(y), sum(z) from db
where n = 'yyy' and m = int
group by x
And manually add the grouped values together from the 2 tables I am getting different results in my queries, with the separated queries being more accurate.
E.G. Result for row 1 will in the first query will be 20 million, Result for adding Row 1's together in the second block of code will be like 18 million? Not sure what the issue is...?
Best to use parentheses when OR's are used with AND's.
select x, sum(y), sum(z) from db
where (n = 'xxx' or n = 'yyy') and m = int
group by x
In SQL, an AND takes precedence over an OR.
So this:
where n = 'xxx' or n = 'yyy' and m = int
Is actually processed as:
where n = 'xxx' or (n = 'yyy' and m = int)
And that gets the n that are 'xxx' with any m.
Anyway, Gordon has a point.
Using an IN for this is better. Even if it's only 2.
Use in. Your code doesn't really make sense:
where n in ('xxx', 'yyy') and m = int
This query:
where n = 'xxx' or 'yyy' and m = int
should return an error in SQL Server, because of the dangling 'yyy'. MySQL accepts this syntax. In that database, it would be processed as:
where n = 'xxx' or 'yyy' and m = int
-- AND has higher precedence than `or`
where n = 'xxx' or ('yyy' and m = int)
-- `'yyy'` is converted to an integer
where n = 'xxx' or (0 and m = int)
-- which is treated as a boolean
where n = 'xxx' or (false and m = int)
-- which is grouped like this
where n = 'xxx' or (false and (m = int))
-- which is equivalent to
where n = 'xxx'

Nested conditions in sql

I have the where condition in the sql:
WHERE
( Spectrum.access.dim_member.centene_ind = 0 )
AND
(
Spectrum.access.Client_List_Groups.Group_Name IN ( 'Centene Health Plan Book of Business' )
AND
Spectrum.access.dim_member.referral_route IN ( 'Claims Data' )
AND
***(
Spectrum.access.fact_task_metrics.task = 'Conduct IHA'
AND
Spectrum.access.fact_task_metrics.created_by_name <> 'BMU, BMU'
AND
Spectrum.access.fact_task_metrics.created_date BETWEEN '01/01/2015 00:0:0' AND '06/30/2015 00:0:0'
)***
AND
***(
Spectrum.access.fact_outreach_metrics.outreach_type IN ( 'Conduct IHA' )
AND
(
Spectrum.dbo.ufnTruncDate(Spectrum.access.fact_outreach_metrics.metric_date) >= Spectrum.access.fact_task_metrics.metric_date
OR
Spectrum.access.fact_outreach_metrics.metric_date >= Spectrum.access.fact_task_metrics.created_date
)
)***
AND
Spectrum.access.fact_outreach_metrics.episode_seq = 1
AND
Spectrum.access.dim_member.reinstated_date Is Null
)
I have marked two of the conditions in the above code.
The 1st condition have 2 AND operators.
The 2nd condition has an AND and an OR operator.
Question 1: Does removing the outer brackets "(" in the 1st condition impact the results?
Question 2: Does removing the outer brackets "(" in the 2nd condition impact the results?
After removing the outer bracket the filters will look like:
Spectrum.access.dim_member.referral_route IN ( 'Claims Data' )
AND
Spectrum.access.fact_task_metrics.task = 'Conduct IHA'
AND
Spectrum.access.fact_task_metrics.created_by_name <> 'BMU, BMU'
AND
Spectrum.access.fact_task_metrics.created_date BETWEEN '01/01/2015 00:0:0' AND '06/30/2015 00:0:0'
AND
Spectrum.access.fact_outreach_metrics.outreach_type IN ( 'Conduct IHA' )
AND
(
Spectrum.dbo.ufnTruncDate(Spectrum.access.fact_outreach_metrics.metric_date) >= Spectrum.access.fact_task_metrics.metric_date
OR
Spectrum.access.fact_outreach_metrics.metric_date >= Spectrum.access.fact_task_metrics.created_date
)
AND
Spectrum.access.fact_outreach_metrics.episode_seq = 1
Appreciate your help.
Regards,
Jude
Order of operations dictate that AND will be processed before OR when these expressions are evaluated within a parenthesis set.
WHERE (A AND B) OR (C AND D)
Is equivalent to:
WHERE A AND B OR C AND D
But the example below:
WHERE (A OR B) AND (C OR D)
Is not equivalent to:
WHERE A OR B AND C OR D
Which really becomes:
WHERE A OR (B AND C) OR D
Technically, you should be able to safely remove the parenthesis in question for both of your examples. With the AND statement, you are adding all of your conditions together to be one large condition. When using the OR clause, you should carefully place the parenthesis so that the groups are properly segmented.
Take the following examples into consideration:
a) where y = 1 AND n = 2 AND x = 3 or x = 5
b) where y = 1 AND n = 2 AND (x = 3 or x = 5)
c) where (y = 1 AND n = 2 AND x = 3) or x = 5
In example A, the intended outcome is unclear.
In example B, the intended outcome states that all of the conditions must be met and X can be either 3 or 5.
In example C, the intended outcome states that either Y=1, N=2 and X=3 OR x=5. As long as X = 5, it doesn't matter what Y and N equal.

Counting characters in an Access database column using SQL

I have the following table
col1 col2 col3 col4
==== ==== ==== ====
1233 4566 ABCD CDEF
1233 4566 ACD1 CDEF
1233 4566 D1AF CDEF
I need to count the characters in col3, so from the data in the previous table it would be:
char count
==== =====
A 3
B 1
C 2
D 3
F 1
1 2
Is this possible to achieve by using SQL only?
At the moment I am thinking of passing a parameter in to SQL query and count the characters one by one and then sum, however I did not start the VBA part yet, and frankly wouldn't want to do that.
This is my query at the moment:
PARAMETERS X Long;
SELECT First(Mid(TABLE.col3,X,1)) AS [col3 Field], Count(Mid(TABLE.col3,X,1)) AS Dcount
FROM TEST
GROUP BY Mid(TABLE.col3,X,1)
HAVING (((Count(Mid([TABLE].[col3],[X],1)))>=1));
Ideas and help are much appreciated, as I don't usually work with Access and SQL.
You can accomplish your task in pure Access SQL by using a Numbers table. In this case, the Numbers table must contain integer values from 1 to some number larger than the longest string of characters in your source data. In this example, the strings of characters to be processed are in [CharacterData]:
CharacterList
-------------
GORD
WAS
HERE
and the [Numbers] table is simply
n
--
1
2
3
4
5
If we use a cross join to extract the characters (eliminating any empty strings that result from n exceeding Len(CharacterList))...
SELECT
Mid(cd.CharacterList, nb.n, 1) AS c
FROM
CharacterData cd,
Numbers nb
WHERE Mid(cd.CharacterList, nb.n, 1) <> ""
...we get ...
c
--
G
W
H
O
A
E
R
S
R
D
E
Now we can just wrap that in an aggregation query
SELECT c AS Character, COUNT(*) AS CountOfCharacter
FROM
(
SELECT
Mid(cd.CharacterList, nb.n, 1) AS c
FROM
CharacterData cd,
Numbers nb
WHERE Mid(cd.CharacterList, nb.n, 1) <> ""
)
GROUP BY c
which gives us
Character CountOfCharacter
--------- ----------------
A 1
D 1
E 2
G 1
H 1
O 1
R 2
S 1
W 1
Knowing that colum3 has a fixed length of 4, this problem is quite easy.
Assume there is a view V with four columns, each for one character in column 3.
V(c1, c2, c3, c4)
Unfortunately, I'm not familiar with Access-specific SQL, but this is the general SQL statement you would need:
SELECT c, COUNT(*) FROM
(
SELECT c1 AS c FROM V
UNION ALL
SELECT c2 FROM V
UNION ALL
SELECT c3 FROM V
UNION ALL
SELECT c4 FROM V
)
GROUP BY c
It's a shame that you don't want to consider using VBA; you don't need as much as you might think:
Public charCounts As Dictionary
Sub LoadCounts(s As String)
If charCounts Is Nothing Then Init
Dim length As Integer, i As Variant
length = Len(s)
For i = 1 To length
Dim currentChar As String
currentChar = Mid(s, i, 1)
If Not charCounts.Exists(currentChar) Then charCounts(currentChar) = 0
charCounts(currentChar) = charCounts(currentChar) + 1
Next
End Sub
Sub Init()
Set charCounts = New Scripting.Dictionary
charCounts.CompareMode = TextCompare 'for case-insensitive comparisons; otherwise use BinaryCompare
End Sub
Then, you execute the query once:
SELECT LoadCount(col3)
FROM Table1
Finally, you read out the values in the Dictionary:
Dim key As Variant
For Each key In charCounts
Debug.Print key, charCounts(key)
Next
Note that between query executions you have to call Init to clear out the old values.
Please Try this,,, I hope this will work
with cte as
(
select row_number() over(order by (select null)) as i from Charactor_Count
)
select substring( name, i, 1 ) as char, count(*) as count
from Charactor_Count, cte
where cte.i <= len(Charactor_Count.name)
group by substring(name,i,1)
order by substring(name,i,1)