Perhaps this is a pedantic distinction, but if we take the SQL grammar, in the following case from BigQuery:
query_statement:
query_expr
query_expr:
[ WITH [ RECURSIVE ] { non_recursive_cte | recursive_cte }[, ...] ]
{ select | ( query_expr ) | set_operation }
[ ORDER BY expression [{ ASC | DESC }] [, ...] ]
[ LIMIT count [ OFFSET skip_rows ] ]
select:
SELECT
[ { ALL | DISTINCT } ]
[ AS { STRUCT | VALUE } ]
select_list
[ FROM from_clause[, ...] ]
[ WHERE bool_expression ]
[ GROUP BY { expression [, ...] | ROLLUP ( expression [, ...] ) } ]
[ HAVING bool_expression ]
[ QUALIFY bool_expression ]
[ WINDOW window_clause ]
It is easy to see, for example, that the WHERE clause would be the section:
[ WHERE bool_expression ]
However, what is considere the "select clause". Is it the wrapper that contains everything? Or is it just the part that goes from SELECT up until the FROM section? Is there a way to distinguish the two?
For a specific example, how would the following be defined:
SELECT 1
FROM x
UNION ALL
SELECT 2
With the three components labeled with:
[ (full)
{ (SELECT until end or set/op)
< (SELECT until FROM):
[
{<SELECT 1> FROM x}
UNION ALL
{<SELECT 2>}
]
The SELECT statement is the whole statement from the word SELECT until the end.
The SELECT clause is from the word SELECT until the word FROM
Related
I have two datasets that I'm trying to consolidate to represent all of the unique touch points for a given user. I've gotten as far as using ARRAY_AGG to aggregate everything down to a single session identifier, but now I want to consolidate the identifiers themselves and am stuck.
The source data looks like this:
Session_GUID
User_GUID
Interaction_GUID
Session_1
User_1
Interact_A
Session_1
User_1
Interact_B
Session_1
User_2
Interact_C
Session_2
User_2
Interact_D
Session_3
User_3
Interact_C
Session_4
User_4
Interact_E
And I've aggregated it down with a simple
SELECT
Session,
ARRAY_AGG(DISTINCT User_GUID),
ARRAY_AGG(DISTINCT Interaction_GUID)
FROM
source_table
GROUP BY
Session
Which gets me here:
Session
User_GUID_Array
Interaction_GUID_Array
Session_1
[ User_1, User_2 ]
[ Interact_A, Interact_B, Interact_C ]
Session_2
[ User_2 ]
[ Interact_D ]
Session_3
[ User_3 ]
[ Interact_C ]
Session_4
[ User_4 ]
[ Interact_E ]
Now I'd like to aggregate everything based on matches in either of the two arrays.
So from the above, Session_1 and Session_2 get grouped together based on User_GUID matches, and Session_3 gets added too based on Interaction_GUID matches.
This seems like it should be do-able based on some sort of "do another ARRAY_AGG if these intersect/overlap conditions are met" logic. But I'm at the limits of my SQL knowledge and haven't been able to figure it out.
The end result I'm looking for is this:
Session_Array
User_GUID_Array
Interaction_GUID_Array
[ Sessionion_1, Session_2, Session_3 ]
[ User_1, User_2, User_3 ]
[ Interact_A, Interact_B, Interact_C, Interact_D ]
[ Session_4 ]
[ User_4 ]
[ Interact_E ]
Grouping by more than one column usually requires a recursive CTE, but in this case the grouping is by array intersection. One way to accomplish this is with a user defined table function that maintains two two-dimensional arrays, one for each column. As a row goes through, the function checks to see if it's seen the values before (in either of the two two-dimensional arrays). If it has seen at least one value in one of the groups, it returns the group number. A CTE can then use those group numbers for the array_union aggregation.
This approach will only work for small partitions. In this example the partition is "1", which means the entire table. If the table is large, the UDTF will run out of memory. This approach requires a partition such as a date, ID of some sort, etc. that keeps the partitions small (a few thousand rows perhaps). If the partitions are significantly larger than that, this approach won't work.
create or replace function GROUP_ANY(ARR1 array, ARR2 array)
returns table(GROUP_NUMBER float)
language javascript
as
$$
{
initialize: function (argumentInfo, context) {
this.isInitialized = false;
this.groupNumber = 0;
this.arr1 = [1];
this.arr2 = [1];
},
processRow: function (row, rowWriter, context) {
var arraysIntersect;
var g;
if(!this.isInitialized) {
this.isInitialized = true;
this.arr1[0] = row.ARR1;
this.arr2[0] = row.ARR2;
} else {
arraysIntersect = false;
for (g=0; g<=this.groupNumber; g++) {
if(arraysOverlap(this.arr1[g], row.ARR1) || arraysOverlap(this.arr2[g], row.ARR2)) {
this.arr1[g] = this.arr1[g].concat(row.ARR1);
this.arr2[g] = this.arr2[g].concat(row.ARR2);
arraysIntersect = true;
}
if (arraysIntersect) {
break;
}
}
if(!arraysIntersect){
this.arr1.push(row.ARR1);
this.arr2.push(row.ARR2);
this.groupNumber++;
}
}
if (arraysIntersect) {
rowWriter.writeRow({GROUP_NUMBER:g});
} else {
rowWriter.writeRow({GROUP_NUMBER:this.groupNumber});
}
function arraysOverlap(arr1, arr2) {
return arr1.some(r=> arr2.includes(r));
}
},
finalize: function (rowWriter, context) {/*...*/},
}
$$;
create or replace table T1(Session_GUID string, User_GUID string, Interaction_GUID string);
insert into T1 (Session_GUID, User_GUID, Interaction_GUID) values
('Session_1', 'User_1', 'Interact_A'),
('Session_1', 'User_1', 'Interact_B'),
('Session_1', 'User_2', 'Interact_C'),
('Session_2', 'User_2', 'Interact_D'),
('Session_3', 'User_3', 'Interact_C'),
('Session_4', 'User_4', 'Interact_E'),
('Session_5', 'User_5', 'Interact_F'),
('Session_6', 'User_4', 'Interact_G'),
('Session_7', 'User_6', 'Interact_E'),
('Session_8', 'User_8', 'Interact_H')
;
with SESSIONS as
(
select Session_GUID
,array_unique_agg(User_GUID) USER_GUID
,array_unique_agg(Interaction_GUID) INTERACTION_GUID
from T1
group by Session_GUID
), GROUPS as
(
select * from SESSIONS, table(group_any(USER_GUID, INTERACTION_GUID)
over (partition by 1 order by SESSION_GUID ))
)
select array_agg(SESSION_GUID) SESSION_GUIDS
,array_union_agg(USER_GUID) USER_GUIDS
,array_union_agg(INTERACTION_GUID) INTERACTION_GUIDS
from GROUPS
group by GROUP_NUMBER
;
Output:
SESSION_GUIDS
USER_GUIDS
INTERACTION_GUIDS
[ "Session_5" ]
[ "User_5" ]
[ "Interact_F" ]
[ "Session_1", "Session_2", "Session_3" ]
[ "User_1", "User_2", "User_3" ]
[ "Interact_A", "Interact_B", "Interact_C", "Interact_D" ]
[ "Session_8" ]
[ "User_8" ]
[ "Interact_H" ]
[ "Session_4", "Session_6", "Session_7" ]
[ "User_4", "User_6" ]
[ "Interact_E", "Interact_G" ]
In SQL Server, a function is defined as follows:
-- Transact-SQL Scalar Function Syntax
CREATE [ OR ALTER ] FUNCTION [ schema_name. ] function_name
( [ { #parameter_name [ AS ][ type_schema_name. ] parameter_data_type
[ = default ] [ READONLY ] }
[ ,...n ]
]
)
RETURNS return_data_type
[ WITH <function_option> [ ,...n ] ]
[ AS ]
BEGIN
function_body
RETURN scalar_expression
END
[ ; ]
Where return_data_type can be text, a table (with a slightly different syntax), or almost any other data type.
Is it possible to retrieve the return data type without running the query?
I know it's possible to do using sp_describe_first_result_set, but this executes the query and looks at the response. Edit: I was wrong. It is done through static analysis, but has a number of limitations associated with it.
As mentioned in comments, you can use sp_describe_first_result_set.
Or you can use the query from the linked duplicate and extend it with INFORMATION_SCHEMA.ROUTINE_COLUMNS:
SELECT r.ROUTINE_NAME AS FunctionName,
r.DATA_TYPE AS FunctionReturnType,
rc.COLUMN_NAME AS ColumnName,
rc.DATA_TYPE AS ColumnType
FROM INFORMATION_SCHEMA.ROUTINES r
LEFT JOIN INFORMATION_SCHEMA.ROUTINE_COLUMNS rc ON rc.TABLE_NAME = r.ROUTINE_NAME
WHERE ROUTINE_TYPE = 'FUNCTION'
ORDER BY r.ROUTINE_NAME, rc.ORDINAL_POSITION;
That will give you the return information for both scalar-value functions and table-value functions, including schema information for the TVF result set.
I have a problem in understanding the Create Table syntax as shown in the MSDN.
I guess that [] means that sth is optional | - a different way - so
CREATE TABLE
[ database_name . [ schema_name ] . | schema_name . ] table_name
means that you can actually use:
1.CREATE table table_name
2.CREATE table database_name.schema_name.table_name
4.Create table database_name.table_name
3.Create table schema_name.table_name
but what about {} or ()
CREATE TABLE
[ database_name . [ schema_name ] . | schema_name . ] table_name
[ AS FileTable ]
( { <column_definition>
| <computed_column_definition>
| <column_set_definition>
| [ <table_constraint> ]
| [ <table_index> ]
[ ,...n ] }
[ PERIOD FOR SYSTEM_TIME ( system_start_time_column_name
, system_end_time_column_name ) ]
)
[ ON { partition_scheme_name ( partition_column_name )
| filegroup
| "default" } ]
[ { TEXTIMAGE_ON { filegroup | "default" } ]
[ FILESTREAM_ON { partition_scheme_name
| filegroup
| "default" } ]
[ WITH ( <table_option> [ ,...n ] ) ]
[ ; ]
?
First of all you should check Transact-SQL Syntax Conventions:
| (vertical bar) Separates syntax items enclosed in brackets or braces.
You can use only one of the items.
[ ] (brackets) Optional syntax items. Do not type the brackets.
{ } (braces) Required syntax items. Do not type the braces.
Now, as for creating table you could use:
CREATE TABLE table_name
CREATE TABLE database_name..table_name
CREATE TABLE database_name.schema_name.table_name
CREATE TABLE schema_name.table_name
So your Create table database_name.table_name is incorrect. You have to use the second example. When you pass .. the table will be created inside default schema (most likely dbo).
As for second question how to read MSDN documentation.
Probably the best way is visual way:
In the MSDN Library or the Technet website, Microsoft tend to use a pseudo syntax in explaining how to use T-SQL statements with all available options. Here is a sample taking from the Technet page on UPDATE STATISTICS :
UPDATE STATISTICS table_or_indexed_view_name
[
{
{ index_or_statistics__name }
| ( { index_or_statistics_name } [ ,...n ] )
}
]
[ WITH
[
FULLSCAN
| SAMPLE number { PERCENT | ROWS }
| RESAMPLE
| <update_stats_stream_option> [ ,...n ]
]
[ [ , ] [ ALL | COLUMNS | INDEX ]
[ [ , ] NORECOMPUTE ]
] ;
<update_stats_stream_option> ::=
[ STATS_STREAM = stats_stream ]
[ ROWCOUNT = numeric_constant ]
[ PAGECOUNT = numeric_contant ]
How to properly read such description and quickly figure out what is required and what is optional and a clean way to write your query?
You should refer to this Transact-SQL Syntax Conventions
The first table in that article explains pretty much everything.
In your example we can see the following:
UPDATE STATISTICS table_or_indexed_view_name
UPDATE STATISTICS is the keyword used
table_or_indexed_view_name is the name of the table or the view to update statistics for
[
{
{ index_or_statistics__name }
| ( { index_or_statistics_name } [ ,...n ] )
}
]
This is optional [], but if supplied, you have to put a statistic name {index_or_statistics__name}, or | a list of statistic names separated by commas { index_or_statistics_name } [ ,...n ]
[ WITH
[
FULLSCAN
| SAMPLE number { PERCENT | ROWS }
| RESAMPLE
| <update_stats_stream_option> [ ,...n ]
]
[ [ , ] [ ALL | COLUMNS | INDEX ]
[ [ , ] NORECOMPUTE ]
] ;
This is optional too []. If used then you must begin with a WITH and you have 4 options that you must choose from.
Your options are
FULLSCAN
SAMPLE number { PERCENT | ROWS }, where you have to define the number and you must choose from PERCENT or | ROWS
RESAMPLE
` [ ,...n ]' which is a list separated by commas
Then you have to choose either ALL, COLUMNS or INDEX and preside that with a comma if you have used the WITH.
Lastly you have another option to use the NORECOMPUTE and put a comma before it if you have used any other option before it.
<update_stats_stream_option> ::=
[ STATS_STREAM = stats_stream ]
[ ROWCOUNT = numeric_constant ]
[ PAGECOUNT = numeric_contant ]
These are the list of predefined options you may use where <update_stats_stream_option> is used before (in 4).
Any thing between Square Brackets [...] are Optional
Any thing seperated by the pipe | symbol is a one or the other option.
In your above example, you could read it as
UPDATE STATISTICS table_or_indexed_view_name
[ optionally specify an index as well]
[ optionally specify options using **WITH**
If you use WITH then you can follow it with one of the following keywords
FULLSCAN
OR SAMPLE number { PERCENT | ROWS }
OR RESAMPLE
].. and so on
I'm trying to document some SQL and wanted to get the right terminology. If you write SQL like so;
select child.ID, parent.ID
from hierarchy child
inner join hierarchy parent
on child.parentId = parent.ID
Then you have one actual table ('hierarchy') which you are giving two names ('parent' and 'child') My question is about how you refer to the logical entity of a table with a name.
What would you write in the blank here for the name?
"This query uses one table (hierarchy) but two _ (child and parent)"
[edit] left a previous draft in the question. now corrected.
I believe this is called a SELF JOIN. A and B (or "child" and "parent", I think you have a typo in your question) are called ALIASes or TABLE ALIASes.
The concept is a self join. However, the a is a syntax error. The table is hierarchy, the alias is child.
I would call each part of a self join an instance.
In the SQL Server docs, the term is table_source :
Specifies a table, view, or derived table source, with or without an alias, to use in the Transact-SQL statement
In the BNF grammar, it's:
<table_source> ::=
{
table_or_view_name [ [ AS ] table_alias ] [ <tablesample_clause> ]
[ WITH ( < table_hint > [ [ , ]...n ] ) ]
| rowset_function [ [ AS ] table_alias ]
[ ( bulk_column_alias [ ,...n ] ) ]
| user_defined_function [ [ AS ] table_alias ] [ (column_alias [ ,...n ] ) ]
| OPENXML <openxml_clause>
| derived_table [ AS ] table_alias [ ( column_alias [ ,...n ] ) ]
| <joined_table>
| <pivoted_table>
| <unpivoted_table>
| #variable [ [ AS ] table_alias ]
| #variable.function_call ( expression [ ,...n ] ) [ [ AS ] table_alias ] [ (column_alias [ ,...n ] ) ]
'child', 'parent'
The term used in the SQL-92 Standard spec is "correlation name", being a type of "identifier".
'hierarchy'
The term used in the SQL-92 Standard spec is "table".
Hence the answer to your (edited) question is:
This query uses one table (hierarchy)
but two correlation names (child and
parent).