Hive order by not visible column - hive

Let's say I have table test with column a,b and c and test2 with same column. Can I create a view of table test and test 2 joined together and ordered by field c from table test without showing it in final output? In my case:
CREATE VIEW AS test_view AS
SELECT a,b FROM (SELECT * FROM test ORDER BY c)
JOIN test2 ON test.a =test2.a;
Ok I test it and it is not possible because shuffle phase so maybe there is another solution to somehow do it? Table are too big to do broadcast join.
Of course I can do
CREATE VIEW AS test_view AS
SELECT a,b,c FROM test
JOIN test2 ON test.a =test2.a
ORDER BY c
and then
CREATE VIEW AS final_view AS
SELECT a,b FROM test_view;
But this solution is very not optimal
Any idea?

Is this what you are looking at?
CREATE VIEW AS test_view AS
SELECT a,b FROM
(SELECT * FROM
test t1 JOIN test2 t2
ON test.a =test2.a
ORDER BY t1.c
) abc;

Consider view the same as a table on the upper level. Select from it returns not ordered dataset, only order by (or distribute+sort) in the upper query guarantees order. If you materialize this view as a table sorted, the result of select from it is not guaranteed to be sorted because the table is being read in parallel and rows returned not in the "original order". Classical theory says that ordered tables is a violation of 1st NF.
Ordered view cannot be created in a database that conforms SQL:2003.
Order by inside view makes no sense. Hence order by invisible column in a view makes no sense as well. Use ORDER BY in the upper query instead. Only order by (or distribute+sort) in the upper query guarantees order.

I end up with
CREATE VIEW AS test_view AS
SELECT a,b,c FROM test
JOIN test2 ON test.a =test2.a
ORDER BY c
and then
CREATE VIEW AS final_view AS
SELECT a,b FROM test_view;
It may not be optimal but it is only guarant way to get everything ordered and in my case (about 4-5 joins) it is much easier to get it on first sight. Of course we can also create test_view as temp view but this is just polishing.
But maybe some of You have some other working solution - feel free to post it:)

Related

Can I leverage BigQuery (BQ) partition via a join?

I am a Tableau designer, and we are building some views that get filtered by category a lot. Because of this, we tried to create a category_id that would serve as partition. The problem seems to be that if I filter data category only, the partition doesn't get used and the total table GB and cost gets hit.
Our team is trying to see if this could be minimized by using a nested query as follows:
SELECT *
FROM table a
INNER JOIN (
SELECT DISTINCT category_id, category
FROM table
) b
ON a.category_id = b.category_id
WHERE b.category = 'Category A'
The idea is that we could show the user b.category, they select it in Tableau and then the inner join would kick off the partition and limit the bytes returned. When I try this in the BQ interface, the estimated returned size comes back the same.
You'll need to filter on the partitioned field before you make the inner join.
I haven't used tableau before so don't know if this is possible but just an idea. You could create a parameter which is set by the chosen category in tableau, which could be referenced in the where statement of the partitioned table?
SELECT *
FROM table a
INNER JOIN (
SELECT DISTINCT category_id, category
FROM table
Where category = #chosen_category
) b
ON a.category_id = b.category_id;
When you say that your attempts to filter only by category, the partition isn't used, have you actually tested querying the table from the console to test whether the partition is being used or not. If it isn't then you need to look at the partition, but if it is, then you would need to take another look at your Tableau query.
VizQL (Viz query language) is Tableau's sql parser that converts your Tableau viz into SQL for execution, so whilst you cannot really modify the outgoing SQL, you can at least capture it and test which enables you to identify poor performing calculations and/or vizzes, as well as optimise the backend for the queries that Tableau will send.
I've written an article about this here: https://datawonders.atlassian.net/wiki/spaces/TABLEAU/pages/1290600449/Let+s+Talk+Errors+Tuning+6+minute+read
The thing about Tableau is that it treats the source as a derived table, with all filters being placed at the upper-level of the query immediately before the stream,
so your query:
Select *
From table a
Join (
Select Distinct Category_ID, Category
From table
)b On a.category_id = b.category_id
Where b.category = 'Category A'
Will actually look like this (assuming you just select everything):
Select a1.*
From (
Select *
From table a
Join (
Select Distinct Category_ID, Category
From table
)b On a.category_id = b.category_id
)a1
Where a1.category = <your selected category>
So you can see from here that being two-levels deep, your Category table just won't be hit, instead everything shall be read into the spool, the join taking place in tempdb, and only the complete set is filtered immediately before streaming to Tableau.
Bad, underperforming sql it most certainly is.
And this is where the relational method of v2020.2 comes into play, as this has been designed to treat each table as a separate exclusive entity, joins are only made at execution time, so you could build a view that uses data from table a where you are using table b to provide the filtering.
As an alternative, and my preferred overall method is to switch entirely to Custom SQL, utilising this with parameters, as this will enable you to craft and test your own sql to create your own high-performance, low-loading query, but as parameters are parsed before the query is executed, you can place the filtering deep down in the query without the need for a secondary look-up table or filtered derived statement - a select distinct as you are currently using it is still going to produce a large plan, as unless the category column is indexed, the engine shall still need to read every record from the table.
So using parameters, your new query will look something like:
Select a1.*
From (
Select *
From table a
Join lookup_table b On On a.category_id = b.category_id
And b.category = <parameters.pCategory>
)a1
(I've placed the filter condition directly onto the join as this can improve performance in some circumstances, though this actually shouldn't make much difference)
And when used in conjunction with the Set parameter action, you can now use parameters as in/out updateable variables which shall update as the user interacts directly with the viz, instead of the user needing to manually update as they go. If you haven't used these before, I wrote an article about it here: https://community.tableau.com/s/news/a0A4T00000313S0UAI/psst-have-you-had-a-go-with-variables-in-tableau-yet
Steve

BigQuery: Select column if it exists, else put NULL?

I am updating a daily dashboard. Assume everyday I have two tables TODAY and BEFORE_TODAY. What I have been doing daily was something like:
SELECT a, b FROM TODAY
UNION ALL
SELECT a,b FROM BEFORE_TODAY;
TODAY table is generated daily and is appended to all the data before it. Now, I need to generate a new column say c and in order to UNION ALL the two, I need that to be available on BEFORE_TODAY as well.
How can I add a conditional statement to BEFORE_TODAY to check if I have a c column and use that column, else use NULL instead of it.
Something is wrong with your data model. You should be putting the data into separate partitions of the same table. Then you would have no problems. You could just query the master table for the partitions you want. The column would appear in the history, with a NULL value.
That is the right solution. You can create a hacked solution, assuming that your tables have a primary key. For instance, if a, b is unique on each row, you could do:
select t.a, t.b, t.c
from today t
union all
select b.a, b.b,
(select c -- not qualified on purpose
from before_today b2
where b2.a = b.a and b2.b = b.b
) as
from before_today b cross join
(select null as c) c;
If b.c does not exist, then the subquery returns c.c. If it does exist this it returns b2.c from the subquery.
Although by combining dynamic SQL and INFORMATION_SCHEMA, you can write a script to achieve what you told, this doesn't seems to be the right thing to do.
As part of your data model planning, you should add column c to BEFORE_TODAY ahead of time. The newly added column on existing rows will always have NULL value. Then you can add column to TODAY and reference column c as normal.

SQL SELECT query where the IDs were already found

I have 2 tables:
Table A has 3 columns (for example) with opportunity sales header data:
OPP_ID, CLOSE_DTTM, STAGE
Table B has 3 columns with the individual line items for the Opportunities:
OPP_LINE_ID, OPP_ID, AMOUNT_USD
I have a select statement that correctly parses through Table A and returns a list of Opportunities. What I would like to do is, without joining the data, to have a SELECT statement that will get data from Table B but only for the OPP_IDs that were found in my first query.
The result should be 2 views/resultset (one for each select query) and not just 1 combined view where Table B is joined to Table A.
The reason why I want to keep them separate is because I will have to perform a few manipulations to the result from table B and i don't want the result from table A affected.
Subquery is all what you need
SELECT OPP_ID, CLOSE_DTTM, STAGE
From table a
where a.opp_id IN (Select opp_id from table b)
Presuming you're using this in some client side data access library that represents B's data in some 2 dimensional collection and you want to manipulate it without affecting/ having A's data present in that collection:
Identify the records in A:
SELECT * FROM a WHERE somecolumn = 'somevalue'
Identify the records in B that relate to A, but don't return A's data:
SELECT b.* FROM a JOIN b ON a.opp_id = b.opp_id WHERE a.somecolumn = 'somevalue'
Just because JOIN is used doesn't mean your end-consuming program has to know about A's data. You could also use IN, like the other answer does, but internally the database will rewrite them to be the same thing anyway
I tend to use exists for this type of query:
select b.*
from b
where exists (select 1 from a where a.opp_id = b.opp_id);
If you want two results sets, you need to run two queries. It is unclear what the second query is, perhaps the first query on A.

I would like a simple example of a sub-query using T-SQL 2008

Can anyone give me a good example of a subquery using TSQL 2008?
Maximilian Mayer believes that, due to referencing MS documentation, my assertion that there is a difference between a subquery and a subSelect is incorrect. Frankly, I'd consider MSDN's "Subquery Fundamentals" a better choice. Quote:
You are making distinctions between terms that actually mean the same.
O RLY?
A subQUERY...
IE:
WHERE id IN (SELECT n.id FROM TABLE n)
OR id = (SELECT MAX(m.id) FROM TABLE m)
OR EXISTS(SELECT 1/0 FROM TABLE) --won't return a math error for division by zero
...affects the WHERE or HAVING clauses -- the filteration of data -- for a SELECT, INSERT, UPDATE or DELETE statement. The value from a subquery is never directly visible in the SELECT clause.
A subSELECT...
IE:
SELECT t.column,
(SELECT x.col FROM TABLE x) AS col2
FROM TABLE t
...does not affect the filteration of data in the main query, and the value is exposed directly in the SELECT clause. But it's only one value - you can't return two or more columns into a single column in the outer query.
A subselect is a consistent means of performing a LEFT JOIN in ANSI-89 join syntax - if there is no supporting row, the column will be null. Additionally, a non-correlated subselect will return the same value for every row of the main query.
Correlation
If a subquery or subselect is correlated, that query runs once for every record of the main query returned -- which doesn't scale well as the number of rows in the result set increases.
Derived Table/Inline View
IE:
SELECT x.*,
y.max_date,
y.num
FROM TABLE x
JOIN (SELECT t.id,
t.num,
MAX(t.date) AS max_date
FROM TABLE t
GROUP BY t.id, t.num) y ON y.id = x.id
...is a JOIN to a derived table (AKA inline view).
"Inline view" is a better term, because that is all that happens when you reference a non-materialized view -- a view is just a prepared SQL statement. There's no performance or efficiency difference if you create a view with a query like the one in the example, and reference the view name in place of the SELECT statement within the brackets of the JOIN. The example has the same information as a correlated subquery, but the performance benefit of using a join and none of the subquery detriments. And you can return more than one column, because it is a view/derived table.
Conclusion
It should be obvious why I and others make distinctions. The concept of relying on the word "subquery" to categorize any SELECT statement that isn't the main clause is fatality flawed, because it's also a specific case under a categorization of the same word (IE: subquery-subselect, subquery-subquery, subquery-join...). Now think of helping someone who says "I've got a problem with a subquery..."
Maximilian Mayer's idea of "official" documentation was written by technical writers, who often have no experience in the subject and are only summarizing what they've been told to from knowledgeable people who have simplified things. Ultimately, it's just text on a page or screen -- like what you're reading now -- and the decision is up to you if the details I've laid out make sense to you.
For variety's sake, here's one in the where clause:
select
a.firstname,
a.lastname
from
employee a
where
a.companyid in (
select top 10
c.companyid
from
company c
where
c.num_employees > 1000
)
...returns all employees in the top ten companies with over 1000 employees.
SELECT
*,
(SELECT TOP 1 SomeColumn FROM dbo.SomeOtherTable)
FROM
dbo.MyTable
SELECT a.*, b.*
FROM TableA AS a
INNER JOIN
(
SELECT *
FROM TableB
) as b
ON a.id = b.id
Thats a normal subquery, running once for the whole result set.
On the other hand
SELECT a.*, (SELECT b.somecolumn FROM TableB AS b WHERE b.id = a.id)
FROM TableA AS a
is a correlated subquery, running once for every row in the result set.

Best way to perform dynamic subquery in MS Reporting Services?

I'm new to SQL Server Reporting Services, and was wondering the best way to do the following:
Query to get a list of popular IDs
Subquery on each item to get properties from another table
Ideally, the final report columns would look like this:
[ID] [property1] [property2] [SELECT COUNT(*)
FROM AnotherTable
WHERE ForeignID=ID]
There may be ways to construct a giant SQL query to do this all in one go, but I'd prefer to compartmentalize it. Is the recommended approach to write a VB function to perform the subquery for each row? Thanks for any help.
I would recommend using a SubReport. You would place the SubReport in a table cell.
Depending on how you want the output to look, a subreport could do, or you could group on ID, property1, property2 and show the items from your other table as detail items (assuming you want to show more than just count).
Something like
select t1.ID, t1.property1, t1.property2, t2.somecol, t2.someothercol
from table t1 left join anothertable t2 on t1.ID = t2.ID
#Carlton Jenke I think you will find an outer join a better performer than the correlated subquery in the example you gave. Remember that the subquery needs to be run for each row.
Simplest method is this:
select *,
(select count(*) from tbl2 t2 where t2.tbl1ID = t1.tbl1ID) as cnt
from tbl1 t1
here is a workable version (using table variables):
declare #tbl1 table
(
tbl1ID int,
prop1 varchar(1),
prop2 varchar(2)
)
declare #tbl2 table
(
tbl2ID int,
tbl1ID int
)
select *,
(select count(*) from #tbl2 t2 where t2.tbl1ID = t1.tbl1ID) as cnt
from #tbl1 t1
Obviously this is just a raw example - standard rules apply like don't select *, etc ...
UPDATE from Aug 21 '08 at 21:27:
#AlexCuse - Yes, totally agree on the performance.
I started to write it with the outer join, but then saw in his sample output the count and thought that was what he wanted, and the count would not return correctly if the tables are outer joined. Not to mention that joins can cause your records to be multiplied (1 entry from tbl1 that matches 2 entries in tbl2 = 2 returns) which can be unintended.
So I guess it really boils down to the specifics on what your query needs to return.
UPDATE from Aug 21 '08 at 22:07:
To answer the other parts of your question - is a VB function the way to go? No. Absolutely not. Not for something this simple.
Functions are very bad on performance, each row in the return set executes the function.
If you want to "compartmentalize" the different parts of the query you have to approach it more like a stored procedure. Build a temp table, do part of the query and insert the results into the table, then do any further queries you need and update the original temp table (or insert into more temp tables).