Lets say I have a calculated column:-
select str_to_map("k1:1,k2:2,k3:3")["k1"] as col1,
str_to_map("k1:1,k2:2,k3:3")["k2"] as col2,
str_to_map("k1:1,k2:2,k3:3")["k3"] as col3;
How do I 'fix' the column calculation only once and access its value multiple times in the query? The map being calculated is the same, only different keys are being accessed for different columns. Performing the same calculation repeatedly is a waste of resources. This example is purposely made too simple, but the point is I want to know how to avoid this kind of redundancy in Hive in general.
In general use subqueries, they are calculated once.
select map_col.["k1"] as col1,
map_col.["k2"] as col2,
map_col.["k3"] as col3
from
(
select str_to_map("k1:1,k2:2,k3:3") as map_col from table...
)s;
Also you can materialize some query into table to reuse the dataset across different queries or workflows.
Related
For background, I am querying a large database with hundreds of attributes per table and millions of rows. I'm trying to figure out which columns have no values at all.
I can do the following:
SELECT
COUNT(DISTINCT col1) as col1,
COUNT(DISTINCT col2) as col2,
...
COUNT(DISTINCT coln) as coln
FROM table;
Whenever a count of 1 is returned, I know that column has no values, great. The issue is that this is incredibly tedious to retype hundreds of times. How can I do this in a more efficient manner? I only have a fundamental understanding of SQL, and the limited capabilities of Athena makes this more difficult for me. Thank you
Edit: Just to clarify, the reason why the count needs to be 1 is because this database uses empty Strings rather than NULL
In MS Sql Server 2016, I have a view that contains a query that looks something like this:
SELECT Col1, Col2, MAX(Col3) FROM TableA
WHERE Col1 = ???
GROUP BY Col1, Col2
I use this view all over the place, and as a result, most of my queries seem very inefficient, having to do an Index Scan on TableA on almost every SELECT in my system.
What I'd like to know is whether there a way to store that MAX(Col3) so that it is computed during INSERTs and UPDATEs instead of during SELECTs?
Here are some thoughts and why I rejected them:
I don't think I can use a clustered indexed on the view, because "MAX(Col3)" is not deterministic.
I don't see how I can use a filtered view.
I don't want to use triggers.
I would start with an index tableA(col1, col2, col3).
Depending on the size of the table and the number of matches on col1, this should be quite reasonable performancewise.
Otherwise, you might need a summary table that is maintained using triggers.
Give a Hive table with 1000 columns:
col1, col2, ..., col1000
The source table contains billions of rows, and the size is about 1PB.
I only need to query 3 columns,
select col1, col2, sum(col3) as col3
from myTable
group by
col1, col2
Will it be advised to do a subquery first, and then send it to the aggregation of group by, so that we have much smaller files sending to groupby? Not sure it Hive automatically takes care of this.
select col1, col2, sum(col3) as col3
from
(select col1, col2, col3
from myTable
) a
group by
col1, col2
Behind the scenes it shouldn't really matter if you do a sub-query or not, but you can look at the explain plan of each query to see if you notice any differences between them.
The ideal situation would be for your table to be stored in a columnar format, so if a lot of queries like this will be used in the future then I would ensure that your table is stored as parquet files which use columnar storage and will give you excellent query performance.
If it isn't in this format then you can create a new table using a create as select statement.
create table yourNewParquetTable stored as parquet as select * from yourOldTable;
In general, there is no reason to use a subquery in this situation. You basically have two situations:
First, Hive could store/fetch all the columns together. In that case, Hive needs to read all the data in all columns either for the subquery or for the aggregation.
Otherwise, Hive could store/fetch only the columns you need. In that case, Hive would do so for either version.
That said, there is a reason to avoid the subquery in some databases. MySQL materializes subqueries -- meaning they are stored as if they were temporary tables. This is unnecessary overhead and a good reason to avoid unnecessary subqueries with MySQL. Hive does not do that. It compiles the query in a data flow and executes the data flow.
I have a table in an Oracle DB that stores transaction batches uploaded by users. A new upload mechanism has been implemented and I would like to compare its results. A single batch was uploaded using the origninal mechanism and then the new mechanism. I am trying to find unique rows (I rows that existed in the first upload that did not exist or are different in the second upload. Or rows that are non-existent in the first upload but do exist or are different in the second). I am dealing with a huge data set (Over a million records) and that makes this analysis very difficult.
I have tried several approaches:
SELECT col1, col2 ...
FROM table
WHERE upload_id IN (first_upload_ID, second_upload_id)
GROUP BY col1, col2..
HAVING COUNT(*) = 1;
SELECT col1, col2 ...
FROM table
WHERE upload_id = first_upload_ID
MINUS
SELECT col1, col2 ...
FROM table
WHERE upload_id = second_upload_id;
SELECT col1, col2 ...
FROM table
WHERE upload_id = second_upload_id
MINUS
SELECT col1, col2 ...
FROM table
WHERE upload_id = first_upload_ID;
Both of these results returned several hundred thousand rows, making it difficult to analyze.
Does anyone how any suggestions in how to approach/simplify this problem? Could I do a self join on several columns that are unique for each upload? If yes, what would that self join look like?
Thank you for the help.
One method that might be useful is to calculate a hash of each record and run a match based on that. It doesn't have to be some super-secure SHA-whatever, just the regular Oracle Ora_Hash(), as long as you're going to get a pretty small chance of hash collisions. Ora_Hash ought to be sufficent with a max_bucket_size of 4,294,967,295.
I'd just run joins between the two sets of hashes. Hash joins (as in the join mechanism) are very efficient.
Alternatively you could join the two data sets in their entirity, and as long as you're using equi-joins and only projecting the identifying rowid's from the data sets it would be broadly equivalent performance-wise because hashes would be computed on the join columns but only the rowid's would have to be stored as well, keeping the hash table size small. The tricky part there is in dealing with nulls in the join.
When doing a join like this make sure not to include columns containing the upload-id, and any audit data added to the uploaded data. Restrict the joins to the columns that contain the uploaded data. The MINUS approach should work well otherwise.
I want to query some data in multiple tables as if it were one table. There are about 10 tables, all with different columns but with 5 matching columns. Ideally I would re-design the tables so that the shared columns go into one table and I could then create relations between the other tables. Unfortunately this is not an option as I can’t change the existing tables.
What would be the best approach for accessing and indexing the data? I was thinking of creating a view or stored procedure with UNION ALLs e.g.
SELECT COL1, COL2, COL3 FROM TABLE1
UNION ALL
SELECT COL1, COL2, COL3 FROM TABLE2
UNION ALL
SELECT COL1, COL2, COL3 FROM TABLE3
UNION ALL
SELECT COL1, COL2, COL3 FROM TABLE4
But then how would I index this? Would a view or stored procedure be best? Or perhaps a completely different approach?
SQL Server has indexed views but it doesn't support UNIONs in the view.
If you have to query data over all tables as if it were one, and you can't change the current schema, then I'd suggest using triggers to maintain another single table that can be indexed etc.
View with union all would do the job. You should index individual tables that participate in the view. Actually it can benefit performance (optimizer may figure out that some tables does not need to be scanned in particular queries).
Stored proc means complications because it is not so easy to use it in from clause (but still possible with OPENQUERY like 'SELECT * FROM OPENQUERY(YourServer, 'SET FMT_ONLY OFF EXEC stored_proc')). It performs better when generating your results involves some procedural logic (e.g. complicated report).