Index data that exists in multiple tables - sql

I want to query some data in multiple tables as if it were one table. There are about 10 tables, all with different columns but with 5 matching columns. Ideally I would re-design the tables so that the shared columns go into one table and I could then create relations between the other tables. Unfortunately this is not an option as I can’t change the existing tables.
What would be the best approach for accessing and indexing the data? I was thinking of creating a view or stored procedure with UNION ALLs e.g.
SELECT COL1, COL2, COL3 FROM TABLE1
UNION ALL
SELECT COL1, COL2, COL3 FROM TABLE2
UNION ALL
SELECT COL1, COL2, COL3 FROM TABLE3
UNION ALL
SELECT COL1, COL2, COL3 FROM TABLE4
But then how would I index this? Would a view or stored procedure be best? Or perhaps a completely different approach?

SQL Server has indexed views but it doesn't support UNIONs in the view.
If you have to query data over all tables as if it were one, and you can't change the current schema, then I'd suggest using triggers to maintain another single table that can be indexed etc.

View with union all would do the job. You should index individual tables that participate in the view. Actually it can benefit performance (optimizer may figure out that some tables does not need to be scanned in particular queries).
Stored proc means complications because it is not so easy to use it in from clause (but still possible with OPENQUERY like 'SELECT * FROM OPENQUERY(YourServer, 'SET FMT_ONLY OFF EXEC stored_proc')). It performs better when generating your results involves some procedural logic (e.g. complicated report).

Related

How to avoid evaluating the same calculated column in Hive query repetedly

Lets say I have a calculated column:-
select str_to_map("k1:1,k2:2,k3:3")["k1"] as col1,
str_to_map("k1:1,k2:2,k3:3")["k2"] as col2,
str_to_map("k1:1,k2:2,k3:3")["k3"] as col3;
How do I 'fix' the column calculation only once and access its value multiple times in the query? The map being calculated is the same, only different keys are being accessed for different columns. Performing the same calculation repeatedly is a waste of resources. This example is purposely made too simple, but the point is I want to know how to avoid this kind of redundancy in Hive in general.
In general use subqueries, they are calculated once.
select map_col.["k1"] as col1,
map_col.["k2"] as col2,
map_col.["k3"] as col3
from
(
select str_to_map("k1:1,k2:2,k3:3") as map_col from table...
)s;
Also you can materialize some query into table to reuse the dataset across different queries or workflows.

Compute aggregate on insert and update rather than on select

In MS Sql Server 2016, I have a view that contains a query that looks something like this:
SELECT Col1, Col2, MAX(Col3) FROM TableA
WHERE Col1 = ???
GROUP BY Col1, Col2
I use this view all over the place, and as a result, most of my queries seem very inefficient, having to do an Index Scan on TableA on almost every SELECT in my system.
What I'd like to know is whether there a way to store that MAX(Col3) so that it is computed during INSERTs and UPDATEs instead of during SELECTs?
Here are some thoughts and why I rejected them:
I don't think I can use a clustered indexed on the view, because "MAX(Col3)" is not deterministic.
I don't see how I can use a filtered view.
I don't want to use triggers.
I would start with an index tableA(col1, col2, col3).
Depending on the size of the table and the number of matches on col1, this should be quite reasonable performancewise.
Otherwise, you might need a summary table that is maintained using triggers.

What's the advised way to do groupby 2 attributes from a Hive table with 1000 columns?

Give a Hive table with 1000 columns:
col1, col2, ..., col1000
The source table contains billions of rows, and the size is about 1PB.
I only need to query 3 columns,
select col1, col2, sum(col3) as col3
from myTable
group by
col1, col2
Will it be advised to do a subquery first, and then send it to the aggregation of group by, so that we have much smaller files sending to groupby? Not sure it Hive automatically takes care of this.
select col1, col2, sum(col3) as col3
from
(select col1, col2, col3
from myTable
) a
group by
col1, col2
Behind the scenes it shouldn't really matter if you do a sub-query or not, but you can look at the explain plan of each query to see if you notice any differences between them.
The ideal situation would be for your table to be stored in a columnar format, so if a lot of queries like this will be used in the future then I would ensure that your table is stored as parquet files which use columnar storage and will give you excellent query performance.
If it isn't in this format then you can create a new table using a create as select statement.
create table yourNewParquetTable stored as parquet as select * from yourOldTable;
In general, there is no reason to use a subquery in this situation. You basically have two situations:
First, Hive could store/fetch all the columns together. In that case, Hive needs to read all the data in all columns either for the subquery or for the aggregation.
Otherwise, Hive could store/fetch only the columns you need. In that case, Hive would do so for either version.
That said, there is a reason to avoid the subquery in some databases. MySQL materializes subqueries -- meaning they are stored as if they were temporary tables. This is unnecessary overhead and a good reason to avoid unnecessary subqueries with MySQL. Hive does not do that. It compiles the query in a data flow and executes the data flow.

Way to get all table and column names from the T-SQL query

I have very large query (here are used more than 100 tables and thousands of columns), and I need to get list of all columns and tables used in this query. Maybe there is any already created software or script to achieve that?
For example:
Query:
SELECT t1.Col1, t1.Col2, t1.Col3, t2.Col4, t2.Coln
FROM TableName1 t1
JOIN TableName2 t2 ON t1.Col1 = t2.Col4
Output should be:
TableName1
Col1
Col2
Col3
TableName2
Col4
Coln
I agree with the comments asking why you need such information and why a query would need to query so many tables. That aside, this is still solvable.
Some low-tech ways would be:
If all tables referenced in the query are schema-qualified (like dbo.table_name), which they should be for best performance, search for your schema in a text editor, or write a small shell script to parse the query text for dbo.
Extract the execution plan of the query and extract a list of objects from it using XML.
But this is what I would do. This is probably the simplest and most reliable. Put the query into a stored procedure, install the procedure, and check its dependencies.
In SQL Server Management Studio, if you right-click a stored procedure, there is a 'View Dependencies' menu item which will show you which tables it relies on. If you need a more programmatic answer, you can extract the dependencies from DMV's (sys.dm_*).
select re.referenced_schema_name + '.' + re.referenced_entity_name as table_name
from sys.dm_sql_referenced_entities('PROC_NAME', 'OBJECT') re
where re.referenced_minor_id = 0;
This also returns a row for any functions the query calls. If needed, I'm sure you can filter those out by joining to a DMV that breaks down object types.

Unique rows in duplicate data contained in one table

I have a table in an Oracle DB that stores transaction batches uploaded by users. A new upload mechanism has been implemented and I would like to compare its results. A single batch was uploaded using the origninal mechanism and then the new mechanism. I am trying to find unique rows (I rows that existed in the first upload that did not exist or are different in the second upload. Or rows that are non-existent in the first upload but do exist or are different in the second). I am dealing with a huge data set (Over a million records) and that makes this analysis very difficult.
I have tried several approaches:
SELECT col1, col2 ...
FROM table
WHERE upload_id IN (first_upload_ID, second_upload_id)
GROUP BY col1, col2..
HAVING COUNT(*) = 1;
SELECT col1, col2 ...
FROM table
WHERE upload_id = first_upload_ID
MINUS
SELECT col1, col2 ...
FROM table
WHERE upload_id = second_upload_id;
SELECT col1, col2 ...
FROM table
WHERE upload_id = second_upload_id
MINUS
SELECT col1, col2 ...
FROM table
WHERE upload_id = first_upload_ID;
Both of these results returned several hundred thousand rows, making it difficult to analyze.
Does anyone how any suggestions in how to approach/simplify this problem? Could I do a self join on several columns that are unique for each upload? If yes, what would that self join look like?
Thank you for the help.
One method that might be useful is to calculate a hash of each record and run a match based on that. It doesn't have to be some super-secure SHA-whatever, just the regular Oracle Ora_Hash(), as long as you're going to get a pretty small chance of hash collisions. Ora_Hash ought to be sufficent with a max_bucket_size of 4,294,967,295.
I'd just run joins between the two sets of hashes. Hash joins (as in the join mechanism) are very efficient.
Alternatively you could join the two data sets in their entirity, and as long as you're using equi-joins and only projecting the identifying rowid's from the data sets it would be broadly equivalent performance-wise because hashes would be computed on the join columns but only the rowid's would have to be stored as well, keeping the hash table size small. The tricky part there is in dealing with nulls in the join.
When doing a join like this make sure not to include columns containing the upload-id, and any audit data added to the uploaded data. Restrict the joins to the columns that contain the uploaded data. The MINUS approach should work well otherwise.