Compute aggregate on insert and update rather than on select - sql

In MS Sql Server 2016, I have a view that contains a query that looks something like this:
SELECT Col1, Col2, MAX(Col3) FROM TableA
WHERE Col1 = ???
GROUP BY Col1, Col2
I use this view all over the place, and as a result, most of my queries seem very inefficient, having to do an Index Scan on TableA on almost every SELECT in my system.
What I'd like to know is whether there a way to store that MAX(Col3) so that it is computed during INSERTs and UPDATEs instead of during SELECTs?
Here are some thoughts and why I rejected them:
I don't think I can use a clustered indexed on the view, because "MAX(Col3)" is not deterministic.
I don't see how I can use a filtered view.
I don't want to use triggers.

I would start with an index tableA(col1, col2, col3).
Depending on the size of the table and the number of matches on col1, this should be quite reasonable performancewise.
Otherwise, you might need a summary table that is maintained using triggers.

Related

How to avoid evaluating the same calculated column in Hive query repetedly

Lets say I have a calculated column:-
select str_to_map("k1:1,k2:2,k3:3")["k1"] as col1,
str_to_map("k1:1,k2:2,k3:3")["k2"] as col2,
str_to_map("k1:1,k2:2,k3:3")["k3"] as col3;
How do I 'fix' the column calculation only once and access its value multiple times in the query? The map being calculated is the same, only different keys are being accessed for different columns. Performing the same calculation repeatedly is a waste of resources. This example is purposely made too simple, but the point is I want to know how to avoid this kind of redundancy in Hive in general.
In general use subqueries, they are calculated once.
select map_col.["k1"] as col1,
map_col.["k2"] as col2,
map_col.["k3"] as col3
from
(
select str_to_map("k1:1,k2:2,k3:3") as map_col from table...
)s;
Also you can materialize some query into table to reuse the dataset across different queries or workflows.

What's the advised way to do groupby 2 attributes from a Hive table with 1000 columns?

Give a Hive table with 1000 columns:
col1, col2, ..., col1000
The source table contains billions of rows, and the size is about 1PB.
I only need to query 3 columns,
select col1, col2, sum(col3) as col3
from myTable
group by
col1, col2
Will it be advised to do a subquery first, and then send it to the aggregation of group by, so that we have much smaller files sending to groupby? Not sure it Hive automatically takes care of this.
select col1, col2, sum(col3) as col3
from
(select col1, col2, col3
from myTable
) a
group by
col1, col2
Behind the scenes it shouldn't really matter if you do a sub-query or not, but you can look at the explain plan of each query to see if you notice any differences between them.
The ideal situation would be for your table to be stored in a columnar format, so if a lot of queries like this will be used in the future then I would ensure that your table is stored as parquet files which use columnar storage and will give you excellent query performance.
If it isn't in this format then you can create a new table using a create as select statement.
create table yourNewParquetTable stored as parquet as select * from yourOldTable;
In general, there is no reason to use a subquery in this situation. You basically have two situations:
First, Hive could store/fetch all the columns together. In that case, Hive needs to read all the data in all columns either for the subquery or for the aggregation.
Otherwise, Hive could store/fetch only the columns you need. In that case, Hive would do so for either version.
That said, there is a reason to avoid the subquery in some databases. MySQL materializes subqueries -- meaning they are stored as if they were temporary tables. This is unnecessary overhead and a good reason to avoid unnecessary subqueries with MySQL. Hive does not do that. It compiles the query in a data flow and executes the data flow.

Oracle using wrong index for select query?

So I have a table with two indexes:
Index_1: Column_A, Column_B
Index_2: Column A, Column_B, Column_C
I am running a select query:
select * from table Where (Column A, Column_B, Column_C)
IN(('1','2','3'), ('4','5','6'),...);
When using "EXPLAIN PLAN for" in SQL developer. It seems to be using the first index instead of the second, despite the second matching the values in my query?
Why is this? and is it hindering my optimal performance?
Expanding on my comment, although we can't analyze Oracle's query planning without knowing anything about the data or seeing the actual plan, the three-column index is not necessarily better suited for your query than is the two-column index, at least if the base table has additional columns (which you are selecting) beyond those three.
Oracle is going to need to read the base table anyway to get the other columns. Supposing that the values in column_C are not too correlated with the values in column_A and column_B, the three-column index will be a lot larger than the two-column index. Using the two-column index may therefore involve reading fewer blocks overall, especially if that index is relatively selective.
Oracle has a very good query planner. If it has good table statistics to work with then it is probably choosing a good plan. Even without good statistics it will probably do a fine job for a query as simple as yours.
I had similar problem, Oracle use an index with 2 columns as per explain plan, while my query involves selection of 20 columns from the table and where clause having 5 values as below:
from tab1
where Col1= 'A'
and col2 = 'b'
and col3 = 'c'
and col4 = 'd'
and col5 >= 10
Index1: col1, col2
Index2: col1, col2, col3, col4, col5
If I added a hint to use index2 the query executed much faster than using index1... what could be done so that Oracle choose the index2?
Tried to ensure statistics are gathered, still system detects the index1 as the best to use.

sql distinct on multiple cols but not all the columns

I am trying to write a merge query to read some data from table A and merge in table B.
While reading from A , I want to make sure that the rows are distinct based on few columns. so I have a query something like:
select distinct col1, col2, col3 from A.
However I want the uniqueness to be on (col1,col2) pair. The above query would still get me rows which are unique for (col1,col2,col3) but not for (col1,col2)
Is there any other way to achieve this or use something like
select distinct(col1,col2),col3 from A.
Also,
I expect that A would have many records ~10-20 million . Moreover I am planning to iterate over multiple such table A [which itself is generated on the fly over few joins]. Hopefully the performance would be not worse of the solution, but in either case I am looking for a solution that works first and then I might run queries in part.
Oracle 10G is the version.
Thanks !
Update: I am wondering if groupby would invoke a sorting operation whcih would be very expensive than using distinct ?
Update: Answering one of the question below : With (A, B, C) and (A, B, D) rows, what is your expected result?
=> Assuming I want distinct on (A,B) it is fine to return either of (A,B,C) or (A,B,D) but not both.
If there more then one distinct value in col3 for specific pair (col1, col2) you'll have to choose one of them I've used here max (but you can use any of those Aggregate Functions (Oracle)):
select col1, col2, max(col3)
from A
group by col1, col2

Index data that exists in multiple tables

I want to query some data in multiple tables as if it were one table. There are about 10 tables, all with different columns but with 5 matching columns. Ideally I would re-design the tables so that the shared columns go into one table and I could then create relations between the other tables. Unfortunately this is not an option as I can’t change the existing tables.
What would be the best approach for accessing and indexing the data? I was thinking of creating a view or stored procedure with UNION ALLs e.g.
SELECT COL1, COL2, COL3 FROM TABLE1
UNION ALL
SELECT COL1, COL2, COL3 FROM TABLE2
UNION ALL
SELECT COL1, COL2, COL3 FROM TABLE3
UNION ALL
SELECT COL1, COL2, COL3 FROM TABLE4
But then how would I index this? Would a view or stored procedure be best? Or perhaps a completely different approach?
SQL Server has indexed views but it doesn't support UNIONs in the view.
If you have to query data over all tables as if it were one, and you can't change the current schema, then I'd suggest using triggers to maintain another single table that can be indexed etc.
View with union all would do the job. You should index individual tables that participate in the view. Actually it can benefit performance (optimizer may figure out that some tables does not need to be scanned in particular queries).
Stored proc means complications because it is not so easy to use it in from clause (but still possible with OPENQUERY like 'SELECT * FROM OPENQUERY(YourServer, 'SET FMT_ONLY OFF EXEC stored_proc')). It performs better when generating your results involves some procedural logic (e.g. complicated report).