Oracle query optimization written on table with partition - sql

I have one table in Oracle consists of around 55 million records with partition on date column.
This table stores around 600,000 records for each day based on some position.
Now, some analytical functions are used in one select query in procedure e.g. lead, lag, row_number() over(partition by col1, date order by col1, date) which is taking too much time due to 'partition by' and 'order by' clause on date column.
Is there any other alternative to optimize the query ?

Have you considered using a materialized view where you store the results of your analytical functions?
More information about MVs
http://docs.oracle.com/cd/B19306_01/server.102/b14200/statements_6002.htm

Related

Optimize SELECT MAX(timestamp) query

I would like to run this query about once every 5 minutes to be able to run an incremental query to MERGE to another table.
SELECT MAX(timestamp) FROM dataset.myTable
-- timestamp is of type TIMESTAMP
My concern is that will do a full scan of myTable on a regular basis.
What are the best practices for optimizing this query? Will partitioning help even if the SELECT MAX doesn't extract the date from the query? Or is it just the columnar nature of BigQuery will make this optimal?
Thank you.
What you can do is, instead of querying your table directly, query the INFORMATION_SCHEMA.PARTITIONS table within your dataset. Doc here.
You can for instance go for:
SELECT LAST_MODIFIED_TIME
FROM `project.dataset.INFORMATION_SCHEMA.PARTITIONS`
WHERE TABLE_NAME = "myTable"
The PARTITIONS table hold metadata at the rate of one record for each of your partitions. It is therefore greatly smaller than your table and it's an easy way to cut your query costs. (it is also much faster to query).

COUNT(*) queries in Hive behaving differently when ran with UNION ALL

I ran two queries to get count of records for two different dates from a Hive managed table partitioned on process date field.
SELECT COUNT(1) FROM prd_fct.mktng WHERE process_dt='2018-01-01' --returned 2 million
SELECT COUNT(1) FROM prd_fct.mktng WHERE process_dt='2018-01-02' --returned 3 million
But if I ran the below query with a UNION ALL clause, the counts returned are different from that of above mentioned individual queries.
SELECT '2018-01-01', COUNT(1) FROM prd_fct.mktng WHERE process_dt='2018-01-01'
UNION ALL
SELECT '2018-01-02', COUNT(1) FROM prd_fct.mktng WHERE process_dt='2018-01-02'
What can be the root cause for this difference?
One of our teammate helped us to identify the issue.
When we run a single count() query,the query is not physically executed on table rather count will be taken from statistics.
One of the remedy is to collect the stats on table agian,then the count() on single table will reflect actual count
Regards,
Anoop
I too faced a similar issue with count(*) returning incorrect count. I added the below to my code and the counts are consistent now.
For non-partitioned table use:
ANALYZE TABLE your_table_name COMPUTE STATISTICS
For partitioned table, analyze the recently added partition by specifying the partition value:
ANALYZE TABLE your_table_name
PARTITION(your_partition_name=your_partition_value)
COMPUTE STATISTICS;

ROWNUM or ROWID in google bigquery

porting some stuff to bigquery, and come across an issue.
We have a bunch of data with no unique key value. Unfortuantely some report logic requires a unique value for each row.
So in systems like Oracle I would just user the ROWNUM or ROWID psudeo columns.
In vertica, which doesn't have those psudeo columns I would use ROW_NUMBER() OVER(). But in bigquery that is failing with the error:
'dataset:bqjob_r79e7b4147102bdd7_0000016482b3957c_1': Resources exceeded during query execution: The query could not be executed in the allotted memory.
OVER() operator used too much memory..
The value does not have to be persistent, just a unique value within the query results.
Would like to avoid extract-process-reload if possible.
So is there any way to assign a unqiue value to query result rows in bigquery SQL?
Edit: Sorry, should have clarified. Using standard sql, not legacy
For ROW_NUMBER() OVER() to scale, you'll need to use PARTITION.
See https://stackoverflow.com/a/16534965/132438
#standardSQL
SELECT *
, FORMAT('%i-%i-%i', year, month, ROW_NUMBER() OVER(PARTITION BY year, month)) id
FROM `publicdata.samples.natality`

Can you query a HANA analytical view as if it's the orginal table?

Suppose you have a HANA analytical view, but you have no access to the table that was the origin of the analytical view.
The analytical view has pre-aggregated columns, but you need the columns without the pre-aggregation, otherwise the query will get the wrong result.
For example if there are the integer columns Price and Profit, and your query has SELECT SUM(PRICE*PROFIT). With the regular table on each row there would be a PRICE*PROFIT calculation, and then the result would aggregate that to SUM(PRICE*PROFIT) from each row. But with the analytical view's pre-aggregation, you end up getting SUM(PRICE)*SUM(PROFIT), which is not the same as SUM(PRICE*PROFIT).
Yes, if there's another column with a unique value per row, that can be added to the query, and you can get multiple rows from the analytical view that will aggregate as I need. And you can do a SELECT * to get all rows without pre-aggregation, but that doesn't allow including the SUM(PRICE*PROFIT).
In my case, my program has no idea which columns would have the unique value to do the aggregated calculation correctly.
Is there any way to query an analytical view as if it was its original table?
I have a solution: do a SELECT * as a subquery, then you have the original table to query from:
SELECT "State", SUM("Price" * "Profit") AS PxP FROM
(
SELECT * FROM AVTable
)
GROUP BY "State"

Select record online by max online ordered by date

Needs help in sql:
I need to group max online of each day by days
(http://prntscr.com/a7j2sm)
my sql select:
SELECT id, date, MAX(online)
FROM `record_online_1`
GROUP BY DAY(date)
and result - http://prntscr.com/a7j3sp
This is incorrect result because, max online is correct, but date and id of this top online is incorrect. I dont have ideas how solve this issue..
UPD: using MySQL MariaDB
When you perform an aggregate functions, you have to include items in the SELECT statement that aren't a part of an aggregate function in the GROUP BY clause. In T-SQL, you simply cannot execute the above query if you don't also GROUP BY "id" for example. However, some database systems allow you to forego this rule, but it's not smart enough to know which ID it should bring back to you. You should only be doing this if, for example, all "ids" for that segment are the same.
So what should you do? Do this in two steps. Step one, find the max values. You will lose the ID and DATETIME data.
SELECT DAY(date) AS Date, MAX(online) AS MaxOnline
FROM `record_online_1` GROUP BY DAY(date)
The above will get you a list of dates with the max for each day. INNER JOIN this to the original "record_online_1" table, joining specifically on the date and max value. You can use a CTE, temp table, subquery, etc to do this.
EDIT: I found an answer that is more eloquent than my own.