Teradata logs. Statistic/Partitioning usage - sql

May be someone knows is it possible to find in Teradata logs what statistics or even better what partition ranges was used in particular query?
For example in table definition we have date range:
PARTITION BY (
Range_N(TransactionDate BETWEEN DATE '2012-01-01' AND DATE '2022-12-31' EACH INTERVAL '1' MONTH)
)
So question is it possible to see which range was used in particular query? I believe not, but may be still there is some way to do it?
I tried to analyze some DBC.DBQ tables, but no results.

Related

Querying a partitioned BigQuery table across multiple far-apart _PARTITIONDATE days?

I have a number of very big tables that are partitioned by _PARTITIONDATE which I'd like to query off of regularly in an efficient way. Each time I run the query, I only need to search across a small number of dates, but these dates will change every run and may be months/years apart from one-another.
To capture these dates, I could do _PARTITIONDATE >= '2015-01-01' but this is making the queries run very slow as there are millions of rows on each partition. I could also do _PARTITIONDATE BETWEEN '2015-01-01' AND '2017-01-01', but the exact date range will change every run. What I'd like to do is something like _PARTITIONDATE IN ("2015-03-10", "2016-01-24", "2016-22-03", "2017-06-14") so that the query only needs to run on the dates provided, which from my testing appears to work.
The problem I'm running into is that the list of dates will change every time, requiring me to join in the list of dates in a temp table first. When doing that like this source._PARTITIONDATE IN (datelist.date), it does not work and runs into an error if that's the only WHERE condition when querying a partition-required table.
Any advice for ways I might get this to work or other approach to querying off specific partitions that aren't back to back without having to process querying the whole thing?
I've been reading through the BigQuery documentation but I don't see an answer to this question. I do see it says that the following "doesn't limit the scanned partitions, because it uses table values, which are dynamic." So possibly what I'm trying to do is impossible with the current BQ limitations?
_PARTITIONTIME = (SELECT MAX(timestamp) from dataset.table1)
Script is a possible solution.
DECLARE max_date DEFAULT (SELECT MAX(...) FROM ...);
SELECT .... FROM ... WHERE _PARTITIONDATE = max_date;

BigQuery: querying the latest partition, bytes to be processed vs. actually processed

I'm struggling with querying efficiently the last partition of a table, using a date or datetime field. The first approach was to filter like this:
SELECT *
FROM my_table
WHERE observation_date = (SELECT MAX(observation_date) FROM my_table)
But that, according to BigQuery's processing estimation, scans the entire table and does not use the partitions. Even Google states this happens in their documentation. It does work if I use an exact value for the partition:
SELECT *
FROM my_table
WHERE observation_date = CURRENT_DATE
But if the table is not up to date then the query will not get any results and my automatic procesess will fail. If I include an offset like observation_date = DATE_SUB(CURRENT_DATE, INTERVAL 2 DAY), I will likely miss the latest partition.
What is the best practice to get the latest partition efficiently?
What makes this worse is that BigQuery's estimation of the bytes to be processed with the active query does not match what was actually processed, unless I'm not interpreting those numbers correctly. Find below the screenshot of the mismatching values.
BigQuery screen with aparrently mistmatching processed bytes
Finally a couple of scenarios that I also tested:
If I store a max_date with a DECLARE statement first, as suggested in this post, the estimation seems to work, but it is not clear why. However, the actual processed bytes after running the query is not different than the case that filters the latest partition in the WHERE clause.
Using the same declared max_date in a table that is both partitioned and clustered, the estimation works only when using a filter for the partition, but fails if I include a filter for the cluster.
After some iterations I got an answer from Google and although it doesn't resolve the issue, it acknowledges it happens.
Tables partitioned with DATE or DATETIME fields cannot be efficiently queried for their latest partition. The best practice remains to filter with something like WHERE observation_date = (SELECT MAX(observation_date) FROM my_table) and that will scan the whole table.
They made notes to try and improve this in the future but we have to deal with this for now. I hope this helps somebody that was trying to do the same thing.

how a hive table should be created when we need query based on daily data

i want to understand if i need to query a table and query is something like below
Select * from table_name where date_column > sysdate-2 and date_column < sysdate;
note: my intention is to select data of each day to be specific
then how should i design my table for better results?? i think partitioning based on date will give too many partitions and lead to performance bottle neck , not sure whether bucketing works here.... plz suggest and some explanation
If the data on a daily basis is just not enough to create a partition, you must think on creating partition based on yyyyMM (Year and Month). In that case, your query changes to
Select * from table_name where
my_partition_col in (date_format(sysdate,'yyyyMM'), date_format(sysdate-2,'yyyyMM'))
AND date_column > sysdate-2 and date_column < sysdate;
This optimizes the storage and performance requirement.
You should partition by date.
You are correct that this will create a lot of partitions. Within Hive, each date will be a separate file, and yes, Hive will need to maintain all of that, but that's exactly what Hive is best at.
note: my intention is to select data of each day to be specific
Since this is your intention, you'll get the best performance with daily partitions.
Other sorts of queries, running across multiple dates, may result in the performance bottleneck you're expressing concern about. But if that occurs, you could consider creating a different table to address that use case.
For your primary, current use case, daily partitions are the solution.

What is the fastest way to perform a date query in Oracle SQL?

We have a 6B row table that is giving us challenges when retrieving data.
Our query returns values instantly when doing a...
SELECT * WHERE Event_Code = 102225120
That type of instant result is exactly what we need. We now want to filter to receive values for just a particular year - but the moment we add...
AND EXTRACT(YEAR FROM PERFORMED_DATE_TIME) = 2017
...the query takes over 10 minutes to begin returning any values.
Another SO post mentions that indexes don't necessarily help date queries when pulling many rows as opposed to an individual row. There are other approaches like using TRUNC, or BETWEEN, or specifying the datetime in YYYY-MM-DD format for doing comparisons.
Of note, we do not have the option to add indexes to the database as it is a vendor's database.
What is the way to add a date filtering query and enable Oracle to begin streaming the results back in the fastest way possible?
Another SO post mentions that indexes don't necessarily help date queries when pulling many rows as opposed to an individual row
That question is quite different from yours. Firstly, your statement above applies to any data type, not only dates. Also the word many is relative to the number of records in the table. If the optimizer decides that the query will return many of all records in your table, then it may decide that a full scan of the table is faster than using the index. In your situation, this translates to how many records are in 2017 out of all records in the table? This calculation gives you the cardinality of your query which then gives you an idea if an index will be faster or not.
Now, if you decide that an index will be faster, based on the above, the next step is to know how to build your index. In order for the optimizer to use the index, it must match the condition that you're using. You are not comparing dates in your query, you are only comparing the year part. So an index on the date column will not be used by this query. You need to create an index on the year part, so use the same condition to create the index.
we do not have the option to add indexes to the database as it is a vendor's database.
If you cannot modify the database, there is no way to optimize your query. You need to talk to the vendor and get access to modify the database or ask them to add the index for you.
A function can also cause slowness for the number of records involved. Not sure if Function Based Index can help you for this, but you can try.
Had you tried to add a year column in the table? If not, try to add a year column and update it using code below.
UPDATE table
SET year = EXTRACT(YEAR FROM PERFORMED_DATE_TIME);
This will take time though.
But after this, you can run the query below.
SELECT *
FROM table
WHERE Event_Code = 102225120 AND year = 2017;
Also, try considering Table Partitioned for this big data. For starters, see link below,
link: https://oracle-base.com/articles/8i/partitioned-tables-and-indexes
Your question is a bit ambiguous IMHO:
but the moment we add...
AND EXTRACT(YEAR FROM PERFORMED_DATE_TIME) = 2017
...the query takes over 10 minutes to begin returning any values.
Do you mean that
SELECT * WHERE Event_Code = 102225120
is fast, but
SELECT * WHERE Event_Code = 102225120 AND EXTRACT(YEAR FROM PERFORMED_DATE_TIME) = 2017
is slow???
For starters I'll agree with Mitch Wheat that you should try to use PERFORMED_DATE_TIME between Jan 1, 2017 and Dec 31, 2017 instead of Year(field) = 2017. Even if you'd have an index on the field, the latter would hardly be able to make use of it while the first method would benefit enormously.
I'm also hoping you want to be more specific than just 'give me all of 2017' because returning over 1B rows is NEVER going to be fast.
Next, if you can't make changes to the database, would you be able to maintain a 'shadow' in another database? This would require that you create a table with all date-values AND the PK of the original table in another database and query those to find the relevant PK values and then JOIN those back to your original table to find whatever you need. The biggest problem with this would be that you need to keep the shadow in sync with the original table. If you know the original table only changes overnight, you could merge the changes in the morning and query all day. If the application is 'real-time(ish)' then this probably won't work without some clever thinking... And yes, your initial load of 6B values will be rather heavy =)
May this could be usefull (because you avoid functions (a cause for context switching) and if you have an index on your date field, it could be used) :
with
dt as
(
select
to_date('01/01/2017', 'DD/MM/YYYY') as d1,
to_date('31/01/2017', 'DD/MM/YYYY') as d2
from dual
),
dates as
(
select
dt.d1 + rownum -1 as d
from dt
connect by dt.d1 + rownum -1 <= dt.d2
)
select *
from your_table, dates
where dates.d = PERFORMED_DATE_TIME
Move the date literal to RHS:
AND PERFORMED_DATE_TIME >= date '2017-01-01'
AND PERFORMED_DATE_TIME < date '2018-01-01'
But without an (undisclosed) appropriate index on PERFORMED_DATE_TIME, the query is unlikely to be any faster.
One option to create indexes in third party databases is to script in the index and then before any vendor upgrade run a script to remove any indexes you've added. If the index is important, ask the vendor to add it to their database design.

Generating dates in Hive SQL

I'm looking to be able to create a table that contains all of the dates (inclusive) between the min and max date from another table. See below the simple query to get these dates
-- Get the min and max dates from the table
select min(date(sale_date)) as min_date,
max(date(sale_date)) as max_date
from TABLE;
I've spent the last hour googling this problem and have found attempts at doing this on MySQL and Oracle SQL but not on Hive SQL which I've been unable to convert to Hive SQL. If anyone has any idea on how to do this, please let me know. Thanking you in advance.
Ok this isn't my answer. A colleague was able to answer it. Still I think its important that I show my colleague's solution for your future benefit. It assumes that you've created a table that contains the min date and max date.
CREATE TABLE TABLE_2
STORED AS AVRO
LOCATION 'xxxxxx'
AS
SELECT date_add (t.min_date,pe.i) AS date_key
FROM TABLE_1 t
LATERAL VIEW
posexplode(split(space(datediff(t.max_date,t.min_date)),' ')) pe AS i,x;