Using Hive, how to query data that is split across multiple partitions? - hive

From a table partitioned over date field (a new partition is generated every day), I need to extract records that range over last three months. This means that I need to query the table on every partition in the last three months to get the data by using "where date < 'today's date' and date>= 'today - 90 days'.
I think that this query would not be very efficient.
Is there a better way of accessing data that is spread across multiple partitions?

Related

SQL - Returning max count, after breaking down a day into hourly rows

I need to write a SQL query that helps return the highest count in a given hourly range. The problem is that in my table, it just logs orders as they come and doesn’t have a unique identifier that separates hours from hours.
So basically, I need to find the highest number of orders (on any given hour), from 7/08/2022, - 7/15/2022, have a table that does not distinguish distinct hour sets, and logs orders as they come.
I have tried to use a query that combines MAX(), COUNT(), and DATETIME(), but to no avail.
Can I please receive some help?
I've had to tackle this kind of measurement in the past..
Here's what I did for 15 minute intervals:
My datetime column is named datreg in my database log area.
cast(round(floor(cast(datreg as float(53))*24*4)/(24*4),5) as smalldatetime
I times by 4 in this formula, to get 4 intervals inside my 24 hour period.. For you it would look like this to get just hourly intervals:
cast(round(floor(cast(datreg as float(53))*24)/(24),5) as smalldatetime
This is a little piece of magic when it comes to dashboards and reports.

query takes longer time to execute even if index are created

I have table DB_TBL where historical data are inserted from tool on 5th of every month. As its historical data the number of rows inserted is very large. For example i checked the count for this month and its 1626521.
This table contains 50 columns. But i am interested to check only 5 columns. There is gui from the front end which is using this 5 columns and where i am providing the filter to get the one month data everytime for example for previous month as closure_date between 01.01.2020 and 31.01.2020.
Select Client_key, Portfolia_Name, Closure_date,
Currency_Type, Balance from DB_TBL
where CLOSURE_DATE between
to_date ('01.01.2020','dd.mm.yyyy') and to_date ('31.01.2020','dd.mm.yyyy');
This query is very slow and takes more than 30 minute to run. This problem is very critical as the same issue i faced from the gui. I have provided index to Closure_date column. But still it does not improve performance.
Is there any way i can improve query performance ? May be i can create Virtual based index or create View ?
First, I would write the query using date literals:
Select Client_key, Portfolia_Name, Closure_date, Currency_Type, Balance
from DB_TBL
where CLOSURE_DATE between date '2020-01-01' and date '2020-01-31'
For this query, you want an index on db_tbl(closure_date). If you already have this index, then presumably you have such a large volume of data that it takes a lot of time to return the data. You can change the index to a covering index by including the rest of the columns in the index (after CLOSURE_DATE, which should be first).
That said, if you want all data from January 2020, I would recommend:
where CLOSURE_DATE >= date '2020-01-01' and
CLOSURE_DATE < date '2020-02-01'
This will return all rows, even those that have a time component. This is particularly important in Oracle, because the date data type can have a time component -- even though user interfaces often show only the date.

What should be my hive partitioning strategy and view strategy so that query can efficiently run and return results within 10 seconds

My use case is i have two data sources:
1. Source1 (as speed layer)
2. Hive external table on top of S3(as batch layer)
I am using Presto for querying data from both the data sources by using view.
I want to create view that will union data from both the sources like : "create view test as select * from Source1.table union all select * from hive.table"
We are keeping 24 hours data in Source1 and after 24 hours that data will be migrated to s3 via hive.
Columns for Source1 tables are:timestamp,logtype,company,category
User will query data using timestamp range(can query data of last 15/30 minutes, last x hours, last x days, last x months, etc)
example: "select * from test where timestamp > (now() - interval '15' minute)","select * from test where timestamp > (now() - interval '12' hour)", "select * from test where timestamp > (now() - interval '1' day)"
To satisfy the user query I need to partition the hive table as well as the user should not be aware of the underlying stategy i.e if user is querying last x minutes data, he/she should not bother that if presto is reading the data from Source1 or hive.
What should be my hive partitioning strategy and view strategy so that query can efficiently run and return results within 10 seconds?
For hive a partition column should be used which will queried in filter.
In your case this is timestamp. However if you use timestamp it would create a partition for every second (or millisecond) depending in the data in the column.
A better solution would be to create columns like year, month, day, hour (from timestamp) and to use these as partition columns.
The same strategy will work for Kudu however be advised it could create hot-spotting since all the newly arriving records will go to same (most-recent) partition this will limit insert (and may be query) performance.
To overcome use one additional column as hash partition along with timestamp derived columns.
e.g year, month, day, hour, logtype

Creating a calculated column (not aggregate) that changes value based on context SSAS tabular DAX

Data: I have a single row that represents an annual subscription to a product, it has an overall startDate and endDate, there is also third date which is startdate + 1 month called endDateNew. I also have a non-related date table (called table X).
Output I'm looking for: I need a new column called Categorisation that will return 'New' if the date selected in table X is between startDate and endDateNew and 'Existing' if the date is between startDate and endDate.
Problem: The column seems to evaluate immediately without taking in to account the date context from the non-related date table - I kinda expected this to happen in visual studio (where it assumes the context is all records?) but when previewing in Excel it carries through this same value through.
The bit that is working:I have an aggregate (an active subscriber count) that correctly counts the subscription as active over the months selected in Table X.
The SQL equivalent on an individual date:
case
when '2015-10-01' between startDate and endDateNew then 'New'
when '2015-10-01' < endDate then 'Existing'
end as Category
where the value would be calculated for each date in table X
Thanks!
Ross
Calculated columns are only evaluated at model refresh/process time. This is by design. There is no way to make a calculated column change based on run-time changes in filter context from a pivot table.
Ross,
Calculated columns work differently than Excel. Optimally the value is known when the record is first added to the model.
Your example is kinda similar to a slowly changing dimension .
There are several possible solutions. Here are two and a half:
Full process on the last 32 days of data every time you process the subscriptions table (which may be unacceptably inefficient).
OR
Create a new table 'Subscription scd' with the primary key from the subscriptions table and your single calculated column of 'Subscription Age in Days'. Like an outrigger. This table could be reprocessed more efficiently than reprocessing the subscriptions table, so process the subscriptions table as incrementals only and do a full process on this table for the data within the last 32 days instead.
OR
Decide which measures are interesting within the 'new/existing' context and write explicit measures for them using a dynamic filter on the date column in the measures
eg. Define
'Sum of Sales - New Subscriptions',
'Sum of Sales - Existing Subscriptions',
'Distinct Count of New Subscriptions - Last 28 Days', etc

Nondeterministic functions in sql partitioning functions

How are non-deterministic functions used in SQL partitioning functions and are they useful?
MsSql allows non-deterministic functions in partitioning functions:
CREATE PARTITION FUNCTION MyArchive(datetime)
AS RANGE LEFT FOR VALUES (GETDATE() – 10)
GO
Does that mean that records older then 10 days are automatically moved to the archive (first) partition? Of course not.
The database stores the date when the partitioning schema was set up and uses it in the most (logical) way.
Lets say one sets the above schema on 2000 -01-11 which makes the delimiting date 2000-01-01.
When you are querying for data with date lower then the initial delimiting date (boundary_value - 2000-01-01) you will use only the archive partition.
When you are querying for data with date higher then the current day minus 10 days (GETDATE() – 10) you will be using only the current partition.
All other queries will use both partitions ie querying for data with date lower then current date minus 10 days but higher then the delimiting date (2000-01-01).
This means that with each passing day, the range of dates for which both partitions are used is growing. And you would have been better of setting the partition to the delimiting date deterministically.
I don't forsee any scenario where this is useful.