Hive groupby slower with partition - hive

I partition my data in hive based on a column value(date). So each date has it's own directory in /warehouse. right now I have about 240 dates, and a total of 70 million records evenly distributed across dates.
I also create another table containing the same data that does not has partitions.
When I query both table with the same queries, the partitioned table does not always out-perform the unpartitioned one. More specifically, partitioned table is slower when executing query with group by.
select count(*) from not_partitioned_table where date > '2018-07-27' and date < '2018-08-27
This took 22.146 seconds, and the count is 7427366.
select count(*) from partitioned_table where date > '2018-07-27' and date < '2018-08-27
This took 22.723 seconds, and also returns 7427366 for count.
However when group by is added, partitioned table perform worse than un-partitioned table.
select count(*) from not_partitioned_table where dated > '2018-07-27' and date < '2018-08-27' group by col_name;
This took 39.733 seconds and 26,724 rows were returned.
select count(*) from partitioned_table where dated > '2018-07-27' and date < '2018-08-27' group by col_name;
This took 76.648 seconds seconds and 26,724 rows were returned.
Why is the partitioned table slower in this scenario?
EDIT
This is how I create my partitioned table:
CREATE TABLE all_ads_from_csv_partitioned3(
id STRING,
...
)
PARTITIONED BY(datedecoded STRING)
STORED AS ORC;
And under 2018-10-08 15:34 /warehouse/tablespace/managed/hive/partitioned_table/, there are 240 directories(240 partitions), each has the format of /warehouse/tablespace/managed/hive/partitioned_table/dated='the partitioned date', and each partition contains roughly 10 buckets.

Related

I don't understand how make task on SQL

There is a table with two fields: Id and Timestamp.
Id is an increasing sequence. Each insertion of a new record into the table leads to the generation of ID(n)=ID(n-1) + 1. Timestamp is a timestamp that, when inserted retroactively, can take any values less than the maximum time of all previous records.
Retroactive insertion is the operation of inserting a record into a table in which
ID(n) > ID(n-1)
Timestamp(n) < max(timestamp(1):timestamp(n-1))
Example of a table:
ID
Timestamp
1
2016.09.11
2
2016.09.12
3
2016.09.13
4
2016.09.14
5
2016.09.09
6
2016.09.12
7
2016.09.15
IDs 5 and 6 were inserted retroactively (their timestamps are lower than later records).
I need a query that will return a list of all ids that fit the definition of insertion retroactively. How can I do this?
It can be rephrased to :
Find every entries for which, in the same table, there is an entry with a lesser id (a previous entry) having a greater timestamp
It can be achieved using a WHERE EXISTS clause :
SELECT t.id, t.timestamp
FROM tbl t
WHERE EXISTS (
SELECT 1
FROM tbl t2
WHERE t.id > t2.id
AND t.timestamp < t2.timestamp
);
Fiddle for MySQL It should work with any DBMS, since it's a standard SQL syntax.

Insert latest records efficiently in hive

I have around 90 tables in hive, 10 each are combined using union all in to 9 master tables.
These 90 base tables are inserted with new rows every 15 minutes. We need to bring in the newly inserted rows in master tables every 15 minutes.
Checking the ID with "not in" is consuming some time.
I have time stamps column as well, getting data based on that as well taking time
Is there a efficient way of achieving this. " Inserting newly added records in base tables into master every 15 minutes"
I can think of two options.
Option 1 - You can create a new table to keep max date timestamp for each master,stage combination. Table should be like this
masters,stages, mxts
master1,stage1, 2021-01-01 12:30:30
...
Then use it in sql like similar to above sql.
select * from Staging table-1 s
Join maxtimestamp On timestamp > mxts and stages='stage1' and masters='master1'
union all
select * from Staging table-2 s
Join maxtimestamp On timestamp > mxts and stages='stage2'and masters='master1'
And then insert max timespamp into the new table everyday after load.
Option 2 - if you can add a new column to master table called record_created_by to keep a track which stage is creating the data.
And your insert statement would be like this
select s.*, 'master1~stage1' as record_created_by from Staging table-1 s
Join (select max(timestamp) mxts from master where record_created_by='master1~stage1') mx On timestamp > mxts
union all
select s.*, 'master1~stage2' as record_created_by from Staging table-2 s
Join (select max(timestamp) mxts from master where record_created_by='master1~stage2') mx On timestamp > mxts
Pls note your first time insert statement would be same above sql but without timestamp part. If you have multiple stages, you can add them like this sql.
First option is way faster but you need to create and maintain a new table.

Partition by week/month//quarter/year to get over the partition limit?

I have 32 years of data that I want to put into a partitioned table. However BigQuery says that I'm going over the limit (4000 partitions).
For a query like:
CREATE TABLE `deleting.day_partition`
PARTITION BY FlightDate
AS
SELECT *
FROM `flights.original`
I'm getting an error like:
Too many partitions produced by query, allowed 2000, query produces at least 11384 partitions
How can I get over this limit?
Instead of partitioning by day, you could partition by week/month/year.
In my case each year of data contains around ~3GB of data, so I'll get the most benefits from clustering if I partition by year.
For this, I'll create a year date column, and partition by it:
CREATE TABLE `fh-bigquery.flights.ontime_201903`
PARTITION BY FlightDate_year
CLUSTER BY Origin, Dest
AS
SELECT *, DATE_TRUNC(FlightDate, YEAR) FlightDate_year
FROM `fh-bigquery.flights.raw_load_fixed`
Note that I created the extra column DATE_TRUNC(FlightDate, YEAR) AS FlightDate_year in the process.
Table stats:
Since the table is clustered, I'll get the benefits of partitioning even if I don't use the partitioning column (year) as a filter:
SELECT *
FROM `fh-bigquery.flights.ontime_201903`
WHERE FlightDate BETWEEN '2008-01-01' AND '2008-01-10'
Predicted cost: 83.4 GB
Actual cost: 3.2 GB
Alternative example, I created a NOAA GSOD summary table clustered by station name - and instead of partitioning by day, I didn't partition it at all.
Let's say I want to find the hottest days since 1980 for all stations with a name like SAN FRAN%:
SELECT name, state, ARRAY_AGG(STRUCT(date,temp) ORDER BY temp DESC LIMIT 5) top_hot, MAX(date) active_until
FROM `fh-bigquery.weather_gsod.all`
WHERE name LIKE 'SAN FRANC%'
AND date > '1980-01-01'
GROUP BY 1,2
ORDER BY active_until DESC
Note that I got the results after processing only 55.2MB of data.
The equivalent query on the source tables (without clustering) processes 4GB instead:
# query on non-clustered tables - too much data compared to the other one
SELECT name, state, ARRAY_AGG(STRUCT(CONCAT(a.year,a.mo,a.da),temp) ORDER BY temp DESC LIMIT 5) top_hot, MAX(CONCAT(a.year,a.mo,a.da)) active_until
FROM `bigquery-public-data.noaa_gsod.gsod*` a
JOIN `bigquery-public-data.noaa_gsod.stations` b
ON a.wban=b.wban AND a.stn=b.usaf
WHERE name LIKE 'SAN FRANC%'
AND _table_suffix >= '1980'
GROUP BY 1,2
ORDER BY active_until DESC
I also added a geo clustered table, to search by location instead of station name. See details here: https://stackoverflow.com/a/34804655/132438

SQL Huge Read Only Table Performance Filter and Ordering

I have a table with 1 billion rows that holds possible solutions to a goal setting program.
The combination of each column's value creates a successful goal path. I want to filter records to show the top 10 rows that are ordered by the choice of the user. Someone may want the lowest possible retirement age, then lowest deposit amount. Someone else may want the highest possible survival chance, then highest ending balance, ...
Here are my columns:
age tinyint
retirement_age tinyint
retirement_length tinyint
survival smallint
deposit int
balance_start int
balance_end int
SLOW 10 MIN QUERY:
select top(10) age,retirement_age,retirement_length,survival,deposit,balance_start,balance_end
from TABLE
where
age >= 30
and survival >= 8000 --OUT OF 10000
and balance_start <= 20000
and retirement_age >= 60
and retirement_age <= 75
and retirement_length >= 10
and retirement_length <= 25
and deposit >= 1000
and deposit <= 20000
ORDER BY -- (COLUMN ORDER PREFERENCES UNKNOWN)
retirement_age,
deposit,
retirement_length desc,
balance_end desc,
age desc,
survival desc
That query takes 10 min.
All of the records are generated once, so there is no more writing/updating to the database. I was thinking I should index each column, but have not done so. The database is 30GB right now, but space is not an issue.
I have run the Estimated Execution plan:
select: 0%
parallelism: 0%
sort: 23%
table scan: 77%
Have you tried creating an index like
CREATE INDEX IX_TABLE ON [TABLE]
(age,survival,balance_start,retirement_age,retirement_length,deposit)
INCLUDE (balance_end)
The order of the index fields (age,survival,balance_start,retirement_age,retirement_length,deposit) will make a difference if not all the fields are used in the WHERE clause, so make sure to put them in order of most used.
Also, the order of the included columns does not make any difference.
Seeing as the table values will not change, you can create more than one such index to improve the performance of other queries where it does not use all the fields in the WHERE clause
I ended up creating separate indexes on each of the columns in my where and order clauses with the default order:
CREATE INDEX IX_age ON TABLE (age desc)
CREATE INDEX IX_retirement_age ON TABLE (retirement_age)
CREATE INDEX IX_retirement_length ON TABLE (retirement_length desc)
CREATE INDEX IX_survival ON TABLE (survival desc)
CREATE INDEX IX_deposit ON TABLE (deposit)
CREATE INDEX IX_balance_start ON TABLE (balance_start)
CREATE INDEX IX_balance_end ON TABLE (balance_end desc)

SQL Query for count of records matching day in a date range?

I have a table with records that look like this:
CREATE TABLE sample (
ix int unsigned auto_increment primary key,
start_active datetime,
last_active datetime
);
I need to know how many records were active on each of the last 30 days. The days should also be sorted incrementing so they are returned oldest to newest.
I'm using MySQL and the query will be run from PHP but I don't really need the PHP code, just the query.
Here's my start:
SELECT COUNT(1) cnt, DATE(?each of last 30 days?) adate
FROM sample
WHERE adate BETWEEN start_active AND last_active
GROUP BY adate;
Do an outer join.
No table? Make a table. I always keep a dummy table around just for this.
create table artificial_range(
id int not null primary key auto_increment,
name varchar( 20 ) null ) ;
-- or whatever your database requires for an auto increment column
insert into artificial_range( name ) values ( null )
-- create one row.
insert into artificial_range( name ) select name from artificial_range;
-- you now have two rows
insert into artificial_range( name ) select name from artificial_range;
-- you now have four rows
insert into artificial_range( name ) select name from artificial_range;
-- you now have eight rows
--etc.
insert into artificial_range( name ) select name from artificial_range;
-- you now have 1024 rows, with ids 1-1024
Now make it convenient to use, and limit it to 30 days, with a view:
Edit: JR Lawhorne notes:
You need to change "date_add" to "date_sub" to get the previous 30 days in the created view.
Thanks JR!
create view each_of_the_last_30_days as
select date_sub( now(), interval (id - 1) day ) as adate
from artificial_range where id < 32;
Now use this in your query (I haven't actually tested your query, I'm just assuming it works correctly):
Edit: I should be joining the other way:
SELECT COUNT(*) cnt, b.adate
FROM each_of_the_last_30_days b
left outer join sample a
on ( b.adate BETWEEN a.start_active AND a.last_active)
GROUP BY b.adate;
SQL is great at matching sets of values that are stored in the database, but it isn't so great at matching sets of values that aren't in the database. So one easy workaround is to create a temp table containing the values you need:
CREATE TEMPORARY TABLE days_ago (d SMALLINT);
INSERT INTO days_ago (d) VALUES
(0), (1), (2), ... (29), (30);
Now you can compare a date that is d days ago to the span between start_active and last_active of each row. Count how many matching rows in the group per value of d and you've got your count.
SELECT CURRENT_DATE - d DAYS, COUNT(*) cnt,
FROM days_ago
LEFT JOIN sample ON (CURRENT_DATE - d DAYS BETWEEN start_active AND last_active)
GROUP BY d
ORDER BY d DESC; -- oldest to newest
Another note: you can't use column aliases defined in your select-list in expressions until you get to the GROUP BY clause. Actually, in standard SQL you can't use them until the ORDER BY clause, but MySQL supports using aliases in GROUP BY and HAVING clauses as well.
Turn the date into a unix timestamp, which is seconds, in your query and then just look for the difference to be <= the number of seconds in a month.
You can find more information here:
http://dev.mysql.com/doc/refman/5.1/en/date-and-time-functions.html#function_unix-timestamp
If you need help with the query please let me know, but MySQL has nice functions for dealing with datetime.
[Edit] Since I was confused as to the real question, I need to finish the lawn but before I forget I want to write this down.
To get a count of the number by day you will want your where clause to be as I described above, to limit to the past 30 days, but you will need to group by day, and so select by converting each start to a day of the month and then do a count of those.
This assumes that each use will be limited to one day, if the start and end dates can span several days then it will be trickier.