Hive count elements of the max partition column - hive

I'm struggling with a query that may look simple but which is causing me a lot of trouble.
SELECT COUNT(*) FROM mytable where partition_column IN (SELECT MAX(partition_column) FROM mytable )
mytable is a 2To Hive External table, partitioned by the column partition_column. This query is taking 10 minutes to run..
When I do 2 separate queries :
SELECT MAX(partition_column) FROM mytable
> 2020-06-29
SELECT COUNT(*) FROM mytable where partition_column = '2020-06-29'
It works super fine and super quickly.
Am I missing something ?
Thank you
I'm on Hive 1.2.1 and Hadoop 2.7.3

It looks like the subquery is taking long time to process. Since you are filtering on the same column and table as the subquery, so the reducer step is taking a long time to process. Hence resulting in slow running of the query.
You could improve your query by introducing CTE which will create a temporary result set. Something as below:
WITH MY_CTE_SUBQUERY AS (
SELECT MAX(partition_column) as max_pc FROM mytable
)
SELECT COUNT(*)
FROM mytable
where partition_column IN (Select max_pc from MY_CTE_SUBQUERY);
More on hive CTE in the official doc.

Related

postgres jsonb_object_keys distinct or group by extremely slow

Database version is PostgreSQL 11.16
My table have 424868 records with json field. When I do:
SELECT jsonb_object_keys(raw_json) FROM table;
It returns result for me within a second. So, I need to remove duplicate keys, but when I do:
SELECT DISTINCT jsonb_object_keys(raw_json) FROM table;
My database CPU increase to 100% and it takes 15 min to get result. I tried solution with group by:
select array_agg(json_keys),id from (
select jsonb_object_keys(raw_json) as json_keys, id from table) a group by a.id
Same result.
For debugging I did this:
select count(*) from (SELECT jsonb_object_keys(raw_json) as k from table) test
and it returns me 41633935 keys

Is there a way to elemenate a subquery from this Hive query?

Edit: I am using Apache Hive (version 3.1.0.3.1.5.0-152)
When I run the following query:
insert into delta_table (select * from batch_table where loaddate=(select max(loaddate) from batch_table));
I get this error:
Unsupported SubQuery Expression 'loaddate': Only SubQuery expressions
that are top level conjuncts are allowed
We have a table that is written to in daily batches with the column loaddate that is unique for each batch. The purpose of the query is to get all the records from the most recent batch without knowing what it's load date is.
I suspect the issue is because I am using a subquery inside a subquery. Is there a way to change this query to do the same thing, but without the last subquery?
Depends on which version of hive you have , but you can use the Clause with to avoid the second subquery
with max_load as ( select max(loaddate) as loaddate from batch_table)
insert into delta_table
(select * from batch_table a where a.loaddate=max_load.loaddate);
It looks like the error was because the table was created incorrectly and for some reason this caused the query to fail. I recreated the table and it now works
Analytic function + filter will be more efficient than self-join or subquery with one more table scan to find max date:
insert into delta_table
select col1, col2, ... coln --list columns here
from
(
select t.*, rank() over(order by loaddate desc) rnk
from batch_table t
)s
where rnk=1;

Bigquery Select all latest partitions from a wildcard set of tables

We have a set of Google BigQuery tables which are all distinguished by a wildcard for technical reasons, for example content_owner_asset_metadata_*. These tables are updated daily, but at different times.
We need to select the latest partition from each table in the wildcard.
Right now we are using this query to build our derived tables:
SELECT
*
FROM
`project.content_owner_asset_metadata_*`
WHERE
_PARTITIONTIME = (
SELECT
MIN(time)
FROM (
SELECT
MAX(_PARTITIONTIME) as time
FROM
`project.content_owner_asset_metadata_*`
WHERE
_PARTITIONTIME > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
)
)
This statement finds out the date that all the up-to-date tables are guarenteed to have and selects that date's data, however I need a filter that selects the data from the maximum partition time of each table. I know that I'd need to use _TABLE_SUFFIX with _PARTITIONTIME, but cannot quite work out how to make a select work without just loading all our data (very costly) and using a standard greatest-n-per-group solution.
We cannot just union a bunch of static tables, as our dataset ingestion is liable to change and the scripts we build need to be able to accomodate.
With BigQuery scripting (Beta now), there is a way to prune the partitions.
Basically, a scripting variable is defined to capture the dynamic part of a subquery. Then in subsequent query, scripting variable is used as a filter to prune the partitions to be scanned.
Below example uses BigQuery public dataset to demonstrate how to prune partition to only query and scan on latest day of data.
DECLARE max_date TIMESTAMP
DEFAULT (SELECT MAX(_PARTITIONTIME) FROM `bigquery-public-data.sec_quarterly_financials.numbers`);
SELECT * FROM `bigquery-public-data.sec_quarterly_financials.numbers`
WHERE _PARTITIONTIME = max_date;
With INFORMATION_SCHEMA.PARTITIONS (preview) as of posting, this can be achieved by joining to the PARTITIONS table as follows (e.g. with HOUR partitioning):
SELECT i.*
FROM `project.dataset.prefix_*` i
JOIN (
SELECT * EXCEPT (r)
FROM (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY table_name ORDER BY partition_id DESC) AS r
FROM `project.dataset.INFORMATION_SCHEMA.PARTITIONS`
WHERE table_name LIKE "%prefix%"
AND partition_id NOT IN ("__NULL__", "__UNPARTITIONED__"))
WHERE r = 1) p
ON (FORMAT_TIMESTAMP("%Y%m%d%H", i._PARTITIONTIME) = p.partition_id
AND CONCAT("prefix_", i._TABLE_SUFFIX) = p.table_name)

how to add index to column with duplicate values to make the query faster in postgresql?

Here's the table in Postgresql:
name, ts, value
A, 2017-05-28, 1
A, 2017-05-27, 5
A, 2017-05-26, 2
...
B, 2017-05-28, 9
B, 2017-05-28, 12
...
The size of the table will be over 10 million. I'm trying to execute select count(distinct(name)) from "table"; and it takes me over 240s without ending. Could anyone give some suggestions regarding the way to optimise this scenario, like adding partition like Hive or adding index (which needs to be unique, but the name is duplicate across multiple records). Thanks!
For some reason, Postgres does not optimize count(distinct name) very well. (Intriguingly, Hive -- which has a very different optimizer -- has a similar problem.)
Try running the query this way:
select count(*)
from (select distinct name
from t
) t;
I don't think an index will help, but you can always try using one on t(name).

How do I calculate a moving average using MySQL?

I need to do something like:
SELECT value_column1
FROM table1
WHERE datetime_column1 >= '2009-01-01 00:00:00'
ORDER BY datetime_column1;
Except in addition to value_column1, I also need to retrieve a moving average of the previous 20 values of value_column1.
Standard SQL is preferred, but I will use MySQL extensions if necessary.
This is just off the top of my head, and I'm on the way out the door, so it's untested. I also can't imagine that it would perform very well on any kind of large data set. I did confirm that it at least runs without an error though. :)
SELECT
value_column1,
(
SELECT
AVG(value_column1) AS moving_average
FROM
Table1 T2
WHERE
(
SELECT
COUNT(*)
FROM
Table1 T3
WHERE
date_column1 BETWEEN T2.date_column1 AND T1.date_column1
) BETWEEN 1 AND 20
)
FROM
Table1 T1
Tom H's approach will work. You can simplify it like this if you have an identity column:
SELECT T1.id, T1.value_column1, avg(T2.value_column1)
FROM table1 T1
INNER JOIN table1 T2 ON T2.Id BETWEEN T1.Id-19 AND T1.Id
I realize that this answer is about 7 years too late. I had a similar requirement and thought I'd share my solution in case it's useful to someone else.
There are some MySQL extensions for technical analysis that include a simple moving average. They're really easy to install and use: https://github.com/mysqludf/lib_mysqludf_ta#readme
Once you've installed the UDF (per instructions in the README), you can include a simple moving average in a select statement like this:
SELECT TA_SMA(value_column1, 20) AS sma_20 FROM table1 ORDER BY datetime_column1
When I had a similar problem, I ended up using temp tables for a variety of reasons, but it made this a lot easier! What I did looks very similar to what you're doing, as far as the schema goes.
Make the schema something like ID identity, start_date, end_date, value. When you select, do a subselect avg of the previous 20 based on the identity ID.
Only do this if you find yourself already using temp tables for other reasons though (I hit the same rows over and over for different metrics, so it was helpful to have the small dataset).
My solution adds a row number in table. The following example code may help:
set #MA_period=5;
select id1,tmp1.date_time,tmp1.c,avg(tmp2.c) from
(select #b:=#b+1 as id1,date_time,c from websource.EURUSD,(select #b:=0) bb order by date_time asc) tmp1,
(select #a:=#a+1 as id2,date_time,c from websource.EURUSD,(select #a:=0) aa order by date_time asc) tmp2
where id1>#MA_period and id1>=id2 and id2>(id1-#MA_period)
group by id1
order by id1 asc,id2 asc
In my experience, Mysql as of 5.5.x tends not to use indexes on dependent selects, whether a subquery or join. This can have a very significant impact on performance where the dependent select criteria change on every row.
Moving average is an example of a query which falls into this category. Execution time may increase with the square of the rows. To avoid this, chose a database engine which can perform indexed look-ups on dependent selects. I find postgres works effectively for this problem.
In mysql 8 window function frame can be used to obtain the averages.
SELECT value_column1, AVG(value_column1) OVER (ORDER BY datetime_column1 ROWS 19 PRECEDING) as ma
FROM table1
WHERE datetime_column1 >= '2009-01-01 00:00:00'
ORDER BY datetime_column1;
This calculates the average of the current row and 19 preceding rows.