Hive table with multiple partitions - hive

I have a table (data_table) with multiple partition columns year/month/monthkey.
Directories look something like year=2017/month=08/monthkey=2017-08/files.parquet
Which of the below queries would be faster?
select count(*) from data_table where monthkey='2017-08'
or
select count(*) from data_table where monthkey='2017-08' and year = '2017' and month = '08'
I think the initial time taken by hadoop take to find the required directories in the first case would be more. But want to confirm

Finding the relevant partitions is a metastore operation and not a file system operation.
It is done by querying the metasore and not by scanning the directories.
The metasore query of the first use-case will most likely be faster than the second use-case but in any case we are talking here on fractions of a second.
Demo
create external table t100k(i int)
partitioned by (x int,y int,xy string)
;
explain dependency select count(*) from t100k where xy='100-1000';
The query that was issued against the metastore:
select "PARTITIONS"."PART_ID"
from "PARTITIONS"
inner join "TBLS" on "PARTITIONS"."TBL_ID" = "TBLS"."TBL_ID" and "TBLS"."TBL_NAME" = 't100k'
inner join "DBS" on "TBLS"."DB_ID" = "DBS"."DB_ID" and "DBS"."NAME" = 'local_db'
inner join "PARTITION_KEY_VALS" "FILTER2" on "FILTER2"."PART_ID" = "PARTITIONS"."PART_ID" and "FILTER2"."INTEGER_IDX" = 2
where (("FILTER2"."PART_KEY_VAL" = '100-1000'))
explain dependency select count(*) from t100k where x=100 and y=1000 and xy='100-1000';
The query that was issued against the metastore:
select "PARTITIONS"."PART_ID"
from "PARTITIONS"
inner join "TBLS" on "PARTITIONS"."TBL_ID" = "TBLS"."TBL_ID" and "TBLS"."TBL_NAME" = 't100k'
inner join "DBS" on "TBLS"."DB_ID" = "DBS"."DB_ID" and "DBS"."NAME" = 'local_db'
inner join "PARTITION_KEY_VALS" "FILTER0" on "FILTER0"."PART_ID" = "PARTITIONS"."PART_ID" and "FILTER0"."INTEGER_IDX" = 0
inner join "PARTITION_KEY_VALS" "FILTER1" on "FILTER1"."PART_ID" = "PARTITIONS"."PART_ID" and "FILTER1"."INTEGER_IDX" = 1
inner join "PARTITION_KEY_VALS" "FILTER2" on "FILTER2"."PART_ID" = "PARTITIONS"."PART_ID" and "FILTER2"."INTEGER_IDX" = 2
where ( ( (((case when "FILTER0"."PART_KEY_VAL" <> '__HIVE_DEFAULT_PARTITION__' then cast("FILTER0"."PART_KEY_VAL" as decimal(21,0)) else null end) = 100)
and ((case when "FILTER1"."PART_KEY_VAL" <> '__HIVE_DEFAULT_PARTITION__' then cast("FILTER1"."PART_KEY_VAL" as decimal(21,0)) else null end) = 1000))
and ("FILTER2"."PART_KEY_VAL" = '100-1000')) )

Since comment will change the formatting, hence posting here.
Kindly accept #Dudu's reply. Please execute the below on metastore DB (mysql in my case):
mysql> select part_id, location, tbl_id, part_name from PARTITIONS as P inner join SDS as S on P.SD_ID = S.SD_ID where P.TBL_ID = 472;
+---------+-------------------------------------------------------------------------+--------+--------------------------------------+
| part_id | location | tbl_id | part_name |
+---------+-------------------------------------------------------------------------+--------+--------------------------------------+
| 7 | hdfs://hostname:8020/tmp/multi_part/2011/01/2011-01 | 472 | year=2011/month=1/year_month=2011-01 |
| 9 | hdfs://hostname:8020/tmp/multi_part/2012/01/2012-01 | 472 | year=2012/month=1/year_month=2012-01 |
+---------+-------------------------------------------------------------------------+--------+--------------------------------------+
2 rows in set (0.00 sec)
The location from both the queries will pull data from same hdfs directory.
The only difference in speed will be from the metastore DB query that is already explained in Dudu's answer.

Related

T-SQL subselect statement is returning all rows instead of limiting to 1 based on subselect

I am trying to return just the first row where the BLOCK_STOP_ORDER = 2. What is wrong with my SQL? Why isn't WHERE SCHEDULE.BLOCK_STOP_ORDER = (SELECT MIN(S1.BLOCK_STOP_ORDER....
working? When I run the subselect on its own it returns the value '2' - doesn't that mean it should then limit the query result to only the row(s) where BLOCK_STOP_ORDER = 2?
SELECT ROUTE.ROUTE_ABBR, SCHEDULE.ROUTE_DIRECTION_ID, SCHEDULE.PATTERN_ID, SCHEDULE.BLOCK_STOP_ORDER,
SCHEDULE.SCHEDULED_TIME, GEO_NODE.GEO_NODE_ABBR, TRIP.TRIP_SEQUENCE AS TPST
FROM SCHEDULE
INNER JOIN GEO_NODE ON SCHEDULE.GEO_NODE_ID = GEO_NODE.GEO_NODE_ID
INNER JOIN ROUTE ON SCHEDULE.ROUTE_ID = ROUTE.ROUTE_ID
INNER JOIN TRIP ON SCHEDULE.TRIP_ID = TRIP.TRIP_ID
WHERE (SCHEDULE.CALENDAR_ID = '120221024') AND ROUTE.ROUTE_ABBR = '001'
AND SCHEDULE.ROUTE_DIRECTION_ID = '2' AND SCHEDULE.PATTERN_ID = '270082'
AND TRIP.TRIP_SEQUENCE = '18600'
AND SCHEDULE.BLOCK_STOP_ORDER =
(SELECT MIN(S1.BLOCK_STOP_ORDER)
FROM SCHEDULE S1
WHERE SCHEDULE.CALENDAR_ID = S1.CALENDAR_ID
AND SCHEDULE.ROUTE_ID = S1.ROUTE_ID
AND SCHEDULE.ROUTE_DIRECTION_ID = S1.ROUTE_DIRECTION_ID
AND SCHEDULE.PATTERN_ID = S1.PATTERN_ID
AND SCHEDULE.SCHEDULED_TIME = S1.SCHEDULED_TIME
AND SCHEDULE.GEO_NODE_ID = S1.GEO_NODE_ID
AND SCHEDULE.BLOCK_STOP_ORDER = S1.BLOCK_STOP_ORDER
AND SCHEDULE.TRIP_ID = S1.TRIP_ID
)
GROUP BY ROUTE.ROUTE_ABBR, SCHEDULE.ROUTE_DIRECTION_ID,
SCHEDULE.PATTERN_ID, SCHEDULE.SCHEDULED_TIME,
GEO_NODE.GEO_NODE_ABBR, SCHEDULE.BLOCK_STOP_ORDER, TRIP.TRIP_SEQUENCE
ORDER BY ROUTE.ROUTE_ABBR, SCHEDULE.ROUTE_DIRECTION_ID, TRIP.TRIP_SEQUENCE
Results:
ROUTE_ABBR
ROUTE_DIRECTION_ID
PATTERN_ID
BLOCK_STOP_ORDER
SCHEDULED_TIME
GEO_NODE_ABBR
TPST
001
2
270082
2
18600
1251
18600
001
2
270082
3
18600
1346
18600
001
2
270082
5
18720
1123
18600
001
2
270082
6
18720
11372
18600
001
2
270082
4
18720
1570
18600
001
2
270082
8
18780
11373
18600
This is probably better solved with the row_number() windowing function:
SELECT *
FROM (
SELECT DISTINCT r.ROUTE_ABBR, s.ROUTE_DIRECTION_ID, s.PATTERN_ID, s.BLOCK_STOP_ORDER,
s.SCHEDULED_TIME, g.GEO_NODE_ABBR, t.TRIP_SEQUENCE AS TPST,
row_number() over (order by SCHEDULE.BLOCK_STOP_ORDER) rn
FROM SCHEDULE s
INNER JOIN GEO_NODE g ON s.GEO_NODE_ID = g.GEO_NODE_ID
INNER JOIN ROUTE r ON s.ROUTE_ID = r.ROUTE_ID
INNER JOIN TRIP t ON s.TRIP_ID = t.TRIP_ID
WHERE s.CALENDAR_ID = '120221024' AND r.ROUTE_ABBR = '001'
AND s.ROUTE_DIRECTION_ID = '2' AND s.PATTERN_ID = '270082'
AND t.TRIP_SEQUENCE = '18600'
) t1
WHERE rn=1
ORDER BY t1.ROUTE_ABBR, t1.ROUTE_DIRECTION_ID, t1.TRIP_SEQUENCE
The problem with the original is the name SCHEDULE. For the full version of the query, the subquery is matching the name in the nested select with the instance of the table from the outer select. This correlates the results of the inner table with the outer, so only the item from that row of the outer table is eligible.
When you run the inner query by itself, separate from the outer query, there is only the one instance of the table. In that situation the WHERE conditions are matching the table to itself — they are always true — and you just get the smallest value of all the rows: 2.
This is why you should ALWAYS give ALL the tables in your queries an alias, and ONLY reference them by that alias (as I did in my answer). Do this, and the MIN() version can work... but will still be slower and more code than using row_number().
Finally, the use of DISTINCT / GROUP BY with every SELECT column is usually an indicator you don't fully understand the JOIN relationships used in the query, and in at least one case the join conditions are not sufficiently selective. I'd hesitate to move a query like that to production, even if it seems to be working, though I confess most of us have done it at some point anyway.

Group By Dynamic Ranges in SQL (cockroachdb/postgres)

I have a query that looks like
select s.session_id, array_agg(sp.value::int8 order by sp.value::int8) as timestamps
from sessions s join session_properties sp on sp.session_id = s.session_id
where s.user_id = '6f129b1c-43a6-4871-86f6-1749bfe1a5af' and sp.key in ('SleepTime', 'WakeupTime') and value != 'None' and value::int8 > 0
group by s.session_id
The result would look like
f321c813-7927-47aa-88c3-b3250af34afa | {1588499070,1588504354}
f38a8841-c402-433d-939d-194eca993bb6 | {1588187599,1588212803}
2befefaf-3b31-46c9-8416-263fa7b9309d | {1589912247,1589935771}
3da64787-65cd-4305-b1ac-1393e2fb11a9 | {1589741569,1589768453}
537e69aa-c39d-484d-9108-2f2cd956d4ee | {1588100398,1588129026}
5a9470ff-f930-491f-a57d-8c089e535d53 | {1589140368,1589165092}
The first column is a unique id and the second column is from and to timestamps.
Now I have a third table which has some timeseries data
records
------------------------
timestamp | name | value
Is it possible to find avg(value) from from records in group of session_ids over the from and to timestamps.
I could run a for loop in the application and do a union to get the desired result. But I was wondering if that is possible in postgres or cockroachdb
I wouldn't aggregate the two values but use two joins to find them. That way you can be sure which value belongs to which property.
Once you have that, you can join that result to your records table.
with ranges as (
select s.session_id, st.value as from_value, wt.value as to_value
from sessions s
join session_properties st on sp.session_id = s.session_id and st.key = 'SleepTime'
join session_properties wt on wt.session_id = s.session_id and wt.key = 'WakeupTime'
where s.user_id = '6f129b1c-43a6-4871-86f6-1749bfe1a5af'
and st.value != 'None' and wt.value::int8 > 0
and wt.value != 'None' and wt.value::int8 > 0
)
select ra.session_id, avg(rc.value)
from records rc
join ranges ra
on ra.from_value >= rc.timewstamp
and rc.timestamp < ra.to_value
group by ra.session_id;

SQL - How to find entries where a value is missing between two rows

We are using Presto SQL at my job. I have spent hours trying to search for the answer to this question but can't find an answer and it's quite difficult to search for. Solving this issue opens the door to fixing a lot of problems.
I need to write a query that tries to find all entries where REQUEST_CANCEL & CHARGED exist but CANCEL_ACCOUNT is missing.
CHARGED & CANCEL_ACCOUNT should always come after REQUEST_CANCEL.
Table Name: CUSTOMER_INFO
|DATE_TIME|CUST_ID |ACTION |
|20180726 |1234 |CHARGED |
|20180726 |1234 |CANCEL_ACCOUNT|
|20180726 |1234 |REQUEST_CANCEL|
All these values exist in the same table. Here's what I have so far.
SELECT *
FROM
(SELECT *
FROM CUSTOMER_INFO
WHERE
DATE_TIME = 20180726
AND ACTION = REQUEST_CANCEL) as a
JOIN
(SELECT *
FROM CUSTOMER_INFO
WHERE
DATE_TIME = 20180726
AND ACTION = CHARGED) as b
ON a.CUST_ID = b.CUST_ID
WHERE
a.TIME < b.TIME
Let me explain it in a way that makes sense.
A = REQUEST_CANCEL
B = CANCEL_ACCOUNT
C = CHARGED
How do you query for when A and C exist but B is missing. The sequence needs to be exact A > B > C. It's essentially querying for something that doesn't exist between two values that do exist. In my current query, B can be returned between the two values and that's NOT what I want.
I think you're searching for NOT EXISTS and a corellated subquery.
SELECT *
FROM (SELECT *
FROM customer_info
WHERE action = 'REQUEST_CANCEL') rc
INNER JOIN (SELECT *
FROM customer_info
WHERE action = 'CHARGED') c
ON c.cust_id = rc.cust_id
AND c.date_time >= rc.date_time
WHERE NOT EXISTS (SELECT *
FROM customer_info ca
WHERE ca.cust_id = rc.cust_id
AND ca.action = 'CANCEL_ACCOUNT'
AND ca.date_time >= rc.date_time
AND ca.date_time <= c.date_time);
Use group by and having:
select cust_id
from customer_info ci
where date_time = 20180726 and
action in ('REQUEST_CANCEL', 'CHARGED', 'CANCEL_ACCOUNT')
group by cust_id
having sum(case when action = 'REQUEST_CANCEL' then 1 else 0 end) > 0 and
sum(case when action = 'CHARGED' then 1 else 0 end) > 0 and
sum(case when action = 'CANCEL_ACCOUNT' then 1 else 0 end) = 0 ;
Each sum() counts the number of matching records for the customer with that action. The > 0 says that one exists. The = 0 says that none exist.
The database doesn't matter for this logic. Here is a SQL Fiddle using MySQL.

Query table with all names in one statement

I have a table that I cannot alter and I am trying to wrap my head around an appropriate way to query this table in one statement based on one given value.
Roughly here's what the table looks like:
What I'm after is a way to query the table using only the SubBody code in order to get all names associated with it.
for example,
<query> where SubBody = '1001'
returns
| HName | HSName | BName | BSName |
+-----------------------------------+
| Toys | Sport | Ball | Baseball |
SELECT Head.Name as HName ,
SubHead.Name as HSName ,
Body.Name as BName ,
SubBody.Name as BSName
FROM yourTable as SubBody
JOIN yourTable as Body
ON SubBody.Body = Body.Body
AND Body.SubBody IS NULL
JOIN yourTable as SubHead
ON Body.SubHead = SubHead.SubHead
AND SubHead.Body IS NULL
JOIN yourTable as Head
ON SubHead.Head = Head.Head
AND Head.SubHead IS NULL
WHERE SubBody.SubBody = '1001'
Although you can express the logic as joins, for some reason correlated subqueries come to mind first:
select (select th.name from t th where th.head = t.head and th.subhead is null) as hname,
(select ts.name from t ts where ts.head = t.head and ts.subhead = t.subhead and t.body is null) as sname,
(select tb.name from t tb where tb.head = t.head and tb.subhead = t.subhead and tb.body = t.body and tb.subbody is null) as bname,
t.name as bsname
from t
where t.subbody = 1001
This answer is close to what Juan Carlos has posted. Which one you like depends a bit on your style.
WITH BaseRecord AS(
SELECT *
FROM ToyTable
WHERE SubBody = '1001'
)
SELECT
h.Name AS HName,
hs.Name AS HSName,
b.Name AS BName,
bs.Name AS BSName
FROM BaseRecord bs
INNER JOIN ToyTable b ON bs.Body = b.Body AND ISNULL(b.SubBody,'') = ''
INNER JOIN ToyTable hs ON bs.SubHead = hs.SubHead AND ISNULL(hs.Body,'') = ''
INNER JOIN ToyTable h ON bs.Head = h.Head AND ISNULL(h.SubHead,'') = ''
I'm not sure whether your "empty" cells contain nulls or empty strings. Either way, both are taken into account here.

Difference Max and Min from Different Dates

I'm going to try to explain this the best I can.
The code below does the following:
Finds a service address from the ServiceLocation table.
Finds a service type (electric or water).
Finds how many days in the past to pull data.
Once it has this, it calculates the "daily usage" by subtracting the max meter read for a day from the minimum meter read for a day.
(MAX(mr.Reading) - MIN(mr.Reading)) AS 'DaytimeUsage'
However, what I'm missing is the max reading from the day prior and the minimum reading from the current day. Mathematically, this should look something like this:
MAX(PriorDayReading) - MIN(ReadDateReading)
Essentially, if it goes back 5 days it should kick out a table that reads as follows:
Service Location | Read Date | Usage |
123 Main St | 4/20/15 | 12 |
123 Main St | 4/19/15 | 8 |
123 Main St | 4/18/15 | 6 |
123 Main St | 4/17/15 | 10 |
123 Main St | 4/16/15 | 11 |
Where "Usage" is the 'DaytimeUsage' + usage that I'm missing (and the question above). For example, 4/18/15 would be the 'DaytimeUsage' in the query below PLUS the the difference between the MAX read from 4/17/15 and the MIN read from 4/18/15.
I'm not sure how to accomplish this or if it is possible.
SELECT
A.ServiceAddress AS 'Service Address',
convert(VARCHAR(10),A.ReadDate,101) AS 'Date',
SUM(A.[DaytimeUsage]) AS 'Usage'
FROM
(
SELECT
sl.location_addr AS 'ServiceAddress',
convert(VARCHAR(10),mr.read_date,101) AS 'ReadDate',
(MAX(mr.Reading) - MIN(mr.Reading)) AS 'DaytimeUsage'
FROM
DimServiceLocation AS sl
INNER JOIN FactBill AS fb ON fb.ServiceLocationKey = sl.ServiceLocationKey
INNER JOIN FactMeterRead as mr ON mr.ServiceLocationKey = sl.ServiceLocationKey
INNER JOIN DimCustomer AS c ON c.CustomerKey = fb.CustomerKey
WHERE
c.class_name = 'Tenant'
AND sl.ServiceLocationKey = #ServiceLocation
AND mr.meter_type = #ServiceType
GROUP BY
sl.location_addr,
convert(VARCHAR(10),
mr.read_date,101)
) A
WHERE A.ReadDate >= GETDATE()-#Days
GROUP BY A.ServiceAddress, convert(VARCHAR(10),A.ReadDate,101)
ORDER BY convert(VARCHAR(10),A.ReadDate,101) DESC
It seems like you could solve this by just calculating the difference between the MAX of yesterday & today, however this is how I would approach it. Join to the same table again for the previous day relative to any given day, and select the Max/Min for that too within your inner query. Also if you place the date in the inner query where clause the data set you return will be quicker & smaller.
SELECT
A.ServiceAddress AS 'Service Address',
convert(VARCHAR(10),A.ReadDate,101) AS 'Date',
SUM(A.[TodayMax]) - SUM(A.[TodayMin]) AS 'Usage',
SUM(A.[TodayMax]) - SUM(A.[YesterdayMax]) AS 'Usage with extra bit you want'
FROM
(
SELECT
sl.location_addr AS 'ServiceAddress',
convert(VARCHAR(10),mr.read_date,101) AS 'ReadDate',
MAX(mrT.Reading) AS 'TodayMax',
MIN(mrT.Reading) AS 'TodayMin',
MAX(mrY.Reading) AS 'YesterdayMax',
MIN(mrY.Reading) AS 'YesterdayMin',
FROM
DimServiceLocation AS sl
INNER JOIN FactBill AS fb ON fb.ServiceLocationKey = sl.ServiceLocationKey
INNER JOIN FactMeterRead as mrT ON mrT.ServiceLocationKey = sl.ServiceLocationKey
INNER JOIN FactMeterRead as mrY ON mrY.ServiceLocationKey = s1.ServiceLocationKey
AND mrY.read_date = mrT.read_date -1)
INNER JOIN DimCustomer AS c ON c.CustomerKey = fb.CustomerKey
WHERE
c.class_name = 'Tenant'
AND sl.ServiceLocationKey = #ServiceLocation
AND mr.meter_type = #ServiceType
AND convert(VARCHAR(10), mrT.read_date,101) >= GETDATE()-#Days
GROUP BY
sl.location_addr,
convert(VARCHAR(10),
mr.read_date,101)
) A
GROUP BY A.ServiceAddress, convert(VARCHAR(10),A.ReadDate,101)
ORDER BY convert(VARCHAR(10),A.ReadDate,101) DESC
You can use the APPLY operator if you are above sql server 2005. Here is a link to the documentation. https://technet.microsoft.com/en-us/library/ms175156(v=sql.105).aspx The APPLY operation comes in two forms OUTER APPLY AND CROSS APPLY - OUTER works like a left join and CROSS works like an inner join. They let you run a query once for each row returned. I setup my own sample of what you were trying to do, here it is and I hope it helps.
http://sqlfiddle.com/#!6/fdb3f/1
CREATE TABLE SequencedValues (
Location varchar(50) NOT NULL,
CalendarDate datetime NOT NULL,
Reading int
)
INSERT INTO SequencedValues (
Location,
CalendarDate,
Reading
)
SELECT
'Address1',
'4/20/2015',
10
UNION SELECT
'Address1',
'4/19/2015',
9
UNION SELECT
'Address1',
'4/19/2015',
20
UNION SELECT
'Address1',
'4/19/2015',
25
UNION SELECT
'Address1',
'4/18/2015',
8
UNION SELECT
'Address1',
'4/17/2015',
7
UNION SELECT
'Address2',
'4/20/2015',
100
UNION SELECT
'Address2',
'4/20/2015',
111
UNION SELECT
'Address2',
'4/19/2015',
50
UNION SELECT
'Address2',
'4/19/2015',
65
SELECT DISTINCT
sv.Location,
sv.CalendarDate,
sv_dayof.MINDayOfReading,
sv_daybefore.MAXDayBeforeReading
FROM SequencedValues sv
OUTER APPLY (
SELECT MIN(sv_dayof_inside.Reading) AS MINDayOfReading
FROM SequencedValues sv_dayof_inside
WHERE sv.Location = sv_dayof_inside.Location
AND sv.CalendarDate = sv_dayof_inside.CalendarDate
) sv_dayof
OUTER APPLY (
SELECT MAX(sv_daybefore_max.Reading) AS MAXDayBeforeReading
FROM SequencedValues sv_daybefore_max
WHERE sv.Location = sv_daybefore_max.Location
AND sv_daybefore_max.CalendarDate IN (
SELECT TOP 1 sv_daybefore_inside.CalendarDate
FROM SequencedValues sv_daybefore_inside
WHERE sv.Location = sv_daybefore_inside.Location
AND sv.CalendarDate > sv_daybefore_inside.CalendarDate
ORDER BY sv_daybefore_inside.CalendarDate DESC
)
) sv_daybefore
ORDER BY
sv.Location,
sv.CalendarDate DESC
I'm not sure I full understood your db structure but I may have a solution so feel free to edit my answer to adapt or correct any mistake.
The idea is to use two aliases for the table FactMeterRead. mrY (Y as yesterday) and mrT (T as Today). And differentiate them with a read_date restriction.
However I didn't understand enough your tables to write a fully functional query. I hope you will get the idea anyway with this example.
SELECT
sl.location_addr AS 'ServiceAddress',
convert(VARCHAR(10),mrT.read_date,101) AS 'ReadDate',
(MAX(mrY.Reading) - MIN(mrT.Reading)) AS 'DaytimeUsage'
FROM
DimServiceLocation AS sl
INNER JOIN FactMeterRead as mrY ON mrY.ServiceLocationKey = sl.ServiceLocationKey
INNER JOIN FactMeterRead as mrT ON mrT.ServiceLocationKey = sl.ServiceLocationKey
WHERE mrY.read_date=DATE_SUB(mrT.read_date,1 DAY)