Window, Partition by in HIVE to get average 7-day temperatures

Window, Partition by in HIVE to get average 7-day temperatures - hive

I have a dataset that has multiple temperature readings per day. I am looking to return the hottest 7-day period by average temperature.
DROP TABLE IF EXISTS oshkosh;
CREATE EXTERNAL TABLE IF NOT EXISTS oshkosh(year STRING, month STRING, day STRING, time STRING, temp FLOAT, dewpoint FLOAT, humidity INT, sealevel FLOAT, visibility FLOAT, winddir STRING, windspeed FLOAT, gustspeed FLOAT, precip FLOAT, events STRING, condition STRING, winddirdegrees INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION '/user/maria_dev/final/Oshkosh' tblproperties ("skip.header.line.count"="1");
SELECT o.theDate
,AVG(o.temp) over (order by o.theDate range INTERVAL 6 DAY preceding) AS Average
FROM
(SELECT CAST(to_date(from_unixtime(UNIX_TIMESTAMP(CONCAT(year,'-',IF(LENGTH(month)=1,CONCAT(0,month),month),'-',IF(LENGTH(day)=1,CONCAT(0,day),day)),'yyyy-MM-dd'))) as timestamp) as theDate
,temp AS temp
FROM oshkosh
WHERE temp != -9999) as o
this returns the error:
Error while compiling statement: FAILED: ParseException line 2:38 cannot recognize input near 'range' 'INTERVAL' '6' in window_value_expression
I'm not sure if I want a timestamp as o.theDate because it seems that the INTERVAL 6 DAY call may not find a new day because there are 28 temperature readings for the first day of the dataset (and 44 readings for day 2, it's variable for each day).

Try:
SELECT
o.theDate,
AVG(o.temp) over (order by unix_timestamp(o.theDate) range between 604800
preceding and current row) AS Average
FROM
(
SELECT
CAST(to_date(CONCAT(year,'-',IF(LENGTH(month)=1,CONCAT(0,month),month),
'-',IF(LENGTH(day)=1,CONCAT(0,day),day))) AS TIMESTAMP) as theDate,
temp AS temp
FROM oshkosh
WHERE temp != -9999
) as o

Related

Hive Average over Fixed Date Interval for Each Entry

SELECT
date,
user_id,
AVG(value) OVER (
PARTITION BY user_id
ORDER BY unix_timestamp(ftime,'yyyyMMddHH')
RANGE BETWEEN 604800 PRECEDING AND CURRENT ROW) AS value
FROM TABLE
I'm trying to get a 7 day average for each user in Hive, wondering why this isn't working:
Error in semantic analysis: line 1:205 Invalid Function 'yyyyMMddHH'

Repeat a record for every day between two dates BigQuery?

I am attempting to produce a table of historical unfulfilled units. Currently, the database captures fulfillment date and order date for a record.
CREATE TABLE `input_table`
(order_name STRING,
line_item_id STRING,
order_date DATE,
fulfillment_date DATE)
Sample Record:
order_name: ABC
line_item_id: 123456
order_date: 2017-04-19
fulfillment_date: 2017-04-25
I want to produce a table that shows the fulfillment status by day, starting with the order date and ending with the date prior to the fulfillment date of each line item, e.g. in the above sample record the output_table would be:
Ultimately, this would allow me to query the count of unfulfilled line items each day:
SELECT
date,
count(line_item_id) AS unfulfilled_line_items
FROM
`output_table`
GROUP BY 1
Indicating the fulfillment status is not strictly necessary, considering it would only include dates in which the status was unfulfilled.
While I could do something like this:
with days as (SELECT
*
FROM
UNNEST(GENERATE_DATE_ARRAY('2017-01-01', CURRENT_DATE(), INTERVAL 1 day)) AS day)
SELECT
*
FROM
`input_table`
JOIN days
ON 1=1
AND order_date <= day
AND fulfillment_date > day
..the operation is fairly expensive.
Is there a better way of going about this?

I want to produce a table that shows the fulfillment status by day, starting with the order date and ending with the date prior to the fulfillment date of each line item
Consider below
select date, order_name, line_item_id, 'unfulfilled' fulfillment_status
from `project.dataset.table`,
unnest(generate_date_array(order_date, fulfillment_date - 1)) date
if applied to sample entry in your question - output is

How to get last one month's data from a table based on current month and year?

I am facing some problem with the hive code.
My FROM TABLE is partitioned based on month, year and day. I came up with the following code to get the data I need. The logic is something like if the current mth is 01 then change the month to 12 and the year to yr - 1
else change month to mth - 1 and keep the year as is.
set hivevar:yr=2019;
set hivevar:mth=03;
set hivevar:dy=29;
SELECT * from
FROM table
WHERE
month = case when cast('${mth}' as int) = 01 then 12 else cast((cast('${mth}' as int) - 1) as string) end
AND year = case when cast('${mth}' as int) = 01 then cast((cast('${yr}' as int) - 1) as string) else '${yr}' end;
It is not working, my select * is coming empty. Please help.
desc table

From what i understand, you are trying to get data from the previous month given a date. If so, you can use inbuilt date functions to do it.
select *
from table
where concat_ws('-',year,month,day) >= add_months(date_add(concat_ws('-','${yr}','${mth}','${dy}'),1-'${dy}'), -1)
and concat_ws('-',year,month,day) < date_add(concat_ws('-','${yr}','${mth}','${dy}'),1-'${dy}')
The solution assumes year, month and day are of the format yyyy, MM and dd. If not, adjust them as needed
Also, you should consider storing date as a column even though you have it partitioned by year,month and day.

Averaging event start time from DateTime column

I'm calculating average start times from events that run late at night and may not start until the next morning.
2018-01-09 00:01:38.000
2018-01-09 23:43:22.000
currently all I can produce is an average of 11:52:30.0000000
I would like the result to be ~ 23:52
the times averaged will not remain static as this event runs daily and I will have new data daily. I will likely take the most recent 10 records and average them.

Would be nice to have SQL you're running, but probably you just need to format properly your output, it should be something like this:
FORMAT(cast(<your column> as time), N'hh\:mm(24h)')

The following will both compute the average across the datetime field and also return the result as a 24hr time notation only.
SELECT CAST(CAST(AVG(CAST(<YourDateTimeField_Here> AS FLOAT)) AS DATETIME) AS TIME) [AvgTime] FROM <YourTableContaining_DateTime>

The following will calculate the average time of day, regardless of what day that is.
--SAMPLE DATA
create table #tmp_sec_dif
(
sample_date_time datetime
)
insert into #tmp_sec_dif
values ('2018-01-09 00:01:38.000')
, ('2018-01-09 23:43:22.000')
--ANSWER
declare #avg_sec_dif int
set #avg_sec_dif =
(select avg(a.sec_dif) as avg_sec_dif
from (
--put the value in terms of seconds away from 00:00:00
--where 23:59:00 would be -60 and 00:01:00 would be 60
select iif(
datepart(hh, sample_date_time) < 12 --is it morning?
, datediff(s, '00:00:00', cast(sample_date_time as time)) --if morning
, datediff(s, '00:00:00', cast(sample_date_time as time)) - 86400 --if evening
) as sec_dif
from #tmp_sec_dif
) as a
)
select cast(dateadd(s, #avg_sec_dif, '00:00:00') as time) as avg_time_of_day
The output would be an answer of 23:52:30.0000000

This code allows you to define a date division point. e.g. 18 identifies 6pm. The time calculation would then be based on seconds after 6pm.
-- Defines the hour of the day when a new day starts
DECLARE #DayDivision INT = 18
IF OBJECT_ID(N'tempdb..#StartTimes') IS NOT NULL DROP TABLE #StartTimes
CREATE TABLE #StartTimes(
start DATETIME NOT NULL
)
INSERT INTO #StartTimes
VALUES
('2018-01-09 00:01:38.000')
,('2018-01-09 23:43:22.000')
SELECT
-- 3. Add the number of seconds to a day starting at the
-- day division hour, then extract the time portion
CAST(DATEADD(SECOND,
-- 2. Average number of seconds
AVG(
-- 1. Get the number of seconds from the day division point (#DayDivision)
DATEDIFF(SECOND,
CASE WHEN DATEPART(HOUR,start) < #DayDivision THEN
SMALLDATETIMEFROMPARTS(YEAR(DATEADD(DAY,-1,start)),MONTH(DATEADD(DAY,-1,start)),DAY(DATEADD(DAY,-1,start)),#DayDivision,0)
ELSE
SMALLDATETIMEFROMPARTS(YEAR(start),MONTH(start),DAY(start),#DayDivision,0)
END
,start)
)
,'01 jan 1900 ' + CAST(#DayDivision AS VARCHAR(2)) + ':00') AS TIME) AS AverageStartTime
FROM #StartTimes

Conversion failed when converting date and/or time from character string in SQL

I have the following columns in my table:
year decimal(4,0)
month decimal(2,0)
day decimal(2,0)
and I am trying to convert them as below:
SELECT (CAST (CAST(year AS varchar(4))
+CAST(month AS varchar(2))
+CAST(day AS varchar(2)
) AS date ) ) AS xDate
FROM table
ORDER BY xDate DESC
But I am getting this error:
Conversion failed when converting date and/or time from character string.

Your approach does not take into account that month or day can be a single-digit value. If you had a row like this:
year month day
---- ----- ---
2014 7 18
your method would concatenate the parts as 2014718 and the subsequent attempt to convert that into a date would result in the error in question.
You probably meant to combine the parts like this: 20140718, which would convert to date without issues. To me, the easiest way to get to that format from numerical parts is to go like this:
calculate year * 10000 + month * 100 + day;
convert the previous result to char(8);
convert the string to date.
So, in Transact-SQL it would be like this:
CAST(CAST(year * 10000 + month * 100 + day AS char(8)) AS date)
On a different note, I cannot possibly know whether you really need to store your dates split like that, of course, but if you do insist on doing so, at least consider using integer type(s) for the parts. I realise that decimal(2,0) may serve as some kind of constraint for you that makes sure that you cannot have month or day values with more than 2 digits, but it still does not protect you from having invalid months or days. And another major point is decimal(2,0) requires more storage space than even int, let alone smallint or tinyint.
So, this would seem fine to me
year int,
month int,
day int
but if you are into saving the storage space as much as possible, you could also try this:
year smallint,
month tinyint,
day tinyint
Finally, to make sure you cannot have invalid values in those columns, you could add a check constraint like below:
ALTER TABLE tablename
ADD CONSTRAINT CK_tablename_yearmonthday CHECK
ISDATE(CAST(year * 10000 + month * 100 + day AS char(8))) = 1
;

This issue is because you have redundant data in your table(Nothing wrong with your query). check following in your table.
1). values for days columns must not be greater than max number of days in associated month (e.g month 2 must not have 30 or 31 days in it).
2). value for month column must not be greater than 12 or equal to 0.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Window, Partition by in HIVE to get average 7-day temperatures - hive

Related

Hive Average over Fixed Date Interval for Each Entry

Repeat a record for every day between two dates BigQuery?

How to get last one month's data from a table based on current month and year?

Averaging event start time from DateTime column

Conversion failed when converting date and/or time from character string in SQL

Categories

Resources