Given a table with entries at irregular time stamps, "breaks" must be inserted at regular 5 min intervals ( the data associated can / will be NULL ).
I was thinking of getting the start time, making a subquery that has a window function and adds 5 min intervals to the start time - but I only could think of using row_number to increment the values.
WITH data as(
select id, data,
cast(date_and_time as double) * 1000 as time_milliseconds
from t1), -- original data
start_times as(
select id, MIN(CAST(date_and_time as double) * 1000) as start_time
from t1
GROUP BY id
), -- first timestamp for each id
boundries as (
SELECT T1.id,(row_number() OVER (PARTITION BY T1.id ORDER BY T1.date_and_time)-1) *300000 + start_times.start_time
as boundry
from T1
INNER JOIN start_times ON start_times.id= T1.id
) -- increment the number of 5 min added on each row and later full join boundries table with original data
However this limits me to the number of rows present for an id in the original data table, and if the timestamps are spread out, the number of rows cannot cover the amount of 5 min intervals needed to be added.
sample data:
initial data:
|-----------|------------------|------------------|
| id | value | timestamp |
|-----------|------------------|------------------|
| 1 | 3 | 12:00:01.011 |
|-----------|------------------|------------------|
| 1 | 4 | 12:03:30.041 |
|-----------|------------------|------------------|
| 1 | 5 | 12:12:20.231 |
|-----------|------------------|------------------|
| 1 | 3 | 15:00:00.312 |
data after my query:
|-----------|------------------|------------------|
| id | value | timestamp (UNIX) |
|-----------|------------------|------------------|
| 1 | 3 | 12:00:01 |
|-----------|------------------|------------------|
| 1 | 4 | 12:03:30 |
|-----------|------------------|------------------|
| 1 | NULL | 12:05:01 | <-- Data from "boundries"
|-----------|------------------|------------------|
| 1 | NULL | 12:10:01 | <-- Data from "boundries"
|-----------|------------------|------------------|
| 1 | 5 | 12:12:20 |
|-----------|------------------|------------------|
| 1 | NULL | 12:15:01 | <-- Data from "boundries"
|-----------|------------------|------------------|
| 1 | NULL | 12:20:01 | <-- Data from "boundries"
|-----------|------------------|------------------| <-- Jumping directly to 15:00:00 (WRONG! :( need to insert more 5 min breaks here )
| 1 | 3 | 15:00:00 |
I was thinking of creating a temporary table inside HIVE and filling it with x rows representing 5 min intervals from the starttime to the endtime of the data table, but I couldn't find any way of accomplishing that.
Any way of using "for loops" ? Any suggestions would be appreciated.
Thanks
You can try calculating the difference between current timestamp and next one, divide 300 to get number of ranges, produce a string of spaces with length = num_ranges, explode to generate rows.
Demo:
with your_table as (--initial data example
select stack (3,
1,3 ,'2020-01-01 12:00:01.011',
1,4 ,'2020-01-01 12:03:30.041',
1,5 ,'2020-01-01 12:20:20.231'
) as (id ,value ,ts )
)
select id ,value, ts, next_ts,
diff_sec,num_intervals,
from_unixtime(unix_timestamp(ts)+h.i*300) new_ts, coalesce(from_unixtime(unix_timestamp(ts)+h.i*300),ts) as calculated_timestamp
from
(
select id ,value ,ts, next_ts, (unix_timestamp(next_ts)-unix_timestamp(ts)) diff_sec,
floor((unix_timestamp(next_ts)-unix_timestamp(ts))/300 --diff in seconds/5 min
) num_intervals
from
(
select id ,value ,ts, lead(ts) over(order by ts) next_ts
from your_table
) s
)s
lateral view outer posexplode(split(space(cast(s.num_intervals as int)),' ')) h as i,x --this will generate rows
Result:
id value ts next_ts diff_sec num_intervals new_ts calculated_timestamp
1 3 2020-01-01 12:00:01.011 2020-01-01 12:03:30.041 209 0 2020-01-01 12:00:01 2020-01-01 12:00:01
1 4 2020-01-01 12:03:30.041 2020-01-01 12:20:20.231 1010 3 2020-01-01 12:03:30 2020-01-01 12:03:30
1 4 2020-01-01 12:03:30.041 2020-01-01 12:20:20.231 1010 3 2020-01-01 12:08:30 2020-01-01 12:08:30
1 4 2020-01-01 12:03:30.041 2020-01-01 12:20:20.231 1010 3 2020-01-01 12:13:30 2020-01-01 12:13:30
1 4 2020-01-01 12:03:30.041 2020-01-01 12:20:20.231 1010 3 2020-01-01 12:18:30 2020-01-01 12:18:30
1 5 2020-01-01 12:20:20.231 \N \N \N \N 2020-01-01 12:20:20.231
Additional rows were added. I left all intermediate columns for debugging purposes.
A recursive query could be helpful here but hive does not support these more info.
You may consider creating the table outside of hive or writing a UDF.
Either way this query can be expensive and the use of materialized views/tables are recommended depending on your frequency.
The example shows a UDF inbetween created using pyspark to run the query. It
generate the values in between the min and max timestamp from the dataset
using CTEs and the UDF to create a temporary table intervals
generating all possible intervals using an expensive cross join in possible_records
Using a left join to retrieve the records with actual values (for demonstration purposes i've represented the timestamp value as just the time string)
The code below shows how it was evaluated using hive
Example Code
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType,ArrayType
inbetween = lambda min_value,max_value : [*range(min_value,max_value,5*60)]
udf_inbetween = udf(inbetween,ArrayType(IntegerType()))
sqlContext.udf.register("inbetween",udf_inbetween)
sqlContext.sql("""
WITH max_timestamp(t) as (
select max(timestamp) as t from initial_data2
),
min_timestamp(t) as (
select min(timestamp) as t from initial_data2
),
intervals as (
select explode(inbetween(unix_timestamp(mint.t),unix_timestamp(maxt.t))) as interval_time FROM
min_timestamp mint, max_timestamp maxt
),
unique_ids as (
select distinct id from initial_data2
),
interval_times as (
select interval_time from (
select
cast(from_unixtime(interval_time) as timestamp) as interval_time
from
intervals
UNION
select distinct d.timestamp as interval_time from initial_data2 d
)
order by interval_time asc
),
possible_records as (
select
distinct
d.id,
i.interval_time
FROM
interval_times i, unique_ids d
)
select
p.id,
d.value,
split(cast(p.interval_time as string)," ")[1] as timestamp
FROM
possible_records p
LEFT JOIN
initial_data2 d ON d.id = p.id and d.timestamp = p.interval_time
ORDER BY p.id, p.interval_time
""").show(20)
Output
+---+-----+---------+
| id|value|timestamp|
+---+-----+---------+
| 1| 3| 12:00:01|
| 1| 4| 12:03:30|
| 1| null| 12:05:01|
| 1| null| 12:10:01|
| 1| 5| 12:12:20|
| 1| null| 12:15:01|
| 1| null| 12:20:01|
| 1| null| 12:25:01|
| 1| null| 12:30:01|
| 1| null| 12:35:01|
| 1| null| 12:40:01|
| 1| null| 12:45:01|
| 1| null| 12:50:01|
| 1| null| 12:55:01|
| 1| null| 13:00:01|
| 1| null| 13:05:01|
| 1| null| 13:10:01|
| 1| null| 13:15:01|
| 1| null| 13:20:01|
| 1| null| 13:25:01|
+---+-----+---------+
only showing top 20 rows
Data Prep to replicate
raw_data1 = [
{"id":1,"value":3,"timestam":"12:00:01"},
{"id":1,"value":4,"timestam":"12:03:30"},
{"id":1,"value":5,"timestam":"12:12:20"},
{"id":1,"value":3,"timestam":"15:00:00"},
]
raw_data = [*map(lambda entry : Row(**entry),raw_data1)]
initial_data = sqlContext.createDataFrame(raw_data,schema="id int, value int, timestam string ")
initial_data.createOrReplaceTempView('initial_data')
sqlContext.sql("create or replace temp view initial_data2 as select id,value,cast(timestam as timestamp) as timestamp from initial_data")
Related
I have a table like this:
EventID
EventTime
AttrA
AttrB
1
2022-10-01 00:00:01.000000
null
null
1
2022-10-01 00:00:02.000000
a
null
1
2022-10-01 00:00:03.000000
b
1
1
2022-10-01 00:00:04.000000
null
null
2
2022-10-01 00:01:01.000000
aa
11
2
2022-10-01 00:01:02.000000
bb
null
2
2022-10-01 00:01:03.000000
null
null
2
2022-10-01 00:01:04.000000
aa
22
and I want to jump across the records to return the first and last not null AttrA and AttrB values for each eventID based on the eventTime. Each eventID can have multiple records so we can't know where the not nulls may be. So the wished results would be:
EventID
FirstAttrA
LastAttrA
FirstAttrB
LastAttrB
1
a
b
1
1
2
aa
aa
11
22
What I did is to add row_number() OVER (PARTITION BY event_id) ORDER BY event_time ASC) and then again DESC and then have multiple CTEs like this:
WITH enhanced_table AS
(
SELECT
eventID,
attrA,
attrB,
row_number() OVER (PARTITION BY event_id) ORDER BY event_time ASC) as rn,
row_number() OVER (PARTITION BY event_id) ORDER BY event_time DESC) as reversed_rn
),
first_events_with_attrA AS
(
SELECT
eventID,
FIRST(attrA) OVER (PARTITION BY eventID ORDER BY rn ASC) AS url
FROM enhanced_table
WHERE attrA IS NOT NULL
)...
But I need one CTE which scans again the table for each case I want (for this example 4 CTEs in total). It works, but it is slow.
Is there a way to grab the values I am interested in in a more efficient way?
No Need to build Row Numbers , you can directly use native SparkSQL Functions FIRST & LAST with isIgnoreNull as True to achieve the intended results -
Data Preparation
s = StringIO("""
EventID,EventTime,AttrA,AttrB
1,2022-10-01 00:00:01.000000,,
1,2022-10-01 00:00:02.000000,a,
1,2022-10-01 00:00:03.000000,b,1
1,2022-10-01 00:00:04.000000,,
2,2022-10-01 00:01:01.000000,aa,11
2,2022-10-01 00:01:02.000000,bb,
2,2022-10-01 00:01:03.000000,,
2,2022-10-01 00:01:04.000000,aa,22
"""
)
inp_schema = StructType([
StructField('EventID',IntegerType(),True)
,StructField('EventTime',StringType(),True)
,StructField('AttrA',StringType(),True)
,StructField('AttrB',DoubleType(),True)
]
)
df = pd.read_csv(s,delimiter=',')
sparkDF = sql.createDataFrame(df,schema=inp_schema)\
.withColumn('AttrA',F.when(F.isnan(F.col('AttrA')),None).otherwise(F.col('AttrA')))\
.withColumn('AttrB',F.when(F.isnan(F.col('AttrB')),None).otherwise(F.col('AttrB')))
sparkDF.show(truncate=False)
+-------+--------------------------+-----+-----+
|EventID|EventTime |AttrA|AttrB|
+-------+--------------------------+-----+-----+
|1 |2022-10-01 00:00:01.000000|null |null |
|1 |2022-10-01 00:00:02.000000|a |null |
|1 |2022-10-01 00:00:03.000000|b |1.0 |
|1 |2022-10-01 00:00:04.000000|null |null |
|2 |2022-10-01 00:01:01.000000|aa |11.0 |
|2 |2022-10-01 00:01:02.000000|bb |null |
|2 |2022-10-01 00:01:03.000000|null |null |
|2 |2022-10-01 00:01:04.000000|aa |22.0 |
+-------+--------------------------+-----+-----+
First & Last
sparkDF.registerTempTable("INPUT")
sql.sql("""
SELECT
EventID,
FIRST(AttrA,True) as First_AttrA,
LAST(AttrA,True) as Last_AttrA,
FIRST(AttrB,True) as First_AttrB,
LAST(AttrB,True) as Last_AttrB
FROM INPUT
GROUP BY 1
""").show()
+-------+-----------+----------+-----------+----------+
|EventID|First_AttrA|Last_AttrA|First_AttrB|Last_AttrB|
+-------+-----------+----------+-----------+----------+
| 1| a| b| 1.0| 1.0|
| 2| aa| aa| 11.0| 22.0|
+-------+-----------+----------+-----------+----------+
I have 2 tables as follows.
Need to join these 2 tables to get below table
I am trying different joins but not getting expected results. Could you please help me to get the desired table.
Really appreciate your help.
Thanks...
Hope this solution can help you, (I am used SQL_Server syntax)
SELECT isnull(date1,date2) as Date3, ISNULL(RM, 0 ),ISNULL(KM, 0 )
FROM table1
FULL JOIN table2
ON table1.Date1 = table2.Date2
order by Date3;
[RESULT]:
[EDIT]:
Live demo
create table Table1 (DATE1 date, RM int);
INSERT INTO Table1 VALUES ('1/4/2020' , 1);
INSERT INTO Table1 VALUES ('2/1/2020' , 4);
INSERT INTO Table1 VALUES ('2/10/2020' , 4);
GO
3 rows affected
create table Table2 (DATE2 date, KM int);
INSERT INTO Table2 VALUES ('2/2/2020' , 1);
INSERT INTO Table2 VALUES ('2/10/2020' , 3);
INSERT INTO Table2 VALUES ('3/5/2020' , 2);
GO
3 rows affected
select * from Table1;
GO
DATE1 | RM
:--------- | -:
2020-01-04 | 1
2020-02-01 | 4
2020-02-10 | 4
select * from Table2;
GO
DATE2 | KM
:--------- | -:
2020-02-02 | 1
2020-02-10 | 3
2020-03-05 | 2
SELECT isnull(date1,date2) as Date3, ISNULL(RM, 0 ),ISNULL(KM, 0 )
FROM table1
FULL JOIN table2
ON table1.Date1 = table2.Date2
order by Date3;
GO
Date3 | (No column name) | (No column name)
:--------- | ---------------: | ---------------:
2020-01-04 | 1 | 0
2020-02-01 | 4 | 0
2020-02-02 | 0 | 1
2020-02-10 | 4 | 3
2020-03-05 | 0 | 2
db<>fiddle here
I don't know scala but in pyspark you can do the following:
df1.join(df2, 'DATE', 'full').fillna(0)
Essentially you do a full join and fill all the NULLs with 0.
For Hive SQL I guess it would be something like
SELECT Date,
CASE WHEN (table1.RM IS NOT NULL) THEN table1.RM ELSE 0 END AS RM,
CASE WHEN (table2.KM IS NOT NULL) THEN table2.KM ELSE 0 END AS KM
FROM table1
FULL JOIN table2
ON table1.Date = table2.Date
I have created two initial dataframe named as df_rm, df_km as a source for your data.
df_rm looks like this:
+---------+---+
| date| rm|
+---------+---+
| 1/4/2020| 1|
| 2/1/2020| 4|
|2/10/2020| 4|
+---------+---+
df_km:
+---------+---+
| date| km|
+---------+---+
| 2/2/2020| 1|
|2/10/2020| 3|
| 3/5/2020| 2|
+---------+---+
Now, first we can do outer join then replace the null values with some values, in this case 0.
df_km.join(right = df_rm, Seq("date"),joinType = "outer")
.withColumn("rm",when(col("rm").isNull,0).otherwise(col("rm")))
.withColumn("km",when(col("km").isNull,0).otherwise(col("km")))
.show()
Which outputs like this:
+---------+---+---+
| date| km| rm|
+---------+---+---+
| 3/5/2020| 2| 0|
| 2/2/2020| 1| 0|
| 2/1/2020| 0| 4|
| 1/4/2020| 0| 1|
|2/10/2020| 3| 4|
+---------+---+---+
i am struggling to find a right way to write as select query that produces a count of ids with unique date, i have Log table as
id| DateTime
1|23-03-2019 18:27:45|
1|23-03-2019 18:27:45|
2|23-03-2019 18:27:50|
2|23-03-2019 18:27:51|
2|23-03-2019 18:28:01|
3|23-03-2019 18:33:15|
1|24-03-2019 18:13:18|
2|23-03-2019 18:27:12|
2|23-03-2019 15:27:46|
3|23-03-2019 18:21:58|
3|23-03-2019 18:21:58|
4|24-03-2019 10:11:14|
What i have am tried
select id, count(cast(DateTime as DATE)) as Counts from Logs group by id
its producing proper count of ids with id like
id|count
1 | 2|
2 | 3|
3 | 1|
1 | 1|
2 | 2|
3 | 2|
4 | 1|
What i want is to add datetime column casted as date
id|count|Date
1 | 2| 23-03-2019
2 | 3| 23-03-2019
3 | 1| 23-03-2019
1 | 1| 24-03-2019
2 | 2| 24-03-2019
3 | 2| 24-03-2019
4 | 1| 24-03-2019
However i get an error saying
Column 'Logs.DateTime' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.
when i try
select id, count(cast(DateTime as DATE)) as Counts from Logs group by id
You need to add cast(DateTime as DATE) also in group by
select id,cast(DateTime as DATE) as dateval, count(cast(DateTime as DATE)) as Counts
from Logs
group by id,cast(DateTime as DATE)
I have two tables.
T1
--------------------------
|IDT1|DESCR | VALUE |
--------------------------
| 1|TEST 1 | 100|
| 2|TEST 2 | 80|
--------------------------
T2
-----------
|IDT2|IDT1|
-----------
| 1| 1|
| 2| 1|
| 3| 2|
-----------
The field T2.IDT1 is foreign key of T1.IDT1.
I need to omit the duplicate values of T1 table (only), like the second row in the below result.
----------------------------
|IDT1|DESCR |IDT2| VALUE|
----------------------------
| 1|TEST 1 | 1| 100|
| | | 2| |
| 2|TEST 2 | 3| 80|
----------------------------
I am using firebird 2.5.
I'm not familiar with firebird, but if this was an Oracle DB, you could try this:
select
t1.idt1,
t1.descr,
t2.idt2,
t1.value
from (
select
t2.idt2 idt2,
case
when lag(t2.idt1) over (order by t2.idt1, t2.idt2) = t2.idt1 theN null
else t2.idt1
end idt1
from t2
) t2
left outer join t1
on t1.idt1 = t2.idt1
order by 3;
You can test that here: SQL Fiddle
I have a problem for creating a query for postgres(strictly speaking its redshift).
table data is below.
the table is PARTITION BY user_id ORDER BY created_at desc
data
user_id| x | y | min | created_at
-------+---+---+------+---------------------
1| 1 | 1 | 1 | 2015-01-15 17:26:53
1| 1 | 1 | 2 | 2015-01-15 17:26:54
1| 1 | 1 | 3 | 2015-01-15 17:26:55
1| 2 | 1 | 10 | 2015-01-16 02:46:21
1| 1 | 1 | 15 | 2015-01-16 02:46:22
1| 3 | 3 | 11 | 2015-01-16 03:01:44
1| 3 | 3 | 2 | 2015-01-16 03:02:06
2| 1 | 1 | 3 | 2015-01-16 03:02:12
2| 2 | 1 | 4 | 2015-01-16 03:02:15
2| 2 | 1 | 7 | 2015-01-16 03:02:18
and what I want is below
ideal result
user_id| x | y | sum_min |
-------+---+---+----------+
1| 1 | 1 | 6 |
1| 2 | 1 | 10 |
1| 1 | 1 | 15 |
1| 3 | 3 | 13 |
2| 1 | 1 | 3 |
2| 2 | 1 | 11 |
If I use simply group by user_id, x, y,
the result of will be
user_id| x | y | sum_min |
-------+---+---+----------+
1| 1 | 1 | 21 |
:| : | : | : |
this is not good for me:(
try this
with cte as (
select user_id,x,y,created_at,sum(min) over (partition by user_id,x,y,replace order by user_id ) sum_min from (
select user_id,x,y,min,replace( created_at::date::text ,'-',''),created_at from usr order by created_at
)t order by created_at
)
select user_id,x,y,sum_min from cte
group by sum_min,user_id,x,y
order by user_id
Maybe try grouping it by the creation date as well:
select user_id, x, y, sum(min), created_at::date from test
group by user_id, x, y, created_at::date
order by user_id, x, y, created_at
It seems that what you want to do is to calculate an aggregate function over a cluster of records ordered on a column that is based on same values in three columns, separated from other clusters only by those three column values. That is not possible in standard SQL because the order of records is not relevant to any of the SQL commands. The fact that you order by date does not change that: SQL commands simply do not support this kind of stratification.
The only option that I am aware of is to create a plpgsql function with a cursor on your data relation (presumably a view, but would work equally well with a table). You iterate over all the records in the relation and for each cluster encountered sum up the min values and output a new record with the clustering columns and the summed value.
CREATE FUNCTION sum_clusters()
RETURNS TABLE (user_id int, x int, y int, sum_int int) AS $$
DECLARE
data_row data%ROWTYPE;
cur CURSOR FOR SELECT * FROM data;
cur_user integer;
cur_x integer;
cur_y integer;
sum integer;
BEGIN
OPEN cur;
FETCH NEXT cur INTO data_row;
LOOP
IF NOT FOUND THEN
EXIT;
END IF;
cur_user := data_row.user_id;
cur_x := data_row.x;
cur_y := data_row.y;
sum := data_row.min;
LOOP
FETCH NEXT cur INTO data_row;
IF NOT FOUND THEN
EXIT;
END IF;
IF (data_row.user_id = cur_user) AND (data_row.x = cur_x) AND (data_row.y = cur_y) THEN
sum += data_row.min;
ELSE
EXIT;
END IF;
END LOOP;
RETURN NEXT cur_user, cur_x, cur_y, sum;
END LOOP;
RETURN;
END;
$$ LANGUAGE plpgsql;
That is a lot of code and not particularly fast, but it should work.