Validating date value using Hive functions - hive

I have a string column whose values are dates. Some are valid (yyyy-MM-dd) and some are not. How do I filter valid and invalid using only Hive? I cannot use custom UDF or Spark, so it has to be using Hive functions only.
select * from date_test;
+-------------------+--+
| date_test.mydate |
+-------------------+--+
| 2018-12-13 | => valid
| 2018-13-12 | => invalid
| 2018-04-31 | => invalid
+-------------------+--+
select mydate,to_date(mydate) from date_test;
+-------------+-------------+--+
| mydate | _c1 |
+-------------+-------------+--+
| 2018-12-13 | 2018-12-13 |
| 2018-13-12 | 2019-01-12 | => to_date() casts it to valid value
| 2018-04-31 | 2018-05-01 | => to_date() casts it to valid value
+-------------+-------------+--+

I have sort of managed to get it, but I am open to other better approaches.
//valid date values
select
mydate,
to_date(mydate)
from
date_test
where
mydate = to_date(mydate);
+-------------+-------------+--+
| mydate | _c1 |
+-------------+-------------+--+
| 2018-12-13 | 2018-12-13 |
+-------------+-------------+--+
//invalid date values
select
mydate,
to_date(mydate)
from
date_test
where
mydate <> to_date(mydate);
+-------------+-------------+--+
| mydate | _c1 |
+-------------+-------------+--+
| 2018-13-12 | 2019-01-12 |
| 2018-04-31 | 2018-05-01 |
+-------------+-------------+--+

Related

BigQuery - Get most recent data for each individual user

I wonder if anyone here can help with a BigQuery piece I am working on.
This will need to pull the most recent gplus/currents activity for each individual user in the domain.
I have tried the following query, but this pulls every activity for every user:
SELECT
TIMESTAMP_MICROS(time_usec) AS date,
email,
event_type,
event_name
FROM
`bqadminreporting.adminlogtracking.activity`
WHERE
record_type LIKE 'gplus'
ORDER BY
email ASC;
I have tried to use DISTINCT, but I still get multiple entries for the same user. Ideally, I need to do this looking back over 90 day... (So between today and 90 days ago, get the most recent activity for each user - if that makes sense?) which brings me to the issue with another question.
EDIT:
Example data and expected output.
Fields: There are over 500 fields, I have just listed the relevant ones
+--------------------------------+---------+----------+
| Field name | Type | Mode |
+--------------------------------+---------+----------+
| time_usec | INTEGER | NULLABLE |
| email | STRING | NULLABLE |
| event_type | STRING | NULLABLE |
| event_name | STRING | NULLABLE |
| record_type | STRING | NULLABLE |
| gplus | RECORD | NULLABLE |
| gplus. log_event_resource_name | STRING | NULLABLE |
| gplus. attachment_type | STRING | NULLABLE |
| gplus. plusone_context | STRING | NULLABLE |
| gplus. post_permalink | STRING | NULLABLE |
| gplus. post_resource_name | STRING | NULLABLE |
| gplus. comment_resource_name | STRING | NULLABLE |
| gplus. post_visibility | STRING | NULLABLE |
| gplus. user_type | STRING | NULLABLE |
| gplus. post_author_name | STRING | NULLABLE |
+--------------------------------+---------+----------+
Output from my query: This is the output I get when running my query above.
+-----+--------------------------------+------------------+----------------+----------------+
| Row | date | email | event_type | event_name |
+-----+--------------------------------+------------------+----------------+----------------+
| 1 | 2020-01-30 07:10:19.088 UTC | user1#domain.com | post_change | create_post |
| 2 | 2020-03-03 08:47:25.086485 UTC | user1#domain.com | coment_change | create_comment |
| 3 | 2020-03-23 09:10:09.522 UTC | user1#domain.com | post_change | create_post |
| 4 | 2020-03-23 09:49:00.337 UTC | user1#domain.com | plusone_change | remove_plusone |
| 5 | 2020-03-23 09:48:10.461 UTC | user1#domain.com | plusone_change | add_plusone |
| 6 | 2020-01-30 10:04:29.757005 UTC | user1#domain.com | coment_change | create_comment |
| 7 | 2020-03-28 08:52:50.711359 UTC | user2#domain.com | coment_change | create_comment |
| 8 | 2020-11-08 10:08:09.161325 UTC | user2#domain.com | coment_change | create_comment |
| 9 | 2020-04-21 15:28:10.022683 UTC | user3#domain.com | coment_change | create_comment |
| 10 | 2020-03-28 09:37:28.738863 UTC | user4#domain.com | coment_change | create_comment |
+-----+--------------------------------+------------------+----------------+----------------+
Desired result: Only 1 row of data per user, showing only the most recent event.
+-----+--------------------------------+------------------+----------------+----------------+
| Row | date | email | event_type | event_name |
+-----+--------------------------------+------------------+----------------+----------------+
| 1 | 2020-03-23 09:49:00.337 UTC | user1#domain.com | plusone_change | remove_plusone |
| 2 | 2020-11-08 10:08:09.161325 UTC | user2#domain.com | coment_change | create_comment |
| 3 | 2020-04-21 15:28:10.022683 UTC | user3#domain.com | coment_change | create_comment |
| 4 | 2020-03-28 09:37:28.738863 UTC | user4#domain.com | coment_change | create_comment |
+-----+--------------------------------+------------------+----------------+----------------+
Use array_agg:
select
email,
array_agg(STRUCT(TIMESTAMP_MICROS(time_usec) as date, event_type, event_name) ORDER BY time_usec desc LIMIT 1)[OFFSET(0)].*
from `bqadminreporting.adminlogtracking.activity`
where
record_type LIKE 'gplus'
and time_usec > unix_micros(timestamp_sub(current_timestamp(), interval 90 day))
group by email
order by email
Test example:
with mytable as (
select timestamp '2020-01-30 07:10:19.088 UTC' as date, 'user1#domain.com' as email, 'post_change' as event_type, 'create_post' as event_name union all
select timestamp '2020-03-03 08:47:25.086485 UTC', 'user1#domain.com', 'coment_change', 'create_comment' union all
select timestamp '2020-03-23 09:10:09.522 UTC', 'user1#domain.com', 'post_change', 'create_post' union all
select timestamp '2020-03-23 09:49:00.337 UTC', 'user1#domain.com', 'plusone_change', 'remove_plusone' union all
select timestamp '2020-03-23 09:48:10.461 UTC', 'user1#domain.com', 'plusone_change', 'add_plusone' union all
select timestamp '2020-01-30 10:04:29.757005 UTC', 'user1#domain.com', 'coment_change', 'create_coment' union all
select timestamp '2020-03-28 08:52:50.711359 UTC', 'user2#domain.com', 'coment_change', 'create_coment' union all
select timestamp '2020-11-08 10:08:09.161325 UTC', 'user2#domain.com', 'coment_change', 'create_coment' union all
select timestamp '2020-04-21 15:28:10.022683 UTC', 'user3#domain.com', 'coment_change', 'create_coment' union all
select timestamp '2020-03-28 09:37:28.738863 UTC', 'user4#domain.com', 'coment_change', 'create_coment'
)
select
email,
array_agg(STRUCT(date, event_type, event_name) ORDER BY date desc LIMIT 1)[OFFSET(0)].*
from mytable
group by email
If you want all columns from the most recent row, you can use this BigQuery syntax:
select array_agg(t order by date desc limit 1)[ordinal(1)].*
from mytable t
group by t.email;
If you want specific columns, then Sergey's solution might be simpler.
An alternative way to solve your problem is :-
select * from (
select
max (date1) max_dt
from mytable
group by date(date1)), mytable
where date1=max_dt

How to concat two fields and use the result in WHERE clause?

I have to get all oldest records based on the date-time information.
Data
Id | External Id | Date | Time
1 | 1000 | 2020-08-18 00:00:00 | 02:30:22
2 | 1000 | 2020-08-12 00:00:00 | 12:45:51
3 | 1556 | 2020-08-17 00:00:00 | 10:09:01
4 | 1919 | 2020-08-14 00:00:00 | 18:19:18
5 | 1919 | 2020-08-14 00:00:00 | 11:45:21
6 | 1919 | 2020-08-14 00:00:00 | 15:54:15
Expected result
Id | External Id | Date | Time
2 | 1000 | 2020-08-12 00:00:00 | 12:45:51
3 | 1556 | 2020-08-17 00:00:00 | 10:09:01
5 | 1919 | 2020-08-14 00:00:00 | 11:45:21
I'm currently doing this
SELECT *
FROM RUN AS T1
WHERE CONCAT(T1.DATE, T1.TIME) = (
SELECT MIN(CONCAT(T2.DATE, T2.TIME))
FROM RUN AS T2
WHERE T2.EXTERNAL_ID = T1.EXTERNAL_ID
)
Is it a correct way to do ?
Thank you, regards
Update 1 : Data type
DATE column is datetime
TIME column is varchar
You can use a window function such as DENSE_RANK()
SELECT ID, External_ID, Date, Time
FROM
(
SELECT DENSE_RANK() OVER (PARTITION BY External_ID ORDER BY Date, Time) AS dr,
r.*
FROM run r
) AS q
WHERE dr = 1
Demo

Arranging the data on the basis of column value

I have a table which has the below structure.
+ ----------------------+--------------+--------+
| timeStamp | value | type |
+ ----------------------+--------------+--------+
| '2010-01-14 00:00:00' | '11787.3743' | 'mean' |
| '2018-04-03 14:19:21' | '9.9908' | 'std' |
| '2018-04-03 14:19:21' | '11787.3743' | 'min' |
+ ----------------------+--------------+--------+
Now i want to write a select query where i can fetch the data on the basis of type.
+ ----------------------+--------------+-------------+----------+
| timeStamp | mean_type | min_type | std_type |
+ ----------------------+--------------+-------------+----------+
| '2010-01-14 00:00:00' | '11787.3743' | | |
| '2018-04-03 14:19:21' | | | '9.9908' |
| '2018-04-03 14:19:21' | | '11787.3743 | |
+ ----------------------+--------------+-------------+----------+
Please help me how can i do this in postgres DB by writing a query.I also want to get the data at the interval of 10 minutes only.
Use CASE ... WHEN ...:
with my_table(timestamp, value, type) as (
values
('2010-01-14 00:00:00', 11787.3743, 'mean'),
('2018-04-03 14:19:21', 9.9908, 'std'),
('2018-04-03 14:19:21', 11787.3743, 'min')
)
select
timestamp,
case type when 'mean' then value end as mean_type,
case type when 'min' then value end as min_type,
case type when 'std' then value end as std_type
from my_table;
timestamp | mean_type | min_type | std_type
---------------------+------------+------------+----------
2010-01-14 00:00:00 | 11787.3743 | |
2018-04-03 14:19:21 | | | 9.9908
2018-04-03 14:19:21 | | 11787.3743 |
(3 rows)

How to convert from DAYOFWEEK and WEEKOFYEAR to date in Hive

I'm trying to GROUP BY to count events over weeks in Hive. What I'd like to get out is the date for each Saturday of the year (the output only needs to return results for weeks where we have data) and the number of events that occurred over the entire preceding week (ie, the num_events column should be the total number of events from Sunday through Saturday).
Example Desired Output:
+------------+------------+
| ymd_date | num_events |
+------------+------------+
| 2016-01-09 | 42 |
| 2016-01-16 | 500 |
| 2016-01-23 | 1090 |
| . | . |
| . | . |
| . | . |
| 2016-12-31 | 23125 |
+------------+------------+
But I'm not sure how to convert from WEEKOFYEAR to get the date for each Saturday.
What I Have So Far:
SELECT
concat_ws('-', cast(YEAR(FROM_UNIXTIME(time))as string),
lpad(cast(MONTH(FROM_UNIXTIME(time))as string), 2, '0'),
cast(WEEKOFYEAR(FROM_UNIXTIME(time))as string)) as ymd_date,
COUNT(*) as num_events
FROM
mytable
GROUP BY
concat_ws('-', cast(YEAR(FROM_UNIXTIME(time))as string),
lpad(cast(MONTH(FROM_UNIXTIME(time))as string), 2, '0'),
cast(WEEKOFYEAR(FROM_UNIXTIME(time))as string))
ORDER BY
ymd_date
Example Current Output:
+------------+------------+
| ymd_date | num_events |
+------------+------------+
| 2016-01-1 | 42 |
| 2016-01-2 | 500 |
| 2016-01-3 | 1090 |
| . | . |
| . | . |
| . | . |
| 2016-12-52 | 23125 |
+------------+------------+
I think what I have so far is just about there, but the date (the ymd_date column) shows the year-month-weekofyear instead of year-month-day.
Any ideas on how to produce the yyyy-mm-dd for each week?
date_sub(next_day(from_unixtime (time),'SAT'),7)
Hive Operators and User-Defined Functions (UDFs)
select date_sub(next_day(from_unixtime(time),'SAT'),7) as ymd_date
,count(*) as num_events
from mytable
group by date_sub(next_day(from_unixtime(time),'SAT'),7)
order by ymd_date
hive> select date_sub(next_day(from_unixtime(unix_timestamp()),'SAT'),7);
OK
2016-12-17

Repeat row number of times based on column value

I have table which looks like this:
| name | start_date | duration_day|
========================================
| A | 2015-01-01 | 3 |
| B | 2015-01-02 | 2 |
And now I want to get an output like so:
| name | date |
=====================
| A | 2015-01-01 |
| A | 2015-01-02 |
| A | 2015-01-03 |
| B | 2015-01-02 |
| B | 2015-01-03 |
How can I do this in PostgreSQL?
Borrowing from Abelisto's answer, you can generate a series from the duration_day value with the generate_series() table function in the row source list. The function uses the duration_day value from my_table through an implicit lateral join.
SELECT name, start_date + n AS date
FROM my_table, generate_series(0, duration_day - 1) AS x(n);
select
name,
start_date + generate_series(0, duration_day - 1)
from
your_table;