Hive - unable to compare two date columns in the same table - hive

I am trying to compare two string columns that has date value in it.
Below is an example dataset
id start_dt end_dt
1 2019-10-10 2019-10-10
2 2019-10-20 2020-01-01
3 2019-01-01 2020-01-01
I want to eliminate records where start_dt and end_dt are equal. I tried all the below inequality
select * from test where to_date(start_dt) <> to_date(end_dt)
select * from test
where to_date(from_unixtime(from_unixtimestamp(start_dt,'yyyy-mm-dd')))
<> to_date(from_unixtime(from_unixtimestamp(end_dt,'yyyy-mm-dd')))
But none of them worked for inequality where the same would work on equality.
Expected output
id start_dt end_dt
2 2019-10-20 2020-01-01
3 2019-01-01 2020-01-01
any help would be highly appreciated

As you are having string type for the start_dt, end_dt columns and we can directly cast to date type(yyyy-MM-dd) and get only the non matching rows!
Try this query:
select * from test where date(start_dt) <> date(end_dt);
We are simply casting to date type and comparing in where clause.

Related

Can I have a column that contains two different data types (date and text) under one common data type (date)in PostgreSQL?

I am new to PostgreSQL. In the table below, the end_date was a date data type that contained 9999-01-01 as a value. I replaced 9999-01-01 to "Never Ending" as the date didn't look good for representation purpose. I want to reconvert my end_time column to date data type from text without losing "Never Ending" text value.
I have following data:
event_id(int) event_name(varchar) start_date(date) end_date(varchar)
1 abc 2020-07-28 2020-07-29
2 efg 2020-08-01 2020-09-01
3 xyz 2021-06-01 Never Ending
The desire output:
event_id(int) event_name(varchar) start_date(date) end_date(date)
1 abc 2020-07-28 2020-07-29
2 efg 2020-08-01 2020-09-01
3 xyz 2021-06-01 Never Ending
The Query that converted data type --> date to text:
select event_id, event_name, start_date,
case
when end_date = '9999-01-01' then 'Never Ending'
else CAST (end_date AS text)
end as end_date
from event
I would use nulls instead of converting everything to text, so:
select
event_id
, event_name
, start_date
, case when end_date = '9999-01-01' then null else end_date end as end_date
from event
Also, you don't need all those parentheses :)

Find missing record between date range

At the end of an enormous stored procedure (in SQL Server), I've created two CTE. One with some date ranges (with 6 month intervals) and one with some records.
Let's assume i have date ranges on table B from 2020-01-01 to 2010-01-01 (with 6 months intervals)
Start End
----------------------
2020-01-01 | 2020-07-01
... ...
other years here
... ...
2010-01-01 | 2010-07-01
and on table A this situation:
Name Date
-----------------
John 2020-01-01
John 2019-01-01
John 2018-07-01
... ...
Rob 2020-01-01
Rob 2019-07-01
Rob 2018-07-01
... ...
I'm trying to generate a recordset like this:
Name MissingDate
-----------------
John 2019-07-01
... ...
John 2010-01-01
Rob 2019-01-01
... ...
Rob 2010-01-01
I've got the flu and I barely know who I am at this moment, I hope it was clear and if anyone could help me with this I would really appreciate it.
If you want missing dates (which appear to be by month), then generate all available dates and take out the ones you have.
with cte as (
select start, end
from dateranges
union all
select dateadd(month, 1, start), end
from cte
where start < end
)
select n.name, cte.start
from cte cross join
(select distinct name from tablea) n left join
tablea a
on a.date = cte.start and a.name = n.name
where a.date is null;

Sum of item count in an SQL query based on DATE_TRUNC

I've got a table which contains event status data, similar to this:
ID Time Status
------ -------------------------- ------
357920 2019-12-25 09:31:38.854764 1
362247 2020-01-02 09:31:42.498483 1
362248 2020-01-02 09:31:46.166916 1
362249 2020-01-02 09:31:47.430933 1
362300 2020-01-03 09:31:46.932333 1
362301 2020-01-03 09:31:47.231288 1
I'd like to construct a query which returns the number of successful events each day, so:
Time Count
-------------------------- -----
2019-12-25 00:00:00.000000 1
2020-01-02 00:00:00.000000 3
2020-01-03 00:00:00.000000 2
I've stumbled across this SO answer to a similar question, but the answer there is for all the data returned by the query, whereas I need the sum grouped by date range.
Also, I cannot use BETWEEN to select a specific date range, since this query is for a Grafana dashboard, and the date range is determined by the dashboard's UI. I'm using Postgres for the SQL dialect, in case that matters.
You need to remove the time from time component. In most databases, you can do this by converting to a date:
select cast(time as date) as dte,
sum(case when status = 1 then 1 else 0 end) as num_successful
from t
group by cast(time as date)
order by dte;
This assumes that 1 means "successful".
The cast() does not work in all databases. Other alternatives are things like trunc(time), date_trunc('day', time), date_trunc(time, day) -- and no doubt many others.
In Postgres, I would phrase this as:
select date_trunc('day', time) as dte,
count(*) filter (where status = 1) as num_successful
from t
group by dte
order by dte;
How about like this:
SELECT date(Time), sum(status)
FROM table
GROUP BY date(Time)
ORDER BY min(Time)

Getting the days between two dates for multiple IDs

This is a follow up question from How to get a list of months between 2 given dates using a query?
really. (I suspect it's because I don't quite understand the logic behind connect by level clauses !)
What I have is a list of data like so
ID | START_DATE | END_DATE
1 | 01-JAN-2018 | 20-JAN-2018
2 | 13-FEB-2018 | 20-MAR-2018
3 | 01-MAR-2018 | 07-MAR-2018
and what I want to try and get is a list with all the days between the start and end date for each ID.
So for example I want a list which gives
ID | DATE
1 | 01-JAN-2018
1 | 02-JAN-2018
1 | 03-JAN-2018
...
1 | 19-JAN-2018
1 | 20_JAN-2018
2 | 13-FEB-2018
2 | 14-FEB-2018
2 | 15-FEB-2018
...
etc.
What I've tried to do is adapt one of the answers from the above link as follows
select id
, trunc(start_date+((level-1)),'DD')
from (
select id
, start_date
, end_date
from blah
)
connect by level <= ((trunc(end_date,'DD')-trunc(start_date,'DD'))) + 1
which gives me what I want but then a whole host of duplicate dates as if it's like a cartesian join. Is there something simple I need to add to fix this?
I like recursive CTEs:
with cte as (
select id, start_dte as dte, end_dte
from blah
union all
select id, dte + 1, end_dte
from cte
where dte < end_dte
)
select *
from cte
order by id, dte;
This is ANSI standard syntax and works in several other databases.
The hierarchical query you were trying to do needs to include id = prior id in the connect-by clause, but as that causes loops with multiple source rows you also need to include a call to a non-deterministic function, such as dbms_random.value:
select id, start_date + level - 1 as day
from blah
connect by level <= end_date - start_date + 1
and prior id = id
and prior dbms_random.value is not null
With your sample data in a CTE, that gets 63 rows back:
with blah (ID, START_DATE, END_DATE) as (
select 1, date '2018-01-01', date '2018-01-20' from dual
union all select 2, date '2018-02-13', date '2018-03-20' from dual
union all select 3, date '2018-03-01', date '2018-03-07' from dual
)
select id, start_date + level - 1 as day
from blah
connect by level <= end_date - start_date + 1
and prior id = id
and prior dbms_random.value is not null;
ID DAY
---------- ----------
1 2018-01-01
1 2018-01-02
1 2018-01-03
...
1 2018-01-19
1 2018-01-20
2 2018-02-13
2 2018-02-14
...
3 2018-03-05
3 2018-03-06
3 2018-03-07
You don't need to trunc() the dates unless they have non-midnight times, which seems unlikely in this case, and even then it might not be necessary if only the end-date has a later time (like 23:59:59).
A recursive CTE is more intuitive in many ways though, at least once you understand the basic idea of them; so I'd probably use Gordon's approach too. There can be differences in performance and whether they work at all for large amounts of data (or generated rows), but for a lot of data it's worth comparing different approaches to find the most suitable/performant anyway.

PostgreSQL query for multiple update

I have a table in which I have 4 columns: emp_no,desig_name,from_date and to_date:
emp_no desig_name from_date to_date
1001 engineer 2004-08-01 00:00:00
1001 sr.engineer 2010-08-01 00:00:00
1001 chief.engineer 2013-08-01 00:00:00
So my question is to update first row to_date column just one day before from_date of second row as well as for the second one aslo?
After update it should look like:
emp_no desig_name from_date to_date
1001 engineer 2004-08-01 00:00:00 2010-07-31 00:00:00
1001 sr.engineer 2010-08-01 00:00:00 2013-07-31 00:00:00
1001 chief.engineer 2013-08-01 00:00:00
You can calculate the "next" date using the lead() function.
This calculated value can then be used to update the table:
with calc as (
select promotion_id,
emp_no,
from_date,
lead(from_date) over (partition by emp_no order by from_date) as next_date
from emp
)
update emp
set to_date = c.next_date - interval '1' day
from calc c
where c.promotion_id = emp.promotion_id;
As you can see getting that value is quite easy, and storing derived information is very often not a good idea. You might want to consider a view that calculates this information on the fly so you don't need to update your table each time you insert a new row.
SQLFiddle example: http://sqlfiddle.com/#!15/31665/1