Join tables on dates, with dirty date field - sql

In AWS Athena, I am trying to join two tables in the db using the date, but one of the tables (table2) is not clean, and contains values that are not dates, as shown below.
| table2.date |
| ---- |
|6/02/2021|
|9/02/2021|
|1431 BEL & 1628 BEL."|
|15/02/2021|
|and failed to ....|
|18/02/2021|
|19/02/2021|
I am not able to have any influence in cleaning this table up.
My current query is:
SELECT *
FROM table1
LEFT JOIN table2
ON table1.operation_date = cast(date_parse(table2."date",'%d/%m/%Y') as date)
LIMIT 10;
I've tried using regex_like(col, '[a-z]'), but this still leaves the values that are numerical, but not dates.
How do I get the query to ignore the values that are not dates?

You may wrap conversion expression with try function, that will resolve to NULL in case of failed conversion.
select
try(date_parse(col, '%d/%m/%Y'))
from(values
('6/02/2021'),
('9/02/2021'),
('1431 BEL & 1628 BEL.'),
('15/02/2021'),
('and failed to ....'),
('18/02/2021'),
('19/02/2021')
) as t(col)
#
_col0
1
2021-02-06 00:00:00.000
2
2021-02-09 00:00:00.000
3
4
2021-02-15 00:00:00.000
5
6
2021-02-18 00:00:00.000
7
2021-02-19 00:00:00.000

Related

running tally in SQL

Need help tallying fork truck training completions at work. Here is an example of the tables I have, and the table I need to create:
table 1:
date
is_work_day
2023-01-25
1
2023-01-26
1
2023-01-27
1
2023-01-28
0
2023-01-29
1
2023-01-30
0
table 2:
employee_id
training_passed
test_date
001
1
2023-01-25
002
1
2023-01-26
003
0
2023-01-26
004
1
2023-01-26
005
0
2023-01-27
006
1
2023-01-29
need table:
date
cumulative_passed_training
2023-01-26
2
2023-01-27
2
2023-01-29
3
The table should count the total passed trainings, but only starting on 2023-01-26 and should only show dates that are work days. Any help would be greatly appreciated.
I think I need to JOIN the two tables, and then SUM the training_passed column, but am unsure how to get it to start at a certain date, and how to make it only show work days on the final table.
JOIN on the date column and add the passed tests as JOIN condition. Also GROUP BY the date so you can sum for each one
select t1.date, count(t2.employee_id)
from table1 t1
join table2 t2 on t1. date = t2.test_date
and t2.training_passed = 1
group by t1.date
It would make no difference if you put the condition
t2.training_passed = 1
in a where clause instead of the INNER JOIN.

Access Query: Match Two FKs, Select Record with Max (Latest) Time, Return 3d Field From Record

I have an Access table (Logs) like this:
pk
modID
relID
DateTime
TxType
1
1234
22.3
10/1/22 04:00
1
2
1234
23.1
10/10/22 06:00
1
3
1234
23.1
10/11/22 07:00
2
4
1234
23.1
10/12/22 08:00
3
5
4321
22.3
10/2/22 06:00
7
6
4321
23.1
10/10/22 06:00
1
7
4321
23.1
10/11/22 07:30
3
Trying to write a query as part of a function that searches this table:
for all records matching a given modID and relID (e.g. 1234 and 23.1),
picks the most recent one (the MAX of DateTime),
returns the TxType for that record.
However, a bit new to Access and its query structure is vexing me. I landed on this but because I have to include a Total/Aggregate function for TxType I had to either choose Group By (not what I want) or Last (closer, but returns junk results). The SQL for my query is currently:
SELECT Last(Logs.TxType) AS LastOfTxType, Max(Logs.DateTime) AS MaxOfDateTime
FROM Logs
GROUP BY Logs.dmID, Logs.relID
HAVING (((Logs.dmID)=[EnterdmID]) AND ((Logs.relID)=[EnterrelID]));
It returns the TxType field when I pass it the right parameters, but not the correct record - I would like to be rid of the Last() bit but if I remove it Access complains that I don't have it as part of an aggregate function.
Anyone that can point me in the right direction here?
Have you tried
SELECT TOP 1 TxtType
FROM Logs
WHERE (((Logs.dmID)=[EnterdmID]) AND ((Logs.relID)=[EnterrelID]))
ORDER BY DateTime DESC;
That will give you the latest single data row based on your DateTime field and other criteria.

How to go between a set of dates and times

I have a set of data where one column is date and time. I have been asked for all the data in the table, between two date ranges and within those dates, only certain time scale. For example, I was data between 01/02/2019 - 10/02/2019 and within the times 12:00 AM to 07:00 AM. (My real date ranges are over a number of months, just using these dates as an example)
I can cast the date and time into two different columns to separate them out as shown below:
select
name
,dateandtimetest
,cast(dateandtimetest as date) as JustDate
,cast(dateandtimetest as time) as JustTime
INTO #Test01
from [dbo].[TestTable]
I put this into a test table so that I could see if I could use a between function on the JustTime column, because I know I can do the between on the dates no problem. My idea was to get them done in two separate tables and perform an inner join to get the results I need
from #Test01
WHERE justtime between '00:00' and '05:00'
The above code will not give me the data I need. I have been racking my brain for this so any help would be much appreciated!
The test table I am using to try and get the correct code is shown below:
|Name | DateAndTimeTest
-----------------------------------------|
|Lauren | 2019-02-01 04:14:00 |
|Paul | 2019-02-02 08:20:00 |
|Bill | 2019-02-03 12:00:00 |
|Graham | 2019-02-05 16:15:00 |
|Amy | 2019-02-06 02:43:00 |
|Jordan | 2019-02-06 03:00:00 |
|Sid | 2019-02-07 15:45:00 |
|Wes | 2019-02-18 01:11:00 |
|Adam | 2019-02-11 11:11:00 |
|Rhodesy | 2019-02-11 15:16:00 |
I have now tried and got the data to show me information between the times on one date using the below code, but now I would need to make this piece of code run for every date over a 3 month period
select *
from dbo.TestTable
where DateAndTimeTest between '2019-02-11 00:00:00' and '2019-02-11 08:30:00'
You can use SQL similar to following:
select *
from dbo.TestTable
where (CAST(DateAndTimeTest as date) between '2019-02-11' AND '2019-02-11') AND
(CAST(DateAndTimeTest as time) between '00:00:00' and '08:30:00')
Above query will return all records where DateAndTimeTest value in date range 2019-02-11 to 2019-02-11 and with time between 12AM to 8:30AM.

is there any optimised way to write SQL query to find difference between two data-sets?

Following is the query and sample data-set (actual data-set is huge and residing in HDFS)
I am trying to find out the diff in data-set 1 with following query.
Is there any better way to achieve this without using join if possible?
SELECT
dt1.name,
dt1.code,
dt1.day
FROM
dt1
LEFT OUTER JOIN dt2 ON (dt1.name = dt2.name AND dt1.code = dt2.code AND dt1.day = dt2.day)
WHERE
dt2.name IS NULL AND dt2.code IS NULL AND dt2.day IS NULL
following is the data set
Data SET 1
name code day
a 1001 2019-01-01
a 1002 2019-01-02
a 1003 2019-01-01
b 2001 2019-01-01
b 2002 2019-01-02
b 2003 2019-01-03
find out name-code combo of data-set 1 which is not found in data-set 2 for a given day
Data SET 2
name code day
a 1001 2019-01-01
b 1002 2019-01-01
a 1003 2019-01-01
d 2001 2019-01-01
e 2002 2019-01-01
b 2003 2019-01-01
Use the Dataset.except (if you data has duplicated exceptAll)
val result = dt1.except(dt2) // Ensure that dt1 and dt2 have the same columns
warning: Ensure that both datasets have the same column order else generate wrong result (instead of proper exception).
Unfortunately this functionality is not available in spark-sql or Imala/Hive.

Postgres count items by interval

I am trying to get the count of items given an interval with no start or stop times specified. I would imagine you could do it with window functions but i am not too sure how to go about it.
The problem is as follows i would like to get the number of times people login to a website within a given an arbitrary interval say 20 mins.
Example A
1. 2015-06-24 23:00:00
2. 2015-06-24 23:45:00
3. 2015-06-25 00:00:00
4. 2015-06-25 00:15:00
5. 2015-06-25 00:17:00
6. 2015-06-25 00:21:00
In the above example I would highlight items (2,3),(3,4,5), (4,5,6), (5,6) the output I would like is the
start,end,count
2015-06-25 23:45:00,2015-06-25 00:00:00,2
2015-06-25 00:00:00,2015-06-25 00:17:00,3
2015-06-25 00:15:00,2015-06-25 00:21:00,3
Also only keep the data where count >= 2 otherwise everything will be a valid grouping
Now is a window function the way i should go, cte or is there another practice to adopt?
Try this query with self join:
select a.id, a.log_at, max(b.log_at), count(1)
from logs a
join logs b on b.log_at >= a.log_at and b.log_at <= a.log_at+ '20 m'::interval
group by 1, 2
having count(1) > 1
order by 1
You can get each "day" groups with counts by a query like:
SELECT MIN(last_seen_at), MAX(last_seen_at), COUNT(*)
FROM user_kinds
GROUP BY DATE(last_seen_at)
ORDER BY DATE(last_seen_at) DESC LIMIT 5;
Which on my sample data set yields a result like:
2015-06-26 00:12:30.476548 | 2015-06-26 22:06:25.134322 | 69
2015-06-25 00:46:03.392651 | 2015-06-25 23:49:46.616964 | 14
2015-06-24 14:22:33.578176 | 2015-06-24 23:39:01.32241 | 10
2015-06-23 01:42:53.438663 | 2015-06-23 20:12:21.864601 | 2
(5 rows)