REGEX Date Match Format - sql

I currently have a dataset with varying date entries (and a mixture of string entries) for which I need to parse. There are a few: 'M/DD/YY', 'M/D/YY', 'MM/DD/YY', 'MM/D/YY', 'MM/DD/YYYY'...). I could use some support with improving my regex to handle the varying formats and possible text entered in the date field.
My current Postgres query breaks out other entries into another column and reformats the date. Although, I've increased the year to 4 digits rather than 2, I believe the issue may live somewhere in the 'YYYY-MM-DD' formatting or that my query does not properly accommodate additional formatting within.
CASE WHEN date ~ '^\\\\d{1,2}/\\\\d{1,2}/\\\\d{4}$' THEN TO_DATE(date::date, 'YYYY-MM-DD')
ELSE NULL END AS x_date,
CASE WHEN NOT date ~ '^\\\\d{1,2}/\\\\d{1,2}/\\\\d{4}$' AND date <> '' THEN date
ELSE NULL END AS x_date_text
For the various date formats, they should be reformatted accordingly and for other non-date values, they should be moved over to the other column.

Based on your list of formats, I believe that just two regexes should be enough to check the values:
'^[0-9]{1,2}/[0-9]{1,2}/[0-9]{4}/$' would map to date format 'MM/DD/YYYY'
'^[0-9]{1,2}/[0-9]{1,2}/[0-9]{2}/$' would map to 'MM/DD/YY'
You can use a CASE construct to check the value against the regex and apply the proper mask when using TO_DATE().
However, since you need to split the data over two columns, you would need to tediously repeat the CASE expression twice, one for each column.
One way to simplify the solution (and to make it easier to maintain afterwards) would be to use a CTE to list the regexes and the associated date format. You can LEFT JOIN the CTE with the table.
Consider the following query:
WITH vars AS (
SELECT '^[0-9]{1,2}/[0-9]{1,2}/[0-9]{4}/$' reg, 'MM/DD/YYYY' format
UNION ALL '^[0-9]{1,2}/[0-9]{1,2}/[0-9]{2}/$', 'MM/DD/YY'
)
SELECT
CASE WHEN vars.reg IS NOT NULL THEN TO_DATE(t.date, vars.format) END x_date,
CASE WHEN vars.reg IS NULL THEN t.date END x_date_text
FROM
mytable t
LEFT JOIN vars ON t.date ~ vars.reg
If more regex/format pairs are needed, you just have to expand the CTE. Just pay attention to the fact that regexes should be exclusives (ie two different regexes should not possibly match on a single value), else you will get duplicated records in the result.

While the regex by #GMB insures format validity it passes many invalid dates, and with liberal to_date conversion by Postgres could introduce errors and or confusion. Run the following to see the liberal conversion:
set datestyle = 'ISO';
select dd,'01/' || dd || '/2019' mmddyyyy, to_date ( '01/' || dd || '/2019', 'mm/dd/yyyy')
from ( select generate_series( 0,40)::text dd) d;
select mm , mm ||'/01/2019' mmddyyyy, to_date ( mm ||'01/2019', 'mm/dd/yyyy')
from ( select generate_series( 0,40)::text mm) d;
If that liberal date conversion is acceptable - Great. But if not we can tighten it down considerable (although still not 100% valid results). Lets break the format down:
for date formats mm/dd/yyyy or mm/dd/yy
breakdown MM valid 1 - 12
valid character 0 followed by 1-9
1 followed by 0-2
regex (0?[1-9]|1[0-2)
DD valid 0 - 31 (sort of)
day 31 valid for April, June, Sep, Nov also evaluate valid but become
day 1 of May, July, Oct, Dec respectivally
days 29-31 of Feb also eveluate valid but become day
1-3 of march and 1-2 in lead yearsin non-leap years
valid character optional 0 followed by 1-9
1-2 followed by 0-9
3 followed by 0-1
regex (0?[1-9]|[1-2][0-9]|3[0-2])
YEAR valid 1900 - 2999 (no ancient history)
valid character 1-2 followed by 0-9,0-9,0-9
0-9,0-9
Now putting that together we get.
-- setup
drop table if exists my_dates;
create table my_dates(test_date text, status text);
insert into my_dates (test_date, status)
values ('01/15/2019', 'valid')
, ('12/25/0001', 'invalid year < 1900')
, ('12/01/2020', 'valid')
, ('oops', 'yea a date NOT')
, ('6/3/19', 'valid')
, ('2/29/2019', 'valid sort of, Postgres liberal evaluation of to_date')
, ('2/30/2019', 'valid sort of, Postgres liberal evaluation of to_date')
, ('2/31/2019', 'valid sort of, Postgres liberal evaluation of to_date')
, ('2/29/2020', 'valid')
, ('14/29/2020', 'invalid month 14')
, ('01/32/2019', 'invalid day 32')
, ('04/31/2019', 'valid sort of, Postgres liberal evaluation of to_date')
;
-- as query
set datestyle = 'ISO';
with patterns (pat, fmt) as (values ('^(0?[1-9]|1[0-2])/(0?[1-9]|[1-2][0-9]|3[0-1])/[12][0-9]{3}$'::text, 'mm/dd/yyyy')
, ('^(0?[1-9]|1[0-2])/(0?[1-9]|[1-2][0-9]|3[0-1])/[0-9]{2}$'::text, 'mm/dd/yy')
)
select to_date(test_date, fmt),status, test_date, pat, fmt
from my_dates
left join patterns on test_date ~ pat;
------------------------------------------------------------------
-- function accessable from SQL
create or replace function parse_date(check_date_in text)
returns date
language sql
as $$
with patterns (pat, fmt) as (values ('^(0?[1-9]|1[0-2])/(0?[1-9]|[1-2][0-9]|3[0-1])/[12][0-9]{3}$'::text, 'mm/dd/yyyy')
, ('^(0?[1-9]|1[0-2])/(0?[1-9]|[1-2][0-9]|3[0-1])/[0-9]{2}$'::text, 'mm/dd/yy')
)
select to_date(check_date_in, fmt)
from patterns
where check_date_in ~ pat;
$$;
--- test function
select test_date, parse_date(test_date), status from my_dates;
-- use demo
select * from my_dates
where parse_date(test_date) >= date '2020-01-02';

Related

How to convert a date to a string

I want yo get only the 'date hours:minutes:seconds' from the Date column
Date
10/11/22 12:14:01,807000000
11/12/22 13:15:46,650000000
29/12/22 14:30:46,501000000
and I want to get a string column with date hours:minutes:seconds
Date_string
10/11/22 12:14:01
11/12/22 13:15:46
29/12/22 14:30:46
I tried this code but it doesn't work:
select*, TO_CHAR(extract(hour from (Date)))||':'||TO_CHAR(extract(minute from (Date)))||':'||TO_CHAR(extract(second from (Date))) as Date_string
from table;
If this is a date column, you could use to_char directly:
SELECT m.*, TO_CHAR(my_date_column, 'dd/mm/yy hh24:mi:ss')
FROM mytable m
You can use REGEX SUBSTRING function to get the date string on the left.
SELECT REGEXP_SUBSTR (Date_string, '[^,]+', 1, 1)
AS left_part
FROM Table1;
where ^, means look for chars that are NOT comma on 1st position
and get the first occurrence (on the left)
Result:
LEFT_PART
10/11/22 12:14:01
11/12/22 13:15:46
29/12/22 14:30:46
reference:
https://docs.oracle.com/cd/B12037_01/server.101/b10759/functions116.htm
Just do it with the TO_DATE() and TO_CHAR() function pair, both operating on the Oracle date format strings:
Building the scenario:
-- your input ..
WITH indata(dt) AS (
SELECT '10/11/22 12:14:01,807000000' FROM dual UNION ALL
SELECT '11/12/22 13:15:46,650000000' FROM dual UNION ALL
SELECT '29/12/22 14:30:46,501000000' FROM dual
)
-- end of your input. Real query starts here.
-- Change following comma to "WITH" ..
,
-- Now convert to TIMESTAMP(9) ...
as_ts AS (
SELECT
TO_TIMESTAMP(dt ,'DD/MM/YY HH24:MI:SS,FF9') AS ts
FROM indata
)
SELECT
ts
, CAST(ts AS TIMESTAMP(0)) AS recast -- note: this is rounded
, TO_CHAR(ts,'DD/MM/YY HH24:MI:SS') AS reformatted -- this is truncated
FROM as_ts
Result:
TS
RECAST
REFORMATTED
10-NOV-22 12.14.01.807000000
10-NOV-22 12.14.02
10/11/22 12:14:01
11-DEC-22 13.15.46.650000000
11-DEC-22 13.15.47
11/12/22 13:15:46
29-DEC-22 14.30.46.501000000
29-DEC-22 14.30.47
29/12/22 14:30:46
Going by what you have in your question, it appears that the data in the field Date is a timestamp. This isn't a problem, but the names of the table (TABLE) and field (Date) present some challenges.
In Oracle, TABLE is a reserved word - so to use it as the name of a table it must be quoted by putting it inside double-quotes, as "TABLE". Similarly, Date is a mixed-case identifier and must likewise be quoted (e.g. "Date") every time it's used.
Given the above your query becomes:
SELECT TO_CHAR("Date", 'DD/MM/YY HH24:MI:SS') AS FORMATTED_DATE
FROM "TABLE"
and produces the desired results. db<>fiddle here
Generally, it's best in Oracle to avoid using reserved words as identifiers, and to allow the database to convert all names to upper case - if you do that you don't have to quote the names, and you can refer to them by upper or lower case as the database automatically converts all unquoted names to upper case internally.

retrieve different format of date values with time from string (t-sql)

I have a requirement where i have to pull the date/time value from string but the problem is that they can be different formats because of which substring becomes more complicated.
Here's what i came up with but is there any other method where i could simply retreive dates of different format with time and convert them all in single format?
IF OBJECT_ID('tempdb..#temp') IS NOT NULL
DROP TABLE #temp
CREATE TABLE #temp (
comments varchar(500)
)
insert into #temp (comments)
(
select 'Mailed on 1/1/22 at 5 pm'
union
select 'Mailed on 01/2/2222 # 6 am'
union
select 'Mailed on 01/2/22 in night'
union
select 'Mailed on 1/02/2222 at 4 pm'
union
select 'Mailed on 1/1/2222 at 4 pm'
);
select *
from #temp
cross apply (select PATINDEX('%Mailed On%',comments) as start_pos) as start_pos
cross apply (select case when substring(comments,patindex('%Mailed On%',comments)+9,11) like '%[0-9][0-9]/[0-9][0-9]/[0-9][0-9][0-9][0-9]%' then 1
when substring(comments,patindex('%Mailed On%',comments)+9,8) like '%[0-9][0-9]/[0-9]/[0-9][0-9]%' then 2
when substring(comments,patindex('%Mailed On%',comments)+9,10) like '%[0-9]/[0-9][0-9]/[0-9][0-9][0-9][0-9]%' then 3
when substring(comments,patindex('%Mailed On%',comments)+9,9) like '%[0-9][0-9]/[0-9][0-9]/[0-9][0-9]%' then 4
when substring(comments,patindex('%Mailed On%',comments)+9,9) like '%[0-9]/[0-9]/[0-9][0-9][0-9][0-9]%' then 5
when substring(comments,patindex('%Mailed On%',comments)+9,7) like '%[0-9]/[0-9]/[0-9][0-9]%' then 6 else null end as substr) as substr
--cross apply (select case when substring(authcomments,start_pos + 9, 11) like '%[1-9]/[0123][0-9]/[0-9][0-9][0-9][0-9]%' then 1 else null end as substr) as substr
cross apply (select case when substr = 1 then substring(comments,patindex('%Mailed On%',comments)+9,11)
when substr = 2 then substring(comments,patindex('%Mailed On%',comments)+9,8)
when substr = 3 then substring(comments,patindex('%Mailed On%',comments)+9,10)
when substr = 4 then substring(comments,patindex('%Mailed On%',comments)+9,9)
when substr = 5 then substring(comments,patindex('%Mailed On%',comments)+9,9)
when substr = 6 then substring(comments,patindex('%Mailed On%',comments)+9,7)
else null end as maileddate
) as maileddate
#user1672315 ,
Sometimes you get stuff like this and in order to fix it so that you can get the dates and times to store in a table or whatever, ya gotta do what ya gotta do to get it and, contrary to the comments, it certainly CAN be done in SQL. It's just not that difficult. Ya just gotta know some of the "gazintas" ;)
So, using the readily consumable test data that you were nice enough to provide, run the following code against it...
SELECT t.*
,TheDateAndTime = DATEADD(hh,ca4.cHour,ca3.cDate)
FROM #temp t
CROSS APPLY(VALUES(SUBSTRING(comments,PATINDEX('%[0-9]%',comments),500))) ca1(DT)
CROSS APPLY(VALUES(SUBSTRING(ca1.dt,PATINDEX('% [0-9]%',ca1.dt),500))) ca2(TM)
CROSS APPLY(VALUES(TRY_CONVERT(DATETIME,SUBSTRING(ca1.DT,1,PATINDEX('%[0-9] %',ca1.DT))))) ca3(cDate)
CROSS APPLY(VALUES(IIF(ca2.TM LIKE '%night%',23,DATEPART(hh,TRY_CONVERT(DATETIME,ca2.TM)))))ca4(cHour)
;
... and see that you CAN do it in SQL... BUT, see the warnings below the graphic below.
You also need to figure out what hour "night" is going to be assigned. I assigned "23" as the hour.
Results are as follows:
I'm thinking that your "2222" years are in error, though. :D
One thing I do agree on is that the format needs to be somewhat consistent. No code in the world, Python or otherwise, will be able to distinguish between a mm-dd-yy and dd-mm-yy format when dd and mm are both less than 13. The code I posted assumes (m)m-(d)d-yy and is based on the current LANGUAGE and DATEFORMAT that I'm using. It WILL return NULLs where the mm part isn't between 1 and 12 or if the dd part isn't between 1 and 31 or if the date is an "illegal date" like 2/29/2021, etc, though.
It also assumes that the format will always contain the numeric date as the first set of numeric values it comes across and that the time will always be the last thing in the string. We can add more checks, if needed but, like I said, unless mm is >=13, it cannot (nor can anything else) determine if it should be mm-dd-yy or dd-mm-yy because there's simply no other information in the string to indicate which format is being used. You MUST check your date format to use this, as well. If the strings are supposed to be in the dd-mm-yy format, we may have to make a change (although I believe SQL server will auto-magically accommodate that if the DATEFORMAT matches the intention of the string).

How to convert an int to DateTime in BigQuery

I have an INT64 column called "Date" which contains many different numbers like: "20210209" or "20200305". I want to turn those numbers into a date with this format: MM-YYYY (so in these cases, 02-2021 and 03-2020). Ultimately I want to sum all the data in each month together. The problem is that BigQuery can't convert INT64 to date, only to strings. I'm not sure if I should convert to a string and then to a date or if there is a better way.
Although converting to a string then a date both works and is very concise, over large enough numbers of rows (which may be the case in Big Query) you may be better off using integer maths and using DATE(year, month, day)...
https://cloud.google.com/bigquery/docs/reference/standard-sql/date_functions#date
SELECT
DATE(
DIV( 20210209 , 10000), -- Which gives 2021
DIV(MOD(20210209, 10000), 100), -- Which gives 02
MOD(20210209, 100) -- Which gives 09
)
You can convert the value to a string and use parse_date():
select parse_date('%Y%m%d', cast(20210209 as string))
Another option
select date,
regexp_replace('' || date, r'(\d{4})(\d{2})(\d{2})', r'\2-\1') as MM_YYYY
from your_table
if applied to sample data in your question - output is
Yet another option
select date,
format_date('%m-%Y', parse_date('%Y%m%d', '' || date)) as MM_YYYY
from your_table
with same output

How to compare date to format date on oracle

How can we compare a date to a format in Oracle?
Something like this: if MyDate is on format DD MONTH YYYY THEN /....
elsif MyDate is on format YYYY-MONTH-DD Then...
EDIT: My dates are in varchar2 and i want to keep them that way. I want just to know how to write a regex that would reprensent for example 10 October 2010.
Is it possible ? If it is a regex how would its format be please
Echoing what was mentioned in the comments to your question, best practice would be to have an actual DATE type field instead of VARCHAR2, and if you needed specific display formats, store those in another field as a format pattern. That said, you can use REGEXP_LIKE to check the format using the patterns in the below example.
with dateinfo as (
select 1 as id, '2018-MARCH-10' as dtString from dual
union all
select 2 as id, '10 MARCH 2018' as dtString from dual )
select id, dtString,
case
when regexp_like(dtString, '^[0-9]{4}-.[a-zA-Z]{3,}-.[0-9]{1,2}$') then 'format1'
when regexp_like(dtString, '^[0-9]{1,2} [a-zA-Z]{3,} [0-9]{4}$') then 'format2'
else 'no format'
end as dtFormat
from dateinfo;

Date arithmetic in SQL on DB2/ODBC

I'm building a query against a DB2 database, connecting through the IBM Client Access ODBC driver. I want to pull fields that are less than 6 days old, based on the field 'a.ofbkddt'... the problem is that this field is not a date field, but rather a DECIMAL field, formatted as YYYYMMDD.
I was able to break down the decimal field by wrapping it in a call to char(), then using substr() to pull the year, month and day fields. I then formatted this as a date, and called the days() function, which gives a number that I can perform arithmetic on.
Here's an example of the query:
select
days( current date) -
days( substr(char(a.ofbkddt),1,4) concat '-' -- YYYY-
concat substr(char(a.ofbkddt),5,2) concat '-' -- MM-
concat substr(char(a.ofbkddt),7,2) ) as difference, -- DD
a.ofbkddt as mydate
from QS36F.ASDF a
This yields the following:
difference mydate
2402 20050402
2025 20060306
...
4 20110917
3 20110918
2 20110919
1 20110920
This is what I expect to see... however when I use the same logic in the where clause of my query:
select
days( current date) -
days( substr(char(a.ofbkddt),1,4) concat '-' -- YYYY-
concat substr(char(a.ofbkddt),5,2) concat '-' -- MM-
concat substr(char(a.ofbkddt),7,2) ) as difference, -- DD
a.ofbkddt as mydate
from QS36F.ASDF a
where
(
days( current date) -
days( substr(char(a.ofbkddt),1,4) concat '-' -- YYYY-
concat substr(char(a.ofbkddt),5,2) concat '-' -- MM
concat substr(char(a.ofbkddt),7,2) ) -- DD
) < 6
I don't get any results back from my query, even though it's clear that I am getting date differences of as little as 1 day (obviously less than the 6 days that I'm requesting in the where clause).
My first thought was that the return type of days() might not be an integer, causing the comparison to fail... according to the documentation for days() found at http://publib.boulder.ibm.com/iseries/v5r2/ic2924/index.htm?info/db2/rbafzmst02.htm, it returns a bigint. I cast the difference to integer, just to be safe, but this had no effect.
You're going about this backwards. Rather than using a function on every single value in the table (so you can compare it to the date), you should pre-compute the difference in the date. It's costing you resources to run the function on every row - you'd save a lot if you could just do it against CURRENT_DATE (it'd maybe save you even more if you could do it in your application code, but I realize this might not be possible). Your dates are in a sortable format, after all.
The query looks like so:
SELECT ofbkddt as myDate
FROM QS36F.ASDF
WHERE myDate > ((int(substr(char(current_date - 6 days, ISO), 1, 4)) * 10000) +
(int(substr(char(current_date - 6 days, ISO), 6, 2)) * 100) +
(int(substr(char(current_date - 6 days, ISO), 9, 2))))
Which, when run against your sample datatable, yields the following:
myDate
=============
20110917
20110918
20110919
20110920
You might also want to look into creating a calendar table, and add these dates as one of the columns.
What if you try a common table expression?
WITH A AS
(
select
days( current date) -
days( substr(char(a.ofbkddt),1,4) concat '-' -- YYYY-
concat substr(char(a.ofbkddt),5,2) concat '-' -- MM-
concat substr(char(a.ofbkddt),7,2) ) as difference, -- DD
a.ofbkddt as mydate
from QS36F.ASDF a
)
SELECT
*
FROM
a
WHERE
difference < 6
Does your data have some nulls in a.ofbkddt? Maybe this is causing some funny behaviour in how db2 is evaluating the less than operation.