Spark SQL change date format - sql

I'm trying to change the date format of my data from (11 20, 2014) to 2014-11-20.
I tried this:
df.withColumn("newDate", to_date(col("reviewTime"),("mm dd, yyyy")))
Because the days with single digits appear as 1,2,8 instead of 01,02,08 I got this message:
SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: Fail to parse '09 1, 2014' in the new parser. You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0, or set to CORRECTED and treat it as an invalid datetime string.
Caused by: DateTimeParseException: Text '09 1, 2014' could not be parsed at index 3
Is there a way to fix this?
Thanks!

Some of your data date rows were written in an old spark version. You should add your spark configuration.
spark.sql.parquet.int96RebaseModeInRead = "LEGACY"
or
spark.sql.parquet.int96RebaseModeInRead = "CORECTED"
according to your requirements, they explain to you the differences between those two options in your error.

You can use format (M d, yyyy) to deal with it
Example (scala spark):
Seq(
"(11 20, 2014)",
"(1 3, 2013)",
"(2 20, 2012)",
"(4 22, 2014)"
).toDF("ugly_date")
.withColumn("date", to_date($"ugly_date", "(M d, yyyy)"))
.show(false)
Output:
+-------------+----------+
|ugly_date |date |
+-------------+----------+
|(11 20, 2014)|2014-11-20|
|(1 3, 2013) |2013-01-03|
|(2 20, 2012) |2012-02-20|
|(4 22, 2014) |2014-04-22|
+-------------+----------+
For more information about Datetime Patterns see https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html
UPD:
Screenshot with the result
Spark: 3.1.2
Scala: 2.12.10
Running on Zeppelin 0.9.0
screenshot

Related

How to parse json data in a column with Druid SQL?

I'm trying to parse json data in a column with Druid SQL in Superset SQL lab. My table looks like this:
id
json_scores
0
{"foo": 20, "bar": 10}
1
{"foo": 30, "bar": 10}
I'm looking for something similar to json_extract in MySQL e.g.
SELECT *
FROM my_table
WHERE json_extract(json_scores, '$.foo') > 10;
Druid doesn't support json_extract function. Druid supports only ANSI SQL 92, which does not understand JSON as a data type.
Supported data type are listed in this page: https://docs.imply.io/latest/druid/querying/sql-data-types/
You can use any expressions that are listed here: https://druid.apache.org/docs/latest/misc/math-expr.html#string-functions
In your case consider using regexp_extract:
regexp_extract(json_scprs, '(?<=\"foo\":\s)(\d+)(?=,)', 0) AS foo,

How to fix Dynamic Sql Error: -104; Token Unkown - using a subquery in the where clause to filter on max date

I am trying to query a firebird database for the first time, and I keep getting a
Dynamic SQL Error -104; Token Unknown.
line 11, column 30; AS
[SQLState:42000, ISC error code:335544634]
Error Code: 335544634
It seems to be a problem with the WHERE clause. CERT_DATE is a TIMESTAMP datatype. Thanks!
I did try casting as TIMESTAMP thinking that could be the error, but I got the same results. Any help would be greatly appreciated.
SELECT
EVENTS.DONE_BY_FNAME,
CERTS.CERT_NUM,
CERTS.CERT_DATE,
CERTS.GAGE_SN,
EVENTS.VENDOR
FROM EVENTS INNER JOIN CERTS ON CERTS.EVENT_NUM = EVENTS.EVENT_NUM
WHERE CERTS.CERT_DATE =
(SELECT MAX(Z.CERT_DATE)
FROM CERTS AS Z
WHERE Z.EVENT_NUM = CERTS.EVENT_NUM
)
Expected Results
DONE_BY_FNAME CERT_NUM CERT_DATE GAGE_SN VENDOR
GRES 12308 2019-01-14 00:00:00.0 AI0186 NW WELDERS
The error indicates you are using Firebird 1.5 or lower. The problem is that Firebird 1.5 and lower do not support AS to define table aliases.
In InterBase 6, Firebird 1 and Firebird 1.5, the FROM clause was defined as (from InterBase 6.0 Language Reference):
FROM <tableref> [, <tableref> …]
<tableref> = <joined_table> | table | view | procedure
[(<val> [, <val> …])] [alias]
As you can see, this syntax does not allow AS before the alias (otherwise it would have been [[AS] alias]). This support for the optional AS token was added in Firebird 2.0 as part of the Derived Tables support.
As a short term solution, replace CERTS AS Z with CERTS Z.
You should really upgrade though: Firebird 1.5 is no longer supported (support was stopped almost 10 years ago!) and contains known security issues that were fixed in later Firebird releases.

Extracting Values from Array in Redshift SQL

I have some arrays stored in Redshift table "transactions" in the following format:
id, total, breakdown
1, 100, [50,50]
2, 200, [150,50]
3, 125, [15, 110]
...
n, 10000, [100,900]
Since this format is useless to me, I need to do some processing on this to get the values out. I've tried using regex to extract it.
SELECT regexp_substr(breakdown, '\[([0-9]+),([0-9]+)\]')
FROM transactions
but I get an error returned that says
Unmatched ( or \(
Detail:
-----------------------------------------------
error: Unmatched ( or \(
code: 8002
context: T_regexp_init
query: 8946413
location: funcs_expr.cpp:130
process: query3_40 [pid=17533]
--------------------------------------------
Ideally I would like to get x and y as their own columns so I can do the appropriate math. I know I can do this fairly easy in python or PHP or the like, but I'm interested in a pure SQL solution - partially because I'm using an online SQL editor (Mode Analytics) to plot it easily as a dashboard.
Thanks for your help!
If breakdown really is an array you can do this:
select id, total, breakdown[1] as x, breakdown[2] as y
from transactions;
If breakdown is not an array but e.g. a varchar column, you can cast it into an array if you replace the square brackets with curly braces:
select id, total,
(translate(breakdown, '[]', '{}')::integer[])[1] as x,
(translate(breakdown, '[]', '{}')::integer[])[2] as y
from transactions;
You can try this :
SELECT REPLACE(SPLIT_PART(breakdown,',',1),'[','') as x,REPLACE(SPLIT_PART(breakdown,',',2),']','') as y FROM transactions;
I tried this with redshift db and this worked for me.
Detailed Explanation:
SPLIT_PART(breakdown,',',1) will give you [50.
SPLIT_PART(breakdown,',',2) will give you 50].
REPLACE(SPLIT_PART(breakdown,',',1),'[','') will replace the [ and will give just 50.
REPLACE(SPLIT_PART(breakdown,',',2),']','') will replace the ] and will give just 50.
Know its an old post.But if someone needs a much easier way
select json_extract_array_element_text('[100,101,102]', 2);
output : 102

NHibernate varying the way it creates DateTime based queries based on the day of the month

I am getting some weird behaviour on how NHibernate is sending queries to the database on DateTimeOffset fields. Given the following example:
DateTimeOffset? myDate = new DateTimeOffset(2012, 3, 17, 0, 0, 0, new TimeSpan(1, 0, 0));
var test = HibernateSession.Query<ExampleEntity>().Where(c => c.DateTimeOffsetField > myDate).ToList();
DateTimeOffset? myDate2 = new DateTimeOffset(2012, 3, 1, 0, 0, 0, new TimeSpan(1, 0, 0));
var test2 = HibernateSession.Query<ExampleEntity>().Where(c => c.DateTimeOffsetField > myDate2).ToList();
Using NHibernate Profiler to look at the SQL generated, the first query shows up as
exampleentity0_.[DateTimeOffsetField] > '17/03/2012 00:00:00 +01:00' /* #p0 */
the second as
exampleentity0_.[DateTimeOffsetField] > '2012-01-02T23:00:00.00' /* #p0 */
Notice the different formatting on the dates? If the day of the month is greater than 12, it uses the first format, if it is less than 12 it uses the second. This is causing errors when we have dates in the first format as SQL server cannot convert the string to a valid Date as it is looking for month 17 (as this example). This is driving me nuts!!
Has anyone seen this behaviour before?
Is it possible to tell NHibernate to always use the yyyy-MM-dd format?
Thanks
Tom
p.s. using FluentNHibernate for the mapping and configuration. An example of the mapping would be
Map(a => a.DateTimeOffsetField).Not.Nullable();
...i.e nothing unusual..
I have "fixed" this issue by changing the sql server instance to use a British time culture so that is accepts both values created by NHibernate... still be interested to know why these differences occur though.

django order by date in datetime / extract date from datetime

I have a model with a datetime field and I want to show the most viewed entries for the day today.
I thought I might try something like dt_published__date to extract the date from the datetime field but obviously it didn't work.
popular = Entry.objects.filter(type='A', is_public=True).order_by('-dt_published__date', '-views', '-dt_written', 'headline')[0:5]
How can I do this?
AFAIK the __date syntax is not supported yet by Django. There is a ticket open for this.
If your database has a function to extract date part then you can do this:
popular = Entry.objects.filter(**conditions).extra(select =
{'custom_dt': 'to_date(dt_published)'}).order_by('-custom_dt')
In the new Django, it should work out of the box [tested on 3.2] with Mysql 5.7
Dataset
[
{ "id": 82148, "paid_date": "2019-09-30 20:51:11"},
{ "id": 82315, "paid_date": "2019-09-30 00:00:00"},
]
Query
Payment.objects.filter(order_id=135342).order_by('paid_date__date', 'id').values_list('id', 'paid_date__date')
Results
`<QuerySet [(82148, datetime.date(2019, 9, 30)), (82315, datetime.date(2019, 9, 30))]>`