Spark SQL is interpreting a datetime.date object as a mathematical formula or integer in statement - sql

I've encountered a problem in Spark SQL. It is interpreting a datetime.date object as a mathematical formula, or integer, in a SQL statement I am writing.
currentDateAndTime = datetime,now()
current_month = currentDateAndTie.strftime("%m")
current_year = currentDateAndTime.strftime("%Y")
first_day_of_month = date(int(current_year), int(current)month), 1)
print(first_day_of_month)
type(first_day_of_month)
and you get:
2022-10-01
datetime.date
Then when I do
df = spark.sql("""
SELECT * FROM table_A
WHERE IncidentCreatedDate < {}
""".format(first_day_of_month))
I get an error that says AnalysisException: cannot resolve '(table_A.IncidentCreatedDate < ((2022 - 10) - 1' due to data type mismatch: differing types in '(tableA.IncidentCreatedDate < ((2022 - 10 - 1))' (date and int).;......
There might be a typo in everything above because I had to type everything out on another laptop since the other one is my work laptop and they don't like me sending anything from that laptop to anywhere else.)

pyspark doesn't support prepared statements.
format will replace the pace holder, but strings mus be in single quotes, so simply add them
df = spark.sql("""
SELECT * FROM table_A
WHERE IncidentCreatedDate < '{}'
""".format(first_day_of_month))

Related

Pass list of dates to SQL WHERE statement in PySpark

In the process of converting some SAS code to PySpark and we previously used a macro variable for the where statement in this code. In adapting to PySpark, I'm trying to pass a list of dates to the where statement, but I keep getting errors. I want the SQL code to pull all data from those 3 months. Any pointers?
month_list = ['202107', '202108', '202109']
sql_query = """ (SELECT *
FROM Table_Blah
WHERE (to_char(DateVariable,'yyyymm') IN '{}')
) as table1""".format(month_list)
Pass the list as a tuple to have the right sql syntax:
month_list = ['202107', '202108', '202109']
sql_query = """ (SELECT *
FROM Table_Blah
WHERE (to_char(DateVariable,'yyyymm') IN {})
) as table1""".format(tuple(month_list))
And you don’t need apostrophe for in statement

issue formatting into human time

SELECT
prefix_grade_items.itemname AS Course,
prefix_grade_items.grademax,
ROUND(prefix_grade_grades_history.finalgrade, 0)
AS finalgrade,
prefix_user.firstname,
prefix_user.lastname,
prefix_user.username,
prefix_grade_grades_history.timemodified
FROM
prefix_grade_grades_history
INNER JOIN prefix_user ON prefix_grade_grades_history.userid = prefix_user.id
INNER JOIN prefix_grade_items ON prefix_grade_grades_history.itemid =
prefix_grade_items.id
WHERE (prefix_grade_items.itemname IS NOT NULL)
AND (prefix_grade_items.itemtype = 'mod' OR prefix_grade_items.itemtype = 'manual')
AND (prefix_grade_items.itemmodule = 'quiz' OR prefix_grade_items.itemmodule IS NULL)
AND (prefix_grade_grades_history.timemodified IS NOT NULL)
AND (prefix_grade_grades_history.finalgrade > 0)
AND (prefix_user.deleted = 0)
ORDER BY course
Currently I am trying to polish this query. The problem I am having is using a UNIX Command to convert the time queried from timemodified into Human time. It comes out in epoch time. I have been attempting to use commands such as FROM_UNIXTIME(timestamp,'%a - %D %M %y %H:%i:%s') as timestamp. For reference this is a adhoc query to a moodle server contained in MariaDB. My desired result from the query is that nothing would change as far as the results we are getting, except that the time would be in a month/day/year format instead of the current format.
I have converted the timestamp into a custom date format using the below command in my select query.
DATE_FORMAT(FROM_UNIXTIME(`timestamp`), "%b-%d-%y")
As included in your question where you mention FROM_UNIXTIME(timestamp,'%a - %D %M %y %H:%i:%s'), it is indeed possible to include a second argument in order to specify the specific time/date format you wish to output converted from the UNIX timestamp.
That's the bit that looks like: '%a - %D %M %y %H:%i:%s' - this particular format string will give you an output that looks something like this: Fri - 24th January 20 14:17:09, which as you stated isn't quite what you were looking for, but we can fix that!
For example, the statement below will return the human-readable date (according to the value returned in the timestamp) in the form of month/day/year as you specified as the goal in your question, and would look similar to this: Jan/01/20
FROM_UNIXTIME(timestamp), '%b/%d/%y')
If you instead wish to use a 4 digit year you can substitute the lowercase %y for a capital %Y.
Additionally if a numeric month is instead preferred you can use %m in place of %b.
For a more comprehensive reference on the available specifiers that can be used to build up the format string, this page has a handy table
So putting it all together in the specific context of your original SQL query, using FROM_UNIXTIME to gain the human readable date (along with a suitable format string to specify the format of the output) may look something like this perhaps:
SELECT
prefix_grade_items.itemname AS Course,
prefix_grade_items.grademax,
ROUND(prefix_grade_grades_history.finalgrade, 0) AS finalgrade,
prefix_user.firstname,
prefix_user.lastname,
prefix_user.username,
FROM_UNIXTIME(prefix_grade_grades_history.timemodified, '%b/%d/%Y') AS grademodified
FROM
prefix_grade_grades_history
INNER JOIN prefix_user ON prefix_grade_grades_history.userid = prefix_user.id
INNER JOIN prefix_grade_items ON prefix_grade_grades_history.itemid = prefix_grade_items.id
WHERE (prefix_grade_items.itemname IS NOT NULL)
AND (prefix_grade_items.itemtype = 'mod' OR prefix_grade_items.itemtype = 'manual')
AND (prefix_grade_items.itemmodule = 'quiz' OR prefix_grade_items.itemmodule IS NULL)
AND (prefix_grade_grades_history.timemodified IS NOT NULL)
AND (prefix_grade_grades_history.finalgrade > 0)
AND (prefix_user.deleted = 0)
ORDER BY course
NOTE: I ended up specifying an alias for the timemodified column, calling it instead grademodified. This was done as without an alias the column name ends up getting a little busy :)
Hope that is helpful to you! :)

Writing where query using pyspark on SQL table

I'm querying sql table using pyspark.
If I have a sql table which has two column (value, isDelayed) where "value" is of double type and "isDelayed" has value 0 or 1. How to write a query using pyspark aggregation query which gives sum of "value" when "isDelayed" is 1.
I've already tried below code which is giving an error
def __main__(self, data):
delayedData = data.where(col('isDelayed').cast('int')==='1')
groupByIsDelayed = delayedData.agg(sum(total))
return groupByIsDelayed
I'm getting
"Syntax Error: invalid syntax"
on below line
delayedData = data.where(col('isDelayed').cast('int')==='1')
replace data.where(col('isDelayed').cast('int')==='1') with data.where(col('isDelayed').cast('int') == 1)
2 = only (equal operator in python is 2 = sign)
1 without quote (because you compare a int, not a string)
or
data.where("isDelayed=1")

SQLDF in R - Problems with date format

I am trying apply some transformations in a data.frame using the sqldf function in R, but got some weird outputs. (I tried to apply some SQL date transformations in the query, but got no success).
First the data.frame has the following format (all columns are of character class):
But after filtering the data.frame with 'sqldf':
sqldf("SELECT BP_OR, N_orden_OR, Tipo_ordenOR, N_lineaOR, OLUSER, OLPID,
(select Fecha_aprobac from aprob_or
order by Fecha_aprobac desc limit 1) AS LastOfFecha_aprobac,
Estadp_sig, Estado_ultimo
FROM aprob_or
GROUP BY BP_OR, N_orden_OR, Tipo_ordenOR, N_lineaOR, OLUSER, OLPID, Estadp_sig, Estado_ultimo
(((OLPID)='X43008'))
")
I got the following format for the column LastOfFecha_aprobc
Those 'P5' is what I don't undestand. I formated the SQL code with some parameters to change the date format, but it persisted.
Do you have a better idea to figure that out?

SparkSQL errors when using SQL DATE function

In Spark I am trying to execute SQL queries on a temporary table derived from a data frame that I manually built by reading a csv file and converting the columns into the right data type.
Specifically, the table I'm talking about is the LINEITEM table from [TPC-H specification][1]. Unlike stated in the specification I am using TIMESTAMP rather than DATE because I've read that Spark does not support the DATE type.
In my single scala source file, after creating the data frame and registering a temporary table called "lineitem", I am trying to execute the following query:
val res = sqlContext.sql("SELECT * FROM lineitem l WHERE date(l.shipdate) <= date('1998-12-01 00:00:00');")
When I submit the packaged jar using spark-submit, I get the following error:
Exception in thread "main" java.lang.RuntimeException: [1.75] failure: ``union'' expected but but `;' found
When I omit the semicolon and do the same thing, I get the following error:
Exception in thread "main" java.util.NoSuchElementException: key not found: date
Spark version is 1.4.0.
Does anyone have an idea what's the problem with these queries?
[1] http://www.tpc.org/TPC_Documents_Current_Versions/pdf/tpch2.17.1.pdf
SQL queries passed to SQLContext.sql shouldn't be delimited using semicolon - this the source of your first problem
DATE UDF expects date in the YYYY-­MM-­DD form and DATE('1998-12-01 00:00:00') evaluates to null. As long as timestamp can be casted to DATE correct query string looks like this:
"SELECT * FROM lineitem l WHERE date(l.shipdate) <= date('1998-12-01')"
DATE is a Hive UDF. It means you have to use HiveContext not a standard SQLContext - this is the source of your second problem.
import org.apache.spark.sql.hive.HiveContext
val sqlContext = new HiveContext(sc) // where sc is a SparkContext
In Spark >= 1.5 it is also possible to use to_date function:
import org.apache.spark.sql.functions.{lit, to_date}
df.where(to_date($"shipdate") <= to_date(lit("1998-12-01")))
Please try hive function CAST (expression AS toDatatype)
It changes an expression from one datatype to other
e.g. CAST ('2016-06-17 00.00.000' AS DATE) will convert String to Date
In your case
val res = sqlContext.sql("SELECT * FROM lineitem l WHERE CAST(l.shipdate as DATE) <= CAST('1998-12-01 00:00:00' AS DATE);")
Supported datatype conversions are as listed in Hive Casting Dates