SparkSQL errors when using SQL DATE function - sql

In Spark I am trying to execute SQL queries on a temporary table derived from a data frame that I manually built by reading a csv file and converting the columns into the right data type.
Specifically, the table I'm talking about is the LINEITEM table from [TPC-H specification][1]. Unlike stated in the specification I am using TIMESTAMP rather than DATE because I've read that Spark does not support the DATE type.
In my single scala source file, after creating the data frame and registering a temporary table called "lineitem", I am trying to execute the following query:
val res = sqlContext.sql("SELECT * FROM lineitem l WHERE date(l.shipdate) <= date('1998-12-01 00:00:00');")
When I submit the packaged jar using spark-submit, I get the following error:
Exception in thread "main" java.lang.RuntimeException: [1.75] failure: ``union'' expected but but `;' found
When I omit the semicolon and do the same thing, I get the following error:
Exception in thread "main" java.util.NoSuchElementException: key not found: date
Spark version is 1.4.0.
Does anyone have an idea what's the problem with these queries?
[1] http://www.tpc.org/TPC_Documents_Current_Versions/pdf/tpch2.17.1.pdf

SQL queries passed to SQLContext.sql shouldn't be delimited using semicolon - this the source of your first problem
DATE UDF expects date in the YYYY-­MM-­DD form and DATE('1998-12-01 00:00:00') evaluates to null. As long as timestamp can be casted to DATE correct query string looks like this:
"SELECT * FROM lineitem l WHERE date(l.shipdate) <= date('1998-12-01')"
DATE is a Hive UDF. It means you have to use HiveContext not a standard SQLContext - this is the source of your second problem.
import org.apache.spark.sql.hive.HiveContext
val sqlContext = new HiveContext(sc) // where sc is a SparkContext
In Spark >= 1.5 it is also possible to use to_date function:
import org.apache.spark.sql.functions.{lit, to_date}
df.where(to_date($"shipdate") <= to_date(lit("1998-12-01")))

Please try hive function CAST (expression AS toDatatype)
It changes an expression from one datatype to other
e.g. CAST ('2016-06-17 00.00.000' AS DATE) will convert String to Date
In your case
val res = sqlContext.sql("SELECT * FROM lineitem l WHERE CAST(l.shipdate as DATE) <= CAST('1998-12-01 00:00:00' AS DATE);")
Supported datatype conversions are as listed in Hive Casting Dates

Related

Flink Window Aggregation using TUMBLE failing on TIMESTAMP

We have one table A in database. We are loading that table into flink using Flink SQL JdbcCatalog.
Here is how we are loading the data
val catalog = new JdbcCatalog("my_catalog", "database_name", username, password, url)
streamTableEnvironment.registerCatalog("my_catalog", catalog)
streamTableEnvironment.useCatalog("my_catalog")
val query = "select timestamp, count from A"
val sourceTable = streamTableEnvironment.sqlQuery(query) streamTableEnvironment.createTemporaryView("innerTable", sourceTable)
val aggregationQuery = select window_end, sum(count) from TABLE(TUMBLE(TABLE innerTable, DESCRIPTOR(timestamp), INTERVAL '10' minutes)) group by window_end
It throws following error
Exception in thread "main" org.apache.flink.table.api.ValidationException: SQL validation failed. The window function TUMBLE(TABLE table_name, DESCRIPTOR(timecol), datetime interval[, datetime interval]) requires the timecol is a time attribute type, but is TIMESTAMP(6).
In short we want to apply windowing aggregation on an already existing column. How can we do that
Note - This is a batch processing
Timestamp columns used as time attributes in Flink SQL must be either TIMESTAMP(3) or TIMESTAMP_LTZ(3).
Column should be TIMESTAMP(3) or TIMESTAMP_LTZ(3) but also the column should be marked as ROWTIME.
Type this line in your code
sourceTable.printSchema();
and check the result. The column should be marked as ROWTIME as shown below.
(
`deviceId` STRING,
`dataStart` BIGINT,
`recordCount` INT,
`time_Insert` BIGINT,
`time_Insert_ts` TIMESTAMP(3) *ROWTIME*
)
You can find my sample below.
Table tableCpuDataCalculatedTemp = tableEnv.fromDataStream(streamCPUDataCalculated, Schema.newBuilder()
.column("deviceId", DataTypes.STRING())
.column("dataStart", DataTypes.BIGINT())
.column("recordCount", DataTypes.INT())
.column("time_Insert", DataTypes.BIGINT())
.column("time_Insert_ts", DataTypes.TIMESTAMP(3))
.watermark("time_Insert_ts", "time_Insert_ts")
.build());
watermark method makes it ROWTIME

How to query a database copied from clipboard using Pandas as DuckDB

I'm trying to test a simple SQL query which should do something like this:
import duckdb
import pandas as pd
df_test = pd.read_clipboard()
duckdb.query("SELECT * FROM df_test").df()
Which works but I can't get the following query to work.
select count(df_test) as cnt,
year(a) as yr
from df_test
where d = "outcome1"
group by yr
However, I get this as an exception, presumably from DuckDB.
BinderException: Binder Error: No function matches the given name and argument types 'year(VARCHAR)'. You might need to add explicit type casts.
Candidate functions:
year(TIMESTAMP WITH TIME ZONE) -> BIGINT
year(DATE) -> BIGINT
year(TIMESTAMP) -> BIGINT
year(INTERVAL) -> BIGINT
I was only using Pandas as it seemed to be the easiest way to convert a csv file (using pd.read_clipboard) into DuckDB.
Any ideas?
(I'm using a mac by the way).

Spark SQL is interpreting a datetime.date object as a mathematical formula or integer in statement

I've encountered a problem in Spark SQL. It is interpreting a datetime.date object as a mathematical formula, or integer, in a SQL statement I am writing.
currentDateAndTime = datetime,now()
current_month = currentDateAndTie.strftime("%m")
current_year = currentDateAndTime.strftime("%Y")
first_day_of_month = date(int(current_year), int(current)month), 1)
print(first_day_of_month)
type(first_day_of_month)
and you get:
2022-10-01
datetime.date
Then when I do
df = spark.sql("""
SELECT * FROM table_A
WHERE IncidentCreatedDate < {}
""".format(first_day_of_month))
I get an error that says AnalysisException: cannot resolve '(table_A.IncidentCreatedDate < ((2022 - 10) - 1' due to data type mismatch: differing types in '(tableA.IncidentCreatedDate < ((2022 - 10 - 1))' (date and int).;......
There might be a typo in everything above because I had to type everything out on another laptop since the other one is my work laptop and they don't like me sending anything from that laptop to anywhere else.)
pyspark doesn't support prepared statements.
format will replace the pace holder, but strings mus be in single quotes, so simply add them
df = spark.sql("""
SELECT * FROM table_A
WHERE IncidentCreatedDate < '{}'
""".format(first_day_of_month))

How to resolve this sql error of schema_of_json

I need to find out the schema of a given JSON file, I see sql has schema_of_json function
and something like this works flawlessly
> SELECT schema_of_json('[{"col":0}]');
ARRAY<STRUCT<`col`: BIGINT>>
But if I query for my table name, it gives me the following error
>SELECT schema_of_json(Transaction) as json_data from table_name;
Error in SQL statement: AnalysisException: cannot resolve 'schemaofjson(`Transaction`)' due to data type mismatch: The input json should be a string literal and not null; however, got `Transaction`.; line 1 pos 7;
The Transaction is one of the columns in my table and after checking it manually I can attest that it is of String type(json).
The SQL statement has it to give me the schema of the JSON, how to do it?
after looking further into the documentation that it is clear that the word foldable means that of the static one, and a column from a table JSON won't work
for minimal reroducible example here you go:
SELECT schema_of_json(CAST('{ "a": "b" }' AS STRING))
As soon as the cast is introduced in the above statement, the schema_of_json will fail......... It needs a static JSON as it's input

Scio saveAsTypedBigQuery write to a partition for SCollection of Typed Big Query case class

I'm trying to write a SCollection to a partition in Big Query using:
import java.time.LocalDate
import java.time.format.DateTimeFormatter
val date = LocateDate.parse("2017-06-21")
val col = sCollection.typedBigQuery[Blah](query)
col.saveAsTypedBigQuery(
tableSpec = "test.test$" + date.format(DateTimeFormatter.ISO_LOCAL_DATE),
writeDisposition = WriteDisposition.WRITE_EMPTY,
createDisposition = CreateDisposition.CREATE_IF_NEEDED)
The error I get is
Table IDs must be alphanumeric (plus underscores) and must be at most 1024 characters long. Also, Table decorators cannot be used."
How can I write to a partition? I don't see any options to specify partitions via either saveAsTypedBigQuery method so I was trying the Legacy SQL table decorators.
See: BigqueryIO Unable to Write to Date-Partitioned Table. You need to manually create the table. BQ IO cannot create a table and partition it.
Additionally, the no table decorators was a complete ruse. It's the alphanumeric part I was missing.
col.saveAsTypedBigQuery(
tableSpec = "test.test$" + date.format(DateTimeFormatter.BASIC_ISO_DATE),
writeDisposition = WriteDisposition.WRITE_APPEND,
createDisposition = CreateDisposition.CREATE_NEVER)