SparkSQL cannot run a simple SQL query - sql

I am working with a simple SparkSQL query:
SELECT
*,
(DATE + DURATION) AS EXPIRY_MONTH
FROM
loan
WHERE
EXPIRY_MONTH >= 12
where the first 10 lines of loan table are the following:
"loan_id";"account_id";"date";"amount";"duration";"payments";"status"
5314;1787;930705;96396;12;8033.00;"B"
5316;1801;930711;165960;36;4610.00;"A"
6863;9188;930728;127080;60;2118.00;"A"
5325;1843;930803;105804;36;2939.00;"A"
7240;11013;930906;274740;60;4579.00;"A"
6687;8261;930913;87840;24;3660.00;"A"
7284;11265;930915;52788;12;4399.00;"A"
6111;5428;930924;174744;24;7281.00;"B"
7235;10973;931013;154416;48;3217.00;"A"
This query works how intented with SQLite (meaning that the column EXPIRY_MONTH is added and data are filtered on the condition EXPIRY_MONTH >= 12) but not with SparkSQL (Spark 3.1.0).
Specifically, the SparkSQL engine throws errore as the EXPIRY_MONTH column does not exist.
How can I fix this query without resorting to subqueries?
What is the reason of this behaviour and difference between SparkSQL and more standard SQL?

you are not able to run this query as spark is lazily evaluated and it won't find that column that you are creating in the where clause.
What you can do is you can use the same logic which you are applying to create the separate column in the where clause which will allow you to run the query without using subquery.
SELECT
*,
(DATE + DURATION) AS EXPIRY_MONTH
FROM
loan
WHERE
(DATE + DURATION) >= 12

Related

Using date functions in where clause of SQL

My question is regarding particular use case of SQL.
I am trying out querybook and presto type sql for a work project.
I deciphered that while writing SQL queries in querybook - it follows a different style of SQL which is Presto.
I want to write a SQL query which can:
give all entries from a table where the created_at_unix_timestamp > unix_timestamp_of_past_hour
There are some functions like FROM_UNIXTIME and TO_UNIXTIME which I'm experimenting with - but all of these functions are usable in the 1st part of the query
select *
where we describe the fields we want from a table
Is it possible to write a query like this
select *
from table
where to_unixtime(field_in_table) > some_unix_timestamp_value_calculated_at_runtime
I am not able to find any documentation around this.
An update:
when i try this - it's giving an error
Final Update:
It's working with this syntax:
select * from events.table where event_name='some_event_name' and consumer_timestamp > ( CAST(to_unixtime(CAST(LOCALTIMESTAMP AS timestamp)) AS BIGINT) - 3590) order by consumer_timestamp DESC limit 10

Optimization on large tables

I have the following query that joins two large tables. I am trying to join on patient_id and records that are not older than 30 days.
select * from
chairs c
join data id
on c.patient_id = id.patient_id
and to_date(c.from_date, 'YYYYMMDD') - to_date(id.from_date, 'YYYYMMDD') >= 0
and to_date (c.from_date, 'YYYYMMDD') - to_date(id.from_date, 'YYYYMMDD') < 30
Currently, this query takes 2 hours to run. What indexes can I create on these tables for this query to run faster.
I will take a shot in the dark, because as others said it depends on what the table structure, indices, and the output of the planner is.
The most obvious thing here is that as long as it is possible, you want to represent dates as some date datatype instead of strings. That is the first and most important change you should make here. No index can save you if you transform strings. Because very likely, the problem is not the patient_id, it's your date calculation.
Other than that, forcing hash joins on the patient_id and then doing the filtering could help if for some reason the planner decided to do nested loops for that condition. But that is for after you fixed your date representation AND you still have a problem AND you see that the planner does nested loops on that attribute.
Some observations if you are stuck with string fields for the dates:
YYYYMMDD date strings are ordered and can be used for <,> and =.
Building strings from the data in chairs to use to JOIN on data will make good use of an index like one on data for patient_id, from_date.
So my suggestion would be to write expressions that build the date strings you want to use in the JOIN. Or to put it another way: do not transform the child table data from a string to something else.
Example expression that takes 30 days off a string date and returns a string date:
select to_char(to_date('20200112', 'YYYYMMDD') - INTERVAL '30 DAYS','YYYYMMDD')
Untested:
select * from
chairs c
join data id
on c.patient_id = id.patient_id
and id.from_date between to_char(to_date(c.from_date, 'YYYYMMDD') - INTERVAL '30 DAYS','YYYYMMDD')
and c.from_date
For this query:
select *
from chairs c join data
id
on c.patient_id = id.patient_id and
to_date(c.from_date, 'YYYYMMDD') - to_date(id.from_date, 'YYYYMMDD') >= 0 and
to_date (c.from_date, 'YYYYMMDD') - to_date(id.from_date, 'YYYYMMDD') < 30;
You should start with indexes on (patient_id, from_date) -- you can put them in both tables.
The date comparisons are problematic. Storing the values as actual dates can help. But it is not a 100% solution because comparison operations are still needed.
Depending on what you are actually trying to accomplish there might be other ways of writing the query. I might encourage you to ask a new question, providing sample data, desired results, and a clear explanation of what you really want. For instance, this query is likely to return a lot of rows. And that just takes time as well.
Your query have a non SERGABLE predicate because it uses functions that are iteratively executed. You need to discard such functions and replace them by a direct access to the columns. As an exemple :
SELECT *
FROM chairs AS c
JOIN data AS id
ON c.patient_id = id.patient_id
AND c.from_date BETWEEN id.from_date AND id.from_date + INTERVAL '1 day'
Will run faster with those two indexes :
CREATE X_SQLpro_001 ON chairs (patient_id, from_date);
CREATE X_SQLpro_002 ON data (patient_id, from_date) ;
Also try to avoid
SELECT *
And list only the necessary columns

Sybase IQ - Pull the last 30 days of data where the most recent date is not today

Objective: Pull the last 30 days of data from a table that has a variable end date
Background: I have a table that contains purchase information but this table is only updated every two weeks therefore there's a lag in the data. Some day it can be 14 days behind and others 13 or 15 days behind.
My table contains a DATE_KEY column which joins to the DATE_DIM table on this key which is where I pull my date field from. I would use GETDATE or CURRENT_DATE but this is not appropriate in my case due to the lag.
I am using Sybase IQ and I believe I can't use a select statements in the where clause to compare dates, I got the following error:
Feature, scalar value subquery (at line 63) outside of a top level SELECT list, is not supported.
This is what I was trying to do
WHERE
TIME.[DAY] >= DATEADD(dd,-30,( SELECT
MAX([TIME1].[DAY])
FROM DB.DATE_DIM TIME1
JOIN DB.PURCHASES PURC
ON TIME1.KEY = PURC.KEY))
How can I pull the most recent 30 days of data given the constraints above?
According to the Sybase IQ documentation, you can use a comparison to a subquery, hence you could add a join to the DATE_DIM to the main FROM clause, and then compare that to a subquery similar to yours, just with the DATEADD moved into it. In the following code, I assume the alias for DATE_DIM in the main FROM clause is TIME0.
WHERE
TIME0.[DAY] >= (SELECT DATEADD(dd,-30, MAX([TIME1].[DAY]))
FROM DB.DATE_DIM TIME1
JOIN DB.PURCHASES PURC
ON TIME1.KEY = PURC.KEY
)

WHERE condition on new created Column in Impala

In my Table, I have time information in UNIX time that I have converted to the proper time format using the following function in impala:
cast(ts DIV 1000 as TIMESTAMP) as NewTime.
Now I want to apply WHERE query on the newly created column "NewTime" to select the data from a particular time period but I am getting the following error:
"Could not revolve column/field reference: NewTime".
How can I apply WHERE query on the newly created column in impala.
Thanks.
You can calculate it using inner subquery and then use it for filtering.
select NewTime
from
(select cast(ts DIV 1000 as TIMESTAMP) as NewTime,... from table) subq
where
subq.NewTime >now()
You can also use CTE like Gordon said.

Use DataStudio to specify the date range for a custom query in BigQuery, where the date range influences operators in the query

I currently have a DataStudio dashboard connected to a BigQuery custom query.
That BQ query has a hardcoded date range and the status of one of the columns (New_or_Relicensed) can change dynamically for a row, based on the dates specified in the range. I would like to be able to alter that range from DataStudio.
I have tried:
simply connecting the DS dashboard to the custom query in BQ and then introducing a date range filter, but as you can imagine - that does not work because it's operating on an already hard-coded date range.
reviewing similar answers, but their problem doesn't appear to be quite the same E.g. BigQuery Data Studio Custom Query
Here is the query I have in BQ:
SELECT t0.New_Or_Relicensed, t0.Title_Category FROM (WITH
report_range AS
(
SELECT
TIMESTAMP '2019-06-24 00:00:00' AS start_date,
TIMESTAMP '2019-06-30 00:00:00' AS end_date
)
SELECT
schedules.schedule_entry_id AS Schedule_Entry_ID,
schedules.schedule_entry_starts_at AS Put_Up,
schedules.schedule_entry_ends_at AS Take_Down,
schedule_entries_metadata.contract AS Schedule_Entry_Contract,
schedules.platform_id AS Platform_ID,
platforms.platform_name AS Platform_Name,
titles_metadata.title_id AS Title_ID,
titles_metadata.name AS Title_Name,
titles_metadata.category AS Title_Category,
IF (other_schedules.schedule_entry_id IS NULL, "new", "relicensed") AS New_Or_Relicensed
FROM
report_range, client.schedule_entries AS schedules
JOIN client.schedule_entries_metadata
ON schedule_entries_metadata.schedule_entry_id = schedules.schedule_entry_id
JOIN
client.platforms
ON schedules.platform_id = platforms.platform_id
JOIN
client.titles_metadata
ON schedules.title_id = titles_metadata.title_id
LEFT OUTER JOIN
client.schedule_entries AS other_schedules
ON schedules.platform_id = other_schedules.platform_id
AND other_schedules.schedule_entry_ends_at < report_range.start_date
AND schedules.title_id = other_schedules.title_id
WHERE
((schedules.schedule_entry_starts_at >= report_range.start_date AND
schedules.schedule_entry_starts_at <= report_range.end_date) OR
(schedules.schedule_entry_ends_at >= report_range.start_date AND
schedules.schedule_entry_ends_at <= report_range.end_date))
) AS t0 LIMIT 100;
Essentially - I would like to be able to set the start_date and end_date from google data studio, and have those dates incorporated into the report_range that then influences the operations in the rest of the query (that assign a schedule entry as new or relicensed).
Have you looked at using the Custom Query interface of the BigQuery connector in Data Studio to define start_date and end_date as parameters as part of a filter.
Your query would need a little re-work...
The following example custom query uses the #DS_START_DATE and #DS_END_DATE parameters as part of a filter on the creation date column of a table. The records produced by the query will be limited to the date range selected by the report user, reducing the number of records returned and resulting in a faster query:
Resources:
Introducing BigQuery parameters in Data Studio
https://www.blog.google/products/marketingplatform/analytics/introducing-bigquery-parameters-data-studio/
Running parameterized queries
https://cloud.google.com/bigquery/docs/parameterized-queries
I had a similar issue where I wanted to incorporate a 30 day look back before the start (#ds_start_date). In this case I was using Google Analytics UA session data and using table suffix in my where clause. I was able to calculate a date RELATIVE to the built in data studio "string" dates by using the following:
...
WHERE
_table_suffix BETWEEN
CAST(FORMAT_DATE('%Y%m%d', DATE_SUB (PARSE_DATE('%Y%m%d',#DS_START_DATE), INTERVAL 30 DAY)) AS STRING)
AND
CAST(FORMAT_DATE('%Y%m%d', DATE_SUB (PARSE_DATE('%Y%m%d',#DS_END_DATE), INTERVAL 0 DAY)) AS STRING)