I am using Spark SQL 2.4.3 and I am trying to create a timeseries of month date like:
2000-01-01,2000-02-01,2000-03-01,etc
with date at the first of every month using the following code:
scala> spark.sql("select sequence(to_date('2000-01-01'), to_date('2001-01-01'), interval 1 month) as interval_date").show(false)
+------------------------------------------------------------------------------------------------------------------------------------------------------------+
|interval_date |
+------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[2000-01-01, 2000-02-01, 2000-03-01, 2000-03-31, 2000-04-30, 2000-05-31, 2000-06-30, 2000-07-31, 2000-08-31, 2000-09-30, 2000-11-01, 2000-12-01, 2001-01-01]|
+------------------------------------------------------------------------------------------------------------------------------------------------------------+
but as you can see I get strange dates like:
2000-03-31, 2000-04-30, 2000-05-31, 2000-06-30, etc
which is quite unexpected. Any idea to fix the problem?
Related
I have a table having year, month and day as partition. I am trying to find an optimized way to read the data for last n days using parameters. The only way I can do this at the moment is by specifying each of the combination of year, month, and day individually which is very problematic if we have to read a lot of data, say for 1 month.
Below is a sample example.
select count(*) from table
where (year = 2021 and month = 7 and day = 5)
or (year = 2021 and month = 7 and day = 4)
or (year = 2021 and month 7 and day 3)
I am interested in knowing the following.
Can I use case when in where clause without impacting the performance? For example, will the below query read the same amount of data as the above query?
select count(*) from table
where year = 2021 and month = 7 and (case when day between 4 and 7 then 1 else 0 end) = 1
How does partition work behind the scenes? I believe that the query gets converted into a map reduce job before execution. Will the both codes mentioned above will be converted to same map reduce job?
Can I use functions like case when freely with partitioned columns in where clause and will the hive query engine be able to interpret the function and scan the appropriate partitions?
Is there any built in function in hive to know which partitions are getting hit by the query? If not, is there any workaround? Is there any way to know the same in presto?
Partition pruning works fine with queries like this, the logic is like in your CASE expression:
where concat(year, '-', month, '-', day) >= '2021-07-04'
and
concat(year, '-', month, '-', day) <= '2021-07-07'
See this answer.
How to check how partition pruning works: Use EXPLAIN DEPENDENCY or EXPLAIN EXTENDED See this answer.
I have on my DB the dates that I can filter like this:
select *
where
a.y=2021 and a.m=2 and a.d=7
However if I run this query tomorrow I'll have to go there and change manually.
Is there a way to do this automatically as in if I run the query tomorrow I'll get d=8 and the day after d=9 and so on?
I tried to use get date but I get the following error:
SQL Error [6]: Query failed (#20210207_153809_06316_2g4as): line 2:7: Function 'getdate' not registered
I also don't know if that is the right solution. Does anybody know how to fix that?
you can use NOW to get the current date, and use YEAR , MONTH , DAY to get parts of the date
SELECT *
FROM [TABLE]
WHERE a.y=YEAR(NOW()) and a.m=MONTH(NOW()) and a.d=DAY(NOW())
The best solution is to have a date column in your data. Then you can just use:
where datecol = current_date
Or whatever your particular database uses for the current date.
Absent that, you have to split the current date into parts. In Standard SQL, this looks like:
where y = extract(year from current_date) and
m = extract(month from current_date) and
d = extract(day from current_date)
That said, date functions notoriously vary among databases, so the exact syntax depends on your database.
For instance, a common way to write this in SQL Server would be:
where y = year(getdate()) and
m = month(getdate()) and
d = day(getdate())
I want to add any number of days to a given date, for example I want to add a day to today's date.
I have one dataframe like this:
------------
| date |
------------
|2020-10-01|
------------
I would like to get a dataframe like this:
------------
| date |
------------
|2020-10-02|
------------
The real code is incrusted in a complex sql query then the valid result is ONLY with SQL statements.
I have tried with this code, that try to get the next day of today and it's not working due to difference between types date and int, I think that I am looking for something similar to python timedelta but in pyspark-sql
spark.sql(f"SELECT to_date(now()) + 1")
The error:
cannot resolve '(to_date(current_timestamp()) + 1)' due to data type mismatch: differing types in '(to_date(current_timestamp()) + 1)' (date and int)
After a while of searching I found a function that solves the problem:
spark.sql("SELECT date_add(to_date(now()),1)").show()
Documentation:
date_add(Column start, int days)
Returns the date that is days days after start
https://spark.apache.org/docs/1.6.2/api/java/org/apache/spark/sql/functions.html
What is the syntax of create a table with interval data type in Hive? I tried something like:
CREATE TABLE t1 (c1 interval year to month);
But it doesn't work. Can't find any document from Apache Hive.
So far I haven't found a way to directly do it and I'm going to check with some of the Hive Developers to see if this is a bug. The actual data types are interval_day_time and interval_year_month as shown in the work around below. This still doesn't solve the problem on how to create the table with these types directly.
create table test_interval
as
select interval '1' day as day_interval,
interval '1' month as month_interval;
describe test_interval;
+-----------------+----------------------+----------+--+
| col_name | data_type | comment |
+-----------------+----------------------+----------+--+
| day_interval | interval_day_time | |
| month_interval | interval_year_month | |
+-----------------+----------------------+----------+--+
2 rows selected (0.048 seconds)
The interval types (YEAR TO MONTH and DAY TIME) are supported only in query expressions and predicates. Interval types are not supported as column data types in tables.
Not sure about pure Spark, but in Databricks, as of 2022, I can specify INTERVAL DAY or INTERVAL MONTH in CREATE TABLE. Other options such as MINUTE, YEAR, etc, work, too, though they converge to one of those two.
This is not well documented though and the error message when you put just INTERVAL alone is misleading.
I have a column (ROW_UPDATE_TIME) in a table where it stores the timestamp when an update happens in this table.
I'd like to know how to check rows that the timestamp is today.
This is what I'm using now, but it's not a pretty solution I think:
SELECT
*
FROM
TABLE
WHERE
ROW_UPDATE_TIME BETWEEN (CURRENT TIMESTAMP - 1 DAY) AND (CURRENT TIMESTAMP + 1 DAY);
Is there a better solution, example: ROW_UPDATE_TIME = CURRENT DATE, or something like that?
Found it:
SELECT
*
FROM
TABLE
WHERE
DATE(ROW_UPDATE_TIME) = CURRENT DATE;
The first version you have provided will not return you the results you expect, because you will get in the result timestamps from today or tomorrow, depends on the hour you run it.
Use the query below to get the results from today:
SELECT
*
FROM
table
WHERE
row_update_time
BETWEEN TIMESTAMP(CURRENT_DATE,'00:00:00')
AND TIMESTAMP(CURRENT_DATE,'23:59:59')
Avoid applying a function to a column you compare in the where clause(DATE(row_update_time) = CURRENT_DATE) . That will cause the optimizer to run the function against each row, just to allocate the data you need. It could slow down the query dramatically. Try to run explain against the two versions and you will see what I mean.