I am building an SQL select statement in Kotlin and I can create a simple select statement as:
import org.jooq.*
import org.jooq.impl.DSL
import java.text.SimpleDateFormat
import java.time.Instant
val format = SimpleDateFormat("yyyy-mm-dd")
format.timeZone = TimeZone.getTimeZone("UTC")
val defaultValueLower = format.parse("1970-01-01")
val defaultValueUpper = Date.from(Instant.now())
val create = DSL.using(SQLDialect.POSTGRES)
val query: Query = create.select()
.from(DSL.table("datasets"))
.orderBy(DSL.field("id").desc())
.limit(DSL.inline(100))
I am using jooq
Now I want to add a where clause where the timestamp column is within certain dates. I have associated Date objects but not sure how to use it within this query builder
Related
I have a Dataframe and I want to dynamically pass the columns names through widgets in a select statement in my Databricks Notebook. How can I do it?
I am using the below code
df1 = spark.sql("select * from tableraw")
where df1 has columns "tablename" and "layer"
df = df1.select("tablename", "layer")
Now, our requirement is to use the values of the widgets to select those columns, something like:
df = df1.select(dbutils.widget.get("tablename"), dbutils.widget.get("datalayer"))
Python / Scala
Create Widgets
%python
dbutils.widgets.text(name = "pythonTextWidget", defaultValue = "columnName")
dbutils.widgets.dropdown(name = "pythonDropdownWidget", defaultValue = "col1", choices = ["col1", "col2", "col3"])
%scala
dbutils.widgets.text("scalaTextWidget", "columnName")
dbutils.widgets.dropdown("scalaDropdownWidget", "col1", Seq("col1", "col2", "col3"))
Extract Value from Widgets
%python
textColumn = dbutils.widgets.get("pythonTextWidget")
dropdownColumn = dbutils.widgets.get("pythonDropdownWidget")
%scala
val textColumn = dbutils.widgets.get("scalaTextWidget")
val dropdownColumn = dbutils.widgets.get("scalaDropdownWidget")
Use Value to select Column
%python
from pyspark.sql.functions import col
df.select(col(textColumn), col(dropdownColumn))
%scala
import org.apache.spark.sql.functions.col
df.select(col(textColumn), col(dropdownColumn))
SQL
The Widgets in SQL work slightly different compared to Python/Scala in the sense that you cannot use them to select a column. However, widgets can be used to dynamically adjust filters.
Create Widget
%sql CREATE WIDGET text sqlTextWidget DEFAULT "ACTIVE"
%sql CREATE WIDGET DROPDOWN sqlDropdownWidget DEFAULT "ACTIVE" CHOICES SELECT DISTINCT Status FROM <databaseName>.<tableName> WHERE Status IS NOT NULL
Apply Widget Value to Filter Statement
%sql SELECT * FROM <databaseName>.<tableName> WHERE Status = getArgument("sqlTextWidget")
More background can be found on the Databricks documentation on Widgets.
Trying to implement pandas.read_sql function.
I created a clickhouse table and filled it:
create table regions
(
date DateTime Default now(),
region String
)
engine = MergeTree()
PARTITION BY toYYYYMM(date)
ORDER BY tuple()
SETTINGS index_granularity = 8192;
insert into regions (region) values ('Asia'), ('Europe')
Then python code:
import pandas as pd
from sqlalchemy import create_engine
uri = 'clickhouse://default:#localhost/default'
engine = create_engine(uri)
query = 'select * from regions'
pd.read_sql(query, engine)
As the result I expected to get a dataframe with columns date and region but all I get is empty dataframe:
Empty DataFrame
Columns: [2021-01-08 09:24:33, Asia]
Index: []
UPD. It occured that defining clickhouse+native solves the problem.
Can it be solved without +native?
There is encient issue https://github.com/xzkostyan/clickhouse-sqlalchemy/issues/10. Also there is a hint which assumes to add FORMAT TabSeparatedWithNamesAndTypes at the end of a query. So the init query will be look like this:
select *
from regions
FORMAT TabSeparatedWithNamesAndTypes
I'm trying to create a table in spark (scala) and then insert values from two existing dataframes but I got this exeption:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Hive support is required to CREATE Hive TABLE (AS SELECT);;
'CreateTable `stat_type_predicate_percentage`, ErrorIfExists
Here is the code :
case class stat_type_predicate_percentage (type1: Option[String], predicate: Option[String], outin: Option[INT], percentage: Option[FLOAT])
object LoadFiles1 {
def main(args: Array[String]) {
val sc = new SparkContext("local[*]", "LoadFiles1")
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val warehouseLocation = new File("spark-warehouse").getAbsolutePath
val spark = SparkSession
.builder()
.appName("Spark Hive Example")
.config("spark.sql.warehouse.dir", warehouseLocation)
.enableHiveSupport()
.getOrCreate()
import sqlContext.implicits._
import org.apache.spark.sql._
import org.apache.spark.sql.Row;
import org.apache.spark.sql.types.{StructType,StructField,StringType};
//statistics
val create = spark.sql("CREATE TABLE stat_type_predicate_percentage (type1 String, predicate String, outin INT, percentage FLOAT) USING hive")
val insert1 = spark.sql("INSERT INTO stat_type_predicate_percentage SELECT types.type, res.predicate, 0, 1.0*COUNT(subject)/(SELECT COUNT(subject) FROM MappingBasedProperties AS resinner WHERE res.predicate = resinner.predicate) FROM MappingBasedProperties AS res, MappingBasedTypes AS types WHERE res.subject = types.resource GROUP BY res.predicate,types.type")
val select = spark.sql("SELECT * from stat_type_predicate_percentage" )
}
How should I solve it?
--- Yo have to enable hive support in you sparksession
val spark = new SparkSession
.Builder()
.appName("JOB2")
.master("local")
.enableHiveSupport()
.getOrCreate()
This problem may be two fold
for one you might want to do what #Tanjin suggested in the comments and it might work afterwards ( Try adding .config("spark.sql.catalogImplementation","hive") to your SparkSession.builder )
but if you actually want to use an existing hive instance with its own metadata which you'll be able to query from outside your job. Or you might already want to use existing tables you might like to add to you configuration the hive-site.xml.
This configuration file contains some properties you probably want like the hive.metastore.uris which will enable your context add a new table which will be save in the store. And it will be able to read from tables in your hive instance thanks to the metastore which contains tables and locations.
I'm trying to write a SCollection to a partition in Big Query using:
import java.time.LocalDate
import java.time.format.DateTimeFormatter
val date = LocateDate.parse("2017-06-21")
val col = sCollection.typedBigQuery[Blah](query)
col.saveAsTypedBigQuery(
tableSpec = "test.test$" + date.format(DateTimeFormatter.ISO_LOCAL_DATE),
writeDisposition = WriteDisposition.WRITE_EMPTY,
createDisposition = CreateDisposition.CREATE_IF_NEEDED)
The error I get is
Table IDs must be alphanumeric (plus underscores) and must be at most 1024 characters long. Also, Table decorators cannot be used."
How can I write to a partition? I don't see any options to specify partitions via either saveAsTypedBigQuery method so I was trying the Legacy SQL table decorators.
See: BigqueryIO Unable to Write to Date-Partitioned Table. You need to manually create the table. BQ IO cannot create a table and partition it.
Additionally, the no table decorators was a complete ruse. It's the alphanumeric part I was missing.
col.saveAsTypedBigQuery(
tableSpec = "test.test$" + date.format(DateTimeFormatter.BASIC_ISO_DATE),
writeDisposition = WriteDisposition.WRITE_APPEND,
createDisposition = CreateDisposition.CREATE_NEVER)
In Spark I am trying to execute SQL queries on a temporary table derived from a data frame that I manually built by reading a csv file and converting the columns into the right data type.
Specifically, the table I'm talking about is the LINEITEM table from [TPC-H specification][1]. Unlike stated in the specification I am using TIMESTAMP rather than DATE because I've read that Spark does not support the DATE type.
In my single scala source file, after creating the data frame and registering a temporary table called "lineitem", I am trying to execute the following query:
val res = sqlContext.sql("SELECT * FROM lineitem l WHERE date(l.shipdate) <= date('1998-12-01 00:00:00');")
When I submit the packaged jar using spark-submit, I get the following error:
Exception in thread "main" java.lang.RuntimeException: [1.75] failure: ``union'' expected but but `;' found
When I omit the semicolon and do the same thing, I get the following error:
Exception in thread "main" java.util.NoSuchElementException: key not found: date
Spark version is 1.4.0.
Does anyone have an idea what's the problem with these queries?
[1] http://www.tpc.org/TPC_Documents_Current_Versions/pdf/tpch2.17.1.pdf
SQL queries passed to SQLContext.sql shouldn't be delimited using semicolon - this the source of your first problem
DATE UDF expects date in the YYYY-MM-DD form and DATE('1998-12-01 00:00:00') evaluates to null. As long as timestamp can be casted to DATE correct query string looks like this:
"SELECT * FROM lineitem l WHERE date(l.shipdate) <= date('1998-12-01')"
DATE is a Hive UDF. It means you have to use HiveContext not a standard SQLContext - this is the source of your second problem.
import org.apache.spark.sql.hive.HiveContext
val sqlContext = new HiveContext(sc) // where sc is a SparkContext
In Spark >= 1.5 it is also possible to use to_date function:
import org.apache.spark.sql.functions.{lit, to_date}
df.where(to_date($"shipdate") <= to_date(lit("1998-12-01")))
Please try hive function CAST (expression AS toDatatype)
It changes an expression from one datatype to other
e.g. CAST ('2016-06-17 00.00.000' AS DATE) will convert String to Date
In your case
val res = sqlContext.sql("SELECT * FROM lineitem l WHERE CAST(l.shipdate as DATE) <= CAST('1998-12-01 00:00:00' AS DATE);")
Supported datatype conversions are as listed in Hive Casting Dates