Flink Window Aggregation using TUMBLE failing on TIMESTAMP - sql

We have one table A in database. We are loading that table into flink using Flink SQL JdbcCatalog.
Here is how we are loading the data
val catalog = new JdbcCatalog("my_catalog", "database_name", username, password, url)
streamTableEnvironment.registerCatalog("my_catalog", catalog)
streamTableEnvironment.useCatalog("my_catalog")
val query = "select timestamp, count from A"
val sourceTable = streamTableEnvironment.sqlQuery(query) streamTableEnvironment.createTemporaryView("innerTable", sourceTable)
val aggregationQuery = select window_end, sum(count) from TABLE(TUMBLE(TABLE innerTable, DESCRIPTOR(timestamp), INTERVAL '10' minutes)) group by window_end
It throws following error
Exception in thread "main" org.apache.flink.table.api.ValidationException: SQL validation failed. The window function TUMBLE(TABLE table_name, DESCRIPTOR(timecol), datetime interval[, datetime interval]) requires the timecol is a time attribute type, but is TIMESTAMP(6).
In short we want to apply windowing aggregation on an already existing column. How can we do that
Note - This is a batch processing

Timestamp columns used as time attributes in Flink SQL must be either TIMESTAMP(3) or TIMESTAMP_LTZ(3).

Column should be TIMESTAMP(3) or TIMESTAMP_LTZ(3) but also the column should be marked as ROWTIME.
Type this line in your code
sourceTable.printSchema();
and check the result. The column should be marked as ROWTIME as shown below.
(
`deviceId` STRING,
`dataStart` BIGINT,
`recordCount` INT,
`time_Insert` BIGINT,
`time_Insert_ts` TIMESTAMP(3) *ROWTIME*
)
You can find my sample below.
Table tableCpuDataCalculatedTemp = tableEnv.fromDataStream(streamCPUDataCalculated, Schema.newBuilder()
.column("deviceId", DataTypes.STRING())
.column("dataStart", DataTypes.BIGINT())
.column("recordCount", DataTypes.INT())
.column("time_Insert", DataTypes.BIGINT())
.column("time_Insert_ts", DataTypes.TIMESTAMP(3))
.watermark("time_Insert_ts", "time_Insert_ts")
.build());
watermark method makes it ROWTIME

Related

BigQuery UPDATE of column within nest from GoogleAnalytics export

I have a website's Google Analytics tracking data exported in BigQuery to a ga_session_* table. It's the standard setup of any GA export to BQ I've come across.
For simplicity's sake I'll refer to only ga_session_20220801.
This table has the following fields:
- visitorId INTEGER NULLABLE
- visitNumber INTEGER NULLABLE
- visitId INTEGER NULLABLE
- visitStartTime INTEGER NULLABLE
- date STRING NULLABLE
- totals RECORD NULLABLE
- trafficSource RECORD NULLABLE
- device RECORD NULLABLE
- geoNetwork RECORD NULLABLE
- customDimensions RECORD REPEATED
- hits RECORD REPEATED
- fullVisitorId STRING NULLABLE
- userId STRING NULLABLE
- clientId STRING NULLABLE
- channelGrouping STRING NULLABLE
- socialEngagementType STRING NULLABLE
- privacyInfo RECORD NULLABLE
Field hits is a repeated field i.e. it contains multiple records for each table record.
My question
How can I execute an update statements on field hits.sourcePropertyInfo.sourcePropertyTrackingId, where sourcePropertyInfo is another nest inside hits (but not a repeated record)?
My attempts
update `my_project.my_dataset.ga_sessions_20220801`
set hits.sourcePropertyInfo.sourcePropertyTrackingId = 'some value'
where 1=1
Cannot access field sourcePropertyInfo on a value with type ARRAY<STRUCT<hitNumber INT64, time INT64, hour INT64, ...>> at [2:10]
update `my_project.my_dataset.ga_sessions_20220801`
set (select sourcePropertyInfo.sourcePropertyDisplayName from unnest(hits)) = 'some value'
where 1=1
Syntax error: Expected keyword DELETE or keyword INSERT or keyword UPDATE but got keyword SELECT at [2:6]
The following is my only attempt that was not stopped by syntax. I ended up trying to re-create the whole nest by listing each field. But still returns an error at runtime due to the record being repeated.
update `my_project.my_dataset.ga_sessions_20220801`
set hits = ARRAY(
SELECT AS STRUCT
hits.hitNumber, hits.time, hits.hour, hits.minute, hits.isSecure, hits.isInteraction, hits.isEntrance, hits.isExit, hits.referer, hits.page,
(SELECT AS STRUCT hits.transaction.transactionId, hits.transaction.transactionRevenue, hits.transaction.transactionTax, hits.transaction.transactionShipping, hits.transaction.affiliation,
hits.transaction.currencyCode,
hits.transaction.localTransactionRevenue, hits.transaction.localTransactionTax, hits.transaction.localTransactionShipping, hits.transaction.transactionCoupon)
as transaction,
hits.item, hits.contentInfo, hits.appInfo, hits.exceptionInfo, hits.eventInfo, hits.product, hits.promotion, hits.promotionActionInfo, hits.refund, hits.eCommerceAction, hits.experiment, hits.publisher, hits.customVariables, hits.customDimensions, hits.customMetrics, hits.type, hits.social, hits.latencyTracking,
(SELECT AS STRUCT 'some value' as sourcePropertyDisplayName, hits.sourcePropertyInfo.sourcePropertyTrackingId)
as sourcePropertyInfo,
hits.contentGroup, hits.dataSource, hits.publisher_infos, hits.uses_transient_token
FROM gas.hits
)
where 1=1
Scalar subquery produced more than one element
You're so close with the third query. You should recreate the hits column, but by preserving the original data structure.
In the query below, I get all the hit rows and replace their sourcePropertyInfo key in the struct.
Then, since I unnested the hits, to gather it again, I used array_agg so it becomes an array again.
update `my_project.my_dataset.ga_sessions_20220801`
set hits =
(
select array_agg(t)
from (
select
hit.* replace(
struct(
hit.sourcePropertyInfo.sourcePropertyDisplayName,
'some value' as sourcePropertyTrackingId
) as sourcePropertyInfo
)
from unnest(hits) as hit
) as t
)
where true

Using Array(Tuple(LowCardinality(String), Int32)) in ClickHouse

I have a table
CREATE TABLE table (
id Int32,
values Array(Tuple(LowCardinality(String), Int32)),
date Date
) ENGINE MergeTree()
PARTITION BY toYYYYMM(date)
ORDER BY (id, date)
but when executing the request
SELECT count(*)
FROM table
WHERE (arrayExists(x -> ((x.1) = toLowCardinality('pattern')), values) = 1)
I get an error
Code: 49. DB::Exception: Received from clickhouse:9000. DB::Exception: Cannot capture column 3 because it has incompatible type: got String, but LowCardinality(String) is expected..
If I replace the column 'values'
values Array(Tuple(String, Int32))
then the request is executed without errors.
What could be the problem when using Array(Tuple(LowCardinality(String), Int32))?
Until it will be fixed (see bug 7815), can be used this workaround:
SELECT uniqExact((id, date)) AS count
FROM table
ARRAY JOIN values
WHERE values.1 = 'pattern'
For the case when there are more than one Array-columns can be used this way:
SELECT uniqExact((id, date)) AS count
FROM
(
SELECT
id,
date,
arrayJoin(values) AS v,
arrayJoin(values2) AS v2
FROM table
WHERE v.1 = 'pattern' AND v2.1 = 'pattern2'
)
values Array(Tuple(LowCardinality(String), Int32)),
Do not use Tuple. It brings only cons.
It's still *2 files on the disk.
It gives twice slowdown then you extract only one tuple element
https://gist.github.com/den-crane/f20a2dce94a2926a1e7cfec7cdd12f6d
valuesS Array(LowCardinality(String)),
valuesI Array(Int32)

Spark SQL push down query issue with max function in MS SQL

I want to execute an aggregate function MAX on a table's ID column residing in MS SQL. I am using spark SQL 1.6 and JDBC push down_query approach as I don't want spark SQL to pull all the data on spark side and do MAX (ID) calculation, but when I execute below code I get below exception, whereas If I try SELECT * FROM in code it works as expected.
Code:
def getMaxID(sqlContext: SQLContext,tableName:String) =
{
val pushdown_query = s"(SELECT MAX(ID) FROM ${tableName}) as t"
val maxID = sqlContext.read.jdbc(url = getJdbcProp(sqlContext.sparkContext).toString, table = pushdown_query, properties = getDBConnectionProperties(sqlContext.sparkContext))
.head().getLong(0)
maxID
}
Exception:
Exception in thread "main" java.sql.SQLException: No column name was specified for column 1 of 't'.
at net.sourceforge.jtds.jdbc.SQLDiagnostic.addDiagnostic(SQLDiagnostic.java:372)
at net.sourceforge.jtds.jdbc.TdsCore.tdsErrorToken(TdsCore.java:2988)
at net.sourceforge.jtds.jdbc.TdsCore.nextToken(TdsCore.java:2421)
at net.sourceforge.jtds.jdbc.TdsCore.getMoreResults(TdsCore.java:671)
at net.sourceforge.jtds.jdbc.JtdsStatement.executeSQLQuery(JtdsStatement.java:505)
at net.sourceforge.jtds.jdbc.JtdsPreparedStatement.executeQuery(JtdsPreparedStatement.java:1029)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:124)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.<init>(JDBCRelation.scala:91)
at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:222)
at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:146)
This exception is not related to Spark. You have to provide an alias for the column
val pushdown_query = s"(SELECT MAX(ID) AS max_id FROM ${tableName}) as t"

Cast to long datatype - BigQuery

BigQuery and SQL noob here. I was going through possible data types big query supports here. I have a column in bigtable which is of type bytes and its original data type is scala Long. This was converted to bytes and stored in bigtable from my application code. I am trying to do CAST(itemId AS integer) (where itemId is the column name) in the BigQuery UI but the output of CAST(itemId AS integer) is 0 instead of actual value. I have no idea how to do this. If someone could point me in the right direction then I would greatly appreciate it.
EDIT: Adding more details
Sample itemId is 190007788462
Following is the code which writes itemId to the big table. I have included the relevant method. Using hbase client to write to bigtable.
import org.apache.hadoop.hbase.client._
def toPut(key: String, itemId: Long): Put = {
val TrxColumnFamily = Bytes.toBytes("trx")
val ItemIdColumn = Bytes.toBytes("itemId")
new Put(Bytes.toBytes(key))
.addColumn(TrxColumnFamily,
ItemIdColumn,
Bytes.toBytes(itemId))
}
Following is the entry in big table based on above code
ROW COLUMN+CELL
foo column=trx:itemId, value=\x00\x00\x00\xAFP]F\xAA
Following is the relevant code which reads the entry from big table in scala. This works correctly. Result is a org.apache.hadoop.hbase.client.Result
private def getItemId(row: Result): Long = {
val key = Bytes.toString(row.getRow)
val TrxColumnFamily = Bytes.toBytes("trx")
val ItemIdColumn = Bytes.toBytes("itemId")
val itemId =
Bytes.toLong(row.getValue(TrxColumnFamily, ItemIdColumn))
itemId
}
The getItemId function above correctly returns itemId. That's because Bytes.toLong is part of org.apache.hadoop.hbase.util.Bytes which correctly casts the Byte string to Long.
I am using big query UI similar to this one and using CAST(itemId AS integer) because BigQuery doesn't have a Long data type. This incorrectly casts the itemId byte string to integer and resulting value is 0.
Is there any way I can have a Bytes.toLong equivalent from hbase-client in BigQuery UI? If not is there any other way I can go about this issue?
Try this:
SELECT CAST(CONCAT('0x', TO_HEX(itemId)) AS INT64) AS itemId
FROM YourTable;
It converts the bytes into a hex string, then casts that string into an INT64. Note that the query uses standard SQL, as opposed to legacy SQL. If you want to try it with some sample data, you can run this query:
WITH `YourTable` AS (
SELECT b'\x00\x00\x00\xAFP]F\xAA' AS itemId UNION ALL
SELECT b'\xFA\x45\x99\x61'
)
SELECT CAST(CONCAT('0x', TO_HEX(itemId)) AS INT64) AS itemId
FROM YourTable;

SparkSQL errors when using SQL DATE function

In Spark I am trying to execute SQL queries on a temporary table derived from a data frame that I manually built by reading a csv file and converting the columns into the right data type.
Specifically, the table I'm talking about is the LINEITEM table from [TPC-H specification][1]. Unlike stated in the specification I am using TIMESTAMP rather than DATE because I've read that Spark does not support the DATE type.
In my single scala source file, after creating the data frame and registering a temporary table called "lineitem", I am trying to execute the following query:
val res = sqlContext.sql("SELECT * FROM lineitem l WHERE date(l.shipdate) <= date('1998-12-01 00:00:00');")
When I submit the packaged jar using spark-submit, I get the following error:
Exception in thread "main" java.lang.RuntimeException: [1.75] failure: ``union'' expected but but `;' found
When I omit the semicolon and do the same thing, I get the following error:
Exception in thread "main" java.util.NoSuchElementException: key not found: date
Spark version is 1.4.0.
Does anyone have an idea what's the problem with these queries?
[1] http://www.tpc.org/TPC_Documents_Current_Versions/pdf/tpch2.17.1.pdf
SQL queries passed to SQLContext.sql shouldn't be delimited using semicolon - this the source of your first problem
DATE UDF expects date in the YYYY-­MM-­DD form and DATE('1998-12-01 00:00:00') evaluates to null. As long as timestamp can be casted to DATE correct query string looks like this:
"SELECT * FROM lineitem l WHERE date(l.shipdate) <= date('1998-12-01')"
DATE is a Hive UDF. It means you have to use HiveContext not a standard SQLContext - this is the source of your second problem.
import org.apache.spark.sql.hive.HiveContext
val sqlContext = new HiveContext(sc) // where sc is a SparkContext
In Spark >= 1.5 it is also possible to use to_date function:
import org.apache.spark.sql.functions.{lit, to_date}
df.where(to_date($"shipdate") <= to_date(lit("1998-12-01")))
Please try hive function CAST (expression AS toDatatype)
It changes an expression from one datatype to other
e.g. CAST ('2016-06-17 00.00.000' AS DATE) will convert String to Date
In your case
val res = sqlContext.sql("SELECT * FROM lineitem l WHERE CAST(l.shipdate as DATE) <= CAST('1998-12-01 00:00:00' AS DATE);")
Supported datatype conversions are as listed in Hive Casting Dates