BigQuery Error on query Tableau wants to execute - google-bigquery

I'm using Tableau 2022.3 Desktop and Google Analytics 4 export into Big Query, and getting this error message in Tableau when trying to access the table in BigQuery
This is the query Tableau wants to run
SELECT `events_20221124`.`device`.`category` AS `device_category`, CAST(CAST(`events_20221124`.`event_date` AS TIMESTAMP) AS DATE) AS `event_date`, `events_20221124`.`event_name` AS `event_name`, `events_20221124`.`event_params`.`value`.`string_value` AS `event_params_value_string_value`, CAST(`events_20221124`.`event_timestamp` AS STRING) AS `event_timestamp`, `events_20221124`.`user_id` AS `user_id` FROM `project.analytics_12345678`.`events_20221124` `events_20221124`
And this is the error Big Query is giving back
Cannot access field value on a value with type ARRAY<STRUCT<key STRING, value STRUCT<string_value STRING, int_value INT64, float_value FLOAT64, ...>>> at [1:230]

Related

Undefined function: 'timestampdiff'.This function is neither a registered temporary function nor a permanent function registered in the database

I am using Power Query for a data table coming from Databricks and used a function date function Date.From([Date1]) - [Date2] where Date 1 is a Random date and Date 2 is a column in a table.
The M code I used:
= Table.AddColumn(#"Renamed Columns", "Age", each Date.From(#date(2024,12,31)) - [#"Date"], type duration)
And here is the error I got
org.apache.hive.service.cli.HiveSQLException: Error running query: org.apache.spark.sql.AnalysisException: Undefined function: 'timestampdiff'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.
This function works fine on tables from folder or sharepoint, just not for table from databricks. Is there an alternative way to calculate data/time difference in Power Query for sources from databricks?

Athena gives extra rows than actual present in datasource in s3

I have stored data source in s3 and when querying it in athena and querying the total no of rows , its giving me more rows than present in csv file stored in s3 .
I have also given separate path for athena query result i.e different from the data source folder path of s3 .
Please help me with this , why athena is giving me extra rows and unknown values in them ,thus creating discrepancies in the data.
Please find the query i wrote create the table in athena
athena_client.start_query_execution(QueryString='create database cms_data',ResultConfiguration={'OutputLocation': 's3://cms-dashboard-automation/Athenaoutput/'})
\t#Tables created for athena
context = {'Database': 'cms_data'}
athena_client.start_query_execution(QueryString='''CREATE EXTERNAL TABLE IF NOT EXISTS `cms_data`.`mpf_data` (
`State` String,
`County` String,
`Org_Name` String,
`Contract_ID` String,
`Plan_ID` double,
`Segment_ID` double,
`Plan_Type_Desc` String,
`Contract_Year` double,
`Category_Name` String,
`Service_Name` String,
`Limit_Flag` double,
`Authorization_Flag` double,
`Referral_Flag` double,
`Network_Description` String,
`Cost_Share` String )
\t ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
\t 'field.delim' = ','
) LOCATION 's3://cms-dashboard-automation/MPF_Data/'
TBLPROPERTIES ('has_encrypted_data'='false');
''',QueryExecutionContext = context,ResultConfiguration={'OutputLocation': 's3://cms-dashboard-automation/Athenaoutput/'})

Cast to long datatype - BigQuery

BigQuery and SQL noob here. I was going through possible data types big query supports here. I have a column in bigtable which is of type bytes and its original data type is scala Long. This was converted to bytes and stored in bigtable from my application code. I am trying to do CAST(itemId AS integer) (where itemId is the column name) in the BigQuery UI but the output of CAST(itemId AS integer) is 0 instead of actual value. I have no idea how to do this. If someone could point me in the right direction then I would greatly appreciate it.
EDIT: Adding more details
Sample itemId is 190007788462
Following is the code which writes itemId to the big table. I have included the relevant method. Using hbase client to write to bigtable.
import org.apache.hadoop.hbase.client._
def toPut(key: String, itemId: Long): Put = {
val TrxColumnFamily = Bytes.toBytes("trx")
val ItemIdColumn = Bytes.toBytes("itemId")
new Put(Bytes.toBytes(key))
.addColumn(TrxColumnFamily,
ItemIdColumn,
Bytes.toBytes(itemId))
}
Following is the entry in big table based on above code
ROW COLUMN+CELL
foo column=trx:itemId, value=\x00\x00\x00\xAFP]F\xAA
Following is the relevant code which reads the entry from big table in scala. This works correctly. Result is a org.apache.hadoop.hbase.client.Result
private def getItemId(row: Result): Long = {
val key = Bytes.toString(row.getRow)
val TrxColumnFamily = Bytes.toBytes("trx")
val ItemIdColumn = Bytes.toBytes("itemId")
val itemId =
Bytes.toLong(row.getValue(TrxColumnFamily, ItemIdColumn))
itemId
}
The getItemId function above correctly returns itemId. That's because Bytes.toLong is part of org.apache.hadoop.hbase.util.Bytes which correctly casts the Byte string to Long.
I am using big query UI similar to this one and using CAST(itemId AS integer) because BigQuery doesn't have a Long data type. This incorrectly casts the itemId byte string to integer and resulting value is 0.
Is there any way I can have a Bytes.toLong equivalent from hbase-client in BigQuery UI? If not is there any other way I can go about this issue?
Try this:
SELECT CAST(CONCAT('0x', TO_HEX(itemId)) AS INT64) AS itemId
FROM YourTable;
It converts the bytes into a hex string, then casts that string into an INT64. Note that the query uses standard SQL, as opposed to legacy SQL. If you want to try it with some sample data, you can run this query:
WITH `YourTable` AS (
SELECT b'\x00\x00\x00\xAFP]F\xAA' AS itemId UNION ALL
SELECT b'\xFA\x45\x99\x61'
)
SELECT CAST(CONCAT('0x', TO_HEX(itemId)) AS INT64) AS itemId
FROM YourTable;

How do I update a nested record in BigQuery using DML syntax?

I've got the following BigQuery schema, and I'm trying to update the event_dim.date field:
I tried the following query using standard SQL and the new BigQuery DML:
UPDATE `sara-bigquery.examples.app_events_20170113`
SET event_dim.date = '20170113'
WHERE true
But got this error:
Error: Cannot access field date on a value with type ARRAY<STRUCT<name STRING, params ARRAY<STRUCT<key STRING,
value STRUCT<string_value STRING, int_value INT64, float_value FLOAT64, ...>>>, timestamp_micros INT64, ...>> at [2:15]
I'm able to select the nested field with this query:
SELECT x.date FROM `sara-bigquery.examples.app_events_20170113`,
UNNEST(event_dim) x
But can't figure out the correct UPDATE syntax.
That query failed because event_dim is an array of structs. This should do the trick:
UPDATE `sara-bigquery.examples.app_events_20170113`
SET event_dim = ARRAY(
SELECT AS STRUCT * REPLACE('20170113' AS date) FROM UNNEST(event_dim)
)
WHERE true
Check out the docs on how arrays are handled in Standard SQL for more details.

Hive - dynamic partitions: Long loading times with a lot of partitions when updating table

I run Hive via AWS EMR and have a jobflow that parses log data frequently into S3. I use dynamic partitions (date and log level) for my parsed Hive table.
One thing that is taking forever now when I have several gigabytes of data and a lot of partitions is when Hive is loading data to the table after the parsing is done.
Loading data to table default.logs partition (dt=null, level=null)
...
Loading partition {dt=2013-08-06, level=INFO}
Loading partition {dt=2013-03-12, level=ERROR}
Loading partition {dt=2013-08-03, level=WARN}
Loading partition {dt=2013-07-08, level=INFO}
Loading partition {dt=2013-08-03, level=ERROR}
...
Partition default.logs{dt=2013-03-05, level=INFO} stats: [num_files: 1, num_rows: 0, total_size: 1905, raw_data_size: 0]
Partition default.logs{dt=2013-03-06, level=ERROR} stats: [num_files: 1, num_rows: 0, total_size: 4338, raw_data_size: 0]
Partition default.logs{dt=2013-03-06, level=INFO} stats: [num_files: 1, num_rows: 0, total_size: 828250, raw_data_size: 0]
...
Partition default.logs{dt=2013-08-14, level=INFO} stats: [num_files: 5, num_rows: 0, total_size: 626629, raw_data_size: 0]
Partition default.logs{dt=2013-08-14, level=WARN} stats: [num_files: 4, num_rows: 0, total_size: 4405, raw_data_size: 0]
Is there a way to overcome this problem and reduce the loading times for this step?
I have already tried to archive old logs to Glacier via a bucket lifecycle rule in hopes that Hive would skip loading the archived partitions. Well, since this still keeps the file(path)s visible in S3 Hive recognizes the archived partitions anyway so no performance is gained.
Update 1
The loading of the data is done by simple inserting the data into the dynamically partitioned table
INSERT INTO TABLE logs PARTITION (dt, level)
SELECT time, thread, logger, identity, message, logtype, logsubtype, node, storageallocationstatus, nodelist, userid, nodeid, path, datablockid, hash, size, value, exception, server, app, version, dt, level
FROM new_logs ;
from one table that contain the unparsed logs
CREATE EXTERNAL TABLE new_logs (
dt STRING,
time STRING,
thread STRING,
level STRING,
logger STRING,
identity STRING,
message STRING,
logtype STRING,
logsubtype STRING,
node STRING,
storageallocationstatus STRING,
nodelist STRING,
userid STRING,
nodeid STRING,
path STRING,
datablockid STRING,
hash STRING,
size STRING,
value STRING,
exception STRING,
version STRING
)
PARTITIONED BY (
server STRING,
app STRING
)
ROW FORMAT
DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS
INPUTFORMAT 'org.maz.hadoop.mapred.LogFileInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION 's3://my-log/logs/${LOCATION}' ;
into the new (parsed) table
CREATE EXTERNAL TABLE logs (
time STRING,
thread STRING,
logger STRING,
identity STRING,
message STRING,
logtype STRING,
logsubtype STRING,
node STRING,
storageallocationstatus STRING,
nodelist STRING,
userid STRING,
nodeid STRING,
path STRING,
datablockid STRING,
hash STRING,
size STRING,
exception STRING,
value STRING,
server STRING,
app STRING,
version STRING
)
PARTITIONED BY (
dt STRING,
level STRING
)
ROW FORMAT
DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION 's3://my-log/parsed-logs' ;
The input format (LogFileInputFormat) is responsible of parsing log entries to the desired log format.
Update 2
When I try the following
INSERT INTO TABLE logs PARTITION (dt, level)
SELECT time, thread, logger, identity, message, logtype, logsubtype, node, storageallocationstatus, nodelist, userid, nodeid, path, datablockid, hash, size, value, exception, server, app, version, dt, level
FROM new_logs
WHERE dt > 'some old date';
Hive still loads all partitions in logs. If I on the other hand use static partitioning like
INSERT INTO TABLE logs PARTITION (dt='some date', level)
SELECT time, thread, logger, identity, message, logtype, logsubtype, node, storageallocationstatus, nodelist, userid, nodeid, path, datablockid, hash, size, value, exception, server, app, version, level
FROM new_logs
WHERE dt = 'some date';
Hive only loads the concerned partitions, but then I need to create one query for each date I think might be present in new_logs. Usually new_logs only contain log entries from today and yesterday it but might contain older entries as well.
Static partitioning are my solution of choice at the moment but aren't there any other (better) solutions to my problem?
During this slow phase, Hive takes the files it built for each partition and moves it from a temporary directory to a permanent directory. You can see this in the "explain extended" called a Move Operator.
So for each partition it's one move and an update to the metastore. I don't use EMR but I presume this act of moving files to S3 has high latency for each file it needs to move.
What's not clear from what you wrote is whether you're doing a full load each time you run. For example why do you have a 2013-03-05 partition? Are you getting new log data that contains this old date? If this data is already in your logs table you should modify your insert statement like
SELECT fields
FROM new_logs
WHERE dt > 'date of last run';
This way you'll only get a few buckets and only a few files to move. It's still wasteful to scan all this extra data from new_logs but you can solve that by partitioning new_logs.
AWS has improved HIVE Partition recovery time by more than an order of magnitude on EMR 3.2.x and above.
We have a HIVE table that has more than 20,000 partitions on S3. With prior versions of EMR, it used to take ~80 minutes to recover and now with 3.2.x/3.3.x, we are able to do it under 5 minutes.