I could not find an anwser to the following question. I have been testing Spotfire with BigQuery connector in the last week. I am using Simba JDBC driver. it works well ~1min per Go of Data import to memory. This is fast thanks to the BigQuery Storage API implementation⚡
What I cannot understand are the 2 following points:
Which version of Simba Driver and Spotfire have this highthroughput feature enabled?
Can I configure this feature? Adding EnableHighThroughputAPI=1 in the connection string toggles BQ storage API use. In Spotfire, changing this parameter from 1 to 0 does not make any difference - Storage API is always used. Is it internally overwritten by Spotfire?
An expert insight would be appreciated - I could not find a documentation on that (I am using 11.2.0 HF002 and simba jdbc42).
thanks!
Related
I am quite puzzled by BigQuery connector on Spotfire. It is taking !extremely! long time to import my dataset in-memory.
my configuration: spotfire on AWS windows instance (8vCPU - 32Go RAM). dataset 50Go >100M rows on BigQuery.
Yes - I should use in-database for such large dataset and push the queries to BigQuery and use Spotfire only for display, but that is not my question today 😋
Today i am trying to understand how the import works and why it is taking so long. this import job started 21hrs ago and it is still not finished. The resources of the server are barely used (CPU, Disk, Network).
Testing done:
I tried importing data from Redshift and it was much faster (14min for 22Go)
I checked resources used during import: network speed (Redshift ~ 370Mbs, BQ ~ 8Mbs for 30min), CPU (Redshift ~ 25%, BQ < 5%), RAM (Redshift & BQ ~ 27Go), Disk write (Redshift 30Mbs, BQ 5MBs)
I really don't understand what is Spotfire actually doing for all this time while importing dataset from BQ in memory. There seems to be no use of server resources and there is no indication of status apart from time running.
Any Spotfire experts have any insights on what's happening? Is the connector to BigQuery actually not to be used for In-memory analysis - what is the actual implementation limiting factor?
Thanks!
😇
We had an issue which is fixed in the Spotfire versions below:
TS 10.10.3 LTS HF-014
TS 11.2.0 HF-002
Please also vote and comment on the idea of using the Storage API when extracting data from BigQuery:
https://ideas.tibco.com/ideas/TS-I-7890
Thanks,
Thomas Blomberg
Senior Product Manager TIBCO Spotfire
#Tarik, you need to install the above hotfix at your end.
You can download the latest hotfix from the link: https://community.tibco.com/wiki/list-hotfixes-tibco-spotfire-clients-analyst-web-player-consumerbusiness-author-and-automation
An update after more testing. Thanks #Thomas and #Manoj's very helpful support. Here are the results:
I updated spotfire version to 11.2.0 HF002 and it fixed the issue with bringing data in-memory with BigQuery 👌 - Using (Data > Add Data...), the data throughput was very low though ~ 13min/Go. The network throughput was doing burst of 8Mbs.
As suggested in tibco ideas by Thomas, I installed Simba JDBC driver and the data throughput improved drammatically to ~ 50s/Go using (Data > Information Designer). The issue off course is that you need access to the server to install it. the network throughput was roughly 200Mbs. I am not sure what is the limiting factor (Spotfire config, Samba Driver or BigQuery).
Using Redshift connector to a Redshift cluster with the same data and connecting using (Data > Information Designer), I get to a data import throughput of ~ 30s/Go with a network throughput of 380Mbs.
So my recommandation is to use the latest simba driver along with the Information Designer to get the best "in-memory" data import throughput when connecting to medium size dataset in BigQuery (10-30Go). This leads to a data import throughput of 1min/Go.
It's not clear what makes Redshift connection faster though and if there is faster method to import data from GCP/BigQuery to Spotfire 🤷♂️
Any comments or suggestions are welcome!
Tarik
I have a use case where we need to periodically load BigQuery table in to a cache and support SQL query from there. I'm doing researching on Apache Ignite and think it could be a good fit to our use case. Only that it's not clear to me yet how I can get auto-load from BigQuery. By "auto-load" I mean to keep Apache Ignite updated with BigQuery table data and let this updating transparent to applications. In most cases, our BigQuery tables are updated by other scheduled jobs/queries with intervals from 5 minutes to 1 month.
I'm new to Ignite, and I guess my questions are as the following:
Is this a feature supported in Ignite already? (I couldn't find any)
Or is there any exiting pluggins already? (I couldn't find any)
how to implement the auto-load cache for BigQuery using Ignite?
You can do this once with Cache Store / loadCache(), but doing this every few minutes is infeasible. You may wish to design a BigQuery streamer to Apache Ignite, if it supports pushing of deltas.
If Google BigQuery doesn't open its changelog files for CDC tools then find how to capture those updates differently and stream them to Ignite via its IgniteDataStreamer API. There should be a way to capture the changes with some pub/sub mechanism.
I am currently trying to setup Presto connection trough Hue on an AWS EMR Cluster (release 5.24.0).
By default, AWS is setting the connection using the jdbc interface.
The problem of using this interface is that for some reason it loads only the first 1000 records when performing a query of the type
SELECT * FROM table
Which is probably due to a current Hue limitation that hardcodes the fetchsize to 1000 rows and cannot be edited.
According to Hue's documentation, however, it is advisable to use sqlalchemy as interface instead of jdbc whenever possible. So I changed the presto interpreter settings and installed pyhive as documented in the guide.
The fetchsize problem is gone, but even performing a
SELECT COUNT(1) FROM table
can take several minutes. Moreover, from the Hue UI, the query seems to be completed, when it is still running if I look at the Presto UI.
What is worse is that if I submit a new query in the Hue Presto editor, the new query is submitted, but the previous one keeps running.
Has anyone experienced similar issues? Does changing some Hue/Presto/Hive settings improve anything?
I get an error when using the following JDBC driver to retrieve BigQuery data in KNIME
The Error Message is in the Database Connection Table Reader node as follow:
Execute failed: " Simba BigQueryJDBCDriver 100033" Error getting job status.
However, this only occurs after consecutively running a couple of similar data flows including the BigQuery driver, in KNIME.
After google searches, no extra info was found. And I already updated the driver / KNIME to the latest version. Als tried to rerun the flow on a different system with no success.
Is there a quota/limits attached to usin g this specific driver?
Hope someone is able to help!
I found this issue tracker, it seems that you opened it and there's already interaction with the BigQuery's Engineering team. Thus, I suggest following the interaction made there and subscribing to it to keep updated as you'll receive e-mails regarding its progress.
Regarding your question about the limits for the driver, the quotas and limits that you usually have in BigQuery will apply to the Simba driver too (I.e. Concurrent queries limit, execution time limit, maximum response size, etc...).
Hope it helps.
Just discovered a new query limit is set at company's Group level, some miscommunication internally. Sorry for bothering and thans for the feedback!
I am sending data to PubSub from where I am trying to create a DataFlow job to put data into BigQuery.
I have a column in the data for unique that I want to do HLL_COUNT.INIT
Is there an equivalent method on the DataFlow side so that I can directly store the HLL version of the column in BigQuery?
No, DataFlow doesn't have support for BigQuery HLL sketches format, but it is clearly something that would be useful. I created feature request for it in DataFlow issue tracker: https://issuetracker.google.com/62153424.
Update: A BigQuery-compatible implementation of HyperLogLog++ has been open-sourced to github.com/google/zetasketch, and a design doc (docs.google.com/document/d/…) about integrating it into Apache Beam has been sent out to dev#beam.apache.org.