Spotfire and BigQuery - google-bigquery

I am quite puzzled by BigQuery connector on Spotfire. It is taking !extremely! long time to import my dataset in-memory.
my configuration: spotfire on AWS windows instance (8vCPU - 32Go RAM). dataset 50Go >100M rows on BigQuery.
Yes - I should use in-database for such large dataset and push the queries to BigQuery and use Spotfire only for display, but that is not my question today 😋
Today i am trying to understand how the import works and why it is taking so long. this import job started 21hrs ago and it is still not finished. The resources of the server are barely used (CPU, Disk, Network).
Testing done:
I tried importing data from Redshift and it was much faster (14min for 22Go)
I checked resources used during import: network speed (Redshift ~ 370Mbs, BQ ~ 8Mbs for 30min), CPU (Redshift ~ 25%, BQ < 5%), RAM (Redshift & BQ ~ 27Go), Disk write (Redshift 30Mbs, BQ 5MBs)
I really don't understand what is Spotfire actually doing for all this time while importing dataset from BQ in memory. There seems to be no use of server resources and there is no indication of status apart from time running.
Any Spotfire experts have any insights on what's happening? Is the connector to BigQuery actually not to be used for In-memory analysis - what is the actual implementation limiting factor?
Thanks!
😇

We had an issue which is fixed in the Spotfire versions below:
TS 10.10.3 LTS HF-014
TS 11.2.0 HF-002
Please also vote and comment on the idea of using the Storage API when extracting data from BigQuery:
https://ideas.tibco.com/ideas/TS-I-7890
Thanks,
Thomas Blomberg
Senior Product Manager TIBCO Spotfire

#Tarik, you need to install the above hotfix at your end.
You can download the latest hotfix from the link: https://community.tibco.com/wiki/list-hotfixes-tibco-spotfire-clients-analyst-web-player-consumerbusiness-author-and-automation

An update after more testing. Thanks #Thomas and #Manoj's very helpful support. Here are the results:
I updated spotfire version to 11.2.0 HF002 and it fixed the issue with bringing data in-memory with BigQuery 👌 - Using (Data > Add Data...), the data throughput was very low though ~ 13min/Go. The network throughput was doing burst of 8Mbs.
As suggested in tibco ideas by Thomas, I installed Simba JDBC driver and the data throughput improved drammatically to ~ 50s/Go using (Data > Information Designer). The issue off course is that you need access to the server to install it. the network throughput was roughly 200Mbs. I am not sure what is the limiting factor (Spotfire config, Samba Driver or BigQuery).
Using Redshift connector to a Redshift cluster with the same data and connecting using (Data > Information Designer), I get to a data import throughput of ~ 30s/Go with a network throughput of 380Mbs.
So my recommandation is to use the latest simba driver along with the Information Designer to get the best "in-memory" data import throughput when connecting to medium size dataset in BigQuery (10-30Go). This leads to a data import throughput of 1min/Go.
It's not clear what makes Redshift connection faster though and if there is faster method to import data from GCP/BigQuery to Spotfire 🤷‍♂️
Any comments or suggestions are welcome!
Tarik

Related

Suggestion for Non Analytical Distributed Processing Frameworks

Can someone please suggest a tool, framework or a service to perform the below task faster.
Input : The input to the service is a CSV file which consists of an identifier and several image columns with over a million rows.
Objective: To check if any of the image column of the row meets the minimum resolution and create a new boolean column for every row according to the results.
True - If any of the image in the row meets the min resolution
False - If no images in the row meets the min resolution
Current Implementation: Python script with pandas and multiprocessing running on a large VM(60 Core CPU) which takes about 4 - 5 Hours. Since this is a periodic task we schedule and manage it with Cloud Workflow and Celery Backend.
Note: We are looking to cut down on costs as uptime of server is just about 4-6Hrs a day. Hence 60 Core CPU 24*7 would be a lot of resources wasted.
Options Explored:
We have ruled out Cloud Run due to the memory, cpu and timeout limitations.
Apache Beam with Cloud Dataflow, seems like there is less support for non analytical workloads and Dataframe implementation with Apache Beam looks buggy still.
Spark and Dataproc seems to be good for analytical workloads. Although a Serverless option would be much preferred.
Which direction should i be looking into?

Spotfire - Biquery - Simba JDBC

I could not find an anwser to the following question. I have been testing Spotfire with BigQuery connector in the last week. I am using Simba JDBC driver. it works well ~1min per Go of Data import to memory. This is fast thanks to the BigQuery Storage API implementation⚡
What I cannot understand are the 2 following points:
Which version of Simba Driver and Spotfire have this highthroughput feature enabled?
Can I configure this feature? Adding EnableHighThroughputAPI=1 in the connection string toggles BQ storage API use. In Spotfire, changing this parameter from 1 to 0 does not make any difference - Storage API is always used. Is it internally overwritten by Spotfire?
An expert insight would be appreciated - I could not find a documentation on that (I am using 11.2.0 HF002 and simba jdbc42).
thanks!

Google Dataflow instance and BigQuery cost considerations

I am planning to spin up a dataflow instance on google cloud platform to run some experiments. I want to get familiar with, and experiment with using apache beam to pull data from BigQuery, run some ETL jobs (in python) and streaming jobs, and finally store the result in BigQuery.
However, I am also concerned with sending my company's GCP bill through the roof. What are the main cost considerations, or any methods to estimate what the cost will be, so I don't get an earful from my boss.
Any help would be greatly appreciated, thanks!
You can use calculator to get an estimate of price of the job.
One of the most important resource on the dataflow side is CPU per hour. To limit the cpu hours, you can set the maximum machines using option maxNumWorkers in your pipeline.
Here are more pipeline options that you can set while running your dataflow job https://cloud.google.com/dataflow/docs/guides/specifying-exec-params
For BQ, you can do a similar estimate using the calculator.

Load a huge data from BigQuery to python/pandas/dask

I read other similar threads and searched Google to find a better way but couldn't find any workable solution.
I have a large large table in BigQuery (assume inserting 20 million rows per day). I want to have around 20 million rows of data with around 50 columns in python/pandas/dask to do some analysis. I have tried using bqclient, panda-gbq and bq storage API methods but it takes 30 min to have 5 millions rows in python. Is there any other way to do so? Even any Google service available to do similar job?
Instead of querying, you can always export stuff to cloud storage -> download locally -> load into your dask/pandas dataframe:
Export + Download:
bq --location=US extract --destination_format=CSV --print_header=false 'dataset.tablename' gs://mystoragebucket/data-*.csv && gsutil -m cp gs://mystoragebucket/data-*.csv /my/local/dir/
Load into Dask:
>>> import dask.dataframe as dd
>>> df = dd.read_csv("/my/local/dir/*.csv")
Hope it helps.
First, you should profile your code to find out what is taking the time. Is it just waiting for big-query to process your query? Is it the download of data> What is your bandwidth, what fraction do you use? Is it parsing of that data into memory?
Since you can make SQLAlchemy support big-query ( https://github.com/mxmzdlv/pybigquery ), you could try to use dask.dataframe.read_sql_table to split your query into partitions and load/process them in parallel. In case big-query is limiting the bandwidth on a single connection or to a single machine, you may get much better throughput by running this on a distributed cluster.
Experiment!
Some options:
Try to do aggregations etc. in BigQuery SQL before exporting (a smaller table) to
Pandas.
Run your Jupyter notebook on Google Cloud, using a Deep Learning VM on a high-memory machine in the same region as your BigQuery
dataset. That way, network overhead is minimized.
Probably you want to export the data to Google Cloud Storage first, and then download the data to your local machine and load it.
Here are the steps you need to take:
Create an intermediate table which will contain the data you want to
export. You can do select and store to the intermediate table.
Export the intermediate table to Google Cloud Storage, to JSON/Avro/Parquet format.
Download your exported data and load to your python app.
Besides downloading the data to your local machine, you can leverage the processing using PySpark and SparkSQL. After you export the data to Google Cloud Storage, you can spin up a Cloud Dataproc cluster and load the data from Google Cloud Storage to Spark, and do analysis there.
You can read the example here
https://cloud.google.com/dataproc/docs/tutorials/bigquery-connector-spark-example
and you can also spin up Jupyter Notebook in the Dataproc cluster
https://cloud.google.com/dataproc/docs/tutorials/jupyter-notebook
Hope this helps.
A couple years late, but we're developing a new dask_bigquery library to help easily move back and forth between BQ and Dask dataframes. Check it out and let us know what you think!

Very low throughput by using JdbcIO on Google Dataflow

I'd like to load data into Google CloudSQL instance via Google Dataflow.
I think that there're no built-in Sink for CloudSQL, I decide to use org.apache.beam.sdk.io.jdbc.JdbcIO.
But, the throughput into CloudSQL is very low (about 6 records/sec).
I suspect that the spec of CloudSQL is too poor, But there's no improve when it's upgraded.
In the log of Dataflow, there're many logs as below:
Proposing dynamic split of work unit my-project;2017-06-27_02_58_19-14077185378147382467;6703504927792172410 at
{"fractionConsumed":0.9669782519340515}
Rejecting split request because custom reader returned null residual source.
What's happened? And How can I improve the performance?
It's resolved!
At generating connection-string, adding as below:
JdbcIO.DataSourceConfiguration.create("com.mysql.jdbc.Driver", "jdbc:mysql://google/mydatabase?cloudSqlInstance=myproject:region:instance-name&socketFactory=com.google.cloud.sql.mysql.SocketFactory&rewriteBatchedStatements=true")
Adding "rewriteBatchedStatements=true", it's worked.
The throughput improved to 2000/sec about!
Notice: it workes only when using mysql, perhaps.
Rejecting split request because custom reader returned null residual
source.
Whatever custom source you implemented doesn't appear to support dynamic rebalancing.
I suspect that the spec of CloudSQL is too poor, But there's no
improve when it's upgraded.
Are you sure it's throughput to Cloud SQL that is the issue. Have you measured the performance of your source and proven it is the bottleneck?
I'd like to load data into Google CloudSQL instance via Google
Dataflow
Generally, I wouldn't recommend this. Cloud SQL is a single machine database, so I suspect you don't get a lot of benefit, and perhaps it's even a performance negative, by using a horizontally scalable method like Dataflow. You should be able to do ingestion into Cloud SQL just as fast using a single VM instance to load the data.