R server 9.0.1 still fails to rxImport huge data - microsoft-r

I am eager to use Microsoft R server to deal with real big data in hadoop compute context. It works well to deal with not large data.
But it fails when data is large with the following error:

This is a known bug when trying to import large data from HDFS into a single data-frame. Currently, you can work around this issue by copying the data from HDFS into your local file system, and then importing it into a data-frame directly.
Thank you,
Kirill

Related

Spotfire and BigQuery

I am quite puzzled by BigQuery connector on Spotfire. It is taking !extremely! long time to import my dataset in-memory.
my configuration: spotfire on AWS windows instance (8vCPU - 32Go RAM). dataset 50Go >100M rows on BigQuery.
Yes - I should use in-database for such large dataset and push the queries to BigQuery and use Spotfire only for display, but that is not my question today 😋
Today i am trying to understand how the import works and why it is taking so long. this import job started 21hrs ago and it is still not finished. The resources of the server are barely used (CPU, Disk, Network).
Testing done:
I tried importing data from Redshift and it was much faster (14min for 22Go)
I checked resources used during import: network speed (Redshift ~ 370Mbs, BQ ~ 8Mbs for 30min), CPU (Redshift ~ 25%, BQ < 5%), RAM (Redshift & BQ ~ 27Go), Disk write (Redshift 30Mbs, BQ 5MBs)
I really don't understand what is Spotfire actually doing for all this time while importing dataset from BQ in memory. There seems to be no use of server resources and there is no indication of status apart from time running.
Any Spotfire experts have any insights on what's happening? Is the connector to BigQuery actually not to be used for In-memory analysis - what is the actual implementation limiting factor?
Thanks!
😇
We had an issue which is fixed in the Spotfire versions below:
TS 10.10.3 LTS HF-014
TS 11.2.0 HF-002
Please also vote and comment on the idea of using the Storage API when extracting data from BigQuery:
https://ideas.tibco.com/ideas/TS-I-7890
Thanks,
Thomas Blomberg
Senior Product Manager TIBCO Spotfire
#Tarik, you need to install the above hotfix at your end.
You can download the latest hotfix from the link: https://community.tibco.com/wiki/list-hotfixes-tibco-spotfire-clients-analyst-web-player-consumerbusiness-author-and-automation
An update after more testing. Thanks #Thomas and #Manoj's very helpful support. Here are the results:
I updated spotfire version to 11.2.0 HF002 and it fixed the issue with bringing data in-memory with BigQuery 👌 - Using (Data > Add Data...), the data throughput was very low though ~ 13min/Go. The network throughput was doing burst of 8Mbs.
As suggested in tibco ideas by Thomas, I installed Simba JDBC driver and the data throughput improved drammatically to ~ 50s/Go using (Data > Information Designer). The issue off course is that you need access to the server to install it. the network throughput was roughly 200Mbs. I am not sure what is the limiting factor (Spotfire config, Samba Driver or BigQuery).
Using Redshift connector to a Redshift cluster with the same data and connecting using (Data > Information Designer), I get to a data import throughput of ~ 30s/Go with a network throughput of 380Mbs.
So my recommandation is to use the latest simba driver along with the Information Designer to get the best "in-memory" data import throughput when connecting to medium size dataset in BigQuery (10-30Go). This leads to a data import throughput of 1min/Go.
It's not clear what makes Redshift connection faster though and if there is faster method to import data from GCP/BigQuery to Spotfire 🤷‍♂️
Any comments or suggestions are welcome!
Tarik

Load a huge data from BigQuery to python/pandas/dask

I read other similar threads and searched Google to find a better way but couldn't find any workable solution.
I have a large large table in BigQuery (assume inserting 20 million rows per day). I want to have around 20 million rows of data with around 50 columns in python/pandas/dask to do some analysis. I have tried using bqclient, panda-gbq and bq storage API methods but it takes 30 min to have 5 millions rows in python. Is there any other way to do so? Even any Google service available to do similar job?
Instead of querying, you can always export stuff to cloud storage -> download locally -> load into your dask/pandas dataframe:
Export + Download:
bq --location=US extract --destination_format=CSV --print_header=false 'dataset.tablename' gs://mystoragebucket/data-*.csv && gsutil -m cp gs://mystoragebucket/data-*.csv /my/local/dir/
Load into Dask:
>>> import dask.dataframe as dd
>>> df = dd.read_csv("/my/local/dir/*.csv")
Hope it helps.
First, you should profile your code to find out what is taking the time. Is it just waiting for big-query to process your query? Is it the download of data> What is your bandwidth, what fraction do you use? Is it parsing of that data into memory?
Since you can make SQLAlchemy support big-query ( https://github.com/mxmzdlv/pybigquery ), you could try to use dask.dataframe.read_sql_table to split your query into partitions and load/process them in parallel. In case big-query is limiting the bandwidth on a single connection or to a single machine, you may get much better throughput by running this on a distributed cluster.
Experiment!
Some options:
Try to do aggregations etc. in BigQuery SQL before exporting (a smaller table) to
Pandas.
Run your Jupyter notebook on Google Cloud, using a Deep Learning VM on a high-memory machine in the same region as your BigQuery
dataset. That way, network overhead is minimized.
Probably you want to export the data to Google Cloud Storage first, and then download the data to your local machine and load it.
Here are the steps you need to take:
Create an intermediate table which will contain the data you want to
export. You can do select and store to the intermediate table.
Export the intermediate table to Google Cloud Storage, to JSON/Avro/Parquet format.
Download your exported data and load to your python app.
Besides downloading the data to your local machine, you can leverage the processing using PySpark and SparkSQL. After you export the data to Google Cloud Storage, you can spin up a Cloud Dataproc cluster and load the data from Google Cloud Storage to Spark, and do analysis there.
You can read the example here
https://cloud.google.com/dataproc/docs/tutorials/bigquery-connector-spark-example
and you can also spin up Jupyter Notebook in the Dataproc cluster
https://cloud.google.com/dataproc/docs/tutorials/jupyter-notebook
Hope this helps.
A couple years late, but we're developing a new dask_bigquery library to help easily move back and forth between BQ and Dask dataframes. Check it out and let us know what you think!

File structure of Apache Beam DynamicDestinations write to BigQuery

I am using DynamicDestinations (from BigQueryIO) to export data from one Cassandra table to multiple Google BigQuery tables. The process consists of several steps including writing prepared data to Google Cloud Storage (as files in JSON format) and then loading the files to BQ via load jobs.
The problem is that export process has ended with out of memory error at the last step (loading files from Google Storage to BQ). But there are prepared files with all of the data in GCS remaining. There are 3 directories in BigQueryWriteTemp location:
And there a lot of files with not obvious names:
The question is what is the storage structure of the files? How can I match the files with tables (table names) they prepared for? How can I use the files to continue export process from load jobs step? Can I use some piece of Beam code for that?
These files, if you're using Beam 2.3.0 or earlier, contain JSON data to be imported into BigQuery using its load job API. However:
This is an implementation detail that you can not rely on, in general. It is very likely to change in future versions of Beam (JSON is horribly inefficient).
It is not possible to match these files with the tables they are intended for - that was stored in the internal state of the pipeline that has failed.
There is also no way to know how much data was written to these files and how much wasn't. The files may contain only partial data: maybe your pipeline failed before creating some of the files, or after some of them were already loaded into BigQuery and deleted.
Basically, you'll need to rerun the pipeline and fix the OOM issue so that it succeeds.
For debugging OOM issues, I suggest using a heap dump. Dataflow can write heap dumps to GCS using --dumpHeapOnOOM --saveHeapDumpsToGcsPath=gs://my_bucket/. You can examine these dumps using any Java memory profiler, such as Eclipse MAT or YourKit. You can also post your code as a separate SO question and ask for advice reducing its memory usage.

Data Loss while transferring from Cassandra to SQL using Talend

We are using Talend open studio for Data Transfer from Cassandra to SQL.
While reading data using Talend job, sometimes we face data loss. And we are unable to find any Error for the same. Even Cassandra System/ Debug Logs are showing very limited information. Is there any setting that we can configure on Cassandra or in Talend Open studio by which this Data loss can be avoided?
Note: We are dealing 5M records/Hour and we are missing approximately 1% of data loss. This is not a consistent issue but intermittent one.
In this sort of situation I have written some java routines within talend that post out to elasticsearch. Depending on the talend version you have, this comes with talend. And makes log based analysis very easy on large datasets using the Elastic and Kibana to do this. But the key is to log success and failures out using tjavarow using java routines which makes it far easier.

Beam - Handling failures during huge data load for bigquery

I have recently started with Apache beam. I am sure I am missing something here. I have a requirement to load from a very huge database to bigquery. These tables are huge. I have written sample beam jobs to load minimal rows from simple tables.
How would I able to load n number of rows from tables using JDBCIO? Is there anyway that i can load these data in batches as we do in conventional data migration jobs.?
Can I do batch read from a database and write in batches to bigquery?
Also i have seen that, the suggested approach to load the data to bigquery is by adding the files to the data store buckets. But, in automated environment, the requirement is to write it as a dataflow job to load from db and write it to bigquery. What should my design approach to solve this issue using apache beam?
Please help.!
It looks[1] like BigQueryIO will write batches of data if it comes from a bounded PCollection (otherwise it uses streaming inserts). It also appears to bound the size of each file and batch, so I don't think you'll need to do any manual batching.
I'd just read from your database via JDBCIO, transform it if needed, and write it to BigQueryIO.
[1] https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java