Data Loss while transferring from Cassandra to SQL using Talend - sql

We are using Talend open studio for Data Transfer from Cassandra to SQL.
While reading data using Talend job, sometimes we face data loss. And we are unable to find any Error for the same. Even Cassandra System/ Debug Logs are showing very limited information. Is there any setting that we can configure on Cassandra or in Talend Open studio by which this Data loss can be avoided?
Note: We are dealing 5M records/Hour and we are missing approximately 1% of data loss. This is not a consistent issue but intermittent one.

In this sort of situation I have written some java routines within talend that post out to elasticsearch. Depending on the talend version you have, this comes with talend. And makes log based analysis very easy on large datasets using the Elastic and Kibana to do this. But the key is to log success and failures out using tjavarow using java routines which makes it far easier.

Related

Accessing Spark Streaming Data pipelines. What option works best?

I am looking for the best option to access data from Spark data pipelines. The scenario is as follows:
I am reading data from Kafka topics, creating a streaming dataframe which is then cleaned and being printed on the console. I need this data to be integrated with existing Python scripts which is doing all the data operations by Pandas. I have considered the following options:
Write streaming data to local memory (e.g. Hive Tables).
Use Spark Structured Streaming ForeachBatch Sink.
I want to mention that the data is to be read after a certain interval and there will be a real time data dashboard in the future with this data.
Please advise which will be the best approach to handle this scenario. Apologies if the question sounds too basic. Thanks in advance.
If you save data on Hive each time before accessing the newly streamed data through python scripts, the newly added hive partitions are required to be refreshed each time as well.
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-RecoverPartitions(MSCKREPAIRTABLE)
Here are some disadvantages of having a hive for mentioned real-time scenarios.
https://www.quora.com/What-are-some-disadvantages-of-Apache-Hive#
Whereas, Using Spark Structured Streaming looks a better choice for the near-real-time experience.
https://databricks.com/blog/2017/04/04/real-time-end-to-end-integration-with-apache-kafka-in-apache-sparks-structured-streaming.html

File structure of Apache Beam DynamicDestinations write to BigQuery

I am using DynamicDestinations (from BigQueryIO) to export data from one Cassandra table to multiple Google BigQuery tables. The process consists of several steps including writing prepared data to Google Cloud Storage (as files in JSON format) and then loading the files to BQ via load jobs.
The problem is that export process has ended with out of memory error at the last step (loading files from Google Storage to BQ). But there are prepared files with all of the data in GCS remaining. There are 3 directories in BigQueryWriteTemp location:
And there a lot of files with not obvious names:
The question is what is the storage structure of the files? How can I match the files with tables (table names) they prepared for? How can I use the files to continue export process from load jobs step? Can I use some piece of Beam code for that?
These files, if you're using Beam 2.3.0 or earlier, contain JSON data to be imported into BigQuery using its load job API. However:
This is an implementation detail that you can not rely on, in general. It is very likely to change in future versions of Beam (JSON is horribly inefficient).
It is not possible to match these files with the tables they are intended for - that was stored in the internal state of the pipeline that has failed.
There is also no way to know how much data was written to these files and how much wasn't. The files may contain only partial data: maybe your pipeline failed before creating some of the files, or after some of them were already loaded into BigQuery and deleted.
Basically, you'll need to rerun the pipeline and fix the OOM issue so that it succeeds.
For debugging OOM issues, I suggest using a heap dump. Dataflow can write heap dumps to GCS using --dumpHeapOnOOM --saveHeapDumpsToGcsPath=gs://my_bucket/. You can examine these dumps using any Java memory profiler, such as Eclipse MAT or YourKit. You can also post your code as a separate SO question and ask for advice reducing its memory usage.

Beam - Handling failures during huge data load for bigquery

I have recently started with Apache beam. I am sure I am missing something here. I have a requirement to load from a very huge database to bigquery. These tables are huge. I have written sample beam jobs to load minimal rows from simple tables.
How would I able to load n number of rows from tables using JDBCIO? Is there anyway that i can load these data in batches as we do in conventional data migration jobs.?
Can I do batch read from a database and write in batches to bigquery?
Also i have seen that, the suggested approach to load the data to bigquery is by adding the files to the data store buckets. But, in automated environment, the requirement is to write it as a dataflow job to load from db and write it to bigquery. What should my design approach to solve this issue using apache beam?
Please help.!
It looks[1] like BigQueryIO will write batches of data if it comes from a bounded PCollection (otherwise it uses streaming inserts). It also appears to bound the size of each file and batch, so I don't think you'll need to do any manual batching.
I'd just read from your database via JDBCIO, transform it if needed, and write it to BigQueryIO.
[1] https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java

Very low throughput by using JdbcIO on Google Dataflow

I'd like to load data into Google CloudSQL instance via Google Dataflow.
I think that there're no built-in Sink for CloudSQL, I decide to use org.apache.beam.sdk.io.jdbc.JdbcIO.
But, the throughput into CloudSQL is very low (about 6 records/sec).
I suspect that the spec of CloudSQL is too poor, But there's no improve when it's upgraded.
In the log of Dataflow, there're many logs as below:
Proposing dynamic split of work unit my-project;2017-06-27_02_58_19-14077185378147382467;6703504927792172410 at
{"fractionConsumed":0.9669782519340515}
Rejecting split request because custom reader returned null residual source.
What's happened? And How can I improve the performance?
It's resolved!
At generating connection-string, adding as below:
JdbcIO.DataSourceConfiguration.create("com.mysql.jdbc.Driver", "jdbc:mysql://google/mydatabase?cloudSqlInstance=myproject:region:instance-name&socketFactory=com.google.cloud.sql.mysql.SocketFactory&rewriteBatchedStatements=true")
Adding "rewriteBatchedStatements=true", it's worked.
The throughput improved to 2000/sec about!
Notice: it workes only when using mysql, perhaps.
Rejecting split request because custom reader returned null residual
source.
Whatever custom source you implemented doesn't appear to support dynamic rebalancing.
I suspect that the spec of CloudSQL is too poor, But there's no
improve when it's upgraded.
Are you sure it's throughput to Cloud SQL that is the issue. Have you measured the performance of your source and proven it is the bottleneck?
I'd like to load data into Google CloudSQL instance via Google
Dataflow
Generally, I wouldn't recommend this. Cloud SQL is a single machine database, so I suspect you don't get a lot of benefit, and perhaps it's even a performance negative, by using a horizontally scalable method like Dataflow. You should be able to do ingestion into Cloud SQL just as fast using a single VM instance to load the data.

BigQuery best approach for ETL (external tables and views vs Dataflow)

CSV files get uploaded to some FTP server (for which I don't have SSH access) in a daily basis and I need to generate weekly data that merges those files with transformations. That data would go into a history table in BQ and a CSV file in GCS.
My approach goes as follows:
Create a Linux VM and set a cron job that syncs the files from the
FTP server with a GCS bucket (I'm using GCSFS)
Use an external table in BQ for each category of CSV files
Create views with complex queries that transform the data
Use another cron job to create a table with the historic data and also the CSV file on a weekly basis.
My idea is to remove as much middle processes as I can and to make the implementation as easy as possible, including dataflow for ETL, but I have some questions first:
What's the problem with my approach in terms of efficiency and money?
Is there anything DataFlow can provide that my approach can't?
any ideas about other approaches?
BTW, I ran into one problem that might be fixable by parsing the csv files myself rather than using external tables, which is invalid characters, like the null char, so I can get rid of them, while as an external table there is a parsing error.
Probably your ETL will be simplified by Google DataFlow Pipeline batch execution job. Upload your files to the GCS bucket. For transforming use pipeline transformation to strip null values and invalid character (or whatever your need is). On those transformed dataset use your complex queries like grouping it by key, aggregating it (sum or combine) and also if you need side inputs data-flow provides ability to merge other data-sets into the current the data-set too. Finally the transformed output can written to BQ or you can write your own custom implementation for writing those results.
So the data-flow gives you very high flexibility to your solution, you can branch the pipeline and work differently on each branch with same data-set. And regarding the cost, if you run your batch job with three workers, which is the default that should not be very costly, but again if you just want to concentrate on your business logic and not worry about the rest, google data-flow is pretty interesting and its very powerful if used wisely.
Data-flow helps you to keep everything on a single plate and manage them effectively. Go through its pricing and determine if it could be the best fit for you (your problem is completely solvable with google data-flow), Your approach is not bad but needs extra maintenance with those pieces.
Hope this helps.
here are a few thoughts.
If you are working with a very low volume of data then your approach may work just fine. If you are working with more data and need several VMs, dataflow can automatically scale up and down the number of workers your pipeline uses to help it run more efficiently and save costs.
Also, is your linux VM always running? Or does it only spin up when you run your cron job? A batch Dataflow job only runs when it needed, which also helps to save on costs.
In Dataflow you could use TextIO to read each line of the file in, and add your custom parsing logic.
You mention that you have a cron job which puts the files into GCS. Dataflow can read from GCS, so it would probably be simplest to keep that process around and have your dataflow job read from GCS. Otherwise you would need to write a custom source to read from your FTP server.
Here are some useful links:
https://cloud.google.com/dataflow/service/dataflow-service-desc#autoscaling