I'd like to load data into Google CloudSQL instance via Google Dataflow.
I think that there're no built-in Sink for CloudSQL, I decide to use org.apache.beam.sdk.io.jdbc.JdbcIO.
But, the throughput into CloudSQL is very low (about 6 records/sec).
I suspect that the spec of CloudSQL is too poor, But there's no improve when it's upgraded.
In the log of Dataflow, there're many logs as below:
Proposing dynamic split of work unit my-project;2017-06-27_02_58_19-14077185378147382467;6703504927792172410 at
{"fractionConsumed":0.9669782519340515}
Rejecting split request because custom reader returned null residual source.
What's happened? And How can I improve the performance?
It's resolved!
At generating connection-string, adding as below:
JdbcIO.DataSourceConfiguration.create("com.mysql.jdbc.Driver", "jdbc:mysql://google/mydatabase?cloudSqlInstance=myproject:region:instance-name&socketFactory=com.google.cloud.sql.mysql.SocketFactory&rewriteBatchedStatements=true")
Adding "rewriteBatchedStatements=true", it's worked.
The throughput improved to 2000/sec about!
Notice: it workes only when using mysql, perhaps.
Rejecting split request because custom reader returned null residual
source.
Whatever custom source you implemented doesn't appear to support dynamic rebalancing.
I suspect that the spec of CloudSQL is too poor, But there's no
improve when it's upgraded.
Are you sure it's throughput to Cloud SQL that is the issue. Have you measured the performance of your source and proven it is the bottleneck?
I'd like to load data into Google CloudSQL instance via Google
Dataflow
Generally, I wouldn't recommend this. Cloud SQL is a single machine database, so I suspect you don't get a lot of benefit, and perhaps it's even a performance negative, by using a horizontally scalable method like Dataflow. You should be able to do ingestion into Cloud SQL just as fast using a single VM instance to load the data.
Related
I have some app data which is currently stored in Splunk. But i am looking for a way where I can input the Splunk data directly to BigQuery. My target is to analyze the app data on BigQuery and perhaps create Data Studio dashboards based on the BigQuery.
I know there are a lot of third party connectors that can help me with this, but I am looking for a solution where I can use features from Splunk or BigQuery to conncet both of them together and not rely on third party connectors.
Based on your comment indicating that you're interested in resources to egress data from Splunk into BigQuery with custom software, I would suggest using either tool's REST API on either side.
You don't indicate whether this is a one-time or a recurring asking - that may impact where you want the software to run that performs this operation. If it's a one-time thing and you've got a fair internet connection yourself, you may just want to write a console application from your own machine to perform the migration. If it's a recurring operation, you might instead look at any of the various "serverless" hosting options out there (e.g. Azure Functions, Google Cloud Functions, or AWS Lambda). In addition to development experience, note that you may have to pay an egress bandwidth cost for each on top of normal service charges.
Beyond that, you need to decide whether it makes more sense to do a bulk export from Splunk out to some external file that you load into Google Drive and then import into Big Query. But maybe it makes more sense to download the records as paged data via HTTPS so you can perform some ETL operation on top of it (e.g. replace nulls with empty strings, update Datetime types to match Google's exacting standards, etc.). If you go this route, it looks as though this is the documentation you'd use from Splunk and you can either use Google's newer, and higher-performance Storage Write API to receive the data or their legacy streaming API to ingest into BigQuery. Either option supports SDKs across varied languages (e.g. C#, Go, Ruby, Node.js, Python, etc.), though only the legacy streaming API supports plain HTTP REST calls.
Beyond that, don't forget your OAuth2 concerns to authenticate on either side of the operation, though this is typically abstracted away by the various SDKs offered by either party, and less of something you'd have to deal with the ins and outs of.
I'm expected to have thousands of sensors sending telemetry data at 10FPS with around 1KB of binary data per frame, using IOT Core, meaning I'll get it via PubSub. I'd like to get that data to BigQuery, and no processing is needed.
As Dataflow don't have a template capable of dealing with binary data, and working with it seems a bit cumbersome, I'd like to try to avoid it and go full serverless.
Question is, what's my best alternative?
I've thought about Cloud Run service running an express app to accept the data from PubSub, and using global variable to accumulate around 500 rows in ram, then dump it using BigQuery's insert() method (NodeJS client).
How reasonable is that? Will I gain something from accumulation, or should I just insert to bigquery every single incoming row?
Streaming Ingestion
If your requirement is to analyze high volumes of continuously arriving data with near-real-time dashboards and queries, streaming inserts would be a good choice. The quotas and limits for streaming inserts can be found here.
Since you are using the Node.js client library, use the BigQuery legacy streaming API's insert() method as you have already mentioned. The insert() method streams one row at a time irrespective of accumulation of rows.
For new projects, the BigQuery Storage Write API is recommended as it is cheaper and has an enriched feature set than the legacy API does. The BigQuery Storage Write API only supports Java, Python and Go(in preview) client libraries currently.
Batch Ingestion
If your requirement is to load large, bounded data sets that don’t have to be processed in real-time, prefer batch loading. BigQuery batch load jobs are free. You only pay for storing and querying the data but not for loading the data. Refer to quotas and limits for batch load jobs here. Some more key points on batch loading jobs have been quoted from this article.
Load performance is best effort
Since the compute used for loading data is made available from a shared pool at no cost to the user,
BigQuery does not make guarantees on performance and available
capacity of this shared pool. This is governed by the fair scheduler
allocating resources among load jobs that may be competing with loads
from other users or projects. Quotas for load jobs are in place to
minimize the impact.
Load jobs do not consume query capacity
Slots used for querying data are distinct from the slots used for ingestion. Hence, data
ingestion does not impact query performance.
ACID semantics
For data loaded through the bq load command, queries will either reflect the presence of all or none of the data .
Queries never scan partial data.
I need to extract data from MongoDB and put into database, the MongoDB is a replica set with 2 secondary, and I read the MongoDB data from secondary.
I setup a simple transformation (MongoDB Input to Dummy step) to test the speed, but the MongoDB read speed is usually around 200 row/s.
I think that is too slow.
Does anyone can share your experience, what is usual speed for MongoDB Input step and how to optimize that?
Thanks!
i haven't worked on with mongodb lately but i may suggest to check a few things
are you running from local machine or remote server if yes when you run the transformation can you check the network speed as you may be sending data on network which may be slow.
we usually configure a cart server on run the transformation on the database server to improve the performance.
check the commit size you may need to change it as per the stream you are working,usually increasing it helps improve.
see if below may help you https://help.pentaho.com/Documentation/6.1/0L0/0Y0/090/020#OptimizeDS
I'm developing an app that allows to track a mobile device instantly (live) ... I need an of advice. The application must send the location to a webservice that in it's turn records the received data in a database.
What would be, in your opinion, the best way to store the location values?
I'm new in using bigdata and I'm afraid that simple sql requests wont be able to do the work properly ... I imagine if there is lot of users and each user send a request each 1sec I'll have issue with the database ...
An advice ? Thank you very much
i think you could have a look into the geospatial queries in mongo, if you chose to go ahead with mongodb.
Refer here
And here
for the design of the database would depend on the nature of the query (essentially the read and write).
Worth having a look into
Working at Cintric we landed on using elasticsearch. We process billions of location points in real time and provide advanced analytics to our users.
We started with mongoDB and ran into a lot of troubles, eventually leading to a painful migration.
Our stack currently has mobile devices dump location updates into AWS Kinesis, which are then processed by AWS Lambda handlers, and then dumped into elasticsearch. We're able to serve, process and store 300 million requests/month for only a few hundred dollars/month. Analytics for our dashboard add additional cost but for your needs I would highly recommend checking out your options on AWS.
I'm looking to use Dataflow to load data into BigQuery tables using BQ load jobs - not streaming (streaming would cost too much for our use case). I see that the Dataflow SDK has built in support for inserting data via BQ streaming, but I wasn't able to find anything in the Dataflow SDK that supports load jobs out of the box.
Some questions:
1) Does the Dataflow SDK have OOTB support for BigQuery load job inserts? If not, is it planned?
2) If I need to roll my own, what are some good approaches?
If I have to roll my own, performing a BQ load job using Google Cloud Storage is a multi step process - write the file to GCS, submit the load job via the BQ API, and (optionally) check the status until the job has completed (or failed). I'd hope I could use the existing TextIO.write() functionality to write to GCS, but I'm not sure how I'd compose that step with the subsequent call to the BQ API to submit the load job (and optionally the subsequent calls to check the status of the job until it's complete).
Also, I'd be using Dataflow in streaming mode, with windows of 60 seconds - so I'd want to do the load job every 60 seconds as well.
Suggestions?
I'm not sure which version of Apache Beam you are using, but now it's possible to use a micro-batching tactic using a Stream Pipeline. If you decide one way or another you can use something like this:
.apply("Saving in batches", BigQueryIO.writeTableRows()
.to(destinationTable(options))
.withMethod(Method.FILE_LOADS)
.withJsonSchema(myTableSchema)
.withCreateDisposition(CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(WriteDisposition.WRITE_APPEND)
.withExtendedErrorInfo()
.withTriggeringFrequency(Duration.standardMinutes(2))
.withNumFileShards(1);
.optimizedWrites());
Things to keep in mind
There are 2 different methods: FILE_LOADS and STREAMING_INSERT, if you use the first one you need to include the withTriggeringFrequency and withNumFileShards. For the first one, from my experience, is better to use minutes and the number will depend on the amount of throughput data. If you receive quite a lot try to keep it small, I have seen "stuck errors" when you increase it too much. The shards can affect mostly your GCS billing, if you add to much shards it will create more files per table per x amount of minutes.
If your input data size is not so big the streaming insert can work really well and the cost shouldn't be a big deal. In that scenario you can use STREAMING_INSERT method and remove the withTriggeringFrequency and withNumFileShards. Also, you can add withFailedInsertRetryPolicy like InsertRetryPolicy.retryTransientErrors() so no rows are being lost (keep in mind that idempotency is not guaranteed with STREAM_INSERTS, so duplication is possible)
You can check your Jobs in BigQuery and validate that everything is working! Keep in mind the policies for jobs with BigQuery (I think is 1000 jobs per table) when you are trying to define triggering frequency and shards.
Note: You can always read this article about efficient aggregation pipelines https://cloud.google.com/blog/products/data-analytics/how-to-efficiently-process-both-real-time-and-aggregate-data-with-dataflow
BigQueryIO.write() always uses BigQuery load jobs when the input PCollection is bounded. If you'd like it to also use them if it is unbounded, specify .withMethod(FILE_LOADS).withTriggeringFrequency(...).