Export pubsub data to object storage using SCIO - spotify-scio

I am trying to export Cloud Pub/Sub streams to Cloud Storage as described by this post by Spotify Reliable export of Cloud Pub/Sub streams to Cloud Storage or this post by Google Simple backup and replay of streaming events using Cloud Pub/Sub, Cloud Storage, and Cloud Dataflow
PubSub creates oubounded PCollection (or SCollection in SCIO) but saveastextfile requires BoundedCollection.
Is there any way to overcome this ?

The new dynamic IO module should support saving unbounded collection to files.
However note that the approach in that Spotify article doesn't use Dataflow since it has a lot of custom logic for SLA/bucketing/reliability reasons. So YMMV.

Related

Difference between BigQuery API and BigQuery Storage API?

What is the difference between the BigQuery API Client Libraries and BigQuery Storage API Client Libraries?
In the Overview section of BigQuery Storage Read API, it says
The BigQuery Storage Read API provides fast access to BigQuery-managed storage by using an rpc-based protocol.
Is BigQuery Storage API just faster because it uses rpc?
Yes you are correct it is fast since it uses rpc, and also as stated in this documentation,
The Storage Read API does not provide functionality related to managing BigQuery resources such as datasets, jobs, or tables.
Basically, you would want to use BigQuery Storage API on top of BigQuery API when your operation requires the need to scan large volumes of managed data as it exposes high throughput data reading for consumers. Otherwise, use of BigQuery API is enough for interactions with core resources such as datasets, tables, jobs, and routines.
For further reading, see this documentations on some of the key features of Storage Write API and Storage Read API.

Pull data from HTTP request API to Google Cloud

I have an app that sending me data from an API. The data is semi-structured (json data)
I would like to send this data to Google Big Query in order to stock all the information.
However, I'm not able to find how can I do it properly.
So far I have used Node JS on my own server to get the data using POST request.
Could you please help me ? Thnak.
You can use bigquery API to do streaming inserts.
You can also write the data to PubSub or Google Cloud Storage and use dataflow pipelines to load them into bigquery (you can either use streaming inserts (incur costs) or batch load jobs (free))
You can also log in stackdriver and from there you can select and send to bigquery (there already exists direct options for it in GCP, note that under the hood it performs streaming inserts)
If you feel that setting up dataflow is complicated, you can store your files and perform batch load jobs by directly calling bigquery API. Note that there are limits on number of batch loads you can make in a day over a particular table (1000 per day)
There is a page in the official documentation that lists all the possibilities of loading data to BigQuery.
For the simplicity, you can just send data from your local data soruce. You should use the Google Cloud client libraries for Big Query. Here you have a guide on how to do that as well as a relevant code example.
But my honest recommendation is to send data to Google Cloud Storage and from there, to load it to BigQuery. This way the whole process will be more stable.
You can check all the options from the first link that I've posted and choose what you think that will fit best with your workflow.
Keep in mind the limitations of this process.

Using Google Cloud ecosystem vs building your own microservice architecture

Building in the Google Cloud ecosystem is really powerful. I really like how you can ingest files to Cloud Storage then Data Flow enriches, transforms and aggregates the data, and then finally stored in BigQuery or Cloud SQL.
I have a couple of questions to help me have a better understanding.
If you are to build a big data product using the Google services.
When a front-end web application (might be built in React) submits a file to Cloud storage it may take some time before it completely processes. The client might want to view the status the file in the pipeline. They then might want to do something with the result on completion. How are front-end clients expected know when a file has completed processed and ready? Do they need to poll data from somewhere?
If you currently have a microservice architecture in which each service does a different kind of processing. For example one might parse a file, another might processes messages. The services communicate using Kafka or RabbitMQ and store data in Postgres or S3.
If you adopt the Google services ecosystem could you replace that microservice architecture with Cloud storage, dataflow, Cloud SQL/Store?
Did you look at Cloud Pub/Sub (topic subscription/publication service).
Cloud Pub/Sub brings the scalability, flexibility, and reliability of enterprise message-oriented middleware to the cloud. By providing many-to-many, asynchronous messaging that decouples senders and receivers, it allows for secure and highly available communication between independently written applications.
I believe Pub/Sub can mostly substitute Kafka or RabitMQ in your case.
How are front-end clients expected know when a file has completed processed and ready? Do they need to poll data from somewhere?
For example, if you are using dataflow API to process the file, Cloud dataflow can publish the progress and send the status to a topic. Your front end (app engine) just needs to subscribe to that topic and receive update.
1)
Dataflow does not offer inspection to intermediary results. If a frontend wants more progress about an element being processed in a Dataflow pipeline, custom progress reporting will need to be built into the Pipline.
One idea, is to write progress updates to a sink table and output molecules to that at various parts of the pipeline. I.e. have a BigQuery sink where you write rows like ["element_idX", "PHASE-1 DONE"]. Then a frontend can query for those results. (I would avoid overwriting old rows personally, but many approaches can work).
You cand do this by consuming the PCollection in both the new sink, and your pipeline's next step.
2)
Is your Microservice architecture using a "Pipes and filters" pipeline style approach? I.e. each service reads from a source (Kafka/RabbitMQ) and writes data out, then the next consumes it?
Probably the best way to do setup one a few different Dataflow pipelines, and output their results using a Pub/Sub or Kafka sink, and have the next pipeline consume that Pub/Sub sink. You may also wish to sink them to a another location like BigQuery/GCS, so that you can query out these results again if you need to.
There is also an option to use Cloud Functions instead of Dataflow, which have Pub/Sub and GCS triggers. A microservice system can be setup with several Cloud Functions.

Loading data from a Google Persistent Disk into BigQuery?

What's the recommended way of loading data into BigQuery that is currently located in a Google Persistent Disk? Are there any special tools or best practises for this particular use case?
Copy to GCS (Google Cloud Storage), point BigQuery to load from GCS.
There's no current direct connection between a persistent disk and BigQuery. You could send the data straight to BigQuery with the bq CLI, but makes everything slower if you ever need to retry.

TB of data need to uploaded to Bigquery

We have TB of data need to uploaded to bigquery. I remember one of the video from Felipe Hoffa mentioning that we can send a hard drive overnight to Google and they can take care of it. Can Google Bigquery team provide more info on it is possible?
This is the Offline Import mechanism from Google Cloud Storage. You can read about it here:
https://developers.google.com/storage/docs/early-access
Basically, you'd use this mechanism to import to Google Cloud Storage, then run BigQuery import jobs to import to BigQuery from there.
Depending on how many TB of data you are importing, you might just be better off uploading directly to Google Cloud Storage. Gsutil and other tools can do resumable uploads.
If you are talking about 100s of TB or more, you might want to talk to a Google Cloud Support person about your scenarios in detail. They may be able to help you optimize your usage of BigQuery and Cloud Storage.