best practice for ingesting pubsub payloads into Big Query

best practice for ingesting pubsub payloads into Big Query - google-bigquery

I am looking for best practices on how to ingest various payloads coming from a single subscription(one topic) into Big Query. The data can be ingested in raw format but need to be structured in more tabular form and mapped to published layer for business consumption. I was googling some options but need guidance on designing pubsub message ingestion to publication solution in real time.
Should I simply dump all the data keeping similar structure to the message in raw as string or Json or should I structure attributes into columns/structures/arrays what are the pros and cons
does it make sense to split the subscription into multiple filtered subsciptions and map them to multiple tables.

Case 1: You know the possible columns (JSON keys) you can receive
I would structure the Pub/Sub message into columns (and if your JSON is pretty nested, take advantage of BigQuery's nested structures)
Case 2: You don't know the possible columns you can receive or suspect they will frequently change
Separate it into columns in your target BigQuery tables as much as you can, and dump the rest as a raw string in a column that you can always parse later with BigQuery SQL.
Also, in case you haven't seen it already: you can now push messages straight form Pub/Sub to BigQuery (without going through Dataflow, which is great at parsing the Pub/Sub messages but can tend to be costly): https://cloud.google.com/blog/products/data-analytics/pub-sub-launches-direct-path-to-bigquery-for-streaming-analytics

Related

API which provide data from Elastic Search and not SQL

I have a system where there are large dataset(s) where I want to have quick searches, and elastic search is suitable for it. So the data resides in SQL, and is synced to ES. There is an obvious small delay in this sync.
There are consumers of this data which could work with slightly stale data. So if there's an API for UI which end users use to see the dataset. A delay of 3-4 seconds is acceptable. So API handler which deals with ES is perfect here.
Then there are consumers of this data (bots) who want to work with real time data. So for the almost same requirements, should I create another API just like that in UI consumer, which gets data from SQL?
What is the usual best practice which is followed, and I'm assuming this is a very common usecase.

You probably should stick to creating just a sinlge API and use a query string parameter to decide which of the two data sources to use. This will result in less code to maintain.

data exchange format to use with Apache Kakfa that provides schema validation

what is the best message format to use with apachhe kafka so that producers and consumers can define contract and validate data and serialize/deserialize data? for example in xml we have xsd. but in json there is no universal schema.. i read about using apache avro but not sure how fast will it be as i can't afford more then 5 to 6 ms for schema validation and deserialisation? any inputs please?
We will be processing thousands of transactions per second and SLA for each transaction is 150ms so i am looking for something that's very fast

Avro is often quoted as being slow(er), and adds overhead compared to other binary formats, but I believe that is for the use-case of not using a Schema Registry where the schema is excluded from the actual payload.
Alternatively, you can use Protobuf or Thrift if you absolutely want a schema, however, I don't think serializers for these formats are readily available, from what I've seen. Plus, the schemas need to be passed between your clients if not otherwise committed to a central location.
I can confidently say that Avro should be fine for starting out, though, and the Registry is definitely useful, and not just for Kafka use cases.

What to use to serve as an intermediary data source in ETL job?

I am creating an ETL pipeline that uses variety of sources and sends the data to Big Query. Talend cannot handle both relational and non relational database components in one job for my use case so here's how i am doing it currently:
JOB 1 --Get data from a source(SQL Server, API etc), transform it and store transformed data in a delimited file(text or csv)
JOB 1 -- Use the stored transformed data from delimited file in JOB 1 as source and then transform it according to big query and send it.
I am using delimited text file/csv as intermediary data storage to achieve this.Since confidentiality of data is important and solution also needs to be scalable to handle millions of rows, what should i use as this intermediary source. Will a relational database help? or delimited files are good enough? or anything else i can use?
PS- I am deleting these files as soon as the job finishes but worried about security till the time job runs, although will run on safe cloud architecture.
Please share your views on this.

In Data Warehousing architecture, it's usually a good practice to have the staging layer to be persistent. This gives you among other things, the ability to trace the data lineage back to source, enable to reload your final model from the staging point when business rules change as well as give a full picture about the transformation steps the data went through from all the way from landing to reporting.
I'd also consider changing your design and have the staging layer persistent under its own dataset in BigQuery rather than just deleting the files after processing.
Since this is just a operational layer for ETL/ELT and not end-user reports, you will be paying only for storage for the most part.
Now, going back to your question and considering your current design, you could create a bucket in Google Cloud Storage and keep your transformation files there. It offers all the security and encryption you need and you have full control over permissions. Big Query works seemingly with Cloud Storage and you can even load a table from a Storage file straight from the Cloud Console.
All things considered, whatever the direction you chose I recommend to store the files you're using to load the table rather than deleting them. Sooner or later there will be questions/failures in your final report and you'll likely need to trace back to the source for investigation.
In a nutshell. The process would be.
|---Extract and Transform---|----Load----|
Source ---> Cloud Storage --> BigQuery

I would do ELT instead of ETL: load the source data as-is and transform in Bigquery using SQL functions.
This allows potentially to reshape data (convert to arrays), filter out columns/rows and perform transform in one single SQL.

Suitable Google Cloud data storage option for raw JSON events with auto-incrementing id

I'm looking for an appropriate google data/storage option to use as a location to stream raw, JSON events into.
The events are generated by users in response to very large email broadcasts so throughput could be very low one moment and up to ~25,000 events per-second for short periods of time. The JSON representation for these events will probably only be around 1kb each
I want to simply store these events as raw and unprocessed JSON strings, append-only, with a separate sequential numeric identifier for each record inserted. I'm planning to use this identifier as a way for consuming apps to be able to work through the stream sequentially (in a similar manner to the way Kafka consumers track their offset through the stream) - this will allow me to replay the event stream from points of my choosing.
I am taking advantage of Google Cloud Logging to aggregate the event stream from Compute Engine nodes, from here I can stream directly into a BigQuery table or Pub/Sub topic.
BigQuery seems more than capable of handling the streaming inserts, however it seems to have no concept of auto-incrementing id columns and also suggests that its query model is best-suited for aggregate queries rather than narrow-result sets. My requirement to query for the next highest row would clearly go against this.
The best idea I currently have is to push into Pub/Sub and have it write each event into a Cloud SQL database. That way Pub/Sub could buffer the events if Cloud SQL is unable to keep up.
My desire for an auto-identifier and possibly an datestamp column makes this feel like a 'tabular' use-case and therefore I'm feeling the NoSQL options might also be inappropriate
If anybody has a better suggestion I would love to get some input.

We know that many customers have had success using BigQuery for this purpose, but it requires some work to choose the appropriate identifiers if you want to supply your own. It's not clear to me from your example why you couldn't just use a timestamp as the identifier and use the ingestion-time partitioned table streaming ingestion option?
https://cloud.google.com/bigquery/streaming-data-into-bigquery#streaming_into_ingestion-time_partitioned_tables
As far as Cloud Bigtable, as noted by Les in the comments:
Cloud Bigtable could definitely keep up, but isn't really designed for sequential adds with a sequential key as that creates hotspotting.
See:
You can consult this https://cloud.google.com/bigtable/docs/schema-design-time-series#design_your_row_key_with_your_queries_in_mind
You could again use a timestamp as a key here although you would want to do some work to e.g. add a hash or other unique-fier in order to ensure that at your 25k writes/second peak you don't overwhelm a single node (we can generally handle about 10k row modifications per second per node, and if you just use lexicographically sequential IDs like an incrementing number all your writes wouldb be going to the same server).
At any rate it does seem like BigQuery is probably what you want to use. You could also refer to this blog post for an example of event tracking via BigQuery:
https://medium.com/streak-developer-blog/using-google-bigquery-for-event-tracking-23316e187cbd

incremental updates documentation is not clear enough

I have a database where I need keep up with changes on Wikidata changes, and while I was looking for ways to do it, I found these three:
RSS
API Call
Socket.IO
I would like to know if there are other ways and which one is the best or recommended by Wikidata

The answer depends on how up to date you need to keep your database.
As up to date as possible
If you need to keep you database as up to date with Wikidata as possible then you will probably want to use a combination of the solutions that you have found.
Socket.IO will provide you with a stream of what has changed, but will not necessarily give you all of the information that you need.
(Note: there is an IRC stream that would allow you to do the same thing)
Based on the data provided by the stream you can then make calls to the Wikidata API retrieving the new data.
Of course this could result in lots of API calls, so make sure you batch them and also don't retrieve update immediately in case lots of changes occur in a row.
Daily or Weekly
As well as the 3 options you have listed above you also have the Database dumps!
https://www.wikidata.org/wiki/Wikidata:Database_download
The JSON & RDF dumps are generally recommended. The JSON dump contains the data exactly as it is stored. These dumps are made weekly.
The XML dumps are not guaranteed to have the same JSON format as the JSON dumps as they use the internal serialization format. However daily XML dumps are provided.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas