What is the purpose of a U-SQL Reducer? - azure-data-lake

I haven't been able to find any documentation or samples for the use of Reducers in U-SQL.
How is a Reducer different from an Applier, because from the function signatures, they both receive one row at a time.
My use case is in the following question:
Azure Data Lake Analytics: Combine overlapping time duration using U-SQL
I have achieved this functionality with an Applier.
How can a reducer be more useful for this use case?

Documentation for the reducer is here: https://msdn.microsoft.com/en-US/library/azure/mt621336.aspx
It is basically a custom rowset level aggregator so it could be going over an ordered set of rows within a key.
In most cases, using Windowing expressions or user defined aggregators is preferable.
Can you share your solution on the other thread?
UPDATED: You can find a sample for a reducer here: https://blogs.msdn.microsoft.com/mrys/2016/06/08/how-do-i-combine-overlapping-ranges-using-u-sql-introducing-u-sql-reducer-udos/

Related

What is the best way to add new data to BigQuery through BigQuery API?

I'm using Django as my backend framework to connect my web app with BigQuery. How I would do it is to use BigQuery API in views.py to fetch data from BQ. So far from my research, I found 2 ways I can add data to BQ from my Django:
Using the insert_rows_json() method where I would just need to have the data in a JSON format and it would append the data to the BQ.
Using the to_gbq() method where I would need the data to be in a Pandas DataFrame and I could include the parameter if_exists="replace" to update existing tables on the BQ.
Currently, for adding new data, I would use method 1 and for other operations such as updating and deleting, I would use method 2.
My question: Is it better if I use method 2 for all of my operations, or should I just stick to using method 1 for adding new data and method 2 for other operations?
OR PERHAPS is there any other way that is more efficient for the web app to run even faster?
Quoted from this doc:
For new projects, we recommend using the BigQuery Storage Write API instead of the tabledata.insertAll method. The Storage Write API has lower pricing and more robust features, including exactly-once delivery semantics. The tabledata.insertAll method is still fully supported.
You can try BigQuery Storage Write API instead of the legacy insert_rows_json() method for streaming data into BigQuery. It has lower pricing and more robust features, including exactly-once delivery semantics. If you still need to use the legacy streaming insert_rows_json() method, you can use it. It is still fully supported by Google Cloud.
Use the insert_rows_json() method for streaming data into BigQuery because that is a recommended method and maintained by Google Cloud.
You can also UPDATE and DELETE table data using DML queries via BigQuery client libraries. But, there are some limitations in BigQuery when doing UPDATE and DELETE queries immediately after streaming inserts.
Rows that were written to a table recently by using streaming (the tabledata.insertall method or the Storage Write API) cannot be modified with UPDATE, DELETE, or MERGE statements. The recent writes are those that occur within the last 30 minutes. All other rows in the table remain modifiable by using UPDATE, DELETE, or MERGE statements. The streamed data can take up to 90 minutes to become available for copy operations.
If you still want to use the to_gbq() method for updating and deleting the table, you can use it. Refer here you can find the difference between the pandas-gbq and google-cloud-bigquery libraries.

Custom Dataflow Template - BigQuery to CloudStorage - documentation? general solution advice?

I am consuming a BigQuery table datasource. It is 'unbounded' as it is updated via a batch process. It contains session keyed reporting data from server logs where each row captures a request. I do not have access to the original log data and must consume the BigQuery table.
I would like to develop a custom Java based google Dataflow template using beam api with the goals of :
collating keyed session objects
deriving session level metrics
deriving filterable window level metrics based on session metrics, e.g., percentage of sessions with errors during previous window and percentage of errors per filtered property, e.g., error percentage per device type
writing the result as a formatted/compressed report to cloud storage.
This seems like a fairly standard use case? In my research thus far, I have not yet found a perfect example and still have not been able to determine the best practice approach for certain basic requirements. I would very much appreciate any pointers. Keywords to research? Documentation, tutorials. Is my current thinking right or do I need to consider other approaches?
Questions :
beam windowing and BigQuery I/O Connector - I see that I can specify a window type and size via beam api. My BQ table has a timestamp field per row. Am I supposed to somehow pass this via configuration or is it supposed to be automagic? Do I need to do this manually via a SQL query somehow? This is not clear to me.
fixed time windowing vs. session windowing functions - examples are basic and do not address any edge cases. My sessions can last hours. There are potentially 100ks plus session keys per window. Would session windowing support this?
BigQuery vs. BigQueryClientStorage - The difference is not clear to me. I understand that BQCS provides a performance benefit, but do I have to store BQ data in a preliminary step to use this? Or can I simply query my table directly via BQCS and it takes care of that for me?
For number 1 you can simply use a withTimestamps function before applying windowing, this assigns the timestamp to your items. Here are some python examples.
For number 2 the documentation states:
Session windowing applies on a per-key basis and is useful for data that is irregularly distributed with respect to time. [...] If data arrives after the minimum specified gap duration time, this initiates the start of a new window.
Also in the java documentation, you can only specify a minimum gap duration, but not a maximum. This means that session windowing can easily support hour-lasting sessions. After all, the only thing it does is putting a watermark on your data and keeping it alive.
For number 3, the differences between the BigQuery IO Connector and the BigQuery storage APIs is that the latter (an experimental feature as of 01/2020) access directly data stored, without the logical passage through BigQuery (BigQuery data isn't stored in BigQuery). This means that with storage APIs, the documentation states:
you can't use it to read data sources such as federated tables and logical views
Also, there are different limits and quotas between the two methods, that you can find in the documentation link above.

How to pre-process BigQuery data coming from Stackdriver

I am currently exporting logs from Stackdriver to BigQuery using sinks. But i am only interessted in the jsonPayload. I would like to ignore pretty much everything else.
But since the table creation and data insertion happens automatically, i could not do this.
Is there a way to preprocess data coming from sink to store only what matters?
If the answer is no, is there a way to run a cron job each day to copy yesterday data into a seperate table and then remove it? (knowing that the tables are named using timestamps which makes it possible to query them by day)
As far as I know both options mentioned are currently not possible in the GCP platform. On my end I've also tried to create an internal reproduction of your request and noticed that there isn't a way to solely filter the jsonPayload.
I would therefore suggest creating a feature request in regards to your ask on the following public issue tracker link. Note that feature requests do not have an ETA as to when they'll processed or if they'll be implemented.

Why data cannot be deleted in Druid?

We are using Druid as time series database and we have a use case where some data from it needs to be deleted.
I know we cannot run direct delete operation and the technology itself is not designed for that
What are various ways in which this can be possible?
The way this is typically handled is reindexing a segment with itself with a filter.
If you use the ingestSegmentFirehose you can directly reindex data and with the addition of a filter you can eliminate rows.
http://druid.io/docs/latest/ingestion/firehose.html#ingestsegmentfirehose
The way druid stores data and works, doesn't allow it to delete specific rows of data, instead the deletion can be done at segment level. So if you have a way to segment your data which could be deleted in future you could set your fragments accordingly and fire a delete task.
Other way is to use the load rules to not load certain segments or datasets based on some rules, though the data still exist in the deep storage.

Suitable Google Cloud data storage option for raw JSON events with auto-incrementing id

I'm looking for an appropriate google data/storage option to use as a location to stream raw, JSON events into.
The events are generated by users in response to very large email broadcasts so throughput could be very low one moment and up to ~25,000 events per-second for short periods of time. The JSON representation for these events will probably only be around 1kb each
I want to simply store these events as raw and unprocessed JSON strings, append-only, with a separate sequential numeric identifier for each record inserted. I'm planning to use this identifier as a way for consuming apps to be able to work through the stream sequentially (in a similar manner to the way Kafka consumers track their offset through the stream) - this will allow me to replay the event stream from points of my choosing.
I am taking advantage of Google Cloud Logging to aggregate the event stream from Compute Engine nodes, from here I can stream directly into a BigQuery table or Pub/Sub topic.
BigQuery seems more than capable of handling the streaming inserts, however it seems to have no concept of auto-incrementing id columns and also suggests that its query model is best-suited for aggregate queries rather than narrow-result sets. My requirement to query for the next highest row would clearly go against this.
The best idea I currently have is to push into Pub/Sub and have it write each event into a Cloud SQL database. That way Pub/Sub could buffer the events if Cloud SQL is unable to keep up.
My desire for an auto-identifier and possibly an datestamp column makes this feel like a 'tabular' use-case and therefore I'm feeling the NoSQL options might also be inappropriate
If anybody has a better suggestion I would love to get some input.
We know that many customers have had success using BigQuery for this purpose, but it requires some work to choose the appropriate identifiers if you want to supply your own. It's not clear to me from your example why you couldn't just use a timestamp as the identifier and use the ingestion-time partitioned table streaming ingestion option?
https://cloud.google.com/bigquery/streaming-data-into-bigquery#streaming_into_ingestion-time_partitioned_tables
As far as Cloud Bigtable, as noted by Les in the comments:
Cloud Bigtable could definitely keep up, but isn't really designed for sequential adds with a sequential key as that creates hotspotting.
See:
You can consult this https://cloud.google.com/bigtable/docs/schema-design-time-series#design_your_row_key_with_your_queries_in_mind
You could again use a timestamp as a key here although you would want to do some work to e.g. add a hash or other unique-fier in order to ensure that at your 25k writes/second peak you don't overwhelm a single node (we can generally handle about 10k row modifications per second per node, and if you just use lexicographically sequential IDs like an incrementing number all your writes wouldb be going to the same server).
At any rate it does seem like BigQuery is probably what you want to use. You could also refer to this blog post for an example of event tracking via BigQuery:
https://medium.com/streak-developer-blog/using-google-bigquery-for-event-tracking-23316e187cbd