Specify number of partitions for output topic when using TopologyTestDriver - testing

I am using TopologyTestDriver when writing tests for my Kafka Streams application. Is there a way to configure the number of partitions for the InputTestTopic and OutputTestTopic?
The record process logic is quite complex in some of our applications, and it becomes even more complex when the input topics contain more than one partition. It would therefore be useful to thoroughly test the record process logic with an input topic containing more than one partition.

Related

De-duplicating BigQuery in an Asynchronous Real Time ETL Pipeline

Our Data Warehouse team is evaluating BigQuery as a Data Warehouse column store solution and had some questions regarding its features and best use. Our existing etl pipeline consumes events asynchronously through a queue and persists the events idempotently into our existing database technology. The idempotent architecture allows us to on occasion replay several hours or days of events to correct for errors and data outages with no risk of duplication.
In testing BigQuery, we've experimented with using the real time streaming insert api with a unique key as the insertId. This provides us with upsert functionality over a short window, but re-streams of the data at later times result in duplication. As a result, we need an elegant option for removing dupes in/near real time to avoid data discrepancies.
We had a couple questions and would appreciate answers to any of them. Any additional advice on using BigQuery in ETL architecture is also appreciated.
Is there a common implementation for de-duplication of real time
streaming beyond the use of the tableId?
If we attempt a delsert (via an delete followed by an insert using
the BigQuery API) will the delete always precede the insert, or do
the operations arrive asynchronously?
Is it possible to implement real time streaming into a staging
environment, followed by a scheduled merge into the destination
table? This is a common solution for other column store etl
technologies but we have seen no documentation suggesting its use in
BigQuery.
We let duplication happen, and write our logic and queries in a such way that every entity is a streamed data. Eg: a user profile is a streamed data, so there are many rows placed in time and when we need to pick the last data, we use the most recent row.
Delsert is not suitable in my opinion as you are limited to 96 DML statements per day per table. So this means you need to temp store in a table batches, for later to issue a single DML statement that deals with a batch of rows, and updates a live table from the temp table.
If you consider delsert, maybe it's easier to consider writing a query to only read most recent row.
Streaming followed by scheduled merge is possible. Actually you can rewrite some data in the same table, eg: removing dups. Or scheduled query batch content from temp table and write to live table. This is somehow the same as let duplicate happening and later deal within a query with it, also called re-materialization if you write to the same table.

Suitable Google Cloud data storage option for raw JSON events with auto-incrementing id

I'm looking for an appropriate google data/storage option to use as a location to stream raw, JSON events into.
The events are generated by users in response to very large email broadcasts so throughput could be very low one moment and up to ~25,000 events per-second for short periods of time. The JSON representation for these events will probably only be around 1kb each
I want to simply store these events as raw and unprocessed JSON strings, append-only, with a separate sequential numeric identifier for each record inserted. I'm planning to use this identifier as a way for consuming apps to be able to work through the stream sequentially (in a similar manner to the way Kafka consumers track their offset through the stream) - this will allow me to replay the event stream from points of my choosing.
I am taking advantage of Google Cloud Logging to aggregate the event stream from Compute Engine nodes, from here I can stream directly into a BigQuery table or Pub/Sub topic.
BigQuery seems more than capable of handling the streaming inserts, however it seems to have no concept of auto-incrementing id columns and also suggests that its query model is best-suited for aggregate queries rather than narrow-result sets. My requirement to query for the next highest row would clearly go against this.
The best idea I currently have is to push into Pub/Sub and have it write each event into a Cloud SQL database. That way Pub/Sub could buffer the events if Cloud SQL is unable to keep up.
My desire for an auto-identifier and possibly an datestamp column makes this feel like a 'tabular' use-case and therefore I'm feeling the NoSQL options might also be inappropriate
If anybody has a better suggestion I would love to get some input.
We know that many customers have had success using BigQuery for this purpose, but it requires some work to choose the appropriate identifiers if you want to supply your own. It's not clear to me from your example why you couldn't just use a timestamp as the identifier and use the ingestion-time partitioned table streaming ingestion option?
https://cloud.google.com/bigquery/streaming-data-into-bigquery#streaming_into_ingestion-time_partitioned_tables
As far as Cloud Bigtable, as noted by Les in the comments:
Cloud Bigtable could definitely keep up, but isn't really designed for sequential adds with a sequential key as that creates hotspotting.
See:
You can consult this https://cloud.google.com/bigtable/docs/schema-design-time-series#design_your_row_key_with_your_queries_in_mind
You could again use a timestamp as a key here although you would want to do some work to e.g. add a hash or other unique-fier in order to ensure that at your 25k writes/second peak you don't overwhelm a single node (we can generally handle about 10k row modifications per second per node, and if you just use lexicographically sequential IDs like an incrementing number all your writes wouldb be going to the same server).
At any rate it does seem like BigQuery is probably what you want to use. You could also refer to this blog post for an example of event tracking via BigQuery:
https://medium.com/streak-developer-blog/using-google-bigquery-for-event-tracking-23316e187cbd

RabbitMQ : One queue per message type, or post routing?

I use RabbitMQ as an integration distribution system, kind of ETL, pollers are querying tables from source databases, publish results on RabbitMQ, and results are consumed according their source (1 queue per source (app.) to be saved in another form.
I'm asking if it would be better to split queues per query AND source (app..), actually it's done only by source, and "postrouted" using a custom payload header.
The only advantage I see, that could be a defect, is that there are a same number of consumer as there are queries to do. But it could become a problem ...
Thanks.
I would say that one queue per query could get out of hand quickly in terms of managing and monitoring them.
I find it works well to have one queue per destination, and to then use the routing key to specify how things should be handled within your consumer code (i.e. for the type). That way, you get RabbitMQ to do the multiplexing for you, and the consumer code can run separately on the same messages on each destination point.
There are course, always many different ways, but I find that this tends to work well for ETL applications. If you have tons of destinations, perhaps you would want to move towards adding the destination to the routing key as well. If you don't have any ordering requirements (i.e. due to RDBMS Foreign Key Constraints), you could also consider adding multiple consumers to the same queue to improve throughput. (For cases where you do have such ordering requirements, that's where the one queue per destination and the multiplexing that provides proves to be especially useful.)

Trident or Storm topology that writes on Redis

I have a problem with a topology. I try to explain the workflow...
I have a source that emits ~500k tuples every 2 minutes, these tuples must be read by a spout and processed exatly once like a single object (i think a batch in trident).
After that, a bolt/function/what else?...must appends a timestamp and save the tuples into Redis.
I tried to implement a Trident topology with a Function that save all the tuples into Redis using a Jedis object (Redis library for Java) into this Function class, but when i deploy i receive a NotSerializable Exception on this object.
My question is.How can i implement a Function that writes on Redis this batch of tuples? Reading on the web i cannot found any example that writes from a function to Redis or any example using State object in Trident (probably i have to use it...)
My simple topology:
TridentTopology topology = new TridentTopology();
topology.newStream("myStream", new mySpout()).each(new Fields("field1", "field2"), new myFunction("redis_ip", "6379"));
Thanks in advance
(replying about state in general since the specific issue related to Redis seems solved in other comments)
The concepts of DB updates in Storm becomes clearer when we keep in mind that Storm reads from distributed (or "partitioned") data sources (through Storm "spouts"), processes streams of data on many nodes in parallel, optionally perform calculations on those streams of data (called "aggregations") and saves the results to distributed data stores (called "states"). Aggregation is a very broad term that just means "computing stuff": for example computing the minimum value over a stream is seen in Storm as an aggregation of the previously known minimum value with the new values currently processed in some node of the cluster.
With the concepts of aggregations and partition in mind, we can have a look at the two main primitives in Storm that allow to save something in a state: partitionPersist and persistentAggregate, the first one runs at the level of each cluster node without coordination with the other partitions and feels a bit like talking to the DB through a DAO, while the second one involves "repartitioning" the tuples (i.e. re-distributing them across the cluster, typically along some groupby logic), doing some calculation (an "aggregate") before reading/saving something to DB and it feels a bit like talking to a HashMap rather than a DB (Storm calls the DB a "MapState" in that case, or a "Snapshot" if there's only one key in the map).
One more thing to have in mind is that the exactly once semantic of Storm is not achieved by processing each tuple exactly once: this would be too brittle since there are potentially several read/write operations per tuple defined in our topology, we want to avoid 2-phase commits for scalability reasons and at large scale, network partitions become more likely. Rather, Storm will typically continue replaying the tuples until he's sure they have been completely successfully processed at least once. The important relationship of this to state updates is that Storm gives us primitive (OpaqueMap) that allows idempotent state update so that those replays do not corrupt previously stored data. For example, if we are summing up the numbers [1,2,3,4,5], the resulting thing saved in DB will always be 15 even if they are replayed and processed in the "sum" operation several times due to some transient failure. OpaqueMap has a slight impact on the format used to save data in DB. Note that those replay and opaque logic are only present if we tell Storm to act like that, but we usually do.
If you're interested in reading more, I posted 2 blog articles here on the subject.
http://svendvanderveken.wordpress.com/2013/07/30/scalable-real-time-state-update-with-storm/
http://svendvanderveken.wordpress.com/2014/02/05/error-handling-in-storm-trident-topologies/
One last thing: as hinted by the replay stuff above, Storm is a very asynchronous mechanism by nature: we typically have some data producer that post event in a queueing system (e,g. Kafka or 0MQ) and Storm reads from there. As a result, assigning a timestamp from within storm as suggested in the question may or may not have the desired effect: this timestamp will reflect the "latest successful processing time", not the data ingestion time, and of course it will not be identical in case of replayed tuples.
Have you tried trident-state for redis. There is a code on github that does it already:
https://github.com/kstyrc/trident-redis.
Let me know if this answers your question or not.

Need Design & Implementation inputs on Cassandra based use case

I am planning to store high-volume order transaction records from a commerce website to a repository (Have to use cassandra here, that is our DB). Let us call this component commerceOrderRecorderService.
Second part of the problem is - I want to process these orders and push to other downstream systems. This component can be called batchCommerceOrderProcessor.
commerceOrderRecorderService & batchCommerceOrderProcessor both will run on a java platform.
I need suggestion on design of these components. Especially the below:
commerceOrderRecorderService
What is he best way to design the columns, considering performance and scalability? Should I store the entire order (complex entity) as a single JSON object. There is no search requirement on the order attributes. We can at least wait until they are processed by the batch processor. Consider - that a single order can contain many sub-items - at the time of processing each of which can be fulfilled differently. Designing columns for such data structure may be an overkill
What should be the key, given that data volumes would be high. 10 transactions per second let's say during peak. Any libraries or best practices for creating such transactional data in cassandra? Can TTL also be used effectively?
batchCommerceOrderProcessor
How should the rows be retrieved for processing?
How to ensure that a multi-threded implementation of the batch processor ( and potentially would be running on multiple nodes as well ) will have row level isolation. That is no two instance would read and process the same row at the same time. No duplicate processing.
How to purge the data after a certain period of time, while being friendly to cassandra processes like compaction.
Appreciate design inputs, code samples and pointers to libraries. Thanks.
Depending on the overall requirements of your system, it could be feasible to employ the architecture composed of:
Cassandra to store the orders, analytics and what have you.
Message queue - your commerce order recorder service would simple enqueue new order to the transactional and persistent queue and return. Scalability and performance should not be an issue here as you can easily achieve thousands of transactions per second with a single queue server. You may have a look at RabbitMQ as one of available choices.
Stream processing framework - you could read a stream of messages from the queue in a scalable fashion using streaming frameworks such as Twitter Storm. You could implement in Java than 3 simple pipelined processes in Storm:
a) Spout process that dequeues next order from the queue and pass it to
the second process
b) Second process called Bolt that inserts each next order to Cassandra and pass it to the third bolt
c) Third Bolt process that pushes the order to other downstream systems.
Such an architecture offers high-performance, scalability, and near real-time, low latency data processing. It takes into account that Cassandra is very strong in high-speed data writes, but not so strong in reading sequential list of records. We use Storm+Cassandra combination in our InnoQuant MOCA platform and handle 25.000 tx/second and more depending on hardware.
Finally, you should consider if such an architecture is not an overkill for your scenario. Nowadays, you can easily achieve 10 tx/second with nearly any single-box database.
This example may help a little. It loads a lot of transactions using the jmxbulkloader and then batches the results into files of a certain size to be transported else where. It multi-threaded but within the same process.
https://github.com/PatrickCallaghan/datastax-bulkloader-writer-example
Hope it helps. BTW it uses the latest cassandra 2.0.5.