I have a couple of questions for the setup of digital twin with Lora and Sigfox devices which data are encoded:
how do we get the iothubowner string to create the callback to Lora or Sigfox backend ?
how do we deal with mandatory properties especially with HardwareId?
what is the best practice to decode message and then compute the message? Knowing that we have to cascade the processing : decoding then normalization then telemetry analytics (monitor room condition for example)
Here are the answers:
1. IoT Hub connection string (iothubowner) will be exposed in the API in couple of months
2. For device the unique identifier from client side is HardwareId. We recommend adding the MacAddress of the device. For SensorId.HardwareId, you have multiple options that we recommend: either Device.HardwareId + SensorName or just SensorName if unique per device or just a GUID. SensorId.HardwareId is important to be set because this value must match the telemetry message header property DigitalTwins-SensorHardwareId in order for the UDF to kick off. See https://learn.microsoft.com/en-us/azure/digital-twins/concepts-device-ingress#device-to-cloud-message
3. You'd have to create a matcher that associates the right UDF with the code to decode the byte array to a certain type of sensors. For example, if you have sensors of Type: LoRa, and then various DataTypes: you'd create a matcher against the Type to match "LoRa" and then various datatypes. For now, you would have to handle all of that in one UDF. In the future, we might support chaining and you could have a UDF for each step separately, but until then, all in one.
Related
I'm developing a Connector with some bank, and we're using the ISO8583 protocol, right now, i'm setting the STAN(field 11) with some random number generated with a random generator but sometimes I have some number collisions, the question is, could I safely use this generator or do I need to make the STAN a sequential number?
Thanks in advance.
The System Trace Audit Number (STAN) ISO-8583 number has different values and is maintained basically between relationships within the transaction. That is it can stay the same or the same transaction will have many STANs over its transaction path but it SHOULD be the same between two end point and it is usually controlled in settings whos STAN to use.
For Example:
Terminal -> Terminal Driver -> Switch 1->Switch 2->Issuer
The STAN is say assign by the terminal driver and then remains constant at minimum for the following relationships... though may change for each relationship.
Terminal Driver - Switch 1
Switch 1 -> Switch 2
Switch 2 -> Issuer
Note that internally within each system to the STAN may be unique as well but it needs to keep a unique STAN for each relationship.. and it shouldn't change between the request and response as it is needed for multi-part transactions (Single PA, Multiple Completions & Multi-PA, Single Completion) as well as for reversals and such in Data Element 90.
Depends on your remote endpoint, but I've seen many requiring sequential numbers, and detecting duplicates.
Usually STAN is the number increased for each request.
Random STAN generation is not the best case for network messages sequences.
The duplication of STANs can be due to different sources, i.e. Host clients or Terminals.
STAN itself cannot be the only field to detect unique transaction requests. It must be mixed together with other fields like RRN, Terminal ID, Merchant ID.
See also "In ISO message, what's the use of stan and rrn ?"
According to BigQuery docs, you can ensure data consistency providing an insertId (https://cloud.google.com/bigquery/streaming-data-into-bigquery#dataconsistency). If it's not provided, BQ will try to ensure consistency based on internals Ids and best-effort.
Using the BQ API you can do that with the row_ids param (https://google-cloud-python.readthedocs.io/en/latest/bigquery/generated/google.cloud.bigquery.client.Client.insert_rows_json.html#google.cloud.bigquery.client.Client.insert_rows_json) but I can't find the same for the Apache Beam Python SDK.
Looking into the SDK I have noticed that a 'unique_row_id' property exist, but I really don't know how to pass my param to WriteToBigQuery()
How can I write into BQ (streaming) providing a row Id for deduplication?
Update:
If you use WriteToBigQuery then it will automatically create and
insert a unique row id called insertId for you, which will be inserted to bigquery. It's handled for you, you don't need to worry about it. :)
WriteToBigQuery is a PTransform, and in it's expand method calls BigQueryWriteFn
BigQueryWriteFn is a DoFn, and in it's process method calls _flush_batch
_flush_batch is a method that then calls the BigQueryWrapper.insert_rows method
BigQueryWrspper.insert_rows creates a list of bigquery.TableDataInsertAllRequest.RowsValueListEntry objects which contain the insertId and the row data as a json object
The insertId is generated by calling the unique_row_id method which returns a value consisting of UUID4 concatenated with _ and with an auto-incremented number.
In the current 2.7.0 code, there is this happy comment; I've also verified it is true :)
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py#L1182
# Prepare rows for insertion. Of special note is the row ID that we add to
# each row in order to help BigQuery avoid inserting a row multiple times.
# BigQuery will do a best-effort if unique IDs are provided. This situation
# can happen during retries on failures.
* Don't use BigQuerySink
At least, not in it's current form as it doesn't support streaming. I guess that might change.
Original (non)answer
Great question, I also looked and couldn't find a certain answer.
Apache Beam doesn't appear to use that google.cloud.bigquery client sdk you've linked to, it has some internal generated api client, but it appears to be up-to-date.
I looked at the source:
The insertall method is there https://github.com/apache/beam/blob/18d2168ee71a1b1b04976717f0f955199bb00961/sdks/python/apache_beam/io/gcp/internal/clients/bigquery/bigquery_v2_client.py#L476
I also found the insertid mentioned
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/internal/clients/bigquery/bigquery_v2_messages.py#L1707
So if you can make an InsertAll call it will use a TableDataInsertAllRequest and pass a RowsValueListEntry
class TableDataInsertAllRequest(_messages.Message):
"""A TableDataInsertAllRequest object.
Messages:
RowsValueListEntry: A RowsValueListEntry object.
The RowsValueListEntry message is where the insertid is.
Here's the API docs for insert all
https://cloud.google.com/bigquery/docs/reference/rest/v2/tabledata/insertAll
I will look some more at this because I don't see the WriteToBigQuery() exposing this.
I suspect that the 'bigquery will remember this for at least one minute` is a pretty loose guarantee for de-duping. The docs suggest using datastore if you need transactions. Otherwise you might need to run SQL with window functions to de-dupe at runtime, or run some other de-duping jobs on bigquery.
Perhaps using batch_size parameter of WriteToBigQuery(), and running a combine (or at worst a GroupByKey) step in dataflow is a more stable way to de-dupe prior to writing.
I am streaming messages from Kafka topic using KafkaIO API
https://beam.apache.org/documentation/sdks/javadoc/2.0.0/org/apache/beam/sdk/io/kafka/KafkaIO.html
The pipeline flow is as follows:
KafkaStream --> Decode Message using transformer -->Save to BigQuery
I decoding the message and save to BigQuery using BigQueryIO. I would like to know do I need to use window or not.
Window.into[Array[Byte]](FixedWindows.of(Duration.standardSeconds(10)))
.triggering(
Repeatedly
.forever(
AfterProcessingTime
.pastFirstElementInPane()
.plusDelayOf(Duration.standardSeconds(10))
)
)
.withAllowedLateness(Duration.standardSeconds(0))
.discardingFiredPanes()
)
as per documenattion Window is require in case we are doing any computation like GroupByKey,etc. Since I am just decoding Array Byte message and storing them into BigQuery, it may not require.
Please let me know, do I need to use window or not?
There is an answer already posted to a similar question, where the data is being stream from PubSub. The main ideas is that it is impossible to collect all of the elements of an unbounded PCollections since new elements are being constantly added, and therefore one of two strategies must be implemented:
Windowing: you should first set a non-global windowing function.
Triggers: you can set up a trigger for an unbounded PCollection in such a way that it provides periodic updates on an unbounded dataset, even if the data in the subscription is still flowing
It might also be necessary to enable Streaming in the Pipeline by setting the appropriate arg parameter of the option using the following command:
pipeline_options.view_as(StandardOptions).streaming = True
I am trying to find best practice(efficient) way of storing set of List objects against ReportingDate key.
List could be serailised as Xml/DataContract or ProtoBuf....
And given some of the data could be big (for that slice of key):
I was wondering if there is any of getting data from redis cache in IEnum/streamed fashion? Atm we using ProtoBuf.NET to have file based cache. And we retrieve data into mem in streamed fashion (we also have an option of selecting what props/fields we want in that T object as ProtoBuf allows us to do it)
Is there any way can force (after some inactivity) certain part of the data to be offloaded from mem and back into file if it is not being used. But load it up again if it is called
Tnx
It sounds like you want a sorted set - see https://redis.io/topics/data-types#sorted-sets. You would use the date as the value, perhaps in epoch time (since it needs to be a number). SE.Redis supports all the operations you would expect to get ranges of values (either positional ranges - the first 20 records, etc; or absolute ranges bases on the value - all items between two dates expressed in the same unit). Look at the methods starting " SortedSet...".
The value can be binary, so protobuf-net is fine (you would serialize the value for each date separately). Just pass a byte[] as the value. You need to handle serialization separately to the redis library.
As for swapping data out: no. Redis has date-based expiration, but doesn't have hot and cold storage. It is either there, or it isn't. You could use scheduled tasks to purge or move data based on date ranges, again using any of the Z* (redis) or SortedSet* (SE.Redis) methods.
For the complete list of Z* operations, see: https://redis.io/commands#sorted_set. They should all be available in SE.Redis.
I've currently got a dataset which is something like:
channel1 = user1,user2,user3
channel2 = user4,user5,user6
(note- these are not actual names, the text is not a predictable sequence)
I would like to have the most optimized capability for the following:
1) Add user to a channel
2) Remove user from a channel
3) Get list of all users in several selected channels, maintaining knowledge of which channel they are in (in case it matters- this can also be simply checking whether a channel has any users or not without getting an actual list of them)
4) Detect if a specific user is in a channel (willing to forego this feature if necessary)
I'm a bit hungup on the fact that there are only two ways I can see of getting multiple keys at once:
A) Using regular keys and a mget key1, key2, key3
In this solution, each value would be a JSON string which can then be manipulated and queried clientside to add/remove/determine values. This itself has a couple problems- firstly that it's possible another client will change the data while it's being processed (i.e. this solution is not atomic) and it's not easy to detect right away if a channel contains a specific user even though it is easy to detect if a channel has any users (this is low priority, as stated above)
B) Using sets and sunion
I would really like to use sets for this solution somehow, the above solution just seems wrong... but I cannot see how to query multiple sets at once and maintain info about which set each member is from or if any of the sets in the union have 0 members (sunion only gives me a final set of all the combined members)
Any solutions which can implement the above points 1-4 in optimal time and atomic operations?
EDIT: One idea which might work in my specific case is to store the channel name as part of the username and then use sets. Still, it would be great to get a more generic answer
Short answer: use sets + pipelining + MULTI/EXEC, or sets + Lua.
1) Add user to a channel
SADD command
2) Remove user from a channel
SREM command
3) Get list of all users in several selected channels
There are several ways to do it.
If you don't need strict atomicity, you just have to pipeline several SMEMBERS commands to retrieve all the sets in one roundtrip. If you are just interested whether channels have users or not, you can replace SMEMBERS by SCARD.
If you need strict atomicity, you can pipeline a MULTI/EXEC block containing SMEMBERS or SCARD commands. The output of the EXEC command will contain all the results. This is the solution I would recommend.
An alternative (atomic) way is to call a server-side Lua script using the EVAL command. Lua script executions are always atomic. The script could take a number of channel as input parameters, and build a multi-layer bulk reply to return the output.
4) Detect if a specific user is in a channel
SISMEMBER command - pipeline them if you need to check for several users.