Flink Streaming: Unexpected charaters in serialized String messages

Flink Streaming: Unexpected charaters in serialized String messages - serialization

My stream is producing records of type Tuple2<String,String>
.toString() output (usr12345,{"_key":"usr12345","_temperature":46.6})
where the key is usr12345 and value is {"_key":"usr12345","_temperature":46.6}
The .print() on the stream outputs the value correctly:
(usr12345,{"_key":"usr12345","_temperature":46.6})
But when I write the stream to Kafka the key becomes usr12345 (with a white space at the beginning) and the value ({"_key":"usr12345","_temperature":46.6}
Notice the space at the beginning of the key and the left parenthesis at the beginning of the value.
Very strange. Why this might happen?
Here is the serialization code:
TypeInformation<String> resultType = TypeInformation.of(String.class);
KeyedSerializationSchema<Tuple2<String, String>> schema =
new TypeInformationKeyValueSerializationSchema<>(resultType, resultType, env.getConfig());
FlinkKafkaProducer010.FlinkKafkaProducer010Configuration flinkKafkaProducerConfig = FlinkKafkaProducer010.writeToKafkaWithTimestamps(
stream,
"topic",
schema,
kafkaProducerProperties);

The TypeInformationKeyValueSerializationSchema serializes data with Flink's custom serializers which means that the result has to be interpreted as binary data. Flink's String serializer writes the length of the String followed encoding all characters.
I would assume that you deserialize the Kafka topic with a plain String deserializer. For the key, the serialized length is interpreted as a whitespace character. For the value, the length is interpreted as '('.
Try to use a different serializer which serializes the key and value as plain strings or use a compatible deserializer.

Related

protocol-buffers: string or byte sequence of the exact length

Looking at https://developers.google.com/protocol-buffers/docs/proto3#scalar it appears that string and bytes types don't limit the length? Does it mean that we're expected to specify the length of transmitted string in a separate field, e.g. :
message Person {
string name = 1;
int32 name_len = 2;
int32 user_id = 3;
...
}

The wire type used for string/byte is Length-delimited. This means that the message includes the strings length. How this is made available to you will depend upon the language you are using - for example the table says that in C++ a string type is used so you can call name.length() to retrieve the length.
So there is no need to specify the length in a separate field.

One of the things that I wished GPB did was allow the schema to be used to set constraints on such things as list/array length, or numerical value ranges. The best you can do is to have a comment in the .proto file and hope that programmers pay attention to it!
Other serialisation technologies do do this, like XSD (though often the tools are poor), ASN.1 and JSON schema. It's very useful. If GPB added these (it doesn't change wire formats), GPB would be pretty well "complete".

J8583 LLLLBIN and LLLLVAR produces the different length padding result

LLLLVAR and LLLLBIN produces different length produced from the same input.
Tried to pass in the value "6832" into the same IsoMessage object, however, LLLLVAR returns "00046382", while LLLLBIN returns "000836333832".
Sample of the source code as below:
msg.setValue(60, "6832".toByteArray(Charsets.US_ASCII), IsoType.LLLLBIN, 10)//encodes to 000836333832
msg.setValue(60, "6832", IsoType.LLLLVAR, 10) //encodes to 00046382
I though both should return 0004, why are both results different?

When you encode ISO messages as text, the LxBIN fields encode their data in hex, and so the size is double what you'd expect. However, the decoder decodes the hex data and gives you a byte array when parsing.
LxVAR and LxBIN fields only have the same length when the whole message is encoded using binary formatting.

How to store Bytes/Slice(UInt8) as a string in Crystal?

I'm encoding an Object into Bytes (ie Slice(UInt8)) via MessagePack. How would I store this in a datastore client (eg Crystal-Redis) that only accepts Strings?

If you have no other choice to store the Slice as a String, you can encode it as a String, but at the cost of reduced performance.
There's Base64 strict_encode/decode:
encoded = An_Object.to_msgpack # Slice(UInt8)
save_to_datastore "my_stuff", Base64.strict_encode(encoded)
from_storage = get_from_datastore "my_stuff"
if from_storage
My_MsgPack_Mapping.from_msgpack( Base64.decode(from_storage) )
end
Or you can use Slice#hexstring and String#hexbytes:
encoded = An_Object.to_msgpack # Slice(UInt8)
save_to_datastore "my_stuff", encoded.hexstring
from_storage = get_from_datastore "my_stuff"
if from_storage && from_storage.hexbytes?
My_MsgPack_Mapping.from_msgpack( from_storage.hexbytes )
end
(Crystal-Redis users have another option: see this issue.)

Both Crystal and Redis should be able to handle strings with non-valid UTF-8 bytes, so you could just directly create a String from the slice and store this to Redis and vice versa.
This is of course not entirely safe: you should make sure to avoid invoking any string methods that expect a valid UTF-8 string.
But apart from that, this direct method should be perfectly fine. Is is faster and more memory-efficient than using a string encoding.
redis.set key, String.new(slice)
redis.get(key).to_slice

Escape special characters in Apache pig data

I am using Apache Pig to process some data.
My data set has some strings that contain special characters i.e (#,{}[]).
This programming pig book says that you can't escape those characters.
So how can I process my data without deleting the special characters?
I thought about replacing them but would like to avoid that.
Thanks

Have you tried loading your data? There is no way to escape these characters when they are part of the values in a tuple, bag, or map, but there is no problem whatsoever in loading these characters in when part of a string. Just specify that field as type chararray.
The only issue you will have to watch out for here is if your strings ever contain the character that Pig is using as field delimiter - for example, if you are USING PigStorage(',') and your strings contain commas. But as long as you are not telling Pig to parse your field as a map, #, [, and ] will be handled just fine.

Easiest way would be,
input = LOAD 'inputLocation' USING TextLoader() as unparsedString:chararray;
TextLoader just reads each line of input into a String regardless of what's inside that string. You could then use your own parsing logic.

When writing your loader function, instead of returning tuples with e.g. maps as a String (and thus later relying on Utf8StorageConverter to get the conversion to a map right):
Tuple tuple = tupleFactory.newTuple( 1 );
tuple.set(0, new DataByteArray("[age#22, name#joel]"));
you can create and set directly a Java map:
HashMap<String, Object> map = new HashMap<String, Object>(2);
map.put("age", 22);
map.put("name", "joel");
tuple.set(0, map);
This is useful especially if you have to do the parsing during loading anyway.

How do I get the length (i.e. number of characters) of an ASCII string in VB.NET?

I'm using this code to return some string from a tcpclient but when the string comes back it has a leading " character in it. I'm trying to remove it but the Len() function is reading the number of bytes instead of the string itself. How can I alter this to give me the length of the string as I would normally use it and not of the array underlying the string itself?
Dim bytes(tcpClient.ReceiveBufferSize) As Byte
networkStream.Read(bytes, 0, CInt(tcpClient.ReceiveBufferSize))
' Output the data received from the host to the console.'
Dim returndata As String = Encoding.ASCII.GetString(bytes)
Dim LL As Int32 = Len(returndata)
Len() reports the number of bytes not the number of characters in the string.

Your code is currently somewhat broken. The answer is tcpClient.ReceiveBufferSize, regardless of how much data you actually received - because you're ignoring the return value from networkStream.Read. It could be returning just a few bytes, but you're creating a string using the rest of the bytes array anyway. Always check the return value of Stream.Read, because otherwise you don't know how much data has actually been read. You should do something like:
Dim bytesRead = networkStream.Read(bytes, 0, CInt(tcpClient.ReceiveBufferSize))
' Output the data received from the host to the console.'
Dim returndata As String = Encoding.ASCII.GetString(bytes, 0, bytesRead)
Now, ASCII always has a single character per byte (and vice versa) so the length of the string will be exactly the same as the length of the data you received.
Be aware that any non-ASCII data (i.e. any bytes over 127) will be converted to '?' by Encoding.ASCII.GetString. You may also get control characters. Is this definitely ASCII text data to start with? If it's not, I'd recommend hex-encoding it or using some other option to dump the exact data in a non-lossy way.

You could try trimming the string inside the call to Len():
Dim LL As Int32 = Len(returndata.Trim())

If Len reports the number of bytes and it doesn't match the number of characters, then I can think of two possibilities:
There are more chars being sent than you think (ie, that extra character is actually being sent)
The encoding is not ASCII, so there can be more than one byte per char (and one of them is that 'weird' character, that is the character is being sent and is not 'wrong data'). Try to find out if the data is really ASCII encoded, if not, change the call accordingly.

When I read you correctly, you get a single quotation mark at the beginning, right?
If you get that one consistently why not just subtract one from the string length? Or use a substring from the second character:
Len(returndata.Substring(1)
And I don't quite understand what you mean with »the length of the string as I would normally use it and not of the array underlying the string itself«. You have a string. Any array which might represent that string internally is entirely implementation-dependent and nothing you should see or rely on. Or am I getting you wrong here. The string is what you are using normally. I mean, if that's not what you do, then why not take the length of the string after processing it into something you would normally use?

Maybe I am missing something here, but what is wrong with String.Length?
Dim LL As Int32 = returndata.Length

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Flink Streaming: Unexpected charaters in serialized String messages - serialization

Related

protocol-buffers: string or byte sequence of the exact length

J8583 LLLLBIN and LLLLVAR produces the different length padding result

How to store Bytes/Slice(UInt8) as a string in Crystal?

Escape special characters in Apache pig data

How do I get the length (i.e. number of characters) of an ASCII string in VB.NET?

Categories

Resources