String version of term_to_binary - serialization

I'm trying to write a simple server that talks to clients via tcp. I have it sending messages around just fine, but now I want it to interpret the messages as Erlang data types. For example, pretend it's HTTP-like (it's not) and that I want to send from the client {get, "/foo.html"} and have the server interpret that as a tuple containing an atom and a list, instead of just a big list or binary.
I will probably end up using term_to_binary and binary_to_term, but debugging text-based protocols is so much easier that I was hoping to find a more list-friendly version. Is there one hiding somewhere?

You can parse a string as an expression (similar to file:consult) via:
% InputString = "...",
{ok, Scanned, _} = erl_scan:string(InputString),
{ok, Exprs} = erl_parse:parse_exprs(Scanned),
{value, ParsedValue, _} = erl_eval:exprs(Exprs, [])
(See http://www.trapexit.org/String_Eval)
You should be able to use io_lib:format to convert an expression to a string using the ~w or ~p format codes, such as io_lib:format("~w", [{get, "/foo.html"}]).
I don't think this will be very fast, so if performance is an issue you should probably not use strings like this.
Also note that this is potentially unsafe since you're evaluating arbitrary expressions -- if you go this route, you should probably do some checks on the intermediate output. I'd suggest looking at the result of erl_parse:parse_exprs to make sure it contains the formats you're interested in (i.e., it's always a tuple of {atom(), list()}) with no embedded function calls. You should be able to do this via pattern matching.

Related

How to replicate sha256 hash example from CyberSource REST API documentation?

I am investigating the CyberSource REST API and want to test the JSON Web Token Authentication method as documented here: https://developer.cybersource.com/api/developer-guides/dita-gettingstarted/authentication/GenerateHeader/jwtTokenAuthentication.html
I am unable to replicate the sha256 hash of the JSON payload described in the JWT Payload/Claim Set section.
{
"clientReferenceInformation" : {
"code" : "TC50171_3"
},
"orderInformation" : {
"amountDetails" : {
"totalAmount" : "102.21",
"currency" : "USD"
}
}
}
I've attempted to use the sha256sum command in binary and text format on a file containing the payload example. I've also attempted running this command on different permutations of this payload, such as without whitespace or newlines.
I expect to get the example hash of
2b4fee10da8c5e1feaad32b014021e079fe4afcf06af223004af944011a7cb65c
but instead get
f710ef58876f83e36b80a83c8ec7da75c8c1640d77d598c470a3dd85ae1458d3 and other dissimilar hashes.
What am I doing wrong?
Since the alleged "example" hash contains 33 hex characters one can see that it is not a possible valid output of SHA256. So there is nothing you can do to make your example match theirs.
There is also a base64 example in that discussion, but it is also not valid base64. By adding an extra padding character '=' to the base64 it can be made valid, and decoding it reveals that it mostly matches the alleged SHA256 hash.
My guess is that the values on that page are just examples of what values look like to the human eye rather than test vectors you are supposed to match exactly.
Probably you are not doing anything wrong. Hash functions have a avalanche effect, wherein any different bit in the input changes a lot the output hash. If the site's original example used a different encoding, or had a different order for the JSON elements, or even had more or less tabs, spaces, line breaks, or any other "trash" character, you'll have a hard time to find a fitting message for the hash showed in the site.
Usually, cryptographic solutions use canonicalizations to avoid this kind of problem (different hash values for semantically equal messages). However, the JWT specification doesn't specify any type of canonicalization for JSON.
In short, I think you don't have to worry about this. Your JWT implementation will be correct as long you use a valid (correctly implemented) hash function.
Also, I noticed that the JWT specification doesn't specify a "Digest" field for the JWT payload. So, you may not even need to use this field. Unless CyberSource REST API makes it mandatory.

ANTLR4 - replace op boundaries error|How to use TokenStreamRewriter to transform text from two listener events on overlapping tokens in original AST?

Hello ANTLR creators/users,
Some context - I am using PlSql ANTLR4 parser to do some lightweight transpiling of some queries from oracle sql to, let's say, spark sql. I have my listener class setup which extends the base listener.
Example of an issue -
Let's say the input is something like -
SELECT to_char(to_number(substr(ATTRIBUTE_VALUE,1,4))-3)||'0101') from xyz;
Now, I'd like to replace || with CONCAT and to_char with CAST as STRING, so that the final query looks like -
SELECT CONCAT(CAST(to_number(substr(ATTRIBUTE_VALUE,1,4))-3) as STRING),'0101') from xyz;
In my listener class, I am overriding two functions from base listener to do this - concatenation and string_function. In those, I am using a tokenStreamRewriter's replace to make the necessary transformation. Since tokenStreamRewriter is evaluated lazily, I am running to issue ->
java.lang.IllegalArgumentException: replace op boundaries of
<ReplaceOp#[#38,228:234='to_char',<2193>,3:15]..[#53,276:276=')',
<2214>,3:63]:"CAST (to_number(substr(ATTRIBUTE_VALUE,1,4))-3 as STRING)">
overlap with previous <ReplaceOp#[#38,228:234='to_char',<2193>,3:15]..
[#56,279:284=''0101'',<2209>,3:66]:"CONCAT
(to_char(to_number(substr(ATTRIBUTE_VALUE,1,4))-3),'0101')">
Clearly, the issue is my two listener functions attempting to replace/transform text on overlapping boundaries.
Is there any work around for territory overlap kind of issues for ANTLR4? I'm sure folks run into such stuff all the time probably.
I'd appreciate any workarounds, even dirty ones at this point of time :)
I did realize that ANTLR4 does not allow us to modify original AST, otherwise this would have been a little bit easier to solve.
Thanks!
A look at how tokenstreamrewriter works leads to the following understanding:
first, a list of all modification operations are built
then, you invoke getText()
here, there is a reduction of modification operations. The idea for example is to merge multiple insert together in one reduction. Its role is also to avoid multiple replace on same data (but i will expand on this point later).
every token is then read, in the case there is a modification listed for the said token index, TokenStreamRewriter do the operation, otherwise it just pop the read token.
Let's have a look on how modification operations are implemented:
for insert, tokenstream rewriter basically just adds the string to be added at the current token index, and then do an index+1, effectively going to next token
for replace, tokenstream rewriter replace a range of tokens with the new string, and set the new index to the end of this range.
So, for tokenstreamrewriter, overlapping replaces are not possible, as when you replace you jump to the end of the range of tokens to be replaced. Especially, in the case you remove the checks of overlapping, then only the first replace will be operated, as afterwards, the token index is past the other replaces.
Basically, this has been done because there is no way to tell easily what tokens should be replaced while using overlapping replaces. You would need for that symbol recognition and matching.
So, what you are trying to do is the following (for each step, the part between '*' is what is modified):
*SELECT to_char(to_number(substr(ATTRIBUTE_VALUE,1,4))-3)||'0101')* from xyz;
|
V
CONCAT (*to_char(to_number(substr(ATTRIBUTE_VALUE,1,4))-3)*,'0101') from xyz;
|
V
SELECT CONCAT(CAST(to_number(substr(ATTRIBUTE_VALUE,1,4))-3) as STRING),'0101') from xyz;
to achieve your transformation, you could do so a replace of :
'to_char' -> 'CONCAT(CAST'
'||' -> ' as STRING),'
And, by using a bit of intelligence while parsing your tokens, like is there a '||' in my tokens to know if it's string, you would know what to replace.
regards
The way I solve it in multiple projects based on ANTLR is this: I translated ANTLR parse-tree to an AST written using Kolasu, an open-source library we developed at Strumenta.
Kolasu has all sort of utilities to process and mutate ASTs. For all non-trivial projects I end up doing transformations on the AST.
Kolasu

Generating Random String of Numbers and Letters Using Go's "testing/quick" Package

I've been breaking my head over this for a few days now and can't seem to be able to figure it out. Perhaps it's glaringly obvious, but I don't seem to be able to spot it. I've read up on all the basics of unicode, UTF-8, UTF-16, normalisation, etc, but to no avail. Hopefully somebody's able to help me out here...
I'm using Go's Value function from the testing/quick package to generate random values for the fields in my data structs, in order to implement the Generator interface for the structs in question. Specifically, given a Metadata struct, I've defined the implementation as follows:
func (m *Metadata) Generate(r *rand.Rand, size int) (value reflect.Value) {
value = reflect.ValueOf(m).Elem()
for i := 0; i < value.NumField(); i++ {
if t, ok := quick.Value(value.Field(i).Type(), r); ok {
value.Field(i).Set(t)
}
}
return
}
Now, in doing so, I'll end up with both the receiver and the return value being set with random generated values of the appropriate type (strings, ints, etc. in the receiver and reflect.Value in the returned reflect.Value).
Now, the implementation for the Value function states that it will return something of type []rune converted to type string. As far as I know, this should allow me to then use the functions in the runes, unicode and norm packages to define a filter which filters out everything which is not part of 'Latin', 'Letter' or 'Number'. I defined the following filter which uses a transform to filter out letters which are not in those character rangetables (as defined in the unicode package):
func runefilter(in reflect.Value) (out reflect.Value) {
out = in // Make sure you return something
if in.Kind() == reflect.String {
instr := in.String()
t := transform.Chain(norm.NFD, runes.Remove(runes.NotIn(rangetable.Merge(unicode.Letter, unicode.Latin, unicode.Number))), norm.NFC)
outstr, _, _ := transform.String(t, instr)
out = reflect.ValueOf(outstr)
}
return
}
Now, I think I've tried just about anything, but I keep ending up with a series of strings which are far from the Latin range, e.g.:
𥗉똿穊
𢷽嚶
秓䝏小𪖹䮋
𪿝ท솲
𡉪䂾
ʋ𥅮ᦸ
堮𡹯憨𥗼𧵕ꥆ
𢝌𐑮𧍛併怃𥊇
鯮
𣏲𝐒
⓿ꐠ槹𬠂黟
𢼭踁퓺𪇖
俇𣄃𔘧
𢝶
𝖸쩈𤫐𢬿詢𬄙
𫱘𨆟𑊙
欓
So, can anybody explain what I'm overlooking here and how I could instead define a transformer which removes/replaces non-letter/number/latin characters so that I can use the Value function as intended (but with a smaller subset of 'random' characters)?
Thanks!
Confusingly the Generate interface needs a function using the type not a the pointer to the type. You want your type signature to look like
func (m Metadata) Generate(r *rand.Rand, size int) (value reflect.Value)
You can play with this here. Note: the most important thing to do in that playground is to switch the type of the generate function from m Metadata to m *Metadata and see that Hi Mom! never prints.
In addition, I think you would be better served using your own type and writing a generate method for that type using a list of all of the characters you want to use. For example:
type LatinString string
const latin = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz01233456789"
and then use the generator
func (l LatinString) Generate(rand *rand.Rand, size int) reflect.Value {
var buffer bytes.Buffer
for i := 0; i < size; i++ {
buffer.WriteString(string(latin[rand.Intn(len(latin))]))
}
s := LatinString(buffer.String())
return reflect.ValueOf(s)
}
playground
Edit: also this library is pretty cool, thanks for showing it to me
The answer to my own question is, it seems, a combination of the answers provided in the comments by #nj_ and #jimb and the answer provided by #benjaminkadish.
In short, the answer boils down to:
"Not such a great idea as you thought it was", or "Bit of an ill-posed question"
"You were using the union of 'Letter', 'Latin' and 'Number' (Letter || Number || Latin), instead of the intersection of 'Latin' with the union of 'Letter' and 'Number' ((Letter || Number) && Latin))
Now for the longer version...
The idea behind me using the testing/quick package is that I wanted random data for (fuzzy) testing of my code. In the past, I've always written the code for doing things like that myself, again and again. This meant a lot of the same code across different projects. Now, I could of course written my own package for it, but it turns out that, even better than that, there's actually a standard package which does just about exactly what I want.
Now, it turns out the package does exactly what I want very well. The codepoints in the strings which it generates are actually random and not just restricted to what we're accustomed to using in everyday life. Now, this is of course exactly the thing which you want in doing fuzzy testing in order to test the code with values outside the usual assumptions.
In practice, that means I'm running into two problems:
There's some limits on what I would consider reasonable input for a string. Meaning that, in testing the processing of a Name field or a URL field, I can reasonably assume there's not going to be a value like 'James Mc⌢' (let alone 'James Mc🙁') or 'www.🕸site.com', but just 'James McFrown' and 'www.website.com'. Hence, I can't expect a reasonable system to be able to support it. Of course, things shouldn't completely break down, but it also can't be expected to handle the former examples without any problems.
When I filter the generated string on values which one might consider reasonable, the chance of ending up with a valid string is very small. The set of possible characters in the set used by the testing/quick is just so large (0x10FFFF) and the set of reasonable characters so small, you end up with empty strings most of the time.
So, what do we need to take away from this?
So, whilst I hoped to use the standard testing/quick package to replace my often repeated code to generate random data for fuzzy testing, it does this so well that it provides data outside the range of what I would consider reasonable for the code to be able to handle. It seems that the choice, in the end, is to:
Either be able to actually handle all fuzzy options, meaning that if somebody's name is 'Arnold 💰💰' ('Arnold Moneybags'), it shouldn't go arse over end. Or...
Use custom/derived types with their own Generator. This means you're going to have to use the derived type instead of the basic type throughout the code. (Comparable to defining a string as wchar_t instead of char in C++ and working with those by default.). Or...
Don't use testing/quick for fuzzy testing, because as soon as you run into a generated string value, you can (and should) get a very random string.
As always, further comments are of course welcome, as it's quite possible I overlooked something.

TSearch2 - dots explosion

Following conversion
SELECT to_tsvector('english', 'Google.com');
returns this:
'google.com':1
Why does TSearch2 engine didn't return something like this?
'google':2, 'com':1
Or how can i make the engine to return the exploded string as i wrote above?
I just need "Google.com" to be foundable by "google".
Unfortunately, there is no quick and easy solution.
Denis is correct in that the parser is recognizing it as a hostname, which is why it doesn't break it up.
There are 3 other things you can do, off the top of my head.
You can disable the host parsing in the database. See postgres documentation for details. E.g. something like ALTER TEXT SEARCH CONFIGURATION your_parser_config
DROP MAPPING FOR url, url_path
You can write your own custom dictionary.
You can pre-parse your data before it's inserted into the database in some manner (maybe splitting all domains before going into the database).
I had a similar issue to you last year and opted for solution (2), above.
My solution was to write a custom dictionary that splits words up on non-word characters. A custom dictionary is a lot easier & quicker to write than a new parser. You still have to write C tho :)
The dictionary I wrote would return something like 'www.facebook.com':4, 'com':3, 'facebook':2, 'www':1' for the 'www.facebook.com' domain (we had a unique-ish scenario, hence the 4 results instead of 3).
The trouble with a custom dictionary is that you will no longer get stemming (ie: www.books.com will come out as www, books and com). I believe there is some work (which may have been completed) to allow chaining of dictionaries which would solve this problem.
First off in case you're not aware, tsearch2 is deprecated in favor of the built-in functionality:
http://www.postgresql.org/docs/9/static/textsearch.html
As for your actual question, google.com gets recognized as a host by the parser:
http://www.postgresql.org/docs/9.0/static/textsearch-parsers.html
If you don't want this to occur, you'll need to pre-process your text accordingly (or use a custom parser).

URL shortener with no database

I'd like to write a URL shortener that doesn't have to use a database. Instead, to have as few moving parts as possible, the script would just create a unique hash for my URL based on an algorithm (like md5, except an md5 would be too long). I'm not really sure how I'd go about doing this. Any advice?
If it matters, I'd prefer to write this in Ruby.
What you need, is a way to compress and decompress a String. Where the resulting compressed version is a string too. This is nearly impossible, because an URL is already very short. Encoding and lossless compression always add minimal overhead, which will result in a string that is larger than the original, for most URLS.
For very long URLs, however, it may work.
So, in the end, you will almost always need a lookup-table in storage (database).
Base64 is the most logical solution. On itself, however, Base64 encoding returns longer strings than the original, for short strings (which URL are, generally); due to the padding, mostly. So we'll also try with zlib, to compress the string.
require "uri"
require "base64"
require "zlib"
shortner_url = URI.parse("https://s.to")
long = "https://stackoverflow.com/questions/4818429/url-shortener-with-no-database"
url = URI.parse(long)
stripped = url.host + url.path
stripped.length #=> 66
# Let's see that Base64 on its own does not shorten the url.
encoded = Base64.encode64(stripped)
encoded.length #=> 90
# So, using zlib. To compress.
compressed = Zlib::Deflate.deflate(stripped)
encoded = Base64.encode64(compressed)
encoded.length #=> 94
# It became worse.
# Now, with a long url (they can be much longer even), in a oneliner; to simplify omit the stripping part:
long = "http://www.thelongestlistofthelongeststuffatthelongestdomainnameatlonglast.com/wearejustdoingthistobestupidnowsincethiscangoonforeverandeverandeverbutitstilllookskindaneatinthebrowsereventhoughitsabigwasteoftimeandenergyandhasnorealpointbutwehadtodoitanyways.html"
long.length #=> 263
Base64.encode64(Zlib::Deflate.deflate(long)).length #=> 228
# In order to turn this into a valid short URL, however, we need `urlsaf_encode64()`
shortner_url.path = "/" + Base64.urlsafe_encode64(Zlib::Deflate.deflate(long))
shorther_url.to_s #=> "https://s.to/eJxNjkEWwyAIRG-U7HsbElFpEPIE68vti6t2BcwbZn51v1_7PufcvCKrFDRnMtf8u81HzuA_IWkDEoGG4EtiMN9ObftE6Pgey0FSvK6gIx7GTUl0GsmJSz1Biqpk7fjBDpL-xjGcopKYWfWyiySBRBFJABw9UnB9xaWj1LDCQWUGAQYzBVLECPbyxFLBJDqA7-DxSJ5YIbkGnoM8Ex7bqjf-AiodbYM="
shortner_url.to_s.length #=> 237 WE SAVED 26 characters!
Note on stripping: can remove 'https://'. A Real implementation would need to add a piece to the string, to determine https or http: '1'+result for https, '0'+result for http. Another "hack" would be to make the url-shortening service use http for http urls and https for https urls.
If you always have the same domain, you can disgard the domain part too.
If you have a lot of slashes, or other repeating characters such as a dash, the compression works better.
You could do this with several of the string manipulation tools available to transform a URL into something obscured however as you noted in your question the url's you get from doing this would be longer than is typical for a url shortener.
url's don't compress very well.
Ultimately if you're after a short link, you simply need to generate a suitably legible unique code (try to omit similar letters/numbers such as zero and 'o', in case some poor bugger actually has to type it in) and associate that code with the original URL in some form of store.
Whilst I can understand why you don't want to use a database, in many ways it's the perfect form of storage, especially if you look at one of the dedicated key/value stores such as Cassandra, Redis, MongoDB, etc. (That said, a simple "traditional" SQL database may be an easy first step if you're in unfamiliar territory.)
You won't be able to resolve the original URL from a hash code without looking it up in some kind of database.
About the only thing you can do without a database is compress the URL and then decompress it when you resolve the URL.
Strictly speaking, I guess you could just hash the URL. But of what possible value would that be if you are not able to resolve it back to the original URL?