Official encoding used by Twitter Streaming API? Is it UTF-8? - api

What is the official encoding for Twitter's streaming API? My best guess is UTF-8 based on what I've seen, but I would like to avoid making assumptions.
The only part of the Twitter site I've seen where they even hint at what they use as their official encoding is here:
Twitter does not want to penalize a user for the fact we use UTF-8 or for the fact that the API client in question used the longer representation
https://dev.twitter.com/docs/counting-characters
Does anyone have a more "official" answer? I'm writing a state-machine tokenizer for the streaming API which makes certain assumptions. The last thing I want is to encounter something like UTF-16.
Thanks! :D

One indicator is that the JSON format, which Twitter uses for virtually everything, dictates (or at least defaults to) UTF-8. They should also set an appropriate HTTP header denoting the encoding (but I haven't confirmed this). If you're using XML instead, the XML opening tag explicitly denotes the encoding, which is UTF-8.

If they say they use UTF-8, that's a pretty good bet. UTF-8 is very common, and UTF-16 in the wild is pretty rare from what I've seen.
There are also some clever libraries you could use if you were so inclined to prove it to yourself by testing whether they support various characters. The best of these is used by Firefox to detect the encoding of webpages as they're loaded: http://www-archive.mozilla.org/projects/intl/UniversalCharsetDetection.html

At the moment twitter API v2 does not send their data in UTF-8!
I believe it's UTF-16 and because when decoding data in UTF-8 surrogate pairs remain. Surrogate pairs are only featured in UTF-16.
With the API I received for example this string: 🎁Crypto Heroez epic giveaway🎁
However, it didn't come this way but rather: \ud83c\udf81Crypto Heroez epic giveaway\ud83c\udf81
\ud83c\udf81 is a surrogate pair that translates into a gift emoji 🎁
In Hex code UTF-16BE that wrapped present is encoded with: D8 3C DF 81, in UTF-8 this same emoji is encoded with F0 9F 8E 81
Other developers noticed the same: https://twitterdevfeedback.uservoice.com/forums/930250-twitter-api/suggestions/41152342-utf-8-encoding-of-v2-api-responses
This issue was written on the Aug 15, 2020. But as I am writing today the 9th September 2021, they didn't communicated anything publicly available. (That's why I wanted to post this answer here)

Related

Best tool to identify encoding of streams (e.g. protobuf, avro, thrift, capnproto, etc.)?

I'm interested in playing with AI/ML tools and was wondering if this is a good problem to solve with that kind of tool? It would also be pretty useful in my day job if it actually worked.
The idea is if you have a bunch of messages of unknown origin that are binary data, would some form of pattern-recognition be able to determine that the message was encoded using a particular encoding tool like protobuf/capnproto/avro/etc.?

Pure HTML vs frameworks to define HATEOAS API?

When should one develop HATEOAS server RESTful API instead of using HTML (resource links, forms, etc.)?
Isn't HTML and a browser good enough as hypermedia engine?
Isn't HTML and a browser good enough as hypermedia engine?
HTML + HTTP + URI + Browser === The world wide web. So it's pretty good, no joke.
It's not without fault.
HTML's understanding of links is disappointingly limited. No support for idempotent writes. Uri Template support for GET only. I'm not super keen on how many different spellings there are for "link".
It's kind of verbose for a hypermedia format; don't get me wrong - built in text markup is brilliant when you are trying to document what is going on for a human being. But my impression thus far is that same structure starts to get in the way when as a human being you want to quickly review the semantic content that your automated agent is consuming.
I call your attention to this quote from RFC-4287
The primary use case that Atom addresses is the syndication of Web content such as weblogs and news headlines to Web sites as well as directly to user agents.
So a bunch of really smart guys, specifically trying to address use cases directly related to the web, decided to invest a bunch of effort into standardizing a new hypermedia format rather than using the one that was already ubiquitous in their problem domain.
And over the past 10+ years, that format has been widely adopted.
Without adoption, I'm not sure that HATEOAS has much benefit. You don't need a hypermedia api if you are controlling both sides of the conversation (example: javascript on the web -- hypermedia with code on demand capability downloading a client that has learned the protocol of a web api via some out of band channel).
Evidence would seem to suggest that HTML is not nearly as convenient a format as, for example, any of the JSON based hypermedia formats.
In conclusion: no, it's not good enough. It might be an acceptable place holder for the moment; but the JSON hypermedia tool sets are soon going to be sufficiently mature that HTML will be seen as a giant step in the wrong direction.

Is it advisable to send/return data in .plist format to/from an API being accessed via iOS?

We're deciding between using JSON vs. Property List (binary) for our API, which will be accessed by iPhone/iPad/iPod Touch.
Are there any speed advantages?
The server guys are going to understand JSON better.
Plists work really well and have much, much better type safety. The real issue you'll run into with JSON is someone server side adds a few quotes around a number and suddenly your app is crashing.
But, JSON is compact, easy to read (unlike binary plists), and as noted is really well understood. So just be very careful in the parsing code, and try out JSON.
According to Sam Soffes, JSON.
edit: he talks about xml-based plists, not binary plists. Either way, json will typically be easier to generate from web based api's.

Posting an image and textual based data to a wcf service

I have a requirement to write a web service that allows me to post an image to a server along with some additional information about that image.
I'm completely new to developing web services (normally client side dev) so I'm a little stumped as to what I need to look into and try.
How do you post binary data and plain text into a servic?
What RequestFormat should I use?
It looks like my options are xml or json. Can I use either of these?
Bit of a waffly question but I just need some direction rather than a solution as I can't seem to find much online.
After reading this guide to building restuful services I figured I was going about the problem the wrong way. The image and text are actually two seperate resources and so should probably be handled seperately. I now have a service that uploads an image and returns a uri to that image and a seperate service to post textual data relating to that image along with the uri to that image.
Though I don't have experience with WCF, I can tell you a painless way to handle POSTing/PUTting of binary data in a REST API (especially one with a mix of text + binary) is to encode the binary data as base64 and treat it much like any other text data in your API.
Yes, there is a slight overhead with base64 in terms of size and an additional encode/decode process, however, base64 is typically only 1.37x larger than binary.
I find in many cases the overhead is well worth avoiding the pain that can be involved with binary data in APIs, especially when you need to POST/PUT a combination of binary and text data. If you wanted to POST an image and additional meta/text data you could easily do so with a json string ("image" would be your base64 encoded image)...
{
"image":"aGVsbG8...gd29ybGQ=",
"user" : 1234,
"sub_title": "A picture from my trip to Pittsburgh"
}
This is by no means the best solution for all cases, and as I said, I'm not a WCF expert, but it's certainly something to consider in the general case to make your life easier.
If you are using WebServiceHost in WCF 3.5 then you need to read this. If you have to use WCF to do HTTP based stuff then try and get onto .Net 4. I believe they have made a whole lot of things much easier.
If you are stuck with 3.5, welcome to a world of pain. Find everything you can written by Aaron Skonnard on the subject. And as I suggested in the comments of the other question, learn how to use SvcTrace.

SEO: How to best encode characters such as ÅÄÖ

In terms of SEO: what's the best way to encode characters such as ÅÄÖ?
I've used ö, å in titles etc.
But in google webmaster tools they end up as:
"Sök bland inkomna förfrågningar från Stockholm inom Golvvård. Offerta.se"
Doen't google recognize these?
Google does recognise HTML entity references in search results; I'm not sure where in Webmaster Tools you're looking to get the HTML-source version you quote, and whether that's actually indicative of any kind of problem (I suspect not).
But these days there's little good reason to ever use an HTML entity reference (other than the XML built-ins <, &, "). Stick with UTF-8 encoding and just type the characters ö, Å et al directly.
I'd expect if you have a charset declaration in your file (such as <meta http-equiv="Content-type" content="text/html;charset=UTF-8"> in your head section) that the webcrawler would understand the unicode characters for those letters. But really you'd have to test these out and/or ask Google.
I wonder if your inference that Google's webcrawler isn't processing the entities correctly (on the basis of what you're seeing in the webmaster tools) is actually correct; I'm not saying it isn't, but it's an assumption I'd test. I'd be a bit surprised, really, if Google's webcrawler didn't understand the entities (Å for Å, etc.), most browsers do, even in the title.
But using a proper charset is probably your best bet. Make sure the charset you declare is the one your editor and other tools are actually producing!
Google converts everything to Unicode internally, so use UTF-8 everywhere.