Bad encoding using URL().readText() with Kotlin

Bad encoding using URL().readText() with Kotlin - kotlin

I try to get the response of my call in a String but the result look like if the charset defined is not good.
val apiResponse = URL("https://api.stackexchange.com/2.2/questions?order=desc&sort=activity&site=stackoverflow")
.readText(Charset.forName("ISO-8859-1"))
println(apiResponse)
I try using "UTF-8" but the result is the same, full of badly encoded characters.
Why ?

The server returns the web page compressed with gzip, so it naturally includes lots of undisplayable characters.
You can confirm this without using Kotlin, e.g.:
$ wget 'https://api.stackexchange.com/2.2/questions?order=desc&sort=activity&site=stackoverflow'
[output snipped]
$ file 'questions\?order=desc\&sort=activity\&site=stackoverflow'
questions?order=desc&sort=activity&site=stackoverflow: gzip compressed data, from TOPS/20, original size 19820
You can use Kotlin to uncompress it — but this is easier if you read the URL as bytes, to avoid any character-set conversions:
val url = URL("https://api.stackexchange.com/2.2/questions?order=desc&sort=activity&site=stackoverflow")
val content = GZIPInputStream(ByteArrayInputStream(url.readBytes()))
.bufferedReader()
.use { it.readText() }

Related

unescaping UTF-8 characters from file (InputStream)

I am trying to unescape UTF_8 characters like "\u00f6" to their UTF-8 representation.
E.g. file contains "Aalk\u00f6rben" should become "Aalkörben".
val tmp = text.toByteArray(Charsets.UTF_8)
val escaped = tmp.decodeToString()
// or val escaped = tmp.toString(Charsets.UTF_8)
When I set the string manually to "Aalk\u00f6rben", this works fine. However, when reading the string from the file it is interpreted like "Aalk\\u00f6rben" with the slash escaped (two slashes) and the escaping fails.
Is there any way to convince Kotlin to convert the special characters? I would rather not use external libraries like from Apache.

I do not know how you read the file, but what happens most probably is that ...\u00f6... is read as six single characters and the backslash is probably being escaped. You could check in the debugger.
So my assumption is that in memory you have "Aalk\\u00f6rben". Try this replace:
val result = text
.replace("\\u00f6", "\u00f6")
.toByteArray(Charsets.UTF_8)
.decodeToString()
Edit: this should replace all escaped 4 byte characters:
val text = Pattern
.compile("\\\\u([0-9A-Fa-f]{4})")
.matcher(file.readText())
.replaceAll { matchResult -> matchResult.group(1).toInt(16).toChar().toString() }

awk replace string with another with new lines ( one time ) after finding another string

I wanted replace ___SIGNATURE___ with an HTML code signature after the first occurrence of "text/html" and only one replacement string ___SIGNATURE___. Any remaining ___SIGNATURE___ tags should simply be removed.
I am processing an email message where the header has a multipart boundary so there are two body parts, one with text/plain and another with text/html and the ___SIGNATURE___ tag exists in both.
So my part of my script looks like this:
awk -v signature="$(cat $disclaimer_file)" '/text\/html/ {html=1;} html==1 && !swap(swap=sub(/___SIGNATURE___/, signature);}1 in.$$ > temp.mail && mv temp.mail in.$$
sed -i "s/charset=us-ascii/charset=utf-8/1;s/___SIGNATURE___//" in.$$
It works, but is that optimal solution?
I have used altermime before but it was not good solution for my case.

Without access to sample messages, it's hard to predict what exactly will work, and whether we need to properly parse the MIME structures or if we can just blindly treat the message as text.
In the latter case, refactoring to something like
awk 'NR==FNR { signature = signature ORS $0; next }
{ sub(/charset="?[Uu][Ss]-[Aa][Ss][Cc][Ii][Ii]"?/, "charset=\"utf-8\"") }
/text\/html/ { html = 1 }
/text\/plain/ { html = 0 }
/___SIGNATURE___/ {
if (html && signature) {
# substr because there is a ORS before the text
sub(/___SIGNATURE___/, substr(signature, 2))
signature = ""
} else
sub(/___SIGNATURE___/, "")
} 1' "$disclaimer_file" "in.$$"
would avoid invoking both Awk and sed (and cat, and the quite pesky temporary file), where just Awk can reasonably and quite comfortably do all the work.
If you need a proper MIME parser, I would look into writing a simple Python script. The email library in Python 3.6+ is quite easy to use and flexible (but avoid copy/pasting old code which uses raw MIMEMultipart etc; you want to use the (no longer very) new EmailMessage class).

Link in PDF (using wijmo PDF) does not work

In IE, the length limit of the URL is about 2048 bytes, Query string may exceed 2048 bytes.
I use lz-string to compress strings, However, the link in the pdf does not work, Or maybe use another way to compress the string?
var compressed = LZString.compress(query)
https://jsfiddle.net/p9e4a8dg/11

Try reading this link.
If you want pure ASCII you can try compressToBase64

How to specify this particular header in Postman

Resource URL
GET https://<MATD_IP>/php/session.php
The following HTTP headers should be specified in the session request:
Accept: application/vnd.ve.v1.0+json
Content-Type: application/json
VE-SDK-API: Base64 encoded "user name:password" string
VE-API-Version (Optional)
I am confused onto what does it mean by specifying base64 encoded string. I have tried to do it but I am failing at it. Can anybody help me with the exact header parameters by giving an example.
Thank you

You could use this in your Pre-request Script:
let base64 = Buffer.from("username:password").toString('base64')
pm.request.headers.add({key: "VE-SDK-API", value: base64})
This will convert to Base64 and then create the header with the encoded value.

It most likely means that you need to provide a base64 string for that field. Write down the credentials with a : in between. Ex:
cooluser:str0ngP4ssword
Then you encode this exact string as base64 which would give you:
Y29vbHVzZXI6c3RyMG5nUEBzc3dvcmQ=
You can encode via terminal (Linux) echo "XXX" | base64 or just search for "base64 encode" on the WEB (not really recommended due to security reasons).
You can then use it for the headers:
Accept: application/vnd.ve.v1.0+json
Content-Type: application/json
VE-SDK-API: Y29vbHVzZXI6c3RyMG5nUEBzc3dvcmQ=
VE-API-Version 1.x

Omit echoing trailing new line using option -n (for not needed):
echo -n "username:password" | base64

Uploading file as utf-8 in Python3.x

I am trying to upload a utf-8 encoded file. In Python2.x I was using something like:
lines = filearg.file.readlines()
In Python3.2 a get an iterator of byte streams. I guess can do something like:
lines = [line.decode() for line in [filearg.file.readlines()]
I wonder whether there isn't a simpler way. For regular files I just write:
with open(path) as f: ## utf-8 is the default
lines = list[f.readlines()]
and I get my list of utf-8 strings.
-- tsf

Without more information (what framework are you using?), we don't know if there's a neat way to do this. But generally:
HTTP communications are bytes-based: there's not necessarily an encoding specified, and if there is, it may not be correct. So it makes sense to give you bytes and let you work out what to do with them. If you want a text file-like object, you can use io.TextIOWrapper:
file = io.TextIOWrapper(filearg.file, 'utf-8')

A similar way used in python2.x is:
import codecs
with codecs.open(path, encoding='utf-8') as f:
lines = [l for l in f]
You can try it in python3.x

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Bad encoding using URL().readText() with Kotlin - kotlin

Related

unescaping UTF-8 characters from file (InputStream)

awk replace string with another with new lines ( one time ) after finding another string

Link in PDF (using wijmo PDF) does not work

How to specify this particular header in Postman

Uploading file as utf-8 in Python3.x

Categories

Resources