When I create a PDF form (for instance using Acrobat) that contains text fields in AcroForm format (PDF dictionaries, no XFA), and I submit the data to a server, how can I specify/retrieve the encoding that will be used?
For instance. When I submit the Chinese glyphs '测试' (test), I receive the following headers and content on the server-side:
accept: application/x-ms-application, image/jpeg, application/xaml+xml, image/gif, image/pjpeg, application/x-ms-xbap, application/vnd.ms-excel, application/vnd.ms-powerpoint, application/msword, */*
content-type: application/x-www-form-urlencoded
content-length: 23
acrobat-version: 10.1.4
user-agent: Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; MDDC; .NET4.0C; AskTbCLA/5.15.1.22229)
accept-encoding: gzip, deflate
connection: Keep-Alive
Song=%b2%e2%ca%d4&Test=
There's no reference to an encoding, except x-www-form-urlencoded. The two glyphs are represented as four bytes: B2 E2 CA D4. After some investigation, I know that B2E2 is the GBK value for the first glyph, and CAD4 the GBK value for the second glyph, but I can't derive this from the request header.
Is it always GBK? I want to change the data encoding by setting a specific key in a dictionary in the PDF, but there doesn't seem to be any. For instance: I would like make sure the PDF always sends Unicode characters instead of GBK.
Note that I've already experimented by changing the default font (and encoding) of the text field. I've also searched ISO-32000-1 for encodings in fields, but all I found was a way to define non-Latin characters for check boxes, and some info about the encoding of an FDF file. None of which answered my questions.
I've just found the answer to my main question myself. I didn't find anything in ISO-32000-1 or the ISO-32000-2 draft, but studying the Acrobat JavaScript reference, I found the cCharset parameter that is available for the submitForm() method. That parameter defines:
The encoding for the values submitted. String values are utf-8,
utf-16, Shift-JIS, BigFive, GBK, and UHC. If not passed, the current
Acrobat behavior applies. For XML-based formats, utf-8 is used. For
other formats, Acrobat tries to find the best host encoding for the
values being submitted. XFDF submission ignores this value and always
uses utf-8.
In other words: in my case GBK was used because it fits best to submit Chinese characters. However, one could force UTF-8 by using the submitForm() JavaScript method using the appropriate value.
Based on this question, I have asked the ISO committee to fix this problem in ISO-32000-2.
As a result, an extra possible entry was added to the table entitled Additional entries specific to a submit-form action in section 12.7.6.2:
CharSet: string
(Optional; inheritable) Possible values include: utf-8, utf-16,
Shift-JIS, BigFive, GBK, or UHC.
Starting with PDF 2.0, this problem will no longer exist.
Update: my suggestion made ISO 32000-2 (aka PDF 2.0):
The CharSet key doesn't exist in ISO 32000-1; it was introduced in ISO 32000-2.
Related
I am using MSXML2.XMLHTTP60 to send text messages via VBA using a web server. I cannot understand why the € symbol is not displayed when receiving a text message. Other special characters, such as ò,à,è etc are displayed after a conversion function I wrote (for example à is encoded as "%E0"). I suppose that web server is expecting charset iso 8859-1 which doe not support € symbol. Therefore how can I solve this problem?
If your request is a POST request then you can specify header for Content-Type with encoding e.g. like this:
objHTTP.Open "POST", ...
objHTTP.setRequestHeader "Content-Type", "text/html; charset=utf-8"
But for GET request the URL with possible query string parameters will be encoded as ASCII. Read e.g. this post.
Using UTF-8 as your character encoding should solve such problems. It may also remove the need for your conversion function. I'm not sure how to set the encoding in your web server, but that's usually well documented.
My html-video calls multiple separate request for chunks. seems not like single stream.
When I see that in debugging tools,
As you see, there are 3 different call.
This is the request header,
Accept:*/*
Accept-Encoding:identity;q=1, *;q=0
Accept-Language:ja-JP,en-US;q=0.8
Connection:keep-alive
Cookie:stg_domain_token=oNijQNByftcYnsLGzFZxRyCesLR-GdWKi6a-uKSJJ9060Yk8pwCiUlcHChyf
Host:stg.myhost.com
Range:bytes=32768-
User-Agent:Mozilla/5.0 (Linux; Android 6.0.1; SC-05G Build/MMB29K; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/54.0.2840.68 Mobile Safari/537.36
X-DevTools-Emulate-Network-Conditions-Client-Id:62626f5b-82c9-48b9-97f5-a7a983e1c3bc
and here is the response header,
accept-ranges:bytes
Connection:keep-alive
Content-Disposition:filename=49976265106__9BB3FA25-04E4-4AF5-903C-9B12CF622567.MOV
Content-Length:324882
content-range:bytes 32768-357649/357650
Content-Type:video/quicktime
Date:Fri, 04 Nov 2016 06:15:06 GMT
Server:Apache
X-Powered-By:PHP/5.6.17
anyone know what I am missing?
Browser won't download entire video or audio file at a time. It downloads them in chunks and plays them one after another.
For your understanding, I'm explaining the headers here.
Request Header
Accept:*/* : Browser will accept any MIME-Types as response.
Range:bytes=32768- : Browser already has the video part, till byte 32767 but requires file from byte 32768.
Response Header
status : 206 : It means the served content is partial (not complete file)
accept-ranges:bytes : Server accepts byte ranges only (which is universal)
Content-Length:324882 : Total content length from requested byte.
content-range:bytes 32768-357649/357650: it is in this format start byte - last byte / total length (from 0 byte to end)
Content-Type:video/quicktime : Type of content
OK now I do have the timestamp from a TS provider.
How am I supposed to put it in a mime message so to comply with the standards?
As far as I know, no mailer supports timestamping, and this will not be a problem because I will be handling the mime message myself.
However I want to make it the standard way... any examples?
Thanks.
I think #Michael's own answer is just quite there with the following caveats:
An application/timestamp-replyis intended to transport a TimeStampResp which may or may not contain a TimeStampToken, and for the current purpose a TimeStampToken is always required to exist. See RFC 3161, "2.4.2. Response Format".
application/timestamp-reply content type is not currently defined as a security multipart protocol. See RFC 1847, "1. Introduction" and RFC 3161, "3.1. Time-Stamp Protocol Using E-mail".
Because of the previous I suggest the following sample structure:
MIME-Version: 1.0
Content-Type: multipart/signed; protocol="application/timestamp-signature"; micalg="sha256"; boundary="{79EAC9E2-BAF9-11CE-8C82-00AA004BA90B}"
--{79EAC9E2-BAF9-11CE-8C82-00AA004BA90B}
MIME-Version: 1.0
Content-Type: text/plain
Hello
--{79EAC9E2-BAF9-11CE-8C82-00AA004BA90B}
Content-Type: application/timestamp-signature; name="tst.bin"
Content-Transfer-Encoding: base64
Content-Disposition: attachment; filename="tst.bin"
MIINygYJKoZIhvcNAQcCoIINuzCCDbcCAQMxDzANBglghkgBZQMEAgEFADB5BgsqhkiG9w0BCRAB
BKBqBGgwZgIBAQYLYIZIAYb9bgEHFwQwMTANBglghkgBZQMEAgEFAAQg7fR3pD+6Lw0dlYtTjYke
...
vlwFfWaVsUq6VyE0Sw3mVxQGooR7/GH10QSP7bNQqHNWyk1kX+9FlrRY3BPjsvJ046+ol74/3QkB
WA7ZrAGzhwRBPQKfkCXysHwtDIj7iF1YXcXoeKQ1SWiGjhIHCpCXMJwNiapZQfYsnZQbI6L/xXMA
--{79EAC9E2-BAF9-11CE-8C82-00AA004BA90B}--
Where
tst.bin is a TimeStampToken.
application/timestamp-signature is an non-standard security multipart protocol.
Edit:
There seems to be a couple of standards that could fit here:
RFC 5544 - "Syntax for Binding Documents with Time-Stamps"
RFC 5955 - "The application/timestamped-data Media Type"
But I did not have the time to check them in detail.
I believe this is the appropriate format:
MIME-Version: 1.0
Content-Type: multipart/signed; protocol="application/timestamp-reply"; micalg="sha256"; boundary="{79EAC9E2-BAF9-11CE-8C82-00AA004BA90B}"
--{79EAC9E2-BAF9-11CE-8C82-00AA004BA90B}
MIME-Version: 1.0
Content-Type: text/plain
Hello
--{79EAC9E2-BAF9-11CE-8C82-00AA004BA90B}
Content-Type: application/timestamp-reply; name="smime.tsr"
Content-Transfer-Encoding: base64
Content-Disposition: attachment; filename="smime.tsr"
MIIIUgYJKoZIhvcNAQcCoIIIQzCCCD8CAQMxDzANBglghkgBZQMEAgEFADCCAQ4G
CyqGSIb3DQEJEAEEoIH+BIH7MIH4AgEBBgorBgEEAbIxAgEBMDEwDQYJYIZIAWUD
BAIBBQAEIO30d6Q/ui8NHZWLU42JHpvHqwcukBtDCZiWtieBErjfAhQJsQprheA+
j/8hfRdCJYqNwURr+BgPMjAxNjA3MjgxMTM4NDdaoIGMpIGJMIGGMQswCQYDVQQG
EwJHQjEbMBkGA1UECBMSR3JlYXRlciBNYW5jaGVzdGVyMRAwDgYDVQQHEwdTYWxm
b3JkMRowGAYDVQQKExFDT01PRE8gQ0EgTGltaXRlZDEsMCoGA1UEAxMjQ09NT0RP
IFNIQS0yNTYgVGltZSBTdGFtcGluZyBTaWduZXKgggSgMIIEnDCCA4SgAwIBAgIQ
TrCHj8wkNTay2Mn3vzlVdzANBgkqhkiG9w0BAQsFADCBlTELMAkGA1UEBhMCVVMx
CzAJBgNVBAgTAlVUMRcwFQYDVQQHEw5TYWx0IExha2UgQ2l0eTEeMBwGA1UEChMV
VGhlIFVTRVJUUlVTVCBOZXR3b3JrMSEwHwYDVQQLExhodHRwOi8vd3d3LnVzZXJ0
cnVzdC5jb20xHTAbBgNVBAMTFFVUTi1VU0VSRmlyc3QtT2JqZWN0MB4XDTE1MTIz
MTAwMDAwMFoXDTE5MDcwOTE4NDAzNlowgYYxCzAJBgNVBAYTAkdCMRswGQYDVQQI
ExJHcmVhdGVyIE1hbmNoZXN0ZXIxEDAOBgNVBAcTB1NhbGZvcmQxGjAYBgNVBAoT
EUNPTU9ETyBDQSBMaW1pdGVkMSwwKgYDVQQDEyNDT01PRE8gU0hBLTI1NiBUaW1l
IFN0YW1waW5nIFNpZ25lcjCCASIwDQYJKoZIhvcNAQEBBQADggEPADCCAQoCggEB
AM68dLdwgE9e8z+Yqi7L1BIBIzVpCyK85v0JbCjkExKsu7ot5dXdIu5ztiz40qRx
50kleKslt5AQoJuLdybdQOpBo/2IzXKmiTtQVxx6JSQiAlFANWeKMWkN5TlzSTmb
lQGFUvIrFImaTgSkvECuOabdQALgOnX+PX1VlFvxTiR8yLhYGcrA2r5YE5rmHOfR
wTvwXY9JCCGe0PO+1tRmT1xyNnvDgtOYCJSvq0RPGMcU2haxHjIOEjjAtTx27HVQ
ACAEERntxv/fTv4IgScxT3F0bgMMcCeBVWqaQ5Kkf9v9P8UXHkG7zuinf4yV+f1/
+GGIiQA+/wsB2/3VtaTkkRECAwEAAaOB9DCB8TAfBgNVHSMEGDAWgBTa7WR0FJwU
PKvdmam9WyhNizzJ2DAdBgNVHQ4EFgQUfb+R16dsWkdmRHuQ1I6QckGPF8IwDgYD
VR0PAQH/BAQDAgbAMAwGA1UdEwEB/wQCMAAwFgYDVR0lAQH/BAwwCgYIKwYBBQUH
AwgwQgYDVR0fBDswOTA3oDWgM4YxaHR0cDovL2NybC51c2VydHJ1c3QuY29tL1VU
Ti1VU0VSRmlyc3QtT2JqZWN0LmNybDA1BggrBgEFBQcBAQQpMCcwJQYIKwYBBQUH
MAGGGWh0dHA6Ly9vY3NwLnVzZXJ0cnVzdC5jb20wDQYJKoZIhvcNAQELBQADggEB
AFCw9d9frTPcw1NYWLzCE3V7IB1Uyro/UD+6ivRrCWPAW12L1nUac72L/0fxFdxR
FiMZMuZukk3Rxi5aHohCFMly5dcIUIpq9WRAVq4k42GXFULwLEiug+Y1PItbwo+u
jsw0UjTg+/7K/bEkaNGkESMQBv2ywiQnx9fpShyPPz7P7et1eWyOX/chtlDmJaHN
ZpQSbL/bs66H2GgDciACwn7alPNyBzxX6FUk5wWgHcSBAYJLHz8PnTOb8E/MndaF
gc/L5/1K6ZK49w1ycy3pd/lvjyh6Ph69CIbcjR4RX/dbu4d2xp5MVGHQZ9uThNox
hwOS55/j6c9aVsho4FJJlFwxggJxMIICbQIBATCBqjCBlTELMAkGA1UEBhMCVVMx
CzAJBgNVBAgTAlVUMRcwFQYDVQQHEw5TYWx0IExha2UgQ2l0eTEeMBwGA1UEChMV
VGhlIFVTRVJUUlVTVCBOZXR3b3JrMSEwHwYDVQQLExhodHRwOi8vd3d3LnVzZXJ0
cnVzdC5jb20xHTAbBgNVBAMTFFVUTi1VU0VSRmlyc3QtT2JqZWN0AhBOsIePzCQ1
NrLYyfe/OVV3MA0GCWCGSAFlAwQCAQUAoIGYMBoGCSqGSIb3DQEJAzENBgsqhkiG
9w0BCRABBDAcBgkqhkiG9w0BCQUxDxcNMTYwNzI4MTEzODQ3WjArBgsqhkiG9w0B
CRACDDEcMBowGDAWBBQ2Un1Pompo+etFlvHZmrssDqdt+jAvBgkqhkiG9w0BCQQx
IgQgXBnEfFijVzb4h7n7wGBdvQhBEzRn87M67RIUdRNe6dwwDQYJKoZIhvcNAQEB
BQAEggEACcjtqph0BQ20lchE0HZYg/4oL8KuPh1Vx5LL2cVaPcj2fruoXH58577E
XFQxhZ08HsjZtYdhVokRs2vbjM/i23HVDX+IkwGuESloFXhtoAt9hKNkyhXTtWx5
tt7TEJwi+o8/SU9bFnDqPMVn5Bg+QNnnegiCJzQ4lZnmTW2JiEmL3u7XzZ21FLZ7
KT/JqgOvBY3yvWySODN1yKVdhk5FkVKxBAxBgccPQ6nwmdm0RxqbsLdoSXFuRMi5
7sUgo113xR2VuvdJzl6d4iAYdUvuSRz94xXIQMQ9L307dKZ2yTYUQTy1YcSRxsZb
kTmkisjtzbXCyfC+AYB6dnoeBp3euQ==
--{79EAC9E2-BAF9-11CE-8C82-00AA004BA90B}--
Please edit me if I am wrong.
Any e-mail client that supports it so to verify?
I'm trying to send a HTTP request on Windows 8 using an IXMLHTTPRequest2 object and I want to customise the outgoing Accept-Encoding header to something other than the default value of "gzip, deflate". When I try and use SetRequestHeader method to set the Accept-Encoding header, the method call succeeds but the request is still sent with the default header value instead of the value I provided (Verified by using Wireshark to capture the HTTP request).
Sample code (simplified for beverity):
::CoCreateInstance( CLSID_FreeThreadedXMLHTTP60, NULL, CLSCTX_INPROC_SERVER, IID_PPV_ARGS( &m_pXHR ));
m_pXHR->Open( "GET", "http://192.168.0.100/test", m_pXHRCallback.Get(), NULL, NULL, NULL,NULL );
m_pXHR->SetRequestHeader( L"Accept-Encoding", L"gzip" );
m_pXHR->Send( NULL, 0 );
Wireshark capture of request that gets made:
GET /users/me/id HTTP/1.1
Accept: */*
Host: 192.168.0.100
Accept-Encoding: gzip, deflate
Connection: Keep-Alive
According to the docs for SetRequestHeader, it is an append operation only. You're getting the gzip Accept-Encoding, so I think it's working as intended. I don't see a way of removing the default header, however.
It seems that you can use IXMLHTTPRequest2::SetProperty() with XHR_PROP_NO_DEFAULT_HEADERS to suppress the default headers.
See: http://msdn.microsoft.com/en-us/library/windows/desktop/hh831167(v=vs.85).aspx
It seems that the IXMLHTTPRequest2 API is plainly broken, as it doesn't have a way to remove a header. Or, perhaps, documentation is broken because it doesn't mention that passing an empty string or NULL removes a header.
Also, according to IXMLHTTPRequest2::SetRequestHeader declaration:
virtual HRESULT STDMETHODCALLTYPE SetRequestHeader(
/* [ref][string][in] */ __RPC__in_string const WCHAR *pwszHeader,
/* [unique][string][in] */ __RPC__in_opt_string const WCHAR *pwszValue) = 0;
header's value is marked as optional (__RPC__in_opt_string) and can be NULL.
So, if you really want to set a header's value, the only proper way that works with IXMLHTTPRequest2 is to do this:
m_pXHR->SetRequestHeader(L"SomeMyHeader", L"");
m_pXHR->SetRequestHeader(L"SomeMyHeader", L"value");
This way you can remove, or change some default headers:
m_pXHR->SetRequestHeader(L"Accept-Language", L"");
Since this isn't documented, this may or may not work for you at some point on some particular version of Windows. If you tried to use IXMLHTTPRequest2 heavily you'd come to the same conclusion as me: it's just broken crap. This for example doesn't work:
m_pXHR->SetRequestHeader(L"Accept-Encoding", L"");
Seems that some dude who implemented IXMLHTTPRequest2 put lots of undocumented logic in there:
You can complete remove some headers if you set them to NULL or an empty string (for example, Accept-Language header, or your own headers can be removed).
You can change, but not remove some headers (for example, User-Agent header).
You cannot change some headers at all (for example, Accept-Encoding header).
When you call ->Send on IXMLHTTPRequest2, internally they unconditionally set Accept-Encoding to whatever they feel like using. That means that you cannot add some alternative encoding like brotli without resorting to hacks and custom headers.
They should just start using libcurl and expose its API instead of exposing IXMLHTTPRequest2 shameful quality.
I have a problem with my app that reads e-mails from external server using mailman gem (which is also using mail).
ruby 1.9.2p0
mail (2.3.0)
mailman (0.4.0)
actionmailer (= 3.1.3)
database.yml
production:
adapter: mysql2
encoding: utf8
Here is a simple method to receive 'mail'. I build #message_body from text_part of multipart email (for ex. with attachments) or from the whole body (decoded).
def self.receive_mail(message)
# some code here
#message_body = message.multipart? ? message.text_part.body.to_s : message.body.decoded
# some code here, to save message in database
My problem is that if the message doesn't have attachments but have diacritics, like ą ś ł ń ż ź ó ... body is split just before first diacricit.
So if body is:
"test żłóbek test"
I will get only "test " in #message_body.
My question is how to save such a message in an elegant way, so that text part is saved in database with all diacritics.
EDIT:
to make it cleaner, I get e-mails that look like this one (it's just a part of e-mail sent from gmail)
--20cf307ac4372d830104c11c8cc6
Date: Mon, 28 May 2012 20:06:16 +0200
Mime-Version: 1.0
Content-Type: text/plain;
charset=ISO-8859-2
Content-Transfer-Encoding: base64
Content-ID: <4fc3be989b76e_794650c25f6625e3#vk1057.some_domain>
dGVzdCC/s7zm8bbzsSB0ZXN0Cg==
So we have this 'body' : dGVzdCC/s7zm8bbzsSB0ZXN0Cg==
After decoding we get : 'test \xbf\xb3\xbc\xe6\xf1\xb6\xf3\xb1 test\n'
And the problem is that starting from '\xbf' data is not saved in database.
UPDATE
another example, I think this is the problem here:
irb(main):008:0* require 'base64'
=> true
irb(main):009:0> a = "test źćłżąńś"
=> "test źćłżąńś"
irb(main):010:0> b = Base64.encode64(a)
=> "dGVzdCDFusSHxYLFvMSFxYTFmw==\n"
irb(main):011:0> Base64.decode64(b)
=> "test \xC5\xBA\xC4\x87\xC5\x82\xC5\xBC\xC4\x85\xC5\x84\xC5\x9B"
see, after decode64 my diacritics are LOST, what to do to get them back?
force_encoding('utf-8')
Doesn't work because the data isn't utf-8 - your mail headers clearly states that the message body is ISO 8859-2.
Mysql2 assumes everything is utf8 but can't convert the bytes to utf8 (because ruby doesn't know the original encoding) so your non ascii characters are thrown away by mysql
For that one string you could try
body.force_encoding('ISO-8859-2').encode('utf-8')
But really you want to be working out what encoding to use from the content type header. I'm surprised the mail gem isn't doing that for you
I have the solution. Concatenation of
.force_encoding("ORIGINAL_CHARSET").encode("UTF-8")
methods on E-Mail body object is the solution.
I had to change my receive_mail() definition from previous 'one liner' to:
if message.multipart?
charset = message.text_part.content_type_parameters[:charset]
#message_body = message.text_part.body.to_s.force_encoding(charset).encode("UTF-8")
else
charset = message.content_type_parameters[:charset]
#message_body = message.body.decoded.force_encoding(charset).encode("UTF-8")
end
With this construct I can detect what was the charset of original e-mail and then force it and encode back to UTF-8. This ensures proper decoding from base64 from original to utf-8.
If anyone has more elegant solution, please share.