How do I parse a .eml file using C/C++? - objective-c

How do I parse .eml files using C/C++ (or preferably Objective-C)?
I looked around but couldn't find any details about the structure of the file.

EML files are plain ASCII (7-bit) files. The file format is specified in RFC 2822. The mail header will be separated from the body by an empty line.
If you will be dealing with emails that contain attachments, or characters whose value is greater than 127, you will need a base64 decoder. maybe this link will help.

Related

Fix Unicode Decode Error Without Specifying Encoding='UTF-8'

I am getting the following error:
'ascii' codec can't decode byte 0xf4 in position 560: ordinal not in range(128)
I find this very weird given that my .csv file doesn't have special characters. Perhaps it has special characters that specify header rows and what not, idk.
But the main problem is that I don't actually have access to the source code that reads in the file, so I cannot simply add the keyword argument encoding='UTF-8'. I need to figure out which encoding is compatible with codecs.ascii_decode(...). I DO have access to the .csv file that I'm trying to read, and I can adjust the encoding to that, but not the source file that reads it.
I have already tried exporting my .csv file into Western (ASCII) and Unicode (UTF-8) formats, but neither of those worked.
Fixed. Had nothing to do with unicode shenanigans, my script was writing a parquet file when my Cloud Formation Template was expecting a csv file. Thanks for the help.

Characters not displayed correctly when reading CSV file

I have an issue when trying to read a string from a .CSV file. When I execute the application and the text is shown in a textbox, certain characters such as "é" or "ó" are shown as a question mark symbol.
The idea is that this code reads the whole CSV file and then splits each line into variables depending on the first word of the line.
The code I'm using to read is:
Dim test() As String
test = IO.File.ReadAllLines("Libro1.csv")
Dim test_chart As String = Array.Find(vls1load, Function(x) (x.StartsWith("sample")))
Dim test_chart_div() As String = test_chart.Split(";")
variable1 = test_chart_div(1)
variable2 = test_chart_div(2)
...etc
I have also tried with:
Dim test() As String
test = IO.File.ReadAllLines("Libro1.csv", System.Text.Encoding.UTF8)
But none of them works. The .csv file is supposed to be UTF8. The "web options" that you can see when saving the file in excel show encoding UTF8. I also tried the trick of changing the file extension to HTML and opening it with the browser to see that the encoding is also correct.
Can someone advice anything else I can try?
Thanks in advance.
When an Excel file is exported using the CSV Comma Separated output format, the Encoding selected in Tools -> Web Option -> Encoding of Excel's Save As... dialog doesn't actually generate the expected result:
the Text file is saved using the Encoding relative to the current Language selected in the Excel Application, not the Unicode (UTF16-LE) or UTF-8 Encoding selected (which is ignored) nor the default Encoding determined by the current System Language.
To import the CSV file, you can use the Encoding.GetEncoding() method to specify the Name or CodePage of the Encoding used in the machine that generated the file: again, not the Encoding related to System Language, but the Encoding of the Language that the Excel Application is currently using.
CodePage 1252 (Windows-1252) and ISO-8859-1 are commonly used in Latin1 zone.
Based the symbols you're referring to, this is most probably the original encoding used.
In Windows, use the former. ISO-8859-1 is still used, mostly in old Web Pages (or Web Pages created without care for the Encoding used).
As a note, CodePage 1252 and ISO-8859-1 are not exactly the same Encoding, there are subtle differences.
If you find documentation that states the opposite, the documentation is wrong.

exporting text file with utf-8 encoding in ms access

I am exporting text files from 2 queries in ms access 2010. Queries are from different linked ODBC tables (but tables are different only by data, structure and data types are same). I set up export specification to export text file in utf-8 encoding for both files. Now here come the trouble part. When I export the queries and open them in notepad, one query is in utf-8 and second one is in ANSI. I don't know how is this possible when both queires has the same export specification and it is driving me crazy.
This is my VBA code to export queries:
DoCmd.TransferText acExportDelim, "miniflow", "qry01_CZ_test", "C:\TEST_CZ.txt", no
DoCmd.TransferText acExportDelim, "miniflow", "qry01_SK_test", "C:\TEST_SK.txt", no
I also tried to modify it by adding 65001 as coding argument by the results were same.
Do you have any idea what could be wrong?
Don't rely on the File Open dialog in Notepad to tell you whether a text file is encoded as "ANSI" or UTF-8. That is just Notepad's "guess" based on whether the file begins with the bytes EF BB BF, which is the UTF-8 Byte Order Mark (BOM).
Many (most?) Windows applications will include the UTF-8 BOM at the beginning of a text file that is UTF-8 encoded. Some Unicode purists insist, often quite vigorously, that the BOM is not required for UTF-8 files and should be excluded, but that is the way Windows applications tend to behave.
Unfortunately, Access does not always follow that pattern when it exports files to text. A UTF-8 text file exported from Access may omit the BOM and that can confuse applications like Notepad if they assume that a UTF-8 encoded file will always include the BOM as the first three bytes of the file.
For a more reliable way of determining the encoding of a text file consider using an application like Notepad++ to open the file. It will differentiate between the UTF-8 files with a BOM (which it designates as "UTF-8") and UTF-8 files without a BOM (which it designates as "ANSI as UTF-8")
To illustrate, consider the following Access table
When exported to text (CSV) with UTF-8 encoding,
the File Open dialog in Notepad reports that it is encoded as "ANSI"
but a hex editor shows that it is in fact encoded as UTF-8 (the character é is encoded as C3 A9, not simply E9 as would be the case for true "ANSI" encoding)
and Notepad++ recognizes it as "ANSI as UTF-8"
in other words, a UTF-8 encoded file without a BOM.

PDF format. function of %-started sequence

What is a function of hex sequence "25 E2 E3 CF D3", found at the beginning of some documents? It should be a comment as far as I understand, but it's content is not any meaningful text and the same sequence occurs in many documents.
It identifies the PDF file as containing binary data.
From the freely available PDF Reference (section 7.5.2, p. 40):
If a PDF file contains binary data, as most do (see 7.2, "Lexical Conventions"), the header line shall be
immediately followed by a comment line containing at least four binary characters—that is, characters whose
codes are 128 or greater. This ensures proper behaviour of file transfer applications that inspect data near the
beginning of a file to determine whether to treat the file’s contents as text or as binary.

dll files compared to gzip files

Okay, the title isn't very clear.
Given a byte array (read from a database blob) that represents EITHER the sequence of bytes contained in a .dll or the sequence of bytes representing the gzip'd version of that dll, is there a (relatively) simple signature that I can look for to differentiate between the two?
I'm trying to puzzle this out on my own, but I've discovered I can save a lot of time by asking for help. Thanks in advance.
Check if it's first two bytes are the gzip magic number 0x1f8b (see RFC 1952). Or just try to gunzip it, the operation will fail if the DLL is not gzip'd.
A gzip file should be fairly straight forward to determine as it ought to consist of a header, footer and some other distinguishable elements in between.
From Wikipedia:
"gzip" is often also used to refer to
the gzip file format, which is:
a 10-byte header, containing a magic
number, a version number and a time
stamp
optional extra headers, such as
the original file name
a body,
containing a DEFLATE-compressed
payload
an 8-byte footer, containing a
CRC-32 checksum and the length of the
original uncompressed data
You might also try determining if the gzip contains any records/entries as each will also have their own header.
You can find specific information on this file format (specifically the member header which is linked) here.