How to handle carriage return, linefeed within a quoted string - azure-data-lake

Multiple source systems I want to process using Azure Data Lake contain a carriage return, linefeed within a column.
This causes Extract in ADLA to fail with the following error:
E_RUNTIME_USER_EXTRACT_UNEXPECTED_ROW_DELIMITER
Trying to find a working configuration to not be running into this issue anymore. The native Extractor documentation on Microsoft.com describes this:
Note that the rowDelimiter character inside a quoted string will not
be escaped and will be used as a row separator which will lead to
incorrect or failing extractions.
https://msdn.microsoft.com/en-us/azure/data-lake-analytics/u-sql/extractor-parameters-u-sql
Unfortunately this fails to mention a good workaround.
I tried switching to another format like Orc or Parquet. However, for the time being, these seem not to be fully supported yet. As this limits the functionality of the development environment, I would rather not use these formats for now.
This issue seems highly likely to occur, yet I am unable to find a good solution. What is a good and standard solution to work around this issue while still keeping the convenience of using csv/tsv to store files?

I've accomplished this by creating a custom extractor based on a third party CSV Parser. Specifically, the CsvParser class from Josh Close's fantastic CsvHelper library. Works like a charm. Don't forget to set AtomicFileProcessing = true.

Related

Vim - proper handling of multiple syntax in a single shell script

I write many scripts where there is complex awk logic incorporated into the script directly. I prefer the "one-stop-shop" approach and not have those logic segments as external files.
I have tried what I have seen in this reference about specifying the multiple syntaxes that are to be evaluated (and hopefully colorized per syntax colorscheme). I tried the suggested method as both a VIM command and as a setting in the .vimrc file. The use of "setfiletype sh.awk.sed" did not work in either case.
Is there a special way to make this work properly ?
Another reference seems to provide what looks like awk-related syntax declarations. However, those are presented out of context, and it is not clear if the info provided is complete and self-contained, or whether that is an extract from an unknown, unpublished larger syntax definition, and whether that is added to the sh.vim or the awk.vim syntax files ?
Can anyone shed light on this ?
This first image is a test file created to show my colorscheme use-cases for Bourne shell.
This second image is a sample of awk logic displayed by the same scheme.

Is there a way to get exported constants from an Objective-C framework? [duplicate]

I'm trying to find a constant (something like a secret token) from inside of an iOS app in order to build an app using an undocumented web API (by the way, I'm not into something illegal). So far, I have the decrypted app executable on my Mac (jailbreak + SSH + dumping decrypted executable as file). I can use the strings command to get a readable list of strings, and I can use the class-dump tool (http://stevenygard.com/projects/class-dump/) to get a list of interface definitions (headers) of the classes. Although this gives me an idea of the app's inner workings, I still can't find what I'm searching for: the constants I'm looking for. There are literally thousands of string definitions in the strings command dump. Is there any way to dump the strings in a way that I can have the names of the NSString constants with their values. I don't need the implementation details of the methods, I know that it's compiled and all I can get is assembly code. But if I can get the names of the string constants (both in strings dump and class dump) and also the string values (in strings dump), I think there may be a way to associate them together.
Thanks,
Can.
Unfortunately, no, unless there's some black magic tool out there that I'm unaware of, or unless the executable was built with debug symbols (which is likely not the case). If there are debug symbols, you should be able to run it through a debugger and get variable names.
At compile time, the compiler strips off the name of the constant, and replaces all occurrences of the constant in the code with the address of its location in memory (which is usually the same byte offset as inside the executable). Because of this, the original variable naming of the constant is lost, leaving only the value. Hence, the reason you can't find the constants anywhere.
Something that I would do to try to find the secret token, is capture all the data traffic that the app creates, and then look for the same patterns in the binary. If the token is indeed in there, and it isn't obfuscated somehow, then at least that narrows it down for you greatly.
Good luck! RE can be very rewarding but sometimes it really sucks.

Workflow / best practices for XLIFF

I am using a command line tool (ng-xi18n) to extract the i18n strings from an angular 2 app I wrote. The output of this command is a messages.xlf file. Coming from a .po background, and being not familiar with .xlf, I assumed that this file is the equivalent to the .pot file (correct me if I am wrong).
I then assumed that if I want to translate my app, I had to cp messages.xlf messages.de.xlf to have a copy (messages.de.xlf) of the template file (messages.xlf) where I can translate each message into German (hence the .de.xlf).
After translating some dummy texts and running the app, I saw that it worked as expected, so I quit translating and continued developing the app. After some time, I added more i18n strings, and eventually thought that I had to update my template. And this is where things got hardly maintainable. I updated the template messages.xlf file, and quickly was wondering how I could update the new strings to my already translated messages.de.xlf file without loosing my progress.
When I was developing using .po files, this was no problem thanks to good tools like poEdit, but I didn't find anything comparable for .xlf. After trying some tools, I thought that the best choice would be Lokalize, but I didn't find a possibility to merge the template file to already translated (but outdated) files either.
Up to now, this was rather an essay than a question, so here's a quick summary:
Is the workflow of dealing with .xlf files really comparable to .po as I initially thought (described above), or is it completely different?
How am I suppose to update my already translated files?
What are the best practices dealing with .xlf files?
What are proof of concept tools to work with .xlf?
Sidenotes:
The Lokalize handbook was not helpful at all. I see a lot of functions that sound promising, like:
"File" > "Update file from template". I did not find anything in the handbook to explain this function. If I click on this, nothing happens.
"Sync" > "Open file for sync/merge". This seems to be a function to merge two similar files (by multiple translators) rather than a tool to update the translation file from a template. Even though there is a tooltip in Lokalize's primary sync tab, notifying me about "x unmatched entries", I just couldn't find anything to append those unmatched entries to my .de.xlf file.
[Update] Turns out, I had similar issues as in this question. After downgrading my version of Lokalize to the suggested one, many issues (including the ones mentioned in the question) disappeared. However, now the "Update file from template" option is greyed out, and I don't know why.
I also tried OmegaT, which does not work at all on my platform (Ubuntu 16.04).
[Update] Virtaal works great for merging new strings from a template, but the UI in general is very poorly designed...
Googling did not help, as every hit seems to be related to XCode or something.
Thanks for any help in advance, I really appreciate it
I wrote a small npm command line tool called xliffmerge.
In principle it does the same, that Roland Oldengarm does with his gulp tasks described in his blog article.
It is free and you can have a look at it at https://github.com/martinroob/ngx-i18nsupport#readme
The best workflow automation solution I have seen described so far is from Roland Oldengarm's blog entry "Angular 2: Automated i18n workflow using gulp". To summarize, in a few dozen lines of Gulp code he created the tooling to handle some of the challenges you faced. Specifically it runs ng-xi18n to extract the messages; creates an English translation with sources copied to targets; updates existing translations by adding new trans-units, keeping existing ones, and removing missing ones; and then exposes all xlf files as TypeScript string constants. These last strings can then be imported to supply the bootstrapModule with its translation provider options.
Caveat: I have not used this exact solution (and code) myself, but I was able to expose generated xlf as TypeScript strings and use them in an app in a manner similar to what he described. As for maintaining translations, I have leveraged IntelliJ IDEA (WebStorm) file comparison features and Counterparts Lite (for Mac) for that. My own efforts are still in early stages but are working end to end for an application that is in active development.
Official Angular docs are now updated for Internationalization (i18n) at https://angular.io/docs/ts/latest/cookbook/i18n.html including a section specifically for creating a translation source file with the ng-xi18n tool.

Semantic equivalence of reformatted (PL/)SQL code

Background
Hundreds of database objects (views, packages, stored procedures, etc.) in a system have no formatting and no source code comments. We'd like to:
Automatically reformat the code (using the General SQL Parser).
Automatically copy a standard comment header into each object's source file.
Problem
We cannot push such sweeping changes into production without being tested.
Question
How would you verify that the reformatted source code is functionally identical to the un-formatted code?
Thank you!
Easy:
Run the unformatted code against a fresh new database
Run the formatted code against a fresh new database
Do a full export of both and compare the two files
They should be identical.
The reason they should be identical is that postgres parses the SQL into its standard, canonical form, so even adding unnecessary brackets for example should result in the same internal version of the code.
I assume that format is required only for those objects that need to be modified, if so I recomend format only the object you are working on that at the end the result should be put into production. I use Oracle SQL Developer and is safe for me to work on a program unit and do format.
To your question: to compare formatted source code with un-formatted you will have to tokenize each of them and compare results. Which in - fact defeat your original objective. ;-)

How to convert Unicode strings (\u00e2, etc) into NSString for display?

I am trying to support arbitrary unicode from a variety of international users. They have already put a bunch of data into sqlite databases on their iPhones, and now I want to capture the data into a database, then send it back to their device. Right now I am using a php page that is sending data back to from an internet mysql database. The data is saved in the mysql database properly, but when it's sent back it comes out as unicode text, such as
Frank\u00e2\u0080\u0099s iPad
instead of just
Frank's iPad
where the apostrophe should really be a curly apostrophe.
The answer posted to another question indicates that there is no built-in Cocoa methods to convert the "\u00e2\u0080\u0099" portion of the unicode string from the webserver to an NSString object. Is this correct?
That seems really surprising (and scarily disappointing), since Cocoa definitely allows input from many different Unicode characters, and I need to support any arbitrary language that I have never heard of, and all of the possible characters. I save them to and from the local sqlite database just fine now, but once I send it to a web server, then perhaps pull down different data, I want to ensure the data pulled from the web server is correctly formatted.
[...] there is no built-in Cocoa methods to convert [...]. Is this
correct?
It's not correct.
You might be interested in CFStringTransform and it's capabilities. It is a full blown ICU transformation engine, which can (also) perform your requested transformation.
See Using Objective C/Cocoa to unescape unicode characters, ie \u1234
All NSStrings are Unicode.
The problem with the “Frank\u00e2\u0080\u0099s iPad” data isn't that it's Unicode; it's that it's escaped to ASCII. “Frank’s iPad” is valid Unicode in any UTF, and is what you need.
So, you need to see whether the database is returning the data escaped or the PHP layer is escaping it at some point. If either of those is the case, fix it if you can; the PHP resource should return UTF-8/16/32. Only if that approach fails should you seek to unescape the string on the Cocoa side.
You're correct that there is no built-in way to unescape the string in Cocoa. If you get to that point, see if you can find some open-source code to do it; if not, you'll need to do it yourself, probably using NSScanner.
Check that your web service response has Content type and charset. Also that xml has encoding specified. In PHP you need to add the following before printing XML:
header('Content-type: text/xml; charset=UTF-8');
print '<?xml version="1.0" encoding="UTF-8"?>';
I guess there is just no encoding specified.