vCard Parsing different parameters - vcf-vcard

I need to write a vCard Parser.
Now the problem is that the Vcard I get can have n number of paramenters
Like say
TEL;CELL:123
or
TEL;CELL;VOICE:123
or
TEL:HOME;CELL;VOICE:123
now how i get this format really depends on my sources(which can be diverse and many).
Now I need to make a generic reader which can identify tht all these different set of parameters can map to a single field(in this case Mobile number), but the way of sending this information varies across all sources(google, MS, Nokia).
can someone please give any suggestion on how to handle such situation

vCard is a bloody mess to parse, especially since almost nothing out there produces RFC 2426-compliant output. For similar reasons I ended up writing a vCard parser / validator which you can use to massage the data into compliance. I use it daily to keep my own vCards (a few hundred people/companies) compliant, and the result has for example been that Gmail now imports all of them properly, address, phones, images and all.

Related

Headless LibreOffice or OpenOffice as a PDF report generator?

I hope it’s Ok to post a complete naive question here for LO or OO experts.
I’m looking for advice on whether scripting LibreOffice or OpenOffice would be suitable for the following:
General Question
I’m looking to generate PDF reports, based on a combination of a “template” and a set of data (currently in JSON format) and inserted images.
This would act as a headless service that gets invoked when necessary from a web server, when a user requests a PDF report (on linux).
We have a need to frequently modify/customise/generate new templates, hence the reluctance to go down a route of using something like Reportlab (plus I don't know Reportlab at all, so face huge learning curve that way
Background
This is in contrast to using an approach of using a PDF library like Reportlab directly within the web server, and having to build up the template/report programmatically.
As LibreOffice/OpenOffice is obviously a lot faster for generating good looking report "templates", this is a question about doing both the template generation, plus final template + data -> PDF generation all directly within LibreOffice.
Some more specifics
The data values would mostly either be substituted into fields in the template, with no to minimal processing of values required.
However, there would be situations where some of the data is in “sets” that would be shown in a table type view, and the number of fields (and so number of table rows for instance) would need to vary per report, based on the number of values in that particular JSON data.
Additionally, I’d need to be able to include (“import”) images into the report. Some of the JSON data would be paths to image files, and I’d like to include those. Again for these, the number of image may vary between each report.
This wouldn't be high frequency at all, so would not need to run either LO/OO as a service, but could simply invoke when required with a sys call. Conceptually something like "LibreOffice --template 'make_fancy.report' <data.json> <output_file.pdf>"
If this approach would be reasonable in either LO or OO, what languages are best to script in? (Hopefully python3).

How to create your own package for interaction with word, pdf etc

I know that there are a lot of packages around which allow you to create or read e.g. PDF, Word and other files.
What I'm interested in (and never learned at the university) is how you create such a package? Are you always relying on source code being given by the original company (such as Adobe or Microsoft), or is there another clever way of working around it? Should I analyze the individual bytes I see in e.g. PDF files?
It varies.
Some companies provide an SDK ("Software Development Kit") for their own data format, others only a specification (i.e., Adobe for PDF, Microsoft for Word and it's up to the software developer to make sure to write a correct implementation.
Since that can be a lot of work – the PDF specification, for example, runs to over 700 pages and doesn't go deep into practically required material such as LZW, JPEG/JPEG2000, color theory, and math transformations – and you need a huge set of data to test against, it's way easier to use the work that others have done on it.
If you are interested in writing a support library for a certain file format which
is not legally protected,
has no, or only sparse (official) documentation,
and is not already under deconstruction elsewhere,a
then yes: you need to
gather as many possible different files;
from as many possible sources;
(ideally, you should have at least one program that can both read and create the files)
inspect them on the byte level;
create a 'reader' which works on all of the test files;
if possible, interesting, and/or required, create a 'writer' that can create a new file in that format from scratch or can convert data in another format to this one.
There is 'cleverness' involved, mainly in #3, as you need to be very well versed in how data representation works in general. You should be able to tell code from data, and string data from floating point, and UTF8 encoded strings from MacRoman-encoded strings (and so on).
I've done this a couple of times, primarily to inspect the data of various games, mainly because it's huge fun! (Fair warning: it can also be incredibly frustrating.) See Reverse Engineering's Reverse engineering file containing sprites for an example approach; notably, at the bottom of my answer in there I admit defeat and start using the phrases "possibly" and "may" and "probably", which is an indication I did not get any further on that.
a Not necessarily of course. You can cooperate with other whose expertise lies elsewhere, or even do "grunt work" for existing projects – finding out and codifying fairly trivial subcases.
There are also advantages of working independently on existing projects. For example, with the experience of my own PDF reader (written from scratch), I was able to point out a bug in PDFBox.

How bad is idea of letting users to upload and store files with national characters in the filename?

Our CMS accepts files with national characters in their names and stores them on the server without a problem. But how bad is such approach in perspective? For example is it possible to store files with filenames in Hebrew, or Arabic or in any other language with non-latin alphabet? Is there a standard established way to handle these?
A standard way would be to generate unique names yourself and store the original file name somewhere else. Typically, even if your underlying OS and file system allow arbitrary Unicode characters in the file name, you don't want users to decide about file names on your server. Doing so may impose certain risks and lead to problems, e.g. caused by too long names or file system collisions. Examples of sites that do that would be Facebook, flickr and many other.
For generating the unique file name Guid values would be a good choice.
Store the original filename in a database of some sort, in case you ever need to use it.
Then, rename the filename using a unique alphanumeric id, keeping the original file extension.
If you expect many files then you should create directories to group the files. Using the year, month, day, hour and minute is usually enough for most. For example:
.../2010/12/02/10/28/1a2b3c4d5e.mp3
Yes, I've had experience with massive mp3 collections which are notorious for being named in the language of the country where the song originates which can cause trouble in several places.
It's fine as long as you detect the charset it's in from the headers in the request, and use a consistent charset (such as UTF-8) internally.
On a Unix server, it's technically feasible and easy to accept any Unicode character in the filename, and then convert filenames to UTF-8 before saving them. However, there might be bugs in the conversion (in the HTML templating engine or web framework you are using, or the user's web browser), so it might be possible that some users will complain that some files they have uploaded disappeared. The root cause might be buggy filename conversion. If all characters in the filename or non-latin, and you (as a software developer) don't speak that foreign language, then good luck figuring out what has happened to the file.
It is an excellent idea. Being Hungarian, I'm pretty annoyed when I'm not allowed to use characters like áÉŰÖÜúÓÚŰÉÍí :)
There is a lot of software out there that has bugs regarding dealing with such file names, especially on Windows.
Udpate:
Example: I couldn't use the Android SDK (without creating a new user), because I had an é in my user name. I also ran into a similar problem with the Intel C++ compiler.
Software usually isn't tested properly with such file names. The Windows API still offers "ANSI" encoded versions of functions, and many developers don't seem to understand its potential problems. I also keep on coming across webpages that mess up my name.
I don't say don't allow such file names, in fact in the 21st century I would expect to be able to use such characters everywhere. But be prepared that you may run into problems.

optical character recognition of PDFs of parliamentary debates

For a contract work, I need to digitalize a lot of old, scanned-graphic-only plenary debate protocol PDFs from the Federal Parliament of Germany.
The problem is that most of these files have a two-column format:
Sample Protocol http://sert.homedns.org/img/btp12001.png
I would love to read your answer to my following questions:
How I can split the two columns before feeding them into OCR?
Which commercial, open-source OCR software or framework, do you recommend and why?
Please note that any tool, programming-language, framework etc. is all fine. Don't hesitate recommend esoteric products, libraries if you think they are cut for the jub ^__^!!
UPDATE: These documents are already scanned by the parliament o_O: sample (same as the image above) and there are lots of them and I want to deliver on the contract ASAP so I can't go fetch print copies of the same documents, cut and scan them myself. There are just too many of them.
Best Regards,
Cetin Sert
Cut the pages down the middle before you scan.
It depends what OCR software you are using. A few years ago I did some work with an OCR API, I cant quite remember the name but I think there's lots of alternatives. Anyway this API allowed me to define regions on the page to OCR, If you always know roughly where the columns are you could use an SDK to map out parts of the page.
I use Omnipage 17 for such things. It has an batchmode too, where you can put the documents in an folder, where they was grabed, and put the result into another.
It autorecognit the layout, include columns, or you can set the default layout to columns.
You can set many options how the output should look like.
But try a demo, if it goes correct. I have at the moment problems with ligaturs in some of my documents. So words like "fliegen" comes out as "fl iegen" so you must spell them.
Take a look at http://www.wisetrend.com/wisetrend_ocr_cloud.shtml (an online, REST API for OCR). It is based on the powerful ABBYY OCR engine. You can get a free account and try it with a few of your images to see if it handles the 2-column format (it should be able to do it). Also, there are a bunch of settings you can play with (see API documentation) - you may have to tweak some of them before it will work with 2 columns. Finally, as a solution of last resort, if the 2-column split is always in the same place, you can first create a program that splits the input image into two images (shouldn't be very difficult to write this using some standard image processing library), and then feed the resulting images to the OCR process.

Extract Address Information from a Web Page

I need to take a web page and extract the address information from the page. Some are easier than others. I'm looking for a firefox plugin, windows app, or VB.NET code that will help me get this done.
Ideally I would like to have a web page on our admin (ASP.NET/VB.NET) where you enter a URL and it scraps the page and returns a Dataset that I can put in a Grid.
If you know the format of the page (for instance, if they're all like that ashnha.com page) then it's fairly easy to write VB.NET code that does this:
Create a System.Net.WebRequest and read the response into a string.
Then create a
System.Text.RegularExpressions.Regex
and iterate over the collection of
Matches between that and the string
you just retrieved. For each match,
create a new row in a DataTable.
The tough bit is writing the regex, which is a bit of a black art. See regexlib.com for loads of tools, books etc about regexes.
If the HTML format isn't well-defined enough for a regex, then you're probably going to have to rely on some amount of user intervention in order to identify which bits are the addresses...
What type of address information are you referring to?
There are a couple FireFox plugins Operator & Tails that allow you to extract and view microformats from web pages.
Aza Raskin has talked about recognising when selected text is an address in his Firefox Proposal: A Better New Tab Screen. No code yet, but I mention it as there may be code in firefox to do this in the future.
Alternatively, you could look at using the map command in Ubiquity, although you'd have to select the addresses yourself.
For general HTML screen scraping in VB.NET, check out HTML Agility Pack. Much easier than trying to Regex it (unless you happen to be a Regex ninja already!)
The page you mentioned in your answer would be easy to automate, as the addresses are in a consistent format.
But to allow the users to point to any page, that's a much harder job. The data could be in any format at all. You could write something to dump all the text, guess how they are divided, try and recognise bits like country and state names, telephone numbers etc, and get then show your results with an interface that will let the users complete missing sections, move the dividers, and identify the bits you missed or they didn't want.
It's not simple though, and making an interface that provides a big advantage over simply cutting and pasting into validated form fields would be quite an achievement I think - I'd be interested to know how you get on!
EDIT: Just noticed this other question that might cover quite a bit of what you want to do:
Parse usable Street Address, City, State, Zip from a string