How to recover currency information from broken data set? - text-mining

This is so not my area, so I apologize if this is not in scope for this stack.
I am cleaning up (for personal entertainment and making visualization to share with others) survey data (download, 9MB) that went through some manipulations to be anonymized before getting released to the public.
One of the questions was about hourly payment rate and allowed free form text answer. Some of those answers got badly broken characters, two most common cases shown in image below:
I would hate to discard those answers, but I am at loss how to revert them to meaningful state.
Ask for better data dump - poked related people about it, but not too hopeful.
Try to determine which characters ended up this way. Dealing with encodings is always troublesome and these don't look like any broken characters I ever seen before so I have no idea where to start and if there are tools available to help with this. That might not even be a valid characters or currency symbols at all.
Try to match broken characters to valid currency characters. I strongly suspect one of the two might be € character and other might be £ given that survey was slanted towards English-speaking countries. However will I be able to reliably back up such guess by relative quantity of character to other answers? Unfortunately geo data was not provided, so I can't match answers to countries.

It was confirmed that this was caused by export bug in survey software and characters do correspond to euro and pound.
As you suspected.
it's a #Polldaddy export to csv bug
Pete Davies

Related

Download OEIS sequences with known algorithm to produce them

I was reading some interesting questions about the topic "Can we make a program that, given a particular sequence, produces the next terms", like this one, and I really like the detailed answer of this one. I understand that the answer is "That's impossible without more restrictions", and that given some restrictions (polynomials, rational function or boolean map) we know some good algorithms, as the second answer I linked explains.
Now, a natural question is how much can we solve, trying our best even if we can't always solve it, to answer the original, general question. What I usually do when facing a hard sequence is trying to see if it's in OEIS, and if it seems to be there, seeing if there is any formula or algorithm to produce it in there. You can download a small version of OEIS with the first terms of each sequence, and you can make queries to find formulas or maple algorithms for a particular sequence. My question is, do you think it's feasible to download a small version of OEIS that includes, with the first terms, a little algorithm to produce it?
The natural problem here is that I haven't seen any link to download the entire database of OEIS with all the details, which maybe deserves its own question. Even if we had this, you need to read the formulas/algorithms (that can be written in different languages, from what I've seen) and interpret them correctly. But I thought maybe someone here knows how to solve this, in any case thanks in advance.
You could, as you note, download the sequences and their A-numbers from the link mentioned here: https://oeis.org/wiki/Welcome#Compressed_Versions
After searching that and finding one sequence (or a small number of sequences) of interest, you could scrape the respective page(s) for formulas. There are specific fields for Maple and Mathematica, which may be helpful, and otherwise, an entry in the PROGRAM field should include identifying information when it is not one of the standard languages with its own field in the database. See: http://oeis.org/wiki/Style_Sheet
Unofficially, but with the interests of the OEIS in mind, I would not recommend trying to download or scrape the OEIS in its entirety. Whether it's one person, or a whole host of people, we would certainly recommend using the compressed version of the database to identify sequences of interest by A-number first, then pulling their entire entry by scraping the site or querying the OEIS using methods that you have already mentioned: Programmatic access to On-Line Encyclopedia of Integer Sequences
If this sounds laborious, perhaps an alternative is the Wolfram Cloud, which actives this through other means. For example, you can navigate to the cloud (you may have to register just to get access) at: https://www.wolframcloud.com/
Typing in something like FindSequenceFunction[{1, 2, 3, 5, 17, 305, 34865}] will give you a formula, if Wolfram/Mathematica can find one. The documentation for FindSequenceFunction can be found here: https://reference.wolfram.com/language/ref/FindSequenceFunction.html
Wolfram/Mathematica can also invoke the OEIS using packages like the one described here: https://mathematica.stackexchange.com/questions/40/is-it-possible-to-invoke-the-oeis-from-mathematica

Issues with Dates in Apache OpenNLP

For a recent project to aid me learning NLP I am working on a number of documents, each of which contain a date. What I would like to be able to do is read the unstructured data and identify the date or dates within, converting it into a numeric format and possibly setting it to the documents metadata. (Note: Since the documents being used is all pseudo information, the actual meta data of the files being read in are false).
Recently I have been attempting to use OpenNLP in conjunction with Lucene to do so and it works to a given degree.
However if the date is written as "13 January 1990" or "2010/01/05", OpenNLP only identifies "January 1990" and "2010" respectively, but not the entire date. Other date formats may have issues as well, I have yet to try them all. While I recognise that OpenNLP works upon a statistical basis rather than a format basis, I can't help but get the feeling I'm making an elementary mistake.
Am I making a mistake? If not is there an easy manner in which to rectify this?
I understand that I may be able to construct my own trained model based on a training data set. Is the Apache OpenNLP one freely available, so I may extend it? Are there any others that are freely available?
Is there a better way to do this? I've heard of Apache UIMA, the main reason why I went for OpenNLP is due to its mention in Taming Text by Manning. I should note that the extraction of dates is the first stage of the project and other data will be extracted later as well.
Many thanks for any response.
I am not an expert in OpenNLP but I know that the problem you are trying to solve is called Temporal Expression Extraction (because I do research in this field :P). Nowadays, there are some systems which can greatly help you in extracting and unambiguously representing the temporal meaning of such expressions.
Here are some references:
ManTIME, online demo, software
HeidelTime, online demo, software
SUTime, online demo, software
If you want a broader overview of the field, please have a look at the results of the last temporal information extraction challenge (TempEval-3, Task A).
I hope this helps. :)

Soft PDF documents

in fact, We have two types of PDF documents :
Soft documents (conversion from word to PDF or Latex to PDF).
Hard documents (conversion from scanned image to PDF).
By the way, I am only interested to soft documents.
In fact I am trying to conceal information (by using a specific steganography method...) in an existing PDF document, and I am interested to insert the embedded message by slightly modifying the position of the characters. So I know that in a line, all characters have the same y-axis but different x-axis. So I can insert some bits by modifying slightly the x-axis of each character, but if I insert bits by modifying the y-axis of characters that are places in the same line, so this will be easy detectable (because they have the same y-axis). That is why, I am interested to insert some bits by modifying the x-axis of characters which are belonging in the same line, and some bits by modifying the y-axis of characters which are belonging to different lines (each character in a specific line, but i didn't know if the gap between lines remains the same or not). And in this case, I think that my method will be more undetectable.
But before to achieve that, I am interested to get responses of the following questions:
1) If we have a PDF generated by a conversion from Microsoft word to PDF : does the gap between each line remains the same? and does the gap between paragraphs is constant (remains the same)?
2) Furthermore, If we have a PDF generated by a conversion from Latex to PDF: does the gap between each line remains the same? and does the gap between paragraphs is constant (remains the same)? Please I need your opinion and brief explanations about that.
3) When the text is justified, does the space between 2 pairs of letters remains the same ? In other word and for more precision, assume that we have a text into pdf, where the text is "happy new year and merry christmas, world is beautiful! ". The space between "ea" in year remains the same in "beautiful" ? So if we have multiple words containing "ea", does always the space between e and a is the same in all the ea of all the words? (assume that we do not change the police along all the text into the PDF).
You might have to explain more about what you want to do; that might make it easier to give good advice. Essentially it's important to understand the fundamental difference between applications such as Word (I'm hesitant to comment about Latex - I don't know enough about it) and PDF.
Word lives by words, sentences and paragraphs. Structured content is important and how that's layout on the page is - almost - an after thought. In fact, while recent versions of Word are much better at this, older versions of Word could produce completely different layout (including pagination) by simply selecting a different printer. Trust me, I got bitten by that very badly at one point (stupid me).
PDF lives by page representation and structure is - literally - an after thought. When a PDF file draws a paragraph, it draws individual characters or groups of characters. Sometimes in reading order, but possibly in a completely different order (depending on many factors). There is no concept of line height attributed to a character or paragraph style; the application generating the PDF simply moves the text pointer down a certain number of points and starts drawing the next characters.
So... to perhaps answer your question partially.
If you have Word documents generated by the same version of Word using the same operating system using the same font (not a font with the same name, the same font), you can generally assume that the basic text layout rules will be the same. So if you reproduce exactly the same text in both Word versions, you'll get exactly the same results.
However...
There are too many influencing parameters in Word to be absolutely certain. For example, line-height can be influenced by the actual words on a line. Having a bold word or a word in another font on a line (symbols can count!) can influence the amount of spacing between those particular lines. So while there might be overall the same distance between lines, individual lines may differ.
Also for example, word spacing is something that can quite easily be influenced with character styles and with text justification, as can inter-character spacing.
As for your question 3), apart from the fact that character spacing may change what you see, it's fair to assume that all things being equal the combination "ea" for example will always have the same distance. There are two types of fonts.
1) Those that define only character widths, which means that each combination of "ea" would logically always have the same width
2) Those that define character widths and specific kerning for specific character pairs. But because such kerning is for specific character pairs, the distance between "ea" would still always be the same.
I hope this makes sense, like I said, perhaps you need to share more about what you are trying to accomplish so that a better answer can be given...
#David's answer and #Jongware's comments to it already answered your explicit questions 1), 2), and 3). In essence, if you have an identical software setup (and at least in case of MS Word this may include system resources not normally considered), a source document (Word or LaTeX) is likely to produce the identical output concerning glyph positions. But small patches, maybe delivered as security updates from the manufacturer, may give rise to differences in this respect, most often minute but sometimes making lines or even pages break at different positions.
Thus, concerning your objective
to conceal information (by using a specific steganography method...) in an existing PDF document, [...] to insert the embedded message by slightly modifying the position of the characters.
Unless you want to have multiple identical software setups as part of your security concept, I would propose that you do not try to hide the information as difference between your manipulated PDF and the PDF without manipulations but instead in less significant digits (e.g. by hiding bits by making those digits odd or even, either before or after transformation with a given precision) in your manipulated documents making comparisons with "originals" unnecessary.
For more definite propositions, please provide more information, e.g.
whom shall the information be concealed from: how knowledgeable and resourceful are they?
how shall the information extraction be possible; by visual comparison? By some small program run-able on any computer? By a very well defined software setup?
what post-processing steps shall be possible without destroying the hidden information; shall e.g. signing by certain software packages be possible? Such post-processors sometimes introduce minor changes, for example by parsing numbers into float variables and writing them back from those floats later.

Input multiple answer in SPSS

I've faced some problems regarding how to input data in SPSS when it comes to multiple answers. Let say the question is like this:
What is the main mode of access to these online courses? (you may choose more than one answer if applicable)
Wired campus network
Wireless campus network
Mobile broadband
Wired broadband/ADSL
Mobile packet data
And the student answers more than one answer. So how can I input all these data in SPSS. This is different from a scaling question where each parameter has a scale. It is only one question, but multiple answer... I really dont know how to find the solutuion. I've been asking many people, refer on books, searching on internet, but all that is not enough and I didn't find any answer until now.
These are sometimes referred to as multiple response sets. You would typically have separate variables (i.e. columns) for each potential answer, and then use some type of integer representation for when a person checked that response and when they did not check that response. Most frequently people use 0 for when they did not check that response and 1 for when they did. Afterwards you can define multiple response sets through a GUI dialog, and this is useful when generating tables.
Googling for multiple response sets SPSS seems to bring up alot of useful resources. I also know John Hall has posted tutorials for multiple response sets in SPSS that may be useful.

What are the things should we consider while writing a Spell Checker?

I want to write a very simple Spell Checker. The spell checker will try to match the input word with equivalent words form the dictionary.
What can be done to find those 'equivalent words'? What analysis can be preformed on two words to mark them equivalent?
Before investing too much trying to unravel that i'd first look to already existing implementations like Aspell or netspell for two main reasons
Not much point in re-inventing the wheel. Spell checking is much trickier than it first appears and it makes sense to build on work that has already been done
If your interest is finding out how to do it, the source code and community will be a great benefit should you decide to implement your own anyway
Much depends on your use case. For example:
Is your dictionary very small (about twenty words)? In this case it probably is better to precompute all possible nearby mistaken words and use a table/hash lookup.
What is your error model? Aspell has at least two (one for spelling errors caused by nearby letters on the keyboard, and the other for spelling errors caused by the way a word sounds).
How dynamic is your dictionary? Can you afford to do a massive preparation in order to get an efficient retrieval?
You may need a "word equivalence" measure like Double Metaphone, in addition to edit distance.
You can get some feel by reading Peter Norvig's great description of spelling correction.
And, of course, whenever possible, steal code. Do not reinvent the wheel without a reason - a reason could be a very special domain, a special way your users make spelling mistakes, or just to learn how it's done.
Edit Distance is the theory you need to write a spell checker. You also need a dictionary. Most UNIX systems come with a dictionary already installed for your locale.
I just finished implementing a spell checker and used a combination of the following in getting a list of "suggested" words
Phonetic hashing of the "misspelled" word to lookup a hash of identical dictionary hashed real words (for java check out Apache Commons Codec for a suitable library). The phonetic hash of your dictionary file can be precomputed.
Edit distance between the input and the potentials (this is reasonably expensive so you need to reduce the list first with something like a phonetic hash, assuming a higher volume load - in my case, a server based spell check)
A known list of common misspellings, e.g. recieve vs. receive.
An ordered list of the most common words in the english language
Essentially I weighted each potential word primarily based on edit-distance and commonality. e.g. if word probability is a percentage, then
weight = edit-distance * 100 / probability
(lower weights are better)
But then I also also override any result with the known common misspellings (i.e. these always float to the top suggested result).
There may be better ways, but this worked pretty well.
You may also wish to ignore ALL CAPS words, initials etc, so choosing what to ignore is also something to think about.
Under linux/unix you have ispell. Why reinventing the whell.