List of reserved words in rails *3* - ruby-on-rails-3

Is there a good, authoritative list of reserved words for RAILS-3?
Candidates:
http://oldwiki.rubyonrails.org/rails/pages/ReservedWords, but it seems a bit out of date and rails2.
http://cheat.errtheblog.com/s/rails_reserved_words (but there seems no authority to this - it could just grow and grow...)
http://latheesh.com/2010/02/02/rails-reserved-words/
Background: I'm maintaining a long-serving rails app and it has plenty of usages of reserved words (judging that http://oldwiki.rubyonrails.org/rails/pages/ReservedWords seems to apply to rails2). However, none of these are actually interfering with current activity (the app works... and the specs&features I'm slowly adding to all pass). But as time passes I'd like to remove those usages of reserved words, but don't want to bother if some reserved words are no longer really reserved. So while a longer list might be good for NEW rails apps, I need stronger justification for budget spend than "it has been listed on a webpage at somepoint"...)
Maybe the nature of rails is that you can't find an authoritative list, but you can find "things that didn't work for me at some point"....

This seems like a pretty comprehensive list. However, you are correct. The very nature of programming means that this list will be ever changing. However I will do some research...maybe there is a site that can list reserved words based on the version of Rails you are using. If not...maybe someone should get working on it :D

Here's a good collaborative list of reserved words in Rails: https://reservedwords.herokuapp.com/

Related

Lucene query-time boosting culture code

I'm using the Lucene.Net implementation packaged with the Kentico CMS. The site that we're indexing has articles in various languages. If a user is viewing the Japanese version of the site (for example) and runs a search for 'VPN', we'd like them to see Japanese articles about VPN first, but also see other language articles in the results.
I'm trying to achieve this with query-time boosting of the _culture field. Since we're using the standard analyzer (really don't want to change that), and the standard analyzer treats hyphens as whitespace, I thought I'd try appending '(_culture:jp)^4' to the user's query. As you can see from the Luke tool's Explain output, that isn't doing anything to boost the documents with 'jp' in the field. What gives?
I've also tried:
_culture:"en-jp"
_culture:en AND _culture:jp
_culture:"en jp"
Update: It's something with the field. There's another field in the index named 'documentculture' that contains the same data (don't know why). But when I try '(documentculture:jp)^4', it works as I expect. That solves my problem, but I still have an academic question of how the fields are different.
Even though the standard analyzer ignores hyphens I don't believe it will treat the two parts of your culture code as separate terms. Therefore under normal circumstances a wildcard would help you here. For example, the query vpn (_culture:en*)^4 would boost all documents with a culture starting with en.
However, in your case you want to match the end of the term. Unfortunately, Lucene syntax doesn't support wildcards at the start of terms for some reason (according to this reference). Therefore I think you're going to have to consider changing the analyzer you're using. I generally find the Whitespace analyzer fits my needs best. I've just tried your scenario using Whitespace analyzer and have found vpn (_culture:en-jp)^4 will give you what you need.
I understand if you don't accept this answer though since you stated you didn't want to change the analyzer!

Rails's way of 'convention over configuration'; are there pitfalls?

After many years of C/C++, PHP, some Ruby and other languages on one hand an different projects with different frameworks on the other, I now want to learn Rails.
After working through (Getting started-) Guides, I think Rails is powerfull and fairly easy to learn. And I feel ready to start with a non Bookshop app.
But a friend warned me about Rails's 'convention over configuration' and the way it 'does things'. I cant see a 'problem' with that, but are there pitfalls?
And: Are there things Rails does very different than other framworks?
You're going to either get zero replies or a bunch of opinions. I would recommend googling for "rails is opinionated". Hopefully that will turn up more examples of what you might run into.
Is it a problem? No, not really. Can it be a problem? Yes, absolutely.
Integrating with legacy databases can be a PITA sometimes. Or if you have some insane desire to name all your primary keys something other than "id" that can be a problem.
Not so much a problem really, but you're fighting a lot of convention.
Really, other than legacy databases I can't think of anything off the top of my head bothers me about it's conventions.
What your friend says is only somehow true. Rails's naming conventions are powerfull and keep your brain free for other things.
But: If you think you have experience ... you are learning Rails -> you are back to school.
Rail's 'naming conventions' are not only conventions. They are Rails somehow. So if you break the convention, you are off road and soon in the middle of no where. I think that this part of Rails could be better pointed out in Guides.
Let me give an example: (you are tired of "books" and start wit a little app around 'Pubs')
You scaffold your Pup (intended typo)
You then put in some logic, put some work, then you realize your (oops) typo. Now dark clouds arise. Since you are experienced, you start correcting the typo ... PupsController -> PubsController, filename of PubsController (you are already off road) ...
You will end up at the database table 'pups' ... (middle of nowhere)
This happened because you think you are experienced. A beginner has built a new scaffold (without typo) or asked here on SO how to correct 'correctly')
An other example is to name thigs 'more nice'. After years and many projects you are probably one, that never uses "unspecified" names like 'user','role','guest','owner' for classes and so on. So you start to name them (nicley?): PubUser, PubOwner, ... Noboddy told you "DON'T".
You put all in a namespace (there are many people here saying "don't") with the nice name 'PubApp'
Although your files are well organized, you will end in tablenames: pub_app_pub_owners and so on, not to think about the name of assotiative Tables between them.
And later on you will type something like
link_to 'add' new_pub_app_pub_pub_guest_url
link_to 'add' new_pub_app_pub_owner_pub_url
This is probaly not what your intention was to make things 'clean'. And if you take a look at the 'beginers' link...
link_to 'add' new_pub_guest_url
What I do not want is to preferre one or the other.
I want to point out, that - since you are not experienced with Rails - you dont know where the things you are doing (off road) are leading you. With only a hard way to return.
Thats somehow a pitfall.
But next time you will know about that and make a compromise: 'Pubowner' and 'Guest' (and 'pa' as namespace (if you realy want to)
so
link_to 'add' new_pa_pubowner_guest_url
Is not so bad. But its hard to reverse things so think before ...
When writing application using any framework it may be necessary to write a lot of configuration code. However if we follow Rails Standard conventions then it is possible to avoid the excess configuration and in some cases no configuration at all. Thus, explicit configuration would be needed only in those cases where you can't follow the standard convention.
Following are the conventions provided by rails:
Naming conventions: Active Record uses some naming conventions to find out how the mapping between models and database tables should be created. Rails will pluralize your class names to find the respective database table. So, for a class Book, you should have a database table called books.
Example:
Database Table - Plural with underscores separating words (e.g., book_clubs).
Model Class - Singular with the first letter of each word capitalized (e.g., BookClub).
Schema Conventions: Active Record uses naming conventions for the columns in database tables, depending on the purpose of these columns.
Foreign keys - These fields should be named following the pattern singularized_table_name_id (e.g., item_id, order_id). These are the fields that Active Record will look for when you create associations between your models.
Primary keys - By default, Active Record will use an integer column named id as the table's primary key. When using migrations to create your tables, this column will be automatically created.
There is a point I never had before, that took me a nerve or two: Rails is caching a lot - even in development env. (for gods sake it does!)
Let me construct a scenario (no; not a construct, happend to me in a more complex variant)
After enough hours of work and you did all that was on the plan for the day, you close with a little cleanup. Check evering is still working - smile - and off
sleep
Back to the Computer you start all up and get a 'constant xy is not ...', so but, but why, overnight?
The answer is easy (if someone tells you at least once): Rail's caching does not (or better cant) check if a class / file / method is just removed, and not altered (sometimes ...)
So the (one to many) deleted file removed the Class, it contained from the world, but not from Rail's cache. Power off did the rest.
I had more subtile situations, that i solved with a rails server restart after i looked out of the window, if I am still on earth ...
What I try to point out is, there is no magic behind, but be warned, if you think you are smart enough to touch the framework code (why and how ever you want to do that). Big cache gets you back where you are, at beginners level.

Fuzzy search in SQL

I am trying to map information of Linux packages (name + version) to their corresponding CPE strings (see http://nvd.nist.gov/cpe.cfm) in order to be able to automatically find possible vulnerabilities of a system.
There is an XML document provided by NIST which contains all relevant CPE. I thought about parsing this information into an SQL database so I can quickly search by name and version number. That would be some 70.000 rows.
The problem now is, of course, that there are variations of the spellings of the CPEs and the package names. For example, the CPE for Tomcat 6.0.36 would be cpe:/a:apache:tomcat:6.0.36 so you have the name tomcat and the version 6.0.36. Now, the package manager could give you something like tomcat6 for the name and 6.0.36-3 for the version. Its likely that both programs are the same or have at least the same vulnerabilities. So I need to be able to automatically identify the above mentioned CPE as the correct one for my tomcat package.
The first thing to do would be some kind of normalization, maybe converting everything to lowercase. But as you can see from the example, that's not enough. I need some kind of fuzzy search. From what I already found out, there are some solutions for identifying matches in the case of misspelling. That is not exactly what I need, though. The package names are not misspelled but may contain additional characters (or miss some).
The fuzzy search must also be relatively fast, since I need to execute it for multiple hosts which each could have some hundred packages installed and as I said, the database would have around 70.000 rows. I can introduce a primary lookup which tries to find an exact match first, but since I suspect many package will not have any corresponding CPE string, that will not decrease the amount too dramatically.
Another constraint is that the solution should be working on a non-proprietary database, since I don't have the financial means for anything else.
So, is there anything that matches these requirements? Or can you think of any solution to my problem except some kind of fuzzy searching?
Thanks in advance!
A general comment, first. The CPE nomenclature seems to have evolved organically, often depending on the vendors' (inconsistent) nomenclature. For example, Sun Java has major.minor.point_version. Adobe uses major.minor.point.subpoint. Microsoft operating systems use Service Packs_Language Packs. Some other vendors would use point releases with mostly numbers but occasional letters sprinkled in (e.g., .8, .9, .9R2, .10).
When I worked on the stated problem, I started from their XML files and manipulated them in Excel, splitting on the periods. Then I would sort either numerically (if they were all numeric) or as a text string. (Note that the letters sprinkled in to mostly numbers causes havoc, and that .10 comes lexically before .8)
This inconsistency is why third-party software vendors have sprouted like mushrooms after a spring rain. Companies would rather pay the software vendors than untangle this Gordian knot.
If you want a truly fuzzy search, please take a look at this question about using Soundex. Expect to get a lot of false positives.
If your goal is accurately mapping the CPE strings, you should probably think about implementing a lookup table that translates from CPE to a library name.

Problems using MySQL FULLTEXT indexing for programming-related data (SO Data Dump)

I'm trying to implement a search feature for an offline-accessible StackOverflow, and I'm noticing some problems with using MySQLs FULLTEXT indexing.
Specifically, by default FULLTEXT indexing is restricted to words between 4 and 84 characters long. Terms such as "PHP" or "SQL" would not meet the minimum length and searching for those terms would yield no results.
It is possible to modify the variable which controls the minimum length a word needs to be to be indexed (ft_min_word_len), but this is a system-wide change requiring indexes in all databases to be rebuilt. On the off chance others find this app useful, I'd rather keep these sort of variables as vanilla as possible. I found a post on this site the other day stating that changing that value is just a bad idea anyway.
Another issue is with terms like "VB.NET" where, as far as I can tell, the period in the middle of the term separates it into two indexed values - VB and NET. Again, this means searches for "VB.NET" would return nothing.
Finally, since I'm doing a direct dump of the monthly XML-based dumps, all values are converted to HTML Entities and I'm concerned that this might have an impact on my search results.
I found a blog post which tries to address these issues with the following advice:
keep two copies of your data - one with markup, etc. for display, and one modified for searching (remove unwanted words, markup, etc)
pad short terms so they will be indexed, I assume with a pre/suffix.
What I'd like to know is, are these really the best workarounds for these issues? It seems like semi-duplicating a > 1GB table is wasteful, but maybe that's just me.
Also, if anyone could recommend a good site to understand MySQL's FULLTEXT indexing, I'd appreciate it. To keep this question from being too cluttered, please leave the site recommendations in the question comments, or email me directly at the site on my user profile).
Thanks!
Additional Info:
I think I should clarify a couple of things.
I know "MySQL" tends to lead to the assumption of "web application", but that's not what I'm going for here. I could install Apache and PHP and run things that way, but I'm trying to keep this light. I can use my website for playing with PHP, so I don't feel the need to install it on my home machine too. I also hope this could be useful for others as well, and I don't want to force anyone else into installing a bunch of extra utilities. I went with MySQL since it was easy and needing to install some sort of DB was unavoidable.
The specifics of the project were going to be:
Desktop application written in C# (WinForms)
MySQL backend
I'm starting to wonder if I should just say to hell with it, and install everything I'd need to make this an (offline) webapp. As much as we'd all like to think our pet project is going to be used and loved by the community at large, I should know by now that this is likely going end up being only used by a single user.
From what was already said, I understand, that MySQL FullText is not for you ;) But why stick to MySQL? Try Sphinx:
http://www.sphinxsearch.com/
It will solve most of your problems.

What are the things should we consider while writing a Spell Checker?

I want to write a very simple Spell Checker. The spell checker will try to match the input word with equivalent words form the dictionary.
What can be done to find those 'equivalent words'? What analysis can be preformed on two words to mark them equivalent?
Before investing too much trying to unravel that i'd first look to already existing implementations like Aspell or netspell for two main reasons
Not much point in re-inventing the wheel. Spell checking is much trickier than it first appears and it makes sense to build on work that has already been done
If your interest is finding out how to do it, the source code and community will be a great benefit should you decide to implement your own anyway
Much depends on your use case. For example:
Is your dictionary very small (about twenty words)? In this case it probably is better to precompute all possible nearby mistaken words and use a table/hash lookup.
What is your error model? Aspell has at least two (one for spelling errors caused by nearby letters on the keyboard, and the other for spelling errors caused by the way a word sounds).
How dynamic is your dictionary? Can you afford to do a massive preparation in order to get an efficient retrieval?
You may need a "word equivalence" measure like Double Metaphone, in addition to edit distance.
You can get some feel by reading Peter Norvig's great description of spelling correction.
And, of course, whenever possible, steal code. Do not reinvent the wheel without a reason - a reason could be a very special domain, a special way your users make spelling mistakes, or just to learn how it's done.
Edit Distance is the theory you need to write a spell checker. You also need a dictionary. Most UNIX systems come with a dictionary already installed for your locale.
I just finished implementing a spell checker and used a combination of the following in getting a list of "suggested" words
Phonetic hashing of the "misspelled" word to lookup a hash of identical dictionary hashed real words (for java check out Apache Commons Codec for a suitable library). The phonetic hash of your dictionary file can be precomputed.
Edit distance between the input and the potentials (this is reasonably expensive so you need to reduce the list first with something like a phonetic hash, assuming a higher volume load - in my case, a server based spell check)
A known list of common misspellings, e.g. recieve vs. receive.
An ordered list of the most common words in the english language
Essentially I weighted each potential word primarily based on edit-distance and commonality. e.g. if word probability is a percentage, then
weight = edit-distance * 100 / probability
(lower weights are better)
But then I also also override any result with the known common misspellings (i.e. these always float to the top suggested result).
There may be better ways, but this worked pretty well.
You may also wish to ignore ALL CAPS words, initials etc, so choosing what to ignore is also something to think about.
Under linux/unix you have ispell. Why reinventing the whell.