Problems using MySQL FULLTEXT indexing for programming-related data (SO Data Dump)

Problems using MySQL FULLTEXT indexing for programming-related data (SO Data Dump) - sql

I'm trying to implement a search feature for an offline-accessible StackOverflow, and I'm noticing some problems with using MySQLs FULLTEXT indexing.
Specifically, by default FULLTEXT indexing is restricted to words between 4 and 84 characters long. Terms such as "PHP" or "SQL" would not meet the minimum length and searching for those terms would yield no results.
It is possible to modify the variable which controls the minimum length a word needs to be to be indexed (ft_min_word_len), but this is a system-wide change requiring indexes in all databases to be rebuilt. On the off chance others find this app useful, I'd rather keep these sort of variables as vanilla as possible. I found a post on this site the other day stating that changing that value is just a bad idea anyway.
Another issue is with terms like "VB.NET" where, as far as I can tell, the period in the middle of the term separates it into two indexed values - VB and NET. Again, this means searches for "VB.NET" would return nothing.
Finally, since I'm doing a direct dump of the monthly XML-based dumps, all values are converted to HTML Entities and I'm concerned that this might have an impact on my search results.
I found a blog post which tries to address these issues with the following advice:
keep two copies of your data - one with markup, etc. for display, and one modified for searching (remove unwanted words, markup, etc)
pad short terms so they will be indexed, I assume with a pre/suffix.
What I'd like to know is, are these really the best workarounds for these issues? It seems like semi-duplicating a > 1GB table is wasteful, but maybe that's just me.
Also, if anyone could recommend a good site to understand MySQL's FULLTEXT indexing, I'd appreciate it. To keep this question from being too cluttered, please leave the site recommendations in the question comments, or email me directly at the site on my user profile).
Thanks!
Additional Info:
I think I should clarify a couple of things.
I know "MySQL" tends to lead to the assumption of "web application", but that's not what I'm going for here. I could install Apache and PHP and run things that way, but I'm trying to keep this light. I can use my website for playing with PHP, so I don't feel the need to install it on my home machine too. I also hope this could be useful for others as well, and I don't want to force anyone else into installing a bunch of extra utilities. I went with MySQL since it was easy and needing to install some sort of DB was unavoidable.
The specifics of the project were going to be:
Desktop application written in C# (WinForms)
MySQL backend
I'm starting to wonder if I should just say to hell with it, and install everything I'd need to make this an (offline) webapp. As much as we'd all like to think our pet project is going to be used and loved by the community at large, I should know by now that this is likely going end up being only used by a single user.

From what was already said, I understand, that MySQL FullText is not for you ;) But why stick to MySQL? Try Sphinx:
http://www.sphinxsearch.com/
It will solve most of your problems.

Related

Developing a search and tag heavy website

I'm in the planning phase of developing a very tag heavy website. Everything will essentially be associated with tags and the entire site would be based on searching these tags.
Now, I've been thinking a lot about going the nosql route here, since from what I read and understand, it makes the most sense for something like this.
Would it be best to go with this database system? Would it makes sense to go with the relational database system? Should I think incorporating something like SOLR?
What would the ideal setup be?
UPDATE:
Ideally they would be user generated, but we all know how that would turn out with giving users that much power. So, let’s change up the requirements and say that users WILL NOT have the power to create tags.
Searching on tags based on text matches is something that would probably be useful and needed. If the tag is “garage sale”, the search for “sale” should also pick this up, at a lower relevance for sure.
I can’t imagine the usage being so much that scaling would be an issue.
Thanks

I would spend a bit of time thinking about these tags. For example, are these tags going to be user generated or will you provide a few tags and let users select which ones they want?
Will you need to search on tags based on text matches? For example if a tag is "garage sale" do you want to search for "sale" to also pick this up? Maybe at a lower relevance?
Also, what kind of usage are you looking at? One good thing about Solr is that it's super easy to scale and synchronize data, it is easy to deploy multiple nodes, shard collections and replicate data to other nodes, something that traditional databases struggle with.
Another thing to keep in mind is that most of the time, Solr is not the official "repository of record", most of the time the data gets fed to it from a DB somewhere, but all reading activities are done from Solr.

See this answer for a SQL solution. Offhand I can't think of any advantage to using most NoSQL databases (i.e. key-value, columnar, or document) as the SQL solution will be more compact and ought to give good performance; a graph database may be appropriate if you're doing a lot of navigational type queries on your tags, but it doesn't sound like that's the case.
Use of Solr (or ElasticSearch or whatever) is orthogonal to your primary database; it may be appropriate to incorporate a search tool if users are typing inexact tags for search, but I recommend integrating a stemming library or something along those lines before turning to a full blown search tool.

JSON vs classic schema design [duplicate]

The Project
I've been asked to work on an interesting project -- what amounts to a basic Web CMS -- that uses HTML/CSS/jQuery with PHP. However, one requirement is that there won't be a database to house the data (they want flat files for the documents/pages -- preferable in JSON format).
In a very basic sense, it'll be used to generate HTML pages via a very "non-techie" interface. Each installation would only have around 20 pages, but a few may get up to 100. It has to be fairly easy to drop onto a PHP capable server and run, with very little setup needed.
What's Out There
There are tons of CMS options and quite a few flat file versions. But an OSS or other existing CMS is not an option. They need a simple propriety system.
Initial Thoughts
So flat files it is... but I'd really like to get some feedback on the drawbacks, and if it is worth the effort to try and convince them to use something like MySQL (SQLite or CouchDB are out since none of the servers can be configured to run them at the present time).
Of course the document files are pretty straightforward, but we're also talking about login info for 1 or 2 admins per installation, a few lists, as well as configs/settings (which also can easily be stored in a file with protection).
The Dilemma
If there are benefits to using MySQL rather than JOSN formatted files and some arrays in a simple project like this -- beyond my own pre-conceived notions :) -- I'll be sure to argue them.
But honestly I can't see any that outweigh their need to not have a database system.
I'd appreciate you insight and opinions.

If you can't cite a specific need for relational table design, then you're good with flat files. Build as specified. The moment you can cite a specific need, let them know; upgrading isn't that hard, if you're perception is timely (that is, if you aren;t in the position of having to normalize data that should have been integrated earlier).

It's a shame you can't use CouchDB, this seems like the perfect application for it. Keep in mind that using flat-files severely constrains your architecture and, especially, scalability.
What's the best case scenario for your CMS app? It's successful and people want to use it more? If you're using flat-files it'll be harder to service and improve your system (e.g. make it more robust, and add new features for future versions) and performance will not scale well. So "success" in this case is at best short-lived, as success translates into more and more work for less and less gains in feature-set and performance.

Then again, if the CSM is designed right, then switching between a flat file to RDMS should be as simple as using a different data access file.

Will this be installed on any shared hosting sites. For this to work somewhat safely, a mechanism like suEXEC needs to be set up properly as the web server will need write permissions to various directories.

What would be cool with a simple site that was feed via JSON and jQuery is that the site wouldn't need to load on each click. Just the relevant data would change. You could then use hashes in the location bar to keep track of where you were (ex. http://localhost/#about)
The problem being if they are editing the raw JSON file they can mess it up pretty quick. I think your admin tools would have to generate the JSON files based on the input so that you can ensure nothing breaks. The admin tools would be more entailed then the site (though isn't that always the case with dynamic sites)

What is the predicted data sizes for the CMS?
A large reason for the use of a RDMS is quick,specific access to large amounts of data. The data format might not be large, but if there is a lot of the data, then it might be better in the long run for a RDMS.
Then again, if the CSM is designed right, then switching between a flat file to RDMS should be as simple as using a different data access file.

While an RDBMS may be necessary for a very large CMS, a small one could run off flat files very well. A lot of CMS products out there fall down in that regard, I think, by throwing an RDBMS into the mix when there's no real need.
However, if you are using flat files, there are security issues which others have highlighted. Another issue I've come across is hosting providers using the disable_functions directive in php.ini to disable file I/O functions like fopen() and friends. If you're hosting your CMS on a box you control, you won't have this problem but if you're using a third-party provider, check first.

As the original poster, I wasn't signed in, so I'm following up to the answers so far in an answer (sorry if this is bad form).
There may instances where this is on
a shared host.
Though the JSON files can technically
be edited, this won't be the case.
The admin interface will be robust
enough to do all of the creating/editing of pages
The size for each install will be
relatively small -- 1 - 2 admins,
10-100 pages. A few lists of common
items may run longer (snippets of
copy for example).
Security will be a big issue -- any
other options suggestions on this
specifically?

Well, isn't there a problem with they being distrustful to any database system? Isn't the problem more in their thinking than in technology? Maybe they are afraid of database because it sounds complex to them. In that case, if you just present them some very simple CMS (like CMS made simple, which I've heard is really simple and the learning process is very fast), if they see everything is easy then may be they just don't care what's behind, if it's a database or whatever!
They could hear to arguments like better maintenance, lower cost of maintenance, much better handover to another webmaster than proprietary solutions (they are not dependent on you) etc.

MYSQL with Coldfusion - Solutions to create Search Capabilites?

I'm using MySQL & ColdFusion. Currently for searching TEXT fields I'm using LIKE in the database. Luckily my database is empty but soon the table will fill up and I fear I the LIKE SQL query will kill my app.
I'm looking for a solution that works with both MySQL & ColdFusion that will allow me to scalably offer search capabilities with my MySQL & ColdFusion app.
Thanks

Consider using ColdFusion's built in Verity search engine or Solr Search engine in ColdFusion 9, which is Apache Lucene. Good Luck!
Update: Coldfusion 9.0.1 has addressed several quirks in the Solr (apache lucene) search engine. Use it..!

You are right to worry about the LIKE operator's performance having scalability problems. But keep two things in mind.
First: column LIKE 'pattern%' works well if your column is indexed. It's column LIKE '%pattern%' that can cause real performance problems.
Second, mySQL has a good full-text search system built into it. See http://dev.mysql.com/doc/refman/5.1/en/fulltext-search.html

Whats makes you think that it will be a problem? Have you done any load testing? What is the worst case scenario max size of the table? Have you filled it to that level and tried it? Finally, do you actually need it to be "text"? MySQL has some very large varchars, would that do instead?
My point being, it sounds like you already have the simplest solution that might possibly work. Maybe you should prove that it does not work before over-engineering something else?
Lastly, to actually answer your question, you could cache the database into a verity search index and then search that (CF 9 offers another index engine as well). But your going to loose it being a live search.

I don't know if it is an option for your app but what I usually do is reserve like '%pattern%' for advanced searches defined by the user when a performance hit could be expected. When possible I default the search options selected by the user to 'Starts With.' I've searched '%pattern%' in a MySql 5 DB with 1.25 Million records on a low traffic site. The database doesn't seem to be the bottle neck, even on a field that isn't indexed. The customer wants all the records shown on the screen. Showing 10,000+ records seems to be the problem (lol). The DB may be less of a problem than you think depending on traffic.

Wiki Database, is there one?

I was searching the net for something like a wiki database, just like wikipedia but instead stores structured content, editable by users. What I was looking for was an online database accessible by everyone where people can design the schema and data with proper versioning of both schema and data. I couldn't find any such site. I am not sure if it is my search skills or if there really is no wiki database as of now. Does anyone out there know anything like this?
I think there is a great potential for something like this. A possible example will be a website with a GUI for querying a MySQL DB where any website visitor can create DB objects and populate data.
UPDATE: I had registered the domain wikidatabase.org to get started on a tool but I didn't find enough time yet. If anyone is interested in spending some time and coding on this, please let me know at wikidatabase.org

It's not quite what you're looking for, but Semantic Mediawiki adds database-like features to MediaWiki:
http://semantic-mediawiki.org/wiki/Semantic_MediaWiki
It's still fundamentally a Wiki, but you can add semantic tags to pages ([[foo::bar]] [[baz::1000]]) and then do database-type queries across them: SELECT baz FROM pages WHERE foo=bar would be {{#ask: [[foo::bar]] | ?baz}}. There is even an embryonic SPARQL implementation for pseudo-SQL queries.

OK this question is old, but Google led me here, so for anyone else out there looking for a wiki for structured data: Take a look at Foswiki.

This might be like what you're looking for: dbpedia.org. They're working on extracting data from Wikipedia, and encoding it in a structured format using RDF, so that it can be queried using SPARQL.
Linkeddata.org has a big list of RDF data sets.

Do you mean something like http://www.freebase.com?

You should check out https://www.wikidata.org/wiki/Wikidata:Main_Page which is a bit different but still may be of interest.

Something that might come close to your requirements is Google Docs.
What's offered is document editing roughly similar to MS Word, and spreadsheets roughly similar to Excel. I'm thinking of the latter, of course.
In Google Docs, You can create spreadsheets for free; being spreadsheets, they naturally have a row-and-column structure similar to a database, and which you can define flexibly. You can also share these sheets with other people. This seems to be a by-invite-only process rather than open-to-all, but there may be other possibilities I'm not aware of, or that level of sharing might be enough for you in any case.

mindtouch should be able to do it. It's rather easy to get data in / out. (for example: it's trivial to aggregate all the IP's for servers into one table).
I pretty much use it as a DB in the wiki itself (pages have tables, key/value..inheritance, templates, etc...) but you can also interface with the API, write dekiscript, grab the XML...

I like this idea. I have heard of some sites that are trying to pull together large datasets for various things for open consumption, but none that would allow a wiki feel.
You could start with something as simple as an installation of phpMyAdmin with a known password that would allow people to log in, create a database, edit data and query from any other site on the web.
It might suffer from more accuracy problems than wikipedia though.

OpenRecord, development of which seems to have halted in 2008, seems to approach this. It is a structured wiki in which pages are views on the data. Unlike RDBMSes it is loosely typed - the system tries to make a best guess about what data you entered, but defaults to text when it cannot guess. Schemas appear to have been implied.
http://openrecord.org
An example of the typing that is given is that of a date. If you enter '2008' in a record, the system interprets this as a date. If you enter 'unknown' however, the system allows that as well.

Perhaps you might be interested in Couch DB:
Apache CouchDB is a document-oriented
database that can be queried and
indexed in a MapReduce fashion using
JavaScript. CouchDB also offers
incremental replication with
bi-directional conflict detection and
resolution.

I'm working on an Open Source PHP / Symfony / PostgreSQL app that does this.
It allows multiple projects, each project can have multiple directories, each directory has a defined field structure. Admins set all this up.
Then members of the public can suggest new records, edit or report existing ones. All this is moderated and versioned.
It's early days yet but it basically works and is already in real world use in several projects.
Future plans already in progress include tools to help keep the data up to date, better searching/querying and field types that allow translations of content between languages.
There is more at http://www.directoki.org/

I'm surprised that nobody has mentioned Wikibase yet, which is the software that powers Wikidata.

Catching SQL Injection and other Malicious Web Requests

I am looking for a tool that can detect malicious requests (such as obvious SQL injection gets or posts) and will immediately ban the IP address of the requester/add to a blacklist. I am aware that in an ideal world our code should be able to handle such requests and treat them accordingly, but there is a lot of value in such a tool even when the site is safe from these kinds of attacks, as it can lead to saving bandwidth, preventing bloat of analytics, etc.
Ideally, I'm looking for a cross-platform (LAMP/.NET) solution that sits at a higher level than the technology stack; perhaps at the web-server or hardware level. I'm not sure if this exists, though.
Either way, I'd like to hear the community's feedback so that I can see what my options might be with regard to implementation and approach.

Your almost looking at it the wrong way, no 3party tool that is not aware of your application methods/naming/data/domain is going to going to be able to perfectly protect you.
Something like SQL injection prevention is something that has to be in the code, and best written by the people that wrote the SQL, because they are the ones that will know what should/shouldnt be in those fields (unless your project has very good docs)
Your right, this all has been done before. You dont quite have to reinvent the wheel, but you do have to carve a new one because of a differences in everyone's axle diameters.
This is not a drop-in and run problem, you really do have to be familiar with what exactly SQL injection is before you can prevent it. It is a sneaky problem, so it takes equally sneaky protections.
These 2 links taught me far more then the basics on the subject to get started, and helped me better phrase my future lookups on specific questions that weren't answered.
SQL injection
SQL Injection Attacks by Example
And while this one isnt quite a 100% finder, it will "show you the light" on existing problem in your existing code, but like with webstandards, dont stop coding once you pass this test.
Exploit-Me

The problem with a generic tool is that it is very difficult to come up with a set of rules that will only match against a genuine attack.
SQL keywords are all English words, and don't forget that the string
DROP TABLE users;
is perfectly valid in a form field that, for example, contains an answer to a programming question.
The only sensible option is to sanitise the input before ever passing it to your database but pass it on nonetheless. Otherwise lots of perfectly normal, non-malicious users are going to get banned from your site.

One method that might work for some cases would be to take the sql string that would run if you naively used the form data and pass it to some code that counts the number of statements that would actually be executed. If it is greater than the number expected, then there is a decent chance that an injection was attempted, especially for fields that are unlikely to include control characters such as username.
Something like a normal text box would be a bit harder since this method would be a lot more likely to return false positives, but this would be a start, at least.

One little thing to keep in mind: In some countries (i.e. most of Europe), people do not have static IP Addresses, so blacklisting should not be forever.

Oracle has got an online tutorial about SQL Injection. Even though you want a ready-made solution, this might give you some hints on how to use it better to defend yourself.

Now that I think about it, a Bayesian filter similar to the ones used to block spam might work decently too. If you got together a set of normal text for each field and a set of sql injections, you might be able to train it to flag injection attacks.

One of my sites was recently hacked through SQL Injection. It added a link to a virus for every text field in the db! The fix was to add some code looking for SQL keywords. Fortunately, I've developed in ColdFiusion, so the code sits in my Application.cfm file which is run at the beginning of every webpage & it looks at all the URL variables. Wikipedia has some good links to help too.

Interesting how this is being implemented years later by google and them removing the URL all together in order to prevent XSS attacks and other malicious acitivites

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas