XML Schema testing a value and applying extra restrictions - testing

I have the following case:
All boats have a boat type like shark , yatch and so on. I need to register which type of boat name and also how many feet the boat is, but this is where the problem arises. If the user types in a shark I need to validate that its between 15-30 feet, if he type in a yatch it needs to be between 30-60 for instance.
Any help on this?
<boat>
<type>shark</type>
<foot>18</foot> //validates
</boat>
<boat>
<type>shark</type>
<foot>14</foot> //fails
</boat>
<boat>
<type>AnyOtherBoat</type>
<foot>14</foot>//validates since its another type of boat than shark and yatch
</boat>
Help appriciated! Thx

Schematron ("a language for making assertions about patterns found in XML documents") might be able to do what you need. It allows specifying additional rules which cannot be expressed within a regular XML schema definition (XSD, RelaxNG).
Here are some articles to get you started:
Schematron on Wikipedia
Schematron: XML Structure Validation Language Using Patterns in Trees
Improving XML Document Validation with Schematron

To answer your question: No, you can't do that in XML Schema.
Firstly, you can't use values to select which constraints to apply (but you could for elements like <shark>)
Secondly, you can't do arithmetic tests (but you can use regex to specify the permissible strings... so you might be able to hack it.)

Related

Custom, user-definable "wildcard" constants in SQL database search -- possible?

My client is making database searches using a django webapp that I've written. The query sends a regex search to the database and outputs the results.
Because the regex searches can be pretty long and unintuitive, the client has asked for certain custom "wildcards" to be created for the regex searches. For example.
Ω := [^aeiou] (all non-vowels)
etc.
This could be achieved with a simple permanent string substitution in the query, something like
query = query.replace("Ω", "[^aeiou]")
for all the elements in the substitution list. This seems like it should be safe, but I'm not really sure.
He has also asked that it be possible for the user to define custom wildcards for their searches on the fly. So that there would be some other input box where a user could define
∫ := some other regex
And to store them you might create a model
class RegexWildcard(models.Model):
symbol = ...
replacement = ...
I'm personally a bit wary of this, because it does not seem to add a whole lot of functionality, but does seem to add a lot of complexity and potential problems to the code. Clients can now write their queries to a db. Can they overwrite each other's symbols?
That I haven't seen this done anywhere before also makes me kind of wary of the idea.
Is this possible? Desirable? A great idea? A terrible idea? Resources and any guidance appreciated.
Well, you're getting paid by the hour....
I don't see how involving the Greek alphabet is to anyone's advantage. If the queries are stored anywhere, everyone approaching the system would have to learn the new syntax to understand them. Plus, there's the problem of how to type the special symbols.
If the client creates complex regular expressions they'd like to be able to reuse, that's understandable. Your application could maintain a list of such expressions that the user could add to and choose from. Notionally, the user would "click on" an expression, and it would be inserted into the query.
The saved expressions could have user-defined names, to make them easier to remember and refer to. And you could define a syntax that referenced them, something otherwise invalid in SQL, such as ::name. Before submitting the query to the DBMS, you substitute the regex for the name.
You still have the problem of choosing good names, and training.
To prevent malformed SQL, I imagine you'll want to ensure the regex is valid. You wouldn't want your system to store a ; drop table CUSTOMERS; as a "regular expression"! You'll either have to validate the expression or, if you can, treat the regex as data in a parameterized query.
The real question to me, though, is why you're in the vicinity of standardized regex queries. That need suggests a database design issue: it suggests the column being queried is composed of composite data, and should be represented as multiple columns that can be queried directly, without using regular expressions.

Which should be used to GROUP BY: text fields or numeric fields?

So, I am currently enrolled in a Database Concepts course (yes, I am a noob) and the professor posed the question: "In a GROUP BY clause, for example GROUP BY xxxx, should the xxxx on which we’re grouping be a text field or a numeric field?"
I am very confused by this question, because it is possible to GROUP BY either text or numeric fields. I have tried both and received no errors. I realize that it typically makes more sense to GROUP BY text fields because, for instance, you would want to know how many products have a price greater than 50.00. The reverse of this would likely be unnecessary. Is there a "best practice" rule that I am missing here?
Thanks in advance.
The correct answer is . . . "xxxx should be whatever is needed to resolve the problem at hand."
You should review the notes or talk to the professor to understand the context of that statement. As you clearly understand, a SQL query can aggregate by numeric, character, or date/time columns -- as well as various other data types. From a functional perspective, all are supported.
There might be a bit of a performance advantage to using numbers. This is an artifact that sorting on fixed-width fields is often faster (marginally) than sorting on variables length fields. I sincerely doubt that your professor is referring to this, however.

SQL queries to their natural language description

Are there any open source tools that can generate a natural language description of a given SQL query? If not, some general pointers would be appreciated.
I don't know much about NLP, so I am not sure how difficult this is, although I saw from some previous discussion that the vice versa conversion is still an active area of research. It might help to say that the SQL tables I will be handling are not arbitrary in any sense, yet mine, which means that I know exact semantics of each table and its columns.
I can devise two approaches:
SQL was intended to be "legible" to non-technical people. A naïve and simpler way would be to perform a series of replacements right on the SQL query: "SELECT" -> "display"; "X=Y" -> "when the field X equals to value Y"... in this approach, using functions may be problematic.
Use a SQL parser and use a series of templates to realize the parsed structure in a textual form: "(SELECT (SUM(X)) (FROM (Y)))" -> "(display (the summation of (X)) (in the table (Y))"...
ANTLR has a grammar of SQL you can use: https://github.com/antlr/grammars-v4/blob/master/sqlite/SQLite.g4 and there are a couple SQL parsers:
http://www.sqlparser.com/sql-parser-java.php
https://github.com/facebook/presto/tree/master/presto-parser/src/main
http://db.apache.org/derby/
Parsing is a core process for executing a SQL query, check this for more information: https://decipherinfosys.wordpress.com/2007/04/19/parsing-of-sql-statements/
There is a new project (I am part of) called JustQuery.Me which intends to do just that with NLP and google's SyntaxNet. You can go to the https://github.com/justquery-me/justqueryme page for more info. Also, sign up for the mailing list at justqueryme-development#googlegroups.com and we will notify you when we have a proof of concept ready.

SELECT FROM (lv_tablename) error: the output table is too small

I have an ABAP class method, say, select_something. select_something has an exporting parameter, say, et_result. et_result is of type standard table because the type of et_result cannot be determined until runtime.
The method sometimes gives a short dump saying With ABAP/4 Open SQL array select, the output table is too small at "select * into table et_result from (lv_tablename) where..."
Error analysis:
......in this particular case, the database table is 3806 bytes wide, but the internal table is only 70 bytes wide.
I tried "any table" too and the error is the same.
You could return a data reference. Your query will no longer fail, and you can assign the data to a correctly typed field symbol afterwards.
" Definition
class-methods select_all
importing
!tabname type string
returning
value(results) type ref to data.
...
...
" Implementation
method select_all.
data dref type ref to data.
create data dref type standard table of (tabname).
field-symbols <tab> type any table.
assign dref->* to <tab>.
select * from (tabname) into table <tab>.
get reference of <tab> into results.
endmethod.
Also, I agree with #vwegert that dynamic queries (and programming for that matter) should be avoided when possible.
What you're trying to do looks horribly wrong on many levels. NEVER use SELECT FROM (whatever) unless someone points a gun at your head AND the door is locked tight. You'll loose every kind of static error checking the system might be able to provide you with. For example, the compiler will no longer be able to tell you "Hey, that table you're reading from is 3806 bytes wide." It simply can't tell, even if you use constants. You'll find that out the hard way, producing short dumps, especially when switching between unicode and NUC systems, quite likely some in production systems. No fun.
(Actually there are a few - very very VERY few - good uses for dynamic table names in the SELECT statement. I need them about once every two to three years, and I code quite a lot weird stuff. Just avoid them wherever you can, even at the cost of writing more code. It's just not worth the trouble fixing broken stuff later.)
Then, changing the generic formal parameter type does not do anything to the type of the actual parameter. If you pass a STANRDARD TABLE OF mandt WITH DEFAULT KEY to your method, that table will have lines of 3 characters. It will be a STANDARD TABLE, and as such, it will also be an ANY TABLE, and that's about it. You can twist the generic types anywhere you like, there's no way to enforce correctness using generic types the way you use them. It's up to the caller to make sure that all the right types are used. That's a bad way to fly.
First off, I agree with vwegert's response, try to avoid dynamic sql selections if you can
That said, check the short dump. If the error is an exception class, you can wrap the SELECT statement in a try/catch block and at least stop it from dumping.
You can also try "INTO CORRESPONDING FIELDS OF TABLE et_result". If ET_RESULT is dynamic, you might have to cast it into the proper structure using RTTS. This might give you some ideas...
Couldn't agree more to vwegert, but if there is absolutely no other way (and there usually is) of performing your task than using dynamic select statements and dynamically typed parameters, do some checks on the type of the table and the parameter at runtime.
Use CL_ABAP_TYPEDESCR and its subclasses to do so.
This way, you can handle errors at runtime without your program dumping,
But as vwegert said, this dynamic stuff is pure evil and will most certainly break at some point during runtime. Adding the necessary error handling will most likely be a lot more work and a lot harder than redesigning your code to none dynamic SQL and typed parameters

First Name Variations in a Database

I am trying to determine what the best way is to find variations of a first name in a database. For example, I search for Bill Smith. I would like it return "Bill Smith", obviously, but I would also like it to return "William Smith", or "Billy Smith", or even "Willy Smith". My initial thought was to build a first name hierarchy, but I do not know where I could obtain such data, if it even exists.
Since users can search the directory, I thought this would be a key feature. For example, people I went to school with called me Joe, but I always go by Joseph now. So, I was looking at doing a phonetic search on the last name, either with NYSIIS or Double Metaphone and then searching on the first name using this name heirarchy. Is there a better way to do this - maybe some sort of graded relevance using a full text search on the full name instead of a two part search on the first and last name? Part of me thinks that if I stored a name as a single value instead of multiple values, it might facilitate more search options at the expense of being able to address a user by the first name.
As far as platform, I am using SQL Server 2005 - however, I don't have a problem shifting some of the matching into the code; for example, pre-seeding the phonetic keys for a user, since they wouldn't change.
Any thoughts or guidance would be appreciated. Countless searches have pretty much turned up empty. Thanks!
Edit: It seems that there are two very distinct camps on the functionality and I am definitely sitting in the middle right now. I could see the argument of a full-text search - most likely done with a lack of data normalization, and a multi-part approach that uses different criteria for different parts of the name.
The problem ultimately comes down to user intent. The Bill / William example is a good one, because it shows the mutation of a first name based upon the formality of the usage. I think that building a name hierarchy is the more accurate (and extensible) solution, but is going to be far more complex. The fuzzy search approach is easier to implement at the expense of accuracy. Is this a fair comparison?
Resolution: Upon doing some tests, I have determined to go with an approach where the initial registration will take a full name and I will split it out into multiple fields (forename, surname, middle, suffix, etc.). Since I am sure that it won't be perfect, I will allow the user to edit the "parts", including adding a maiden or alternate name. As far as searching goes, with either solution I am going to need to maintain what variations exists, either in a database table, or as a thesaurus. Neither have an advantage over the other in this case. I think it is going to come down to performance, and I will have to actually run some benchmarks to determine which is best. Thank you, everyone, for your input!
In my opinion you should either do a feature right and make it complete, or you should leave it off to avoid building a half-assed intelligence into a computer program that still gets it wrong most of the time ("Looks like you're writing a letter", anyone?).
In case of human names, a computer will get it wrong most of the time, doing it right and complete is impossible, IMHO. Maybe you can hack something that does the most common English names. But actually, the intelligence to look for both "Bill" and "William" is built into almost any English speaking person - I would leave it to them to connect the dots.
The term you are looking for is Hypocorism:
http://en.wikipedia.org/wiki/Hypocorism
And Wikipedia lists many of them. You could bang out some Python or Perl to scrape that page and put it in a db.
I would go with a structure like this:
create table given_names (
id int primary key,
name text not null unique
);
create table hypocorisms (
id int references given_names(id),
name text not null,
primary key (id, name)
);
insert into given_names values (1, 'William');
insert into hypocorisms values (1, 'Bill');
insert into hypocorisms values (1, 'Billy');
Then you could write a function/sproc to normalize a name:
normalize_given_name('Bill'); --returns William
One issue you will face is that different names can have the same hypocorism (Albert -> Al, Alan -> Al)
I think your basic approach is solid. I don't think fulltext is going to help you. For seeding, behindthename.com seems to have large amount of the data you want.
Are you using SQl Server 2005 Express with Advanced Services as to me it sounds you would benefit from the Full Text indexing and more specifically Contains and Containstable which you can use with specific instructions here is a link for the uses of Containstable:
http://msdn.microsoft.com/en-us/library/ms189760.aspx
and here is the download link for SQL Server 2005 With Advanced Services:
http://www.microsoft.com/downloads/details.aspx?familyid=4C6BA9FD-319A-4887-BC75-3B02B5E48A40&displaylang=en
Hope this helps,
Andrew
You can use the SQL Server Full Text Search and do an inflectional search.
Basically like:
SELECT ProductId, ProductName
FROM ProductModel
WHERE CONTAINS(CatalogDescription, ' FORMSOF(THESAURUS, metal) ')
Check out:
http://en.wikipedia.org/wiki/SQL_Server_Full_Text_Search#Inflectional_Searches
http://msdn.microsoft.com/en-us/library/ms345119.aspx
http://www.mssqltips.com/tip.asp?tip=1491
Not sure what your application is, but if your users know at the time of sign up that people from their past might be searching the database for them, you could offer them the chance in the user profile to define other names they might be known as (including last names, women change these all the time and makes finding them much harder!) and that they want people to be able to search on. Store these in a separate related table. Then search on that. Just make the structure such that you can define one name as the main name (the one you use for everything except the search.)
You'll find that you're dabbling in an area known as "Natural Language Processing" and you'll need to do several things, most of which can be found under the topic of stemming.
Simplistic stemming simply breaks the word apart, but more advanced algorithms associate words that mean the same thing - for instance Google might use stemming to convert "cat" and "kitten" to "feline" and search for all three, weighing the actual word provided by the user as slightly heavier so exact matches return before stemmed matches.
It's a known problem, and there are open source stemmers available.
-Adam
No, Full Text searches will not help to solve your problem.
I think you might want to take a look at some of the following links: (Funny, no one mentioned SoundEx till now)
SoundEx - MSDN
SoundEx - Google results
InformIT - Tolerant Search algorithms
Basically SoundEx allows you to evaluate the level of similarity in similar sounding words. The function is also available on SQL 2005.
As a side issue, instead of returning similar results, it might prove more intuitive to the user to use a AJAX based script to deliver similar sounding names before the user initiates his/her search. That way you can show the user "similar names" or "did you mean..." kind of data.
Here's an idea for automatically finding "name synonyms" like Bill/William. That problem has been studied in the broader context of synonyms in general: inducing them from statistics of which words commonly appear in the same contexts in a large text corpus like the Web. You could try combining that approach with a list of names like Moby Names; I don't know if it's been done before.
Here are some pointers.