SQL: List-Field contains sublist - sql

Quick preface: I use the SQL implementation persistent (Haskell) and esqueleto.
Anyway, I want to have a SQL table with a column of type [String], i.e. a list of strings. Now I want to make a query which gives me all the records where a given list is a sublist of the one in the record.
For instance the table with
ID Category
0 ["math", "algebra"]
1 ["personal", "life"]
2 ["algebra", "university", "personal"]
with a query of ["personal", "algebra"] would return only the record with ID=2, since ["personal", "algebra"] is a sublist of ["algebra", "university", "personal"].
Is a query like this possible with variable-length of my sought-after sublist and "basic" SQL operators?
If someone knows their way around persistent/esqueleto that would of course be awesome.
Thanks.

Expanding on the comment of Gordon Linoff and the previous answer:
SQL databases are sometimes limited in their power. Since the order of your Strings in [String] does not seem to matter, you are trying to put something like a set into a relational database and for your query you suggest something like a is a subset of operator.
If there was a database engine that provides those structures, there would be nothing wrong about using it (I don't know any). However, approximating your set logic (or any logic that is not natively supported by the database) has disadvantages:
You have to explicitly deal with edge cases (cf. xnyhps' answer)
Instead of hiding the complexity of storing data, you need to explicitly deal with it in your code
You need to study the database engine rather than writing your Haskell code
The interface between database and Haskell code becomes blurry
A mightier approach is to reformulate your storing task to something that fits easily into the relational database concept. I.e. try to put it in terms of relations.
Entities and relations are simple, thus you avoid edge cases. You don't need to bother how exactly the db backend stores your data. You don't have to bother with the database much at all. And your interface is reduced to rather straightforward queries (making use of joins). Everything that cannot be (comparatively) easily realized with a query, belongs (probably) into the Haskell code.
Of course, the details differ based on the specific circumstances.
In your specific case, you could use something like this:
Table: Category
ID Description
0 math
1 algebra
2 personal
3 life
4 university
Table: CategoryGroup
ID CategoryID
0 0
0 1
1 2
1 3
2 1
2 4
2 2
... where the foreign key relation allows to have groups of categories. Here, you are using a relational database where it excels. In order to query for CategoryGroup you would join the two tables, resulting in a result of type
[(Entity CategoryGroup, Entity Category)]
which I would transform in Haskell to something like
[(Entity CategoryGroup, [Entity Category])]
Where the Category entities are collected for each CategoryGroup (that requires deriving (Eq, Ord) in your CategoryGroup-Model).
The set-logic as described above, for a given List cs :: [Entity Category], would then go like
import qualified Data.Set as Set
import Data.Set (isSubsetOf)
let s = Set.fromList ["personal", "algebra"]
let s0 = Set.fromList $ map (categoryDescription . entityVal) cs
if s `isSubsetOf` s0 -- ... ?
Getting used to the restrictions of relational databases can be annoying in the beginning. I guess, for something of central importance (persisting data) a robust concept is often better than a mighty one and it pays out to always know what your database is doing exactly.

By using [String], persistent converts the entire list to a quoted string, making it very hard to work with from SQL.
You can do something like:
mapM (\cat ->
where_ (x ^. Category `like` (%) ++. val (show cat) ++. (%)))
["personal", "algebra"]
But this is very fragile (may break when the categories contain ", etc.).
Better approaches are:
You could do the filtering in Haskell if the database is small enough.
It would be much easier to model your data as:
Objects:
ID ...
0 ...
1 ...
2 ...
ObjectCategories:
ObjectID Category
0 math
0 algebra
1 personal
1 life
2 algebra
2 university
2 personal

Related

SQL UDF - Struct Diff

We have a table with 2 top level columns of type 'struct' - one is a 'before', and an 'after' image. The struct schemas are non trivial - nested, with arrays to a variable depth. The are sent to us from replication, so the schemas are always the same (but the schemas of course can be updated at some point, but always together)
Objective is for the two input structs, to return 2 struct 'diffs' of the before and after with only fields that have changed - essentially the 'delta' diff of the changes produce by the replication source. We know something has changed, but not 'what' since we get the full before and after image. this raw data lands in BQ and is then processed from there but need to determine the more granular change for high order BQ processing.
The table schema is very wide (1000's of leaf fields), and the data populated fairly spare (so alot of nulls will be present on both sides of the snapshot) - so would need to be performant as best as possible when executing over 10s of millions of rows.
All things are nullable for maximum flexibility.
So change could look like:
null -> value
value -> null
valueA -> valueB
Arrays:
recursive use of above for arrays of structs, ordering could be relaxed if that makes it easier?
It might not be possible.
Ive not attempted this yet as it seems really difficult so am looking to the community boffins for some support for this. I feel the arrays could be difficult part. There is probably an easy way perhaps in Python I dont or even doing some JSON conversion and comparison using JOSN tools? It feels like it would be a super cool feature built in to BQ as well, so if can get this to work, will add a feature request for it.
Id like to have a SQL UDF for reuse (we have SQL skills not python, although if easier in python then thats ok), and now with the new feature of persistent SQL UDFs, this seems the right time to ask and test the feature out!
sql
def struct_diff(before Struct, after Struct)
(beforeChange, afterChange) - type of signature but open to suggestions?
It appears to be really difficult to get a piece of reusable code. Since currently there is no support for recursive functions for SQL UDF, you cannot use a recursive approach for the nested structs.
Although, you might be able to get some specific SQL UDF functions depending on your array and structs structures. You can use an approach like this one to compare the structs.
CREATE TEMP FUNCTION final_compare(s1 ANY TYPE, s2 ANY TYPE) AS (
STRUCT(s1 as prev, s2 as cur)
);
CREATE TEMP FUNCTION compare(s1 ANY TYPE, s2 ANY TYPE) AS (
STRUCT(final_compare(s1.structA, s2.structA))
);
You can use UNNEST to work with arrays, and the final SQL UDF would really depend on your data.
As #rtenha suggested, Python could be a lot easier to handle this problem.
Finally, I did some tests using JavaScript UDF, and it was basically the same result, if not worst than SQL UDF.
The console allows a recursive definition of the function, however it will fail during execution. Also, javascript doesn't allow the ANY TYPE data type on the signature, so you would have to define the whole STRUCT definition or use a workaround like applying TO_JSON_STRING to your struct in order to pass it as a string.

How to implement a matrix concept in SQL Server

I am stuck in this concept of creating a matrix in SQL Server where it is created in Excel. I couldn't find good answer online. There are room numbers as the first row and on the first column there are functional requirements. So for example when there is a camera needed in one of the rooms,I will place X mark in the desired row and col coordinate to indicate that it contains one.I attached an sample of the Excel to explain better. Excel Matrix.png
Rather than having multiple columns for every possible functional requirement, use proper relational methods for a many-to-may relationship:
Rooms
------
Id
RoomName
Functions
---------
Id
FunctionName
RoomFunctions
-------------
RoomId
FunctionId
Then you can relate one room to a variable number of functions, and can add functions easily without changing your data structure.
Without having data to work with, it's hard to give you an example.
With that said, the pivot method may help you out. You can just have dummy column with a 1 or 0 based on whether or not it has an 'X' in your data. Then in the pivot you would just do a max on that for the various values.
It may require massaging your data into a better format, but should be doable.

Query Optimization Suggestion

Suppose you have a query (PL-SQL) such as:
Select a.*
From Table a
Where a.foo in (#1)
or a.bar in (#1);
Where #1 is a list containing about 10.000 (string) parameters. Yes, the list is repeated in both restrictions. And for any given row, a.foo <> a.bar.
This list comes from a webservice, and it changes according to a set of parameters. Suppose it is not possible to store them. The strings in this list are numeric strings, with 9 characters, such as '001234567'.
Is there a better way to structure this query?
Given the information in your question (and comments) then the simple answer is no there is not a better way to structure your query, however there is almost certainly a better way to design this process so that you aren't forced to use the procedure to which this horrible pseudo-code snippet alludes.

Building Query from Multi-Selection Criteria

I am wondering how others would handle a scenario like such:
Say I have multiple choices for a user to choose from.
Like, Color, Size, Make, Model, etc.
What is the best solution or practice for handling the build of your query for this scneario?
so if they select 6 of the 8 possible colors, 4 of the possible 7 makes, and 8 of the 12 possible brands?
You could do dynamic OR statements or dynamic IN Statements, but I am trying to figure out if there is a better solution for handling this "WHERE" criteria type logic?
EDIT:
I am getting some really good feedback (thanks everyone)...one other thing to note is that some of the selections could even be like (40 of the selections out of the possible 46) so kind of large. Thanks again!
Thanks,
S
What I would suggest doing is creating a function that takes in a delimited list of makeIds, colorIds, etc. This is probably going to be an int (or whatever your key is). And splits them into a table for you.
Your SP will take in a list of makes, colors, etc as you've said above.
YourSP '1,4,7,11', '1,6,7', '6'....
Inside your SP you'll call your splitting function, which will return a table-
SELECT * FROM
Cars C
JOIN YourFunction(#models) YF ON YF.Id = C.ModelId
JOIN YourFunction(#colors) YF2 ON YF2.Id = C.ColorId
Then, if they select nothing they get nothing. If they select everything, they'll get everything.
What is the best solution or practice for handling the build of your query for this scenario?
Dynamic SQL.
A single parameter represents two states - NULL/non-existent, or having a value. Two more means squaring the number of parameters to get the number of total possibilities: 2 yields 4, 3 yields 9, etc. A single, non-dynamic query can contain all the possibilities but will perform horribly between the use of:
ORs
overall non-sargability
and inability to reuse the query plan
...when compared to a dynamic SQL query that constructs the query out of only the absolutely necessary parts.
The query plan is cached in SQL Server 2005+, if you use the sp_executesql command - it is not if you only use EXEC.
I highly recommend reading The Curse and Blessing of Dynamic SQL.
For something this complex, you may want a session table that you update when the user selects their criteria. Then you can join the session table to your items table.
This solution may not scale well to thousands of users, so be careful.
If you want to create dynamic SQL it won't matter if you use the OR approach or the IN approach. SQL Server will process the statements the same way (maybe with little variation in some situations.)
You may also consider using temp tables for this scenario. You can insert the selections for each criteria into temp tables (e.g., #tmpColor, #tmpSize, #tmpMake, etc.). Then you can create a non-dynamic SELECT statement. Something like the following may work:
SELECT <column list>
FROM MyTable
WHERE MyTable.ColorID in (SELECT ColorID FROM #tmpColor)
OR MyTable.SizeID in (SELECT SizeID FROM #tmpSize)
OR MyTable.MakeID in (SELECT MakeID FROM #tmpMake)
The dynamic OR/IN and the temp table solutions work fine if each condition is independent of the other conditions. In other words, if you need to select rows where ((Color is Red and Size is Medium) or (Color is Green and Size is Large)) you'll need to try other solutions.

Loading a entity IDs CSV column as hydrated entities in NHibernate

I have a number of database table that looks like this:
EntityId int : 1
Countries1: "1,2,3,4,5"
Countries2: "7,9,10,22"
I would like to have NHibernate load the Country entities identifed as 1,2,3,4,5,7,9 etc. whenever my EntityId is loaded.
The reason for this is that we want to avoid a proliferation of joins, as there are scores of these collections.
Does fetching the countries have to happen as the entity is loaded - it is acceptable that you run a query to fetch these? (You could amend your DAO to run the query after fetching the entity.) The reason I ask is that simply running a query compared to having custom code invoked when entities are loaded requires less "plumbing" and framework innards.
After fecthing your entity, and you have the Country1,Country2 lists, You can run a query like:
select c from Country c where c.id in (:Country1)
passing :Country1 as a named parameter. You culd also retrieve all rows for both sets of countries
select Entity e where e.id in (:Country1, :Country2)
I'm hoping the country1 & country2 strings can be used as they are, but I have a feeling this won't work. If so, you should convert the Strings to a collection of Integers and pass the collection as the query parameter.
EDIT: The "plumbing" to make this more transparent comes in the form of the IInterceptor interface. This allows you to plug in to how entities are loaded, saved, updated, flushed etc. Your entity will look something like this
class MyEntity
{
IList<Country> Country1;
IList<Country> Country2;
// with public getter/setters
String Country1IDs;
String Country2IDs;
// protected getter and setter for NHibernate
}
Although the domain object has two representations of the list - the actual entities, and the list of IDs, this is the same intrusion that you have when declaring a regular ID field in an entity. The collections (country1 and Country2) are not persisted in the mapping file.
With this in place, you provide an IInterceptor implementation that hooks the loading and saving. On loading, you fetch the value of the countryXID property an use to load the list of countries (as I described above.) On saving, you turn the IList of countries into an ID list, and save this value.
I couldn't find the documentation for IInterceptor, but there are many projects on the net using it. The interface is described in this article.
No you cannot, at least not with default functionality.
Considering that there is no SPLIT string function in SQL it would be hard for any ORM to detect the discreet integer values delimited by commas in a varchar column. If you somehow (custom sql func) overcame that obstacle, your best shot would be to use some kind of component/custom user type that would still make a smorgasbond of joins on the 'Country' table to fetch, in the end, a collection of country entities...
...but I'm not sure it can be done, and it would also mean writing from scratch the persistence mechanism as well.
As a side note, I must say that i don't understand the design decision; you denormalized the db and, well, since when joins are bad?
Also, the other given answer will solve your problem without re-designing your database, and without writing a lot of experimental plumbing code. However, it will not answer your question for hydration of the country entities
UPDATE:
on a second thought, you can cheat, at least for the select part.
You could make a VIEW the would split the values and display them as separate rows:
Entity-Country1 View:
EntityId Country
1 1
1 2
1 3
etc
Entity-Country2 View:
EntityId Country
1 7
1 9
1 10
etc
Then you can map the view