What is most efficient way to find ‘inverse' of getting all records that match particular criteria - sql

I am trying to find the most efficient way to find ‘inverse' of getting all records that match particular criteria
I.e. find all predefined criteria from a set that a particular record matches
I have a table of 'target' criteria that has many records - each built using a querybuilder javascript component - so each target record has its criteria stored as a json string in a field.
I also have a standard 'person' table
It is straight forward to query how many people fit a particular target.
What I am trying to do is get all targets that match a particular person
Is there a more efficient way than just running each target's criteria against a person?
Open to suggestions beyond just sql - e.g. caching , hashing or building up some kind of lookup table/file
Edit:
Hopefully tables below clarify this issue. If I parsed and ran the 'Good Eyesight' target criteria I would expect to return both Bob and Sue
But I want to know that Bob matches the 'Young People' and 'Good Eyesight' target. I will have thousands of users and probably up to 50 active targets.
Table 1: Person
ID Name Age Fav_Vegetable
---------------------------------
1 Bob 20 Carrot
2 Sue 40 Carrot
Table 2: Target
ID Name Criteria_JSON
---------------------------------
1 Young People {"rule": "young_age", "selectedOperator": "<","selectedOperand": "Age","value": "30"}
2 Old People {"rule": "old_age", "selectedOperator": ">","selectedOperand": "Age","value": "30"}
3 Good Eyesight {"rule": "vegetable","selectedOperator": "equals","selectedOperand": "Fav_Vegetable","value": "Carrot"}

The answer I have come up with is to run all targets against all people and maintain an index type table of the results.
i.e. have a table TargetIndex with columns targetId, personId
Then when I need to know the targets for a particular person I can just check against the TargetIndex table rather than rerunning queries.
Obviously these results would need to be refreshed as the target or people records change - - probably whenever a target is added/edited and refreshed periodically (hourly/nightly?) to pick up changes in people
Thanks for people's thoughts

Related

Is it possible to use SQL to show the average of some values in one column and then in subsequent columns display the individual values?

I have a bunch of data and I want the output to display an average of all the data points but also the individual data points in subsequent columns. Ideally it would look something like this:
Compound | Subject | Avg datapoint | Datapoint Experiment 1 | Datapoint Exp 2 | ...
..........XYZ......|.....ABC....|............40...............|...............20..............................|...............60...............|......
..........TUV......|.....ABC....|............30...............|...............20..............................|...............40...............|......
..........TUV......|.....DEF....|............20...............|...............10..............................|...............30...............|......
One problem I'm running in to is that I get repetitive lines of information. Another is that I have some rows pulling in info that doesn't apply, such that some of the individual datapoints in, say, row 2 would have info from subject DEF when I only want it to have info from subject ABC.
I hope this makes sense! I'm currently using inner join with a ton of where qualifiers. I'm close but not quite there. Any help is appreciate and let me know if I can provide additional info to help you help me.
The SQL language has a very strict rule requiring you to know the exact number of columns for your result set in advance, before looking at any data in your tables.
Therefore, if this average is based off a known fixed number of columns, or if the number of potential columns is reasonably small, where you can manually setup placeholders, then this will be possible. The key search terms to learn how to do this is "conditional aggregation", where you may also need to join the table to itself for each field.
Otherwise, you will need to pivot and aggregate your data in your client code or reporting tool.

Merge two CSV and collate data

I have two CSV files, the first like so:
Book1:
ID,TITLE,SUBJECT
0001,BLAH,OIL
0002,BLAH,HAMSTER
0003,BLAH,HAMSTER
0004,BLAH,PLANETS
0005,BLAH,JELLO
0006,BLAH,OIL
0007,BLAH,HAMSTER
0008,BLAH,JELLO
0009,BLAH,JELLO
0010,BLAH,HAMSTER
0011,BLAH,OIL
0012,BLAH,OIL
0013,BLAH,OIL
0014,BLAH,JELLO
0015,BLAH,JELLO
0016,BLAH,HAMSTER
0017,BLAH,PLANETS
0018,BLAH,PLANETS
0019,BLAH,HAMSTER
0020,BLAH,HAMSTER
And then a second CSV with items associated with the first list, with ID being the common attribute between the two.
Book2:
ID,ITEM
0001,PURSE
0001,STEAM
0001,SEASHELL
0002,TRUMPET
0002,TRAMPOLINE
0003,PURSE
0003,DOLPHIN
0003,ENVELOPE
0004,SEASHELL
0004,SERPENT
0004,TRUMPET
0005,CAR
0005,NOODLE
0006,CANNONBALL
0006,NOODLE
0006,ORANGE
0006,SEASHELL
0007,CREAM
0007,CANNONBALL
0007,GUM
0008,SERPENT
0008,NOODLE
0008,CAR
0009,CANNONBALL
0009,SERPENT
0009,GRAPE
0010,SERPENT
0010,CAR
0010,TAPE
0011,CANNONBALL
0011,GRAPE
0012,ORANGE
0012,GUM
0012,SEASHELL
0013,NOODLE
0013,CAR
0014,STICK
0014,ORANGE
0015,GUN
0015,GRAPE
0015,STICK
0016,BASEBALL
0016,SEASHELL
0017,CANNONBALL
0017,ORANGE
0017,TRUMPET
0018,GUM
0018,STICK
0018,GRAPE
0018,CAR
0019,CANNONBALL
0019,TRUMPET
0019,ORANGE
0020,TRUMPET
0020,CHERRY
0020,ORANGE
0020,GUM
The real datasets are millions of records, so I'm sorry in advance for my simple example.
The problem I need to solve is getting the data merged and collated in a way where I can see which item groupings most commonly appear together on the same ID. (e.g. GRAPE,GUM,SEASHELL appear together 340 times, ORANGE and STICK 89 times, etc...)
Then I need to see if there is any change/deviation to the general results in common appearance when grouped by SUBJECT.
Tools I'm familiar with are Excel and SQL, but I also have PowerBI and Alteryx at my disposal.
Full disclosure: Not homework, or work, but a volunteer project, thus my unfamiliarity with this kind of data manipulation.
Thanks in advance.
An Alteryx solution:
Drag the two .csv files onto your canvas (seen as book1.csv and book2.csv in my picture; Alteryx will create "Input" tools for you.
Drag a "Join" tool on and connect the two .csv files to its inputs; select "ID" as the join field; unselect the "Right_ID" as output since it's merely a duplicate of "ID"
Drag a "Summary" tool on and connect the Join tool's output to the Summary tool's input; select all three of the outputs and add as a "group by"... then add the ID column with a "count"
Drag a browse tool on and connect the summary's output to the browse tool's input.
run the workflow
After all that, click on the browse tool and you should see what is seen in my screenshot: (which is showing just the first ten rows of output):
+1 for taking on a volunteer project - I think anyone who knows data can have a big impact in support of their favourite group or cause.
I would just pull the 2 files into Power BI as 2 separate tables (Get Data / From File). Create a relationship between the 2 tables based on ID (it might get auto-generated). It should be one to many.
Then I would add a Calculated Column to the Book1 table to Concatenate the related ITEM values, eg.
Items =
CALCULATE (
CONCATENATEX (
DISTINCT ( 'Book2'[ITEM] ),
'Book2'[ITEM],
", ",
'Book2'[ITEM], ASC
)
)
Now you can use that Items field in visuals (e.g. a Table), along with Count of ID to get the frequency.
Adding Subject to a copy of the table (e.g. to the Columns well of a Matrix) will produce your grouped scenario, or you could add a Subject Slicer.
As you will be comparing subsets of varying size, I would change Count of ID to Show value as - % of grand total.
Little different solution using Alteryx.
With this dataset, there are very few repeating 3 or 4 item groups. You can do the two item affinity analysis and get a probability of 3 or 4 item groups, or you can count the 3 and 4 item groups individually. I believe what you want is the latter as your probability of getting grapes with oranges may be altered by whether you have bananas in the cart or not.
Anyway, I did not join in the subject until after finding all of my combinations. I found all the combinations by taking the Cartesian join of two, then three, then four of the original set. I then removed all duplicates by ensuring items were always in alphabetical order in each row. I then counted occurrences of each combination. More joins can be added in the same pattern to count groups of 5,6,7...
Once you have the counts of occurrences, then I would join back with the subjects and perform this analysis on each group and compare to the overall results.
I'm supposed to disclose that I work for Alteryx.
first of all if you are using windows
just navigate to the directory which contains the CSV and write the following command:
copy pattern newfileName.csv
#example
copy *.csv merged.csv
now you created one csv file, the file is too large now you can't process it once, depending on your programming language you can use appropriate way, for python you can use generators to process line by line, or pandas you can read chunk by chunk it will be easy.
I hope this help you.

Merge multiple rows in fixed width file source into one row

I'm working with the craziest file format I've seen. It is fixed width, and contains multiple record types (in the sense that each row may have different columns and widths). There's a file header, trailer, and then a static number of rows that when put together make up one record. The problem I'm having is that there is nothing in the rows that tell you they belong to the same record other than their sort order and a row number attribute.
Example:
001 David Wellingsworth Mr.
002 312-555-5555 3060 W Maple St. Chicago
001 Jimothy Bogendath Dr.
002 563-555-5432 123 Main St. Davenport
My question is therefore: is it possible, without using a Script Component, to process a file like this? I understand the basic concept of how to handle disparate record types in a fixed width file (making use of conditional splits and substrings), but I can't get past how to join up all this data after the splits if the rows don't have identifiers.
If it helps, my question is basically this previous question but in reverse.
Possible but with some work. I've worked with data like these and this was our approach on how we solved them.
You will need to build a table that will give them their own unique RecordID
Create another table for your Files to log in your filename and unique fileID
Link your fileID to the RecordID so you know which file each record came from
Build all your sub tables linking to each unique RecordID
Building your tables this way will give you:
Unique recordID for each row (though there maybe duplicate in the file, in your tables they are unique).
Knowing which file each record comes from.

Using LIKE in SQL Server to identify strings

I am writing a program that performs operations on a database of Football matches and data. One of the issues that I have is that my source data does not have consistent naming of each Team. So Leyton Orient could appear as L Orient. Most of the time this team is listed as L Orient. So I need to find the closest match to a team name when it does not appear in the database team name list exactly as it appears in the data that I am importing. Currently in my database I have a table 'Team' with a data sample as follows:
TeamID TeamName TeamLocation
1 Arsenal England
2 Aston Villa England
3 L Orient England
If the name 'Leyton Orient' appears in the data being imported I need to match this to L Orient and get the TeamID 3. My question is, can I use the LIKE function to achieve this in a case where the team name is longer than the name in the database?
I have figured out that if I had 'Leyton Orient' in the table and was importing 'L Orient' I could locate the correct entry with:
SELECT TeamName FROM Team WHERE TeamName LIKE '%l%orient%';
But can I do it the other way around? Also, I could have an example like Manchester United and I want to import Man Utd. I could find this by putting a % sign between every character like this:
SELECT TeamName FROM Team WHERE TeamName LIKE '%M%a%n%U%t%d%';
But is there a better way?
Finally, and this might be better put in another question, I would like not to have to search for the correct team when the way a team is named is repeated, i.e. I would like to store alternative spellings/aliases for teams in order to find the correct team entry quickly. Can anybody advise on how I might approach this? Thanks
The solution you are looking for is the FULL TEXT SEARCH, it'll require your DBA to create a full text index, however, once there you can perform much more powerful searches than just character pattern matching.
As the others have suggested, you could also just have an Alias table, which contains all possible forms of a team name and reference that. depending on how your search is working, that may well be the path of least resistance.
Finally, and this might be better put in another question, I would like not to have to search for the correct team when the way a team is named is repeated, i.e. I would like to store alternative spellings/aliases for teams in order to find the correct team entry quickly. Can anybody advise on how I might approach this? Thank
I would personally have a team table and a teamalias table. Use relationships to marry them up.
I believe the best way to prevent this, is to have a list of teams names displayed in a dropdown list. This will also let you drop validation for the team name. The users can now only choose one set team name and will also make it much easier for you working in your database. then you can look for the exact team name as it appears in your database. i.e.:
SELECT TeamName FROM Team WHERE TeamName = [dropdownlist_name];

Search products with parent and child categories

I'm building a shopping cart website and using SQL tables
CATEGORY
Id int,
Parent_Id,
Description varchar(100)
Data:
1 0 Electronics
2 0 Furniture
3 1 TVs
4 3 LCD
5 4 40 inches
6 4 42 inches
PRODUCTS
Id int,
Category_Id int
Description...
Data:
1 5 New Samsung 40in LCD TV
2 6 Sony 42in LCD TV
As you can see I only have one column for the last Child Category
Now what I need to do is search by Main Category at homepage, for example if the user clicks to Electronics, show both TVs as they have a Parent-Parent-Parent Id at Electronics, keeping in mind that Products table do have only one column for Category.
Shall I update the Products Table and include 6 columns for category childs in order to solve this? Or how can I build an effective SQL Stored Procedure for this?
Thank you
Jerry
in Oracle, you would use CONNECT BY
If you're using SQL 2008 then you might want to look at the HIERARCHYID data type. Otherwise, you might want to consider redesigning the Category table. How you have it modeled now, you have to use recursion to get from children notes to parents or from parents down through children.
Instead of using the linked list model (which is what you have) you could use the nested set model for hierarchies. Do a search on Joe Celko and Nested Set Model and you should be able to find some good descriptions of it. He also wrote an entire book on modeling trees and hierarchies in SQL. The nested set model requires a bit of set up to maintain the data, but it's much easier to work with when selecting out data. Since your categories will probably remain relatively stable it seems like a good solution.
EDIT: To actually answer your question... you could write a stored procedure that sits in a WHILE loop, selecting children and collecting any products found in a table variable. Check ##ROWCOUNT in each loop and if it's 0 then you've gotten to the end. Then you just select out from your table variable. It's a recursive (and slow) method, which is why this type of a model doesn't work very well in many cases in SQL.
Under almost no circumstances should you just add 6 (or 7 or 8) category IDs to your products table. Bad. Bad. Bad. It will be a maintenance nightmare among other things (what happens when your categories go 7 levels deep... then 8... then 9.
Use recursive CTEs to do this ! works like a dream ! http://msdn.microsoft.com/en-us/library/ms186243.aspx