Select unique combination of values (attributes) based on user_id - sql

I have a table that has user a user_id and a new record for each return reason for that user. As show here:
| user_id | return_reason |
|--------- |-------------- |
| 1 | broken |
| 2 | changed mind |
| 2 | overpriced |
| 3 | changed mind |
| 4 | changed mind |
What I would like to do is generate a foreign key for each combination of values that are applicable in a new table and apply that key to the user_id in a new table. Effectively creating a many to many relationship. The result would look like so:
Dimension Table ->
| reason_id | return_reason |
|----------- |--------------- |
| 1 | broken |
| 2 | changed mind |
| 2 | overpriced |
| 3 | changed mind |
Fact Table ->
| user_id | reason_id |
|--------- |----------- |
| 1 | 1 |
| 2 | 2 |
| 3 | 3 |
| 4 | 3 |
My thought process is to iterate through the table with a cursor, but this seems like a standard problem and therefore has a more efficient way of doing this. Is there a specific name for this type of problem? I also thought about pivoting and unpivoting. But that didn't seem too clean either. Any help or reference to articles in how to process this is appreciated.

The problem concerns data normalization and relational integrity. Your concept doesn't really make sense - Dimension table shows two different reasons with same ID and Fact table loses a record. Conventional schema for this many-to-many relationship would be three tables like:
Users table (info about users and UserID is unique)
Reasons table (info about reasons and ReasonID is unique)
UserReasons junction table (associates users with reasons - your
existing table). Assuming user could associate with same reason
multiple times, probably also need ReturnDate and OrderID_FK fields
in UserReasons.
So, need to replace reason description in first table (UserReasons) with a ReasonID. Add a number long integer field ReasonID_FK in that table to hold ReasonID key.
To build Reasons table based on current data, use DISTINCT:
SELECT DISTINCT return_reason INTO Reasons FROM UserReasons
In new table, rename return_reason field to ReasonDescription and add an autonumber field ReasonID.
Now run UPDATE action to populate ReasonID_FK field in UserReasons.
UPDATE UserReasons INNER JOIN UserReasons.return_reason ON Reasons.ReasonDescription SET UserReasons.ReasonID_FK = Reasons.ReasonID
When all looks good, delete return_reason field.

Related

Organizing & normalising RSS Feed categories data

I am having trouble normalising data from a RSS Feed into a database.
Each post would have id and categories.
The problem I am having is that categories is a list which is not predefined in size. By 1NF I should split a list up such that each column only has atomic data:
+----+----------+
| id | name |
+----+----------+
| 1 | flying |
| 2 | swimming |
| 3 | throwing |
| 4 | sleeping |
| 5 | etc |
+----+----------+
However, blog posts can have more than one category tagged. This means that the posts table can have a list of ids of the categories tagged.
Alternatively, the categories table can have two ids:
+----+--------+----------+
| id | postId | name |
+----+--------+----------+
| 1 | 1 | flying |
| 2 | 1 | swimming |
| 3 | 1 | throwing |
| 4 | 2 | flying |
| 5 | 2 | swimming |
| 6 | 2 | etc |
+----+--------+----------+
And the posts table id will reference the postId column. However, there is repeated data, which is not good.
Lastly, another method I had thought of was to put all the categories in one table:
+----+--------+----------+----------+----------+-----+
| id | flying | swimming | throwing | sleeping | etc |
+----+--------+----------+----------+----------+-----+
| 1 | 1 | 1 | 1 | 1 | 1 |
| 2 | 0 | 1 | 0 | 0 | 0 |
| 3 | 1 | 1 | 0 | 0 | 1 |
| 4 | 0 | 0 | 1 | 1 | 1 |
+----+--------+----------+----------+----------+-----+
1s representing present and 0s representing absent, the id in the posts table references id. This method would not have any repeated data. However, categories from blogs can be created at will, making it hard to maintain such a table as I would need to update it every time there is a new category.
How do I put my database in 3NF, eliminating repetition while keeping it maintainable?
TL;DR "Repeated data" is a bugbear. Learn about design and normalization. Start with rows/tables that make clear straightforward relevant statements about an arbitrary situation. So far all you need is:
-- [id] identifies a post with ...
Post(id, ...)
-- post [id] is tagged [name]
Post_Category(id, name)
there is repeated data, which is not good
What exactly do you think "repeated data" is? And why exactly do you think it's "not good"?
There is nothing intrinsically bad about having the same value appear multiple times as a column of a row or part of a value for a column of a row. What matters is whether rows in tables say overlapping things about a situation in certain ways.
Normalization replaces a table by projections of it that join back to it. That means that it replaces tables whose rows say (ie have predicate) "some stuff AND other stuff" about column values by tables whose rows say "some stuff" and "other stuff" separately. Having "AND"s in such a row/table meaning isn't always bad. When there's only one AND, normalization says to decompose to a particular pair of tables exactly when no shared column set always holds a unique set of values in either of the two tables.
put all the categories in one table
Although there is nothing about such a design that would cause normalization to decompose it, your last table is a "bad" design. (Sometimes this kind of design with repeated similar columns is said to violate some notion of "1NF" or "normalization", but that is a misconception.) Eg its rows say "(post [id] is tagged 'flying' and [flying] = 1 OR post [id] is not tagged 'flying' AND [flying] = 0) AND (post [id] is tagged 'swimming' and [swimming] = 1 OR post [id] is not tagged 'swimming' AND [swimming] = 0) AND ..." when instead we could just have a table Post_Category with rows saying "post [id] is tagged [name]". Eg we cannot write queries that ask about all categories without mentioning all categories explicitly. Eg if we add a new category then we must add a new column to the table and then if we want our past queries re all categories to mean the same thing then they we must add the new column to still be referring to all categories.
PS It's not clear why you introduced ids. There are reasons we do so, but you should do it for a reason. (Normalization does not introduce ids.) Eg introducing post ids if posts are not uniquely identifiable by other information we want to record.

Oracle Recursive Select to Find Current ID Associated with a Customer

I have a table that contains the history of Customer IDs that have been merged in our CRM system. The data in the historical reporting Oracle schema exists as it was when the interaction records were created. I need a way to find the Current ID associated with a customer from potentially an old ID. To make this a bit more interesting, I do not have permissions to create PL/SQL for this, I can only create Select statements against this data.
Sample Data in customer ID_MERGE_HIST table
| OLD_ID | NEW_ID |
+----------+----------+
| 44678368 | 47306920 |
| 47306920 | 48352231 |
| 48352231 | 48780326 |
| 48780326 | 50044190 |
Sample Interaction table
| INTERACTION_ID | CUST_ID |
+----------------+----------+
| 1 | 44678368 |
| 2 | 48352231 |
| 3 | 80044190 |
I would like a query with a recursive sub-query to provide a result set that looks like this:
| INTERACTION_ID | CUST_ID | CUR_CUST_ID |
+----------------+----------+-------------+
| 1 | 44678368 | 50044190 |
| 2 | 48352231 | 50044190 |
| 3 | 80044190 | 80044190 |
Note: Cust_ID 80044190 has never been merged, so does not appear in the ID_MERGE_HIST table.
Any help would be greatly appreciated.
You can look at CONNECT BY construction.
Also, you might want to play with recursive WITH (one of the descriptions: http://gennick.com/database/understanding-the-with-clause). CONNECT BY is better, but ORACLE specific.
If this is frequent request, you may want to store first/last cust_id for all related records.
First cust_id - will be static, but will require 2 hops to get to the current one
Last cust_id - will give you result immediately, but require an update for the whole tree with every new record

Rolling id based on foreign key in a hierarchical schema

As an example, consider this hierarchical schema.
Assume all id fields are auto incrementing primary keys and that foreign keys are named by [parent_table_name]_id convention.
The problem
As soon as there are multiple companies in the database, then companies will share all primary key sequences between them.
For example, if there are two company rows, the customer_group table could look like this
| id | company_id |
-------------------
| 1 | 1 |
| 2 | 1 |
| 3 | 2 |
| 4 | 2 |
| 5 | 1 |
-------------------
But it should look like this
| id | company_id |
-------------------
| 1 | 1 |
| 2 | 1 |
| 1 | 2 |
| 2 | 2 |
| 3 | 1 |
-------------------
This behavior should also be exhibited for customer and any other table in the tree that directly or indirectly references company.
Note that I will most likely make a second id column (named something like relative_id) for this purpose, keeping the unique id column intact, as this is really mostly for display purposes and how users will reference these data entities.
Now if this was just one level of hierarchy, it would be a relatively simple solution.
I could make a table (table_name, company_id, current_id) and a trigger procedure that fires before insert on any of the tables, incrementing the current id by 1 and setting the row's relative_id to that value.
It's trivial when the company_id is right there in the insert query.
But how about the tables that don't reference company directly?
Like the lowest level of the hierarchy in this example, workorder, which only references customer.
Is there a clean, reusable solution to climb the ladder all the way from 'customer_id' to ultimately retrieve the parenting company_id?
Going recursively up the hierarchy with SELECTs on each INSERT doesn't sound too appealing to me, performance wise.
I also do not like the idea of just adding a foreign key to company for each of these tables, the schema would get increasingly uglier with each additional table.
But these are the two solutions I can see, but I may not be looking in the right places.
The company shouldn't care what the primary key is if you're using generated keys. They're supposed to be meaningless; compared for equality and nothing else. I grumbled about this earlier, so I'm really glad to see you write:
Note that I will most likely make a second id column (named something
like relative_id) for this purpose, keeping the unique id column
intact, as this is really mostly for display purposes and how users
will reference these data entities.
You're doing it right.
Most of the time it doesn't matter what the ID is, so you can just give them whatever comes out of a sequence and not care about holes/gaps. If you're concerned about inter-company leakage (unlikely) you can obfuscate the IDs by using the sequence as an input to a pseudo-random generator. See the function Daniel Verité wrote in response to my question about this a few years ago, pseudo_encrypt.
There are often specific purposes for which you need perfectly sequential gapless IDs, like invoice numbers. For those you need to use a counter table and - yes - look up the company ID. Such ID generation is slow and has terrible concurrency anyway, so an additional SELECT with a JOIN or two on indexed keys won't hurt much. Don't go recursively up the schema with SELECTs though, just use a series of JOINs. For example, for an insert into workorder your key generation trigger on workorder would be something like the (untested):
CREATE OR REPLACE FUNCTION workorder_id_tgfn() RETURNS trigger AS $$
BEGIN
IF tg_op = 'INSERT' THEN
-- Get a new ID, locking the row so no other transaction can add a
-- workorder until this one commits or rolls back.
UPDATE workorder_ids
SET next_workorder_id = next_workorder_id + 1
WHERE company_id = (SELECT company_id
FROM customer
INNER JOIN customer_group ON (customer.customer_group_id = customer_group.id)
INNER JOIN company ON (customer_group.company_id = company.id)
WHERE customer.id = NEW.customer_id)
RETURNING next_workorder_id
INTO NEW.id;
END IF;
END;
$$ LANGUAGE 'plpgsql';
For the UPDATE ... RETURNING ... INTO syntax see Executing a Query with a Single-Row Result.
There can be gaps in normal sequences even if there's no multi-company problem. Observe:
CREATE TABLE demo (id serial primary key, blah text);
BEGIN;
INSERT INTO demo(blah) values ('aa');
COMMIT;
BEGIN;
INSERT INTO demo(blah) values ('bb');
ROLLBACK;
BEGIN;
INSERT INTO demo(blah) values ('aa');
COMMIT;
SELECT * FROM demo;
Result:
regress=# SELECT * FROM demo;
id | blah
----+------
1 | aa
3 | aa
"But it should look like this"
| id | company_id |
-------------------
| 1 | 1 |
| 2 | 1 |
| 1 | 2 |
| 2 | 2 |
| 3 | 1 |
-------------------
I think it should not and I think you want a many to many relationship. The customer_group table:
| id | name |
-------------
| 1 | n1 |
| 2 | n2 |
| 3 | n3 |
-------------
And then the customer_group_company table:
| group_id | company_id |
-------------------------
| 1 | 1 |
| 2 | 1 |
| 1 | 2 |
| 2 | 2 |
| 3 | 1 |
-------------------------

Database design, for optimized access

I jump straight into the problem, currently I have a table as such
id | model | CategoryId | etc...
Now my new requirement is to have support for multiple categories. So I have two possible solutions in mind but I would like to know problems that both this designs might create. I also know that at most I can have 6 categories, also I can't create a linker table to link product to category.
On first design I would simply create column CategoryN
id | model | CategoryId1 | CategoryId2 | CategoryId3 | CategoryId4 | etc...
But this would make queries hideous,
id | model | CategoryId | etc...
My second approach is simply to add product for N categories
id | model | CategoryId | etc...
1 | ABC | 1 | etc...
2 | ABC | 2 | etc...
3 | ABC | 3 | etc...
I think queries would be cleaner but not necessarily simpler.
Another aspect is that I am looking at the performance of the queries and it looks like the first approach would be better.
I hope this is clear enough.
Thanks for any suggestions.
The third option is a many-to-many table to link a model to a category:
MODEL_CATEGORIES
model_id (primary key, foreign key to MODEL table)
category_id (primary key, foreign key to CATEGORY table)
Your example data would resemble:
model_id category_id
----------------------
1 1
1 2
1 3
This means there's no need for a category_id column in the MODEL table.
I think you're after a many-to-many relationship here.
Basically, you have a model_to_categories table that matches model ids against category ids.
Can you make the CategoryID values bit-field "flags"?
If so, you could keep your performance high by using the single CategoryID field that you have now, keep your queries simple by not adding a bunch of new columns, and would have up to 32 categories for each product over time (assuming that CategoryID is an int).
So, your CategoryID values would be:
id | model | CategoryId | etc...
1 | ABC | 1 | etc...
2 | ABC | 2 | etc...
3 | ABC | 4 | etc...
3 | ABC | 8 | etc...
You would store the total of all of the CategoryIDs in the CategoryID column (for backward compatibility) and then would have to test the value of the field to find out if a specific category were "set". Incidentally, you can do that directly in your query as well.
It's not "best practice", which would really require a brand-new table and a lot of joining, but if you are looking for a way to shoe-horn something in there that will work, bit-field flags will do the trick.

Retrieve comma delimited data from a field

I've created a form in PHP that collects basic information. I have a list box that allows multiple items selected (i.e. Housing, rent, food, water). If multiple items are selected they are stored in a field called Needs separated by a comma.
I have created a report ordered by the persons needs. The people who only have one need are sorted correctly, but the people who have multiple are sorted exactly as the string passed to the database (i.e. housing, rent, food, water) --> which is not what I want.
Is there a way to separate the multiple values in this field using SQL to count each need instance/occurrence as 1 so that there are no comma delimitations shown in the results?
Your database is not in the first normal form. A non-normalized database will be very problematic to use and to query, as you are actually experiencing.
In general, you should be using at least the following structure. It can still be normalized further, but I hope this gets you going in the right direction:
CREATE TABLE users (
user_id int,
name varchar(100)
);
CREATE TABLE users_needs (
need varchar(100),
user_id int
);
Then you should store the data as follows:
-- TABLE: users
+---------+-------+
| user_id | name |
+---------+-------+
| 1 | joe |
| 2 | peter |
| 3 | steve |
| 4 | clint |
+---------+-------+
-- TABLE: users_needs
+---------+----------+
| need | user_id |
+---------+----------+
| housing | 1 |
| water | 1 |
| food | 1 |
| housing | 2 |
| rent | 2 |
| water | 2 |
| housing | 3 |
+---------+----------+
Note how the users_needs table is defining the relationship between one user and one or many needs (or none at all, as for user number 4.)
To normalise your database further, you should also use another table called needs, and as follows:
-- TABLE: needs
+---------+---------+
| need_id | name |
+---------+---------+
| 1 | housing |
| 2 | water |
| 3 | food |
| 4 | rent |
+---------+---------+
Then the users_needs table should just refer to a candidate key of the needs table instead of repeating the text.
-- TABLE: users_needs (instead of the previous one)
+---------+----------+
| need_id | user_id |
+---------+----------+
| 1 | 1 |
| 2 | 1 |
| 3 | 1 |
| 1 | 2 |
| 4 | 2 |
| 2 | 2 |
| 1 | 3 |
+---------+----------+
You may also be interested in checking out the following Wikipedia article for further reading about repeating values inside columns:
Wikipedia: First normal form - Repeating groups within columns
UPDATE:
To fully answer your question, if you follow the above guidelines, sorting, counting and aggregating the data should then become straight-forward.
To sort the result-set by needs, you would be able to do the following:
SELECT users.name, needs.name
FROM users
INNER JOIN needs ON (needs.user_id = users.user_id)
ORDER BY needs.name;
You would also be able to count how many needs each user has selected, for example:
SELECT users.name, COUNT(needs.need) as number_of_needs
FROM users
LEFT JOIN needs ON (needs.user_id = users.user_id)
GROUP BY users.user_id, users.name
ORDER BY number_of_needs;
I'm a little confused by the goal. Is this a UI problem or are you just having trouble determining who has multiple needs?
The number of needs is the difference:
Len([Needs]) - Len(Replace([Needs],',','')) + 1
Can you provide more information about the Sort you're trying to accomplish?
UPDATE:
I think these Oracle-based posts may have what you're looking for: post and post. The only difference is that you would probably be better off using the method I list above to find the number of comma-delimited pieces rather than doing the translate(...) that the author suggests. Hope this helps - it's Oracle-based, but I don't see .