What is the most efficient way to store a variable number of columns in SQL Server? - sql

What is the most efficient way to store a variable amount of columns in MS-SQL?
I have a requirement to store a large number (several million) records into a Microsoft SQL server (via c#). Most columns are standard, but certain groups of users will need to add their own custom columns, and record data in them.
The data in each custom column field will not be large, but the number of records with a certain set of custom columns will be in the millions.
I do not know ahead of time what these columns might be (in terms of name or datatype), but I'll need to pull reports based on these columns as effeciently as possible..
What is the most efficient way of storing the new varying columns and data?
Entity-Attribute-Value model?
Con's: Efficiency if there's a large number of custom columns (= large number of rows)?
A extra table "CustomColumns"?
Storing columnName, Data, Datatype each time an entry has a custom column, for each column.
Con's: A table with a large number of records, perhaps not the most efficient storage.
Serialise the extra columns for each record into a single field
Con's: Lookup efficiency and stored procedure complicated when running reports based on a custom field.
Any other?
Edit: Think I may be confusing option (1) and (2): I actually meant, is the following the best approach :
Entity (User Groups)
id | name | description
-- | ---- | ------------
1 | user group 1 | user group 1
2 | user group 2 | user group 2
Attribute
id | name | type | entityids (best way to do this for 2 user
-- | ---- | ---- | groups using same attribute?
1 | att1 | string | 1,2
2 | att2 | int | 2
3 | att3 | string | 1
4 | att4 | numeric | 2
5 | att5 | string | 1
Value
id | entityId| attributeId | value
-- | --------| ----------- | -----
1 | 1 | 1 | a
2 | 1 | 2 | 1
3 | 1 | 3 | b
4 | 1 | 3 | c
5 | 1 | 3 | d
6 | 1 | 3 | 75
7 | 1 | 5 | Inches

Related

Returning singular row/value from joined table date based on closest date

I have a Production Table and a Standing Data table. The relationship of Production to Standing Data is actually Many-To-Many which is different to how this relationship is usually represented (Many-to-One).
The standing data table holds a list of tasks and the score each task is worth. Tasks can appear multiple times with different "ValidFrom" dates for changing the score at different points in time. What I am trying to do is query the Production Table so that the TaskID is looked up in the table and uses the date it was logged to check what score it should return.
Here's an example of how I want the data to look:
Production Table:
+----------+------------+-------+-----------+--------+-------+
| RecordID | Date | EmpID | Reference | TaskID | Score |
+----------+------------+-------+-----------+--------+-------+
| 1 | 27/02/2020 | 1 | 123 | 1 | 1.5 |
| 2 | 27/02/2020 | 1 | 123 | 1 | 1.5 |
| 3 | 30/02/2020 | 1 | 123 | 1 | 2 |
| 4 | 31/02/2020 | 1 | 123 | 1 | 2 |
+----------+------------+-------+-----------+--------+-------+
Standing Data
+----------+--------+----------------+-------+
| RecordID | TaskID | DateActiveFrom | Score |
+----------+--------+----------------+-------+
| 1 | 1 | 01/02/2020 | 1.5 |
| 2 | 1 | 28/02/2020 | 2 |
+----------+--------+----------------+-------+
I have tried the below code but unfortunately due to multiple records meeting the criteria, the production data duplicates with two different scores per record:
SELECT p.[RecordID],
p.[Date],
p.[EmpID],
p.[Reference],
p.[TaskID],
s.[Score]
FROM ProductionTable as p
LEFT JOIN StandingDataTable as s
ON s.[TaskID] = p.[TaskID]
AND s.[DateActiveFrom] <= p.[Date];
What is the correct way to return the correct and singular/scalar Score value for this record based on the date?
You can use apply :
SELECT p.[RecordID], p.[Date], p.[EmpID], p.[Reference], p.[TaskID], s.[Score]
FROM ProductionTable as p OUTER APPLY
( SELECT TOP (1) s.[Score]
FROM StandingDataTable AS s
WHERE s.[TaskID] = p.[TaskID] AND
s.[DateActiveFrom] <= p.[Date]
ORDER BY S.DateActiveFrom DESC
) s;
You might want score basis on Record Level if so, change the where clause in apply.

Efficiently reconcile changing identifiers in SQL?

I am working with data where the user identifier changes. The user identifiers are GUIDs, so shouldn't repeat across different users. When the identifier changes, I am provided with the old user identifier and the current user identifier on the same row in a table. I need to reconcile these values and have them both assigned to the same database-generated integer ID, which is the value I use to refer to the user elsewhere in the database.
Not long ago, the user identifiers would not change. I had the following setup:
users table
id | identifier
---------------
1 | ABC
2 | DEF
etc ...
activity table
id | timestamp | identifier | other_data
---------------------------------------------
...
29 | 1 | ABC | more data
30 | 2 | ABC | even more data
31 | 3 | ABC | etc
32 | 4 | DEF | etc
33 | 5 | DEF | etc
34 | 6 | ABC | more data
...
My goal remains to aggregate activity from the activity table into an activity_daily table. In the prior setup, that was relatively simple because I could expect that the identifier was consistent per user.
My output aggregate activity_daily table had the structure:
id | user_id | date | other_stuff
--------------------------------------
1 | 1 | 9/10/2017 | etc
2 | 1 | 9/11/2017 | etc
3 | 2 | 9/08/2017 | etc
4 | 2 | 9/09/2017 | etc
5 | 1 | 9/12/2017 | etc
...
Now, however, the activity table has changed. For the first activity record where a identifier changes, I get a value in a column called identifier_old. The activity table now looks like the following:
activity table
id | timestamp | identifier | identifier_old | other_data
-------------------------------------------------------------------
...
29 | 110 | ABC | | more data
30 | 111 | GHI | ABC | other data
31 | 112 | GHI | | etc
32 | 114 | DEF | | etc
33 | 115 | DEF | | etc
34 | 116 | JKL | DEF | etc
35 | 117 | GHI | | etc
36 | 118 | JKL | | etc
37 | 119 | JKL | | etc
38 | 120 | GHI | | etc
...
Now, my need to to create the same aggregate activity_daily table with the added complexity of mapping the identifier and identifier_old to the same integer id in the users table.
Each day, somewhere around 10 million records are loaded into the activity table that have to be reconciled and aggregated. There are millions of unique identifiers, so I'm trying to keep the reconciling of the identifiers and the aggregation steps as efficient as possible.
I've had two thoughts about how to approach this, but neither seem particularly efficient when considering the aggregation and joins on the activity table.
1) Create an identifiers table with columns id, identifier, and user_id. The users table no longer stores the identifier. Then do the following: a) check if the identifier_old is in the identifiers table. If not, add it and create an entry in the users table to generate an id. Add that id to the proper record in the identifiers table. b) Look in the activity table at records that have a value both in identifier and old_identifier. Add the identifier from those records to the identifiers table and then update those records with the appropriate user_id values from the old_identifier values that are already in the identifier table. c) Do my aggregation, etc. based on the identifier column in the activity table.
2) Similar, but don't maintain a separate identifiers table. Instead, add a third column to the users table called user_static_id (or something). All identifier values go into the users table but those that refer to the same person share the same user_static_id and the aggregate table has a foreign key for user_static_id instead of for the id column in the users table.
Neither of these seem like a great approach and both seem like they could significantly slow down the reconciliation and aggregation process.
Note: I cannot say, for certain, that the changed identifier values won't revert back to their previous values. For each user, they may continue to change periodically, they may revert, or they may remain static forever. The timestamp column in the activity table allows me to sort the records so that I don't end up encountering records with a new identifier before I encounter records that have both identifier and identifier_old.
It's also worth noting that the activity table is flushed after the aggregation has occurred.
Given this scenario, what is the most efficient way to handle this problem?

Efficient Classification of records by common letters in impala

I have a table in impala (TBL1), that contains different names with different number of first common letters. The table contains about 3M records. I would like to add add an new attribute to the table, where each common first letters will have a class. It is the same way as DENSE_RANK work but with dynamic number of first letters. The number of same first letters should not be less than p=3 letters (p = parameter).
Here is an example for the table and the required results:
| ID | Attr1 | New_Attr1 | Some more attribute...
+-------+--------------+-------------+-----------------------
| 1 | ZXA-12 | 1 |
| 2 | YL3300 | 2 |
| 3 | ZXA-123 | 1 |
| 4 | YL3400 | 2 |
| 5 | YL3-aaa | 2 |
| 6 | TSA 789 | 3 |
...
Does this do what you want?
select t.*,
dense_rank() over (order by strleft(attr1, 3)) as newcol
from . . .;
The "3" is your parameter.
As a note: In your example, you seem to have assigned the new value in reverse alphabetic order. Hence, you would want desc for the order by.

1 to Many Query: Help Filtering Results

Problem: SQL Query that looks at the values in the "Many" relationship, and doesn't return values from the "1" relationship.
Tables Example: (this shows two different tables).
+---------------+----------------------------+-------+
| Unique Number | <-- Table 1 -- Table 2 --> | Roles |
+---------------+----------------------------+-------+
| 1 | | A |
| 2 | | B |
| 3 | | C |
| 4 | | D |
| 5 | | |
| 6 | | |
| 7 | | |
| 8 | | |
| 9 | | |
| 10 | | |
+---------------+----------------------------+-------+
When I run my query, I get multiple, unique numbers that show all of the roles associated to each number like so.
+---------------+-------+
| Unique Number | Roles |
+---------------+-------+
| 1 | C |
| 1 | D |
| 2 | A |
| 2 | B |
| 3 | A |
| 3 | B |
| 4 | C |
| 4 | A |
| 5 | B |
| 5 | C |
| 5 | D |
| 6 | D |
| 6 | A |
+---------------+-------+
I would like to be able to run my query and be able to say, "When the role of A is present, don't even show me the unique numbers that have the role of A".
Maybe if SQL could look at the roles and say, WHEN role A comes up, grab unique number and remove it from column 1.
Based on what I would "like" to happen (I put that in quotations as this might not even be possible) the following is what I would expect my query to return.
+---------------+-------+
| Unique Number | Roles |
+---------------+-------+
| 1 | C |
| 1 | D |
| 5 | B |
| 5 | C |
| 5 | D |
+---------------+-------+
UPDATE:
Query Example: I am querying 8 tables, but I condensed it to 4 for simplicity.
SELECT
c.UniqueNumber,
cp.pType,
p.pRole,
a.aRole
FROM c
JOIN cp ON cp.uniqueVal = c.uniqueVal
JOIN p ON p.uniqueVal = cp.uniqueVal
LEFT OUTER JOIN a.uniqueVal = p.uniqueVal
WHERE
--I do some basic filtering to get to the relevant clients data but nothing more than that.
ORDER BY
c.uniqueNumber
Table sizes: these tables can have anywhere from 50,000 rows to 500,000+
Pretending the table name is t and the column names are alpha and numb:
SELECT t.numb, t.alpha
FROM t
LEFT JOIN t AS s ON t.numb = s.numb
AND s.alpha = 'A'
WHERE s.numb IS NULL;
You can also do a subselect:
SELECT numb, alpha
FROM t
WHERE numb NOT IN (SELECT numb FROM t WHERE alpha = 'A');
Or one of the following if the subselect is materializing more than once (pick the one that is faster, ie, the one with the smaller subtable size):
SELECT t.numb, t.alpha
FROM t
JOIN (SELECT numb FROM t GROUP BY numb HAVING SUM(alpha = 'A') = 0) AS s USING (numb);
SELECT t.numb, t.alpha
FROM t
LEFT JOIN (SELECT numb FROM t GROUP BY numb HAVING SUM(alpha = 'A') > 0) AS s USING (numb)
WHERE s.numb IS NULL;
But the first one is probably faster and better[1]. Any of these methods can be folded into a larger query with multiple additional tables being joined in.
[1] Straight joins tend to be easier to read and faster to execute than queries involving subselects and the common exceptions are exceptionally rare for self-referential joins as they require a large mismatch in the size of the tables. You might hit those exceptions though, if the number of rows that reference the 'A' alpha value is exceptionally small and it is indexed properly.
There are many ways to do it, and the trade-offs depend on factors such as the size of the tables involved and what indexes are available. On general principles, my first instinct is to avoid a correlated subquery such as another, now-deleted answer proposed, but if the relationship table is small then it probably doesn't matter.
This version instead uses an uncorrelated subquery in the where clause, in conjunction with the not in operator:
select num, role
from one_to_many
where num not in (select otm2.num from one_to_many otm2 where otm2.role = 'A')
That form might be particularly effective if there are many rows in one_to_many, but only a small proportion have role A. Of course you can add an order by clause if the order in which result rows are returned is important.
There are also alternatives involving joining inline views or CTEs, and some of those might have advantages under particular circumstances.

Relative incremental ID by reference field

I have a table to store reservations for certain events; relevant part of it is:
class Reservation(models.Model):
# django creates an auto-increment field "id" by default
event = models.ForeignKey(Event)
# Some other reservation-specific fields..
first_name = models.CharField(max_length=255)
Now, I wish to retrieve the sequential ID of a given reservation relative to reservations for the same event.
Disclaimer: Of course, we assume reservations are never deleted, or their relative position might change.
Example:
+----+-------+------------+--------+
| ID | Event | First name | Rel.ID |
+----+-------+------------+--------+
| 1 | 1 | AAA | 1 |
| 2 | 1 | BBB | 2 |
| 3 | 2 | CCC | 1 |
| 4 | 2 | DDD | 2 |
| 5 | 1 | EEE | 3 |
| 6 | 3 | FFF | 1 |
| 7 | 1 | GGG | 4 |
| 8 | 1 | HHH | 5 |
+----+-------+------------+--------+
The last column is the "Relative ID", that is, a sequential number, with no gaps, for all reservations of the same event.
Now, what's the best way to accomplish this, without having to manually calculate relative id for each import (I don't like that)? I'm using postgresql as underlying database, but I'd prefer to stick with django abstraction layer in order to keep this portable (i.e. no database-specific solutions, such as triggers etc.).
Filtering using Reservation.objects.filter(event_id = some_event_id) should suffice. This will give you a QuerySet that should have the same ordering each time. Or am I missing something in your question?
I hate always being the one that responds its own questions, but I solved using this:
class Reservation(models.Model):
# ...
def relative_id(self):
return self.id - Reservation.objects.filter(id__lt=self.id).filter(~Q(event=self.event)).all().count()
Assuming records from reservations are never deleted, we can safely assume the "relative id" is the incremental id - (count of reservations before this one not belonging to same event).
I'm thinking of any drawbacks, but I didn't find any.