I have a relationships table, the table looks something like this
------------------------
| client_id | service_id |
------------------------
| 1 | 1 |
| 1 | 2 |
| 1 | 4 |
| 1 | 7 |
| 2 | 1 |
| 2 | 5 |
------------------------
I have a list of new permissions I need to add, what I'm doing right now is, for example, if I have to add new permissions for the client with id 1, i do
DELETE FROM myTable WHERE client_id = 1
INSERT INTO ....
Is there a more efficient way I can remove only the ones I won't insert later, and add only the new ones?
yes, you can do this but in my humble opinion, it's not really sql dependent subject. actually it depends on your language/platform choice. if you use a powerful platform like .NET or Java, there are many database classes like adapters, datasets etc. which are able to take care of things for you like finding the changed parts, updating/inserting/deleting only necessery parts etc.
i prefer using hibernate/nhibernate like libraries. in this case, you don't even need to write sql queries most of the time. just do the things at oop level and synchronize with the database.
If you put the new permissions into another table, you could do something like:
DELETE FROM myTable WHERE client_id in (SELECT client_id FROM tmpTable);
INSERT INTO myTable AS (SELECT client_id, service_id FROM tmpTable);
You are still taking 2 passes, but you are doing them all at once instead of one at a time.
Related
I've got two types of input files I'm loading into an ADLA job. In one, I've got a bunch of data (left) and in another, I've got a list of values that are important to me (right).
As an example here, let's say I'm using the following in my "left" rowset:
| ID | URL |
|----|-------------------------|
| 1 | https://www.google.com/ |
| 2 | https://www.yahoo.com/ |
| 3 | https://www.hotmail.com/|
I'll have something like the following in my right rowset:
| ID | Name | Regex | Exceptions | Other Lookup Val |
|----|-------|-------------|------------|------------------|
| 1 | ThisA | /[a-z]{3,}/ | abc | 091238 |
| 2 | ThatA | /[a-z]{3,}/ | xyz | lksdf9 |
| 3 | OtherA| /[a-z]{3,}/ | def | 098143 |
As each are loaded via an EXTRACT statement, both are in separate rowsets. Ideally, I'd like to be able to load all the values for both rowsets and loop through the right one to run a series of calculations against the left one to find a match per various business rules. Notably, there's no value to simply join on, nor is it a simple Regex evaluation, but rather something a bit more involved. Thus, the output might just look something like the "left" rowset:
| ID | URL |
|----|-------------------------|
| 1 | https://www.google.com/ |
| 3 | https://www.hotmail.com/|
Now, a COMBINER is the only UDO I see that accepts two rowsets, but the U-SQL syntax requires that I do some sort of join statement here. There's no common identifier between each of the rowsets though, so there's nothing to join on, which suddenly makes this seem less ideal. Of the attribute options defined at https://learn.microsoft.com/en-us/azure/data-lake-analytics/data-lake-analytics-u-sql-programmability-guide#use-user-defined-combiners, I'd like to specify this as a Full because I'd need each of the left values available to evaluate against each of the right ones, but again, no shared identifier to do this on.
I then tried to use a REDUCER that accepted an IRowset in the IReducer constructor as a parameter, then tried to just pass the rowset in from the U-SQL, but it didn't like that syntax.
Is there any way to perform this custom combining in a manner that doesn't require a JOIN ON clause?
It sounds like you may be able to use an IProcessor. This would allow you to analyze each row in the RIGHT set and add a column (with a value based on your business rules) that you can subsequently use to join to the LEFT set.
[Adding a bit more detail]: You could also do this twice, once for the left and once for the right to create an artificial join column, like row_number or some such.
I am facing an issue where a data supplier is generating a dump of his multi-tenant databases in a single table. Recreating the original tables is not impossible, the problem is I am receiving millions of rows every day. Recreating everything, every day, is out of question.
Until now, I was using SSIS to do so, with a lookup-intensive approach. In the past year, my virtual machine went from having 2 GB of ram to 128, and still growing.
Let me explain the disgrace:
Imagine a database where users have posts, and posts have comments. In my real scenario, I am talking about 7 distinct tables. Analyzing a few rows, I have the following:
+-----+------+------+--------+------+-----------+------+----------------+
| Id* | T_Id | U_Id | U_Name | P_Id | P_Content | C_Id | C_Content |
+-----+------+------+--------+------+-----------+------+----------------+
| 1 | 1 | 1 | john | 1 | hello | 1 | hello answer 1 |
| 2 | 1 | 2 | maria | 2 | cake | 2 | cake answer 1 |
| 3 | 2 | 1 | pablo | 1 | hello | 1 | hello answer 3 |
| 4 | 2 | 1 | pablo | 2 | hello | 2 | hello answer 2 |
| 5 | 1 | 1 | john | 3 | nosql | 3 | nosql answer 1 |
+-----+------+------+--------+------+-----------+------+----------------+
the Id is from my table
T_Id is the "tenant" Id, which identifies multiple databases
I have imagined the following possible solution:
I make a query that selects non-existent Ids for each table, such as:
SELECT DISTINCT n.t_id,
n.c_id,
n.c_content
FROM mytable n
WHERE n.id > 4
AND NOT EXISTS (SELECT 1
FROM mytable o
WHERE o.id <= 4
AND n.t_id = o.t_id
AND n.c_id = o.c_id)
This way, I am able to select only the new occurrences whenever a new Id of a table is found. Although it works, it may perform badly when working with 100s of millions of rows.
Could anyone share a suggestion? I am quite lost.
Thanks in advance.
EDIT > my question is vague
My final intent is to rebuild the tables from the dump, incrementally, avoiding lookups outside the database. Every now and then I am gonna run a script that will select new tenants, users, posts and comments and add them to their corresponding tables.
My previous solution worked as follows:
Cache the whole database
For each new row, search for the columns inside the cache
If it doesn't exist, then insert it
I know it sounds dumb, but it made sense as a new developer working with ETLs
First, if you have a full flat DB dump, I'll suggest you to work on your file before even importing it in your DB (low level file processing is pretty cheap and nearly instantaneous).
From Removing lines in one file that are present in another file using python you can remove all the already parsed line since your last run.
with open('new.csv','r') as source:
lines_src = source.readlines()
with open('old.csv','r') as f:
lines_f = f.readlines()
destination = open('diff_add.csv',"w")
for data in lines_src:
if data not in lines_f:
destination.write(data)
destination.close()
This take less than five second to work on a 900Mo => 1.2Go dump. With this you'll only work with line that really make change in one of your new table.
Now you can import this flat DB to a working table.
As you'll have to search the needle in each line, some index on the ids may by a good idea (go to composite index that use your Tenant_id first).
For the last part, I don't know exactly how your data look, can you have some update to do ?
The Operators - EXCEPT and INTERSECT can help you too with this kind of problem.
I have multiple databases on a server, each with a large table where most rows are identical across all databases. I'd like to move this table to a shared database and then have an override table in each application database which has the differences between the shared table and the original table.
The aim is to make updating and distributing the data easier as well as keeping database sizes down.
Problem constraints
The table is a hierarchical data store with date based validity.
table DATA (
ID int primary key,
CODE nvarchar,
PARENT_ID int foreign key references DATA(ID),
END_DATE datetime,
...
)
Each unique CODE in DATA may have a number of rows, but at most a single row where END_DATE is null or greater than the current time (a single valid row per CODE). New references are only made to valid rows.
Updating the shared database should not require anything to be run in application databases. This means any override tables are final once they have been generated.
Existing references to DATA.ID must point to the same CODE, but other columns do not need to be the same. This means any current rows can be invalidated if necessary and multiple occurrences of the same CODE may be combined.
PARENT_ID references must have same parent CODE before and after the split. The actual PARENT_ID value may change if necessary.
The shared table is updated regularly from an external source and these updates need to be reflected in each database's DATA. CODEs that do not appear in the external source can be thought of as invalid, new references to these will not be added.
Existing functionality will continue to use DATA, so the new view (or alternative) must be transparent. It may, however, contain more rows than the original provided earlier constraints are met.
New functionality will use the shared table directly.
Select performance is a concern, insert/update/delete is not.
The solution needs to support SQL Server 2008 R2.
Possible solution
-- in a single shared DB
DATA_SHARED (table)
-- in each app DB
DATA_SHARED (synonym to DATA_SHARED in shared DB)
DATA_OVERRIDE (table)
DATA (view of DATA_SHARED and DATA_OVERRIDE)
Take an existing DATA table to become DATA_SHARED.
Exclude IDs with more than one possible CODE so only rows common across all databases remain. These missing rows will be added back once the data is updated the first time.
Unfortunately every DATA_OVERRIDE will need all rows that differ in any table, not only rows that differ between DATA_SHARED and the previous DATA. There are several IDs that differ only in a single database, this causes all other databases to inflate. Ideas?
This solution causes DATA_SHARED to have a discontinuous ID space. It's a mild annoyance rather than a major issue, but worth noting.
edit: I should be able to keep all of the rows in DATA_SHARED, just invalidate them, then I only need to store differing rows in DATA_OVERRIDE.
I can't think of any situations where PARENT_ID references become invalid, thoughts?
Before:
DB1.DATA
ID | CODE | PARENT_ID | END_DATE
1 | A | NULL | NULL
2 | A1 | 1 | 2020
3 | A2 | 1 | 2010
DB2.DATA
ID | CODE | PARENT_ID | END_DATE
1 | A | NULL | NULL
2 | X | NULL | NULL
3 | A2 | 1 | 2010
4 | X1 | 2 | NULL
5 | A1 | 1 | 2020
After initial processing (DATA_SHARED created from DB1.DATA):
SHARED.DATA_SHARED
ID | CODE | PARENT_ID | END_DATE
1 | A | NULL | NULL
3 | A2 | 1 | 2010
-- END_DATE is omitted from DATA_OVERRIDE as every row is implicitly invalid
DB1.DATA_OVERRIDE
ID | CODE | PARENT_ID
2 | A1 | 1
DB2.DATA_OVERRIDE
ID | CODE | PARENT_ID
2 | X |
4 | X1 | 2
5 | A1 | 1
After update from external data where A1 exists in source but X and X1 don't:
SHARED.DATA_SHARED
ID | CODE | PARENT_ID | END_DATE
1 | A | NULL | NULL
3 | A2 | 1 | 2010
6 | A1 | 1 | 2020
edit: The DATA view would be something like:
select D.ID, ...
from DATA D
left join DATA_OVERRIDE O on D.ID = O.ID
where O.ID is null
union all
select ID, ...
from DATA_OVERRIDE
order by ID
Given the small number of rows in DATA_OVERRIDE, performance is good enough.
Alternatives
I also considered an approach where instead of DATA_SHARED sharing IDs with the original DATA, there would be mapping tables to link DATA.IDs to DATA_SHARED.IDs. This would mean DATA_SHARED would have a much cleaner ID-space and there could be less data duplication, but the DATA view would require some fairly heavy joins. The additional complexity is also a significant negative.
Conclusion
Thank you for your time if you made it all the way to the end, this question ended up quite long as I was thinking it through as I wrote it. Any suggestions or comments would be appreciated.
I'm facing a database that keeps the ORDERING in columns of the table.
It's like:
Id Name Description Category OrderByName OrderByDescription OrderByCategory
1 Aaaa bbbb cccc 1 2 3
2 BBbbb Aaaaa bbbb 2 1 2
3 cccc cccc aaaaa 3 3 1
So, when the user want's to order by name, the SQL goes with an ORDER BY OrderByName.
I think this doesn't make any sense, since that's why Index are for and i tried to find any explanation for that but haven't found. Is this faster than using indexes? Is there any scenario where this is really useful?
It can make sense for many reasons but mainly when you don't want to follow the "natural order" given by the ORDER BY clause.
This is a scenario where this can be useful :
SQL Fiddle
MS SQL Server 2008 Schema Setup:
CREATE TABLE Table1
([Id] int, [Name] varchar(15), [OrderByName] int)
;
INSERT INTO Table1
([Id], [Name], [OrderByName])
VALUES
(1, 'Del Torro', 2 ),
(2, 'Delson', 1),
(3, 'Delugi', 3)
;
Query 1:
SELECT *
FROM Table1
ORDER BY Name
Results:
| ID | NAME | ORDERBYNAME |
|----|-----------|-------------|
| 1 | Del Torro | 2 |
| 2 | Delson | 1 |
| 3 | Delugi | 3 |
Query 2:
SELECT *
FROM Table1
ORDER BY OrderByName
Results:
| ID | NAME | ORDERBYNAME |
|----|-----------|-------------|
| 2 | Delson | 1 |
| 1 | Del Torro | 2 |
| 3 | Delugi | 3 |
I think it makes little sense for two reasons:
Who is going to maintain this set of values in the table? You need to update them every time any row is added, updated, or deleted. You can do this with triggers, or horribly buggy and unreliable constraints using user-defined functions. But why? The information that seems to be in those columns is already there. It's redundant because you can get that order by ordering by the actual column.
You still have to use massive conditionals or dynamic SQL to tell the application how to order the results, since you can't say ORDER BY #column_name.
Now, I'm basing my assumptions on the fact that the ordering columns still reflect the alphabetical order in the relevant columns. It could be useful if there is some customization possible, e.g. if you wanted all Smiths listed first, and then all Morts, and then everyone else. But I don't see any evidence of this in the question or the data.
This could be useful if the ordering was customizable - that is, if users did not want to see the list in alphabetical order, but rather in some custom order.
An index on the int columns would be smaller than an index on the column that holds the actual text, but I don't see that there is any real benefit to this in most cases.
Is there a way to have a custom order by query in sqlite?
For example, I have essentially an enum
_id|Name|Key
------------
1 | One | Named
2 | Two | Contributing
3 | Three | Named
4 | Four | Key
5 | Five | Key
6 | Six | Contributing
7 | Seven | Named
And the 'key' columns have ordering. Say Key > Named > Contributing.
Is there a way to make
SELECT * FROM table ORDER BY Key
return something to the effect of
_id|Name|Key
------------
4 | Four | Key
5 | Five | Key
1 | One | Named
3 | Three | Named
7 | Seven | Named
2 | Two | Contributing
6 | Six | Contributing
this?
SELECT _id, Name, Key
FROM my_table t
ORDER BY CASE WHEN key = 'Key' THEN 0
WHEN key = 'Named' THEN 1
WHEN key = 'Contributing' THEN 2 END, id;
If you have a lot of CASE's (or complicated set of conditions), Adam's solution may result in an extremely large query.
SQLite does allow you to write your own functions (in C++). You could write a function to return values similar to the way Adam does, but because you're using C++, you could work with a much larger set of conditions (or separate table, etc).
Once the function is written, you can refer to it in your SELECT as if it were a built-in function:
SELECT * FROM my_table ORDER BY MyOrder(Key)
Did you try (not tested on my side but relying on a technique I previously used):
ORDER BY KEY = "Key" DESC,
KEY = "Named" DESC,
KEY = "Contributing" DESC