We have a reference data DB that is like an ODS/MDM but it's read only. The data is updated from the authoritative systems on various schedules. Every table maintains historic data - updates do an update existing & insert new, deletes do an update existing.
All tables are of the following form:
table <name>
surrogate key,
business key(s),
attribute(s),
effective_start_date,
effective_end_date
I want to expose 2 sets of views to users/systems for querying.
First view set is views that return only the current records from the respective table. That's easy.
Second view set should provide a way to query (by joining) multiple tables (all with history) and get the effective history of the result set.
For example, if a user issues something like the following query:
select
A.busines_key,
B.business_key,
effective_start_date(),
effective_end_date()
from
A inner join B on (A.b_fk_col = B.business_key)
then I need to transform this statement into:
select
A.busines_key,
B.business_key,
max( A.effective_start_date, B.effective_start_date ) effect_start_date,
min( A.effective_end_date, B.effective_end_date ) effective_end_date
from
A inner join B on (A.b_fk_col = B.business_key)
where
(A.effective_start_date between B.effective_start_date and B.effective_end_date
or
A.effective_end_date between B.effective_start_date and B.effective_end_date)
Really, what I need to be able to do is to add a step to the query plan right after the join(s):
e.g.
Instead of the original:
SELECT STATEMENT
MERGE JOIN CARTESIAN
BUFFER SORT
TABLE ACCESS BY INDEX ID A
INDEX FULL SCAN A_B_FK_IDX
BUFFER SORT
INDEX FULL SCAN B_PK_IDX
I could get something like:
SELECT STATEMENT
**** ADDED ****
EFFECTIVE RANGES // create/modify where & select clauses
TABLE ACCESS BY INDEX ID A // get the eff dates from A
TABLE ACCESS BY INDEX ID B // get the eff dates from B
****************
MERGE JOIN CARTESIAN
BUFFER SORT
TABLE ACCESS BY INDEX ID A
INDEX FULL SCAN A_B_FK_IDX
BUFFER SORT
INDEX FULL SCAN B_PK_IDX
Any thoughts on how I could do this? Thanks.
Related
I have table towns which is main table. This table contains so many rows and it became so 'dirty' (someone inserted 5 milions rows) that I would like to get rid of unused towns.
There are 3 referent table that are using my town_id as reference to towns.
And I know there are many towns that are not used in this tables, and only if town_id is not found in neither of these 3 tables I am considering it as inactive and I would like to remove that town (because it's not used).
as you can see towns is used in this 2 different tables:
employees
offices
and for table * vendors there is vendor_id in table towns since one vendor can have multiple towns.
so if vendor_id in towns is null and town_id is not found in any of these 2 tables it is safe to remove it :)
I created a query which might work but it is taking tooooo much time to execute, and it looks something like this:
select count(*)
from towns
where vendor_id is null
and id not in (select town_id from banks)
and id not in (select town_id from employees)
So basically I said, if vendor_is is null it means this town is definately not related to vendors and in the same time if same town is not in banks and employees, than it will be safe to remove it.. but query took too long, and never executed successfully...since towns has 5 milions rows and that is reason why it is so dirty..
In face I'm not able to execute given query since server terminated abnormally..
Here is full error message:
ERROR: server closed the connection unexpectedly This probably means
the server terminated abnormally before or while processing the
request.
Any kind of help would be awesome
Thanks!
You can join the tables using LEFT JOIN so that to identify the town_id for which there is no row in tables banks and employee in the WHERE clause :
WITH list AS
( SELECT t.town_id
FROM towns AS t
LEFT JOIN tbl.banks AS b ON b.town_id = t.town_id
LEFT JOIN tbl.employees AS e ON e.town_id = t.town_id
WHERE t.vendor_id IS NULL
AND b.town_id IS NULL
AND e.town_id IS NULL
LIMIT 1000
)
DELETE FROM tbl.towns AS t
USING list AS l
WHERE t.town_id = l.town_id ;
Before launching the DELETE, you can check the indexes on your tables.
Adding an index as follow can be usefull :
CREATE INDEX town_id_nulls ON towns (town_id NULLS FIRST) ;
Last but not least you can add a LIMIT clause in the cte so that to limit the number of rows you detele when you execute the DELETE and avoid the unexpected termination. As a consequence, you will have to relaunch the DELETE several times until there is no more row to delete.
You can try an JOIN on big tables it would be faster then two IN
you could also try UNION ALL and live with the duplicates, as it is faster as UNION
Finally you can use a combined Index on id and vendor_id, to speed up the query
CREATE TABLe towns (id int , vendor_id int)
CREATE TABLE
CREATE tABLE banks (town_id int)
CREATE TABLE
CREATE tABLE employees (town_id int)
CREATE TABLE
select count(*)
from towns t1 JOIN (select town_id from banks UNION select town_id from employees) t2 on t1.id <> t2.town_id
where vendor_id is null
count
0
SELECT 1
fiddle
The trick is to first make a list of all the town_id's you want to keep and then start removing those that are not there.
By looking in 2 tables you're making life harder for the server so let's just create 1 single list first.
-- build empty temp-table
CREATE TEMPORARY TABLE TEMP_must_keep
AS
SELECT town_id
FROM tbl.towns
WHERE 1 = 2;
-- get id's from first table
INSERT TEMP_must_keep (town_id)
SELECT DISTINCT town_id
FROM tbl.banks;
-- add index to speed up the EXCEPT below
CREATE UNIQUE INDEX idx_uq_must_keep_town_id ON TEMP_must_keep (town_id);
-- add new ones from second table
INSERT TEMP_must_keep (town_id)
SELECT town_id
FROM tbl.employees
EXCEPT -- auto-distincts
SELECT town_id
FROM TEMP_must_keep;
-- rebuild index simply to ensure little fragmentation
REINDEX TABLE TEMP_must_keep;
-- optional, but might help: create a temporary index on the towns table to speed up the delete
CREATE INDEX idx_towns_town_id_where_vendor_null ON tbl.towns (town_id) WHERE vendor IS NULL;
-- Now do actual delete
-- You can do a `SELECT COUNT(*)` rather than a `DELETE` first if you feel like it, both will probably take some time depending on your hardware.
DELETE
FROM tbl.towns as del
WHERE vendor_id is null
AND NOT EXISTS ( SELECT *
FROM TEMP_must_keep mk
WHERE mk.town_id = del.town_id);
-- cleanup
DROP INDEX tbl.idx_towns_town_id_where_vendor_null;
DROP TABLE TEMP_must_keep;
The idx_towns_town_id_where_vendor_null is optional and I'm not sure if it will actaully lower the total time but IMHO it will help out with the DELETE operation if only because the index should give the Query Optimizer a better view on what volumes to expect.
The issue
I'm trying to create a view to get the latest rows from a partitioned table, filtered on the date partition _LOCALDATETIME and zero or more cluster fields. I can create a view which uses a partition and I can create a view which handles some filters, but I can't work out the syntax to achieve both.
An example query requirement
SELECT fieldA, fieldB, fieldC FROM theView
WHERE date between '2021-01-01' and '2021-12-31' AND
_CLUSTERFIELD1 = 'foo'
GROUPBY _CLUSTERFIELD2
ORDERBY _CLUSTERFIELD3
Table schema
_LOCALDATETIME
_id
_CLUSTERFIELD1
_CLUSTERFIELD2
_CLUSTERFIELD3
_CLUSTERFIELD4
...other fields
Base on what I'm understanding from your case I have come with this approach.
I have created partion table based on _LOCALDATETIME with clustered fields and then the view that returns the data from a defined date scope and the value of the last elements based on _id. So, that will allow me to have a view which have the last items of a partitioned table from a fixed date range.
view
CREATE VIEW `<my-project-id>.<dataset>.<table>` AS
with range_id as (
select MAX(_id) as last_id_partition,_localdatetime as partition_ FROM
`<my-project-id>.<dataset>.<table>` where _localdatetime BETWEEN "2020-01-01" and "2022-01-01" group by _localdatetime)
SELECT s.*
FROM
`<my-project-id>.<dataset>.<table>` s
inner join range_id r on s._id = r.last_id_partition and s._localdatetime = r.partition_
where _localdatetime BETWEEN "2020-01-01" and "2022-01-01"
group by _id,_localdatetime,_name,_location
The view will return the last ids of a partioned clustered table with the clustered fields that are within the view (which is for year 2020 and 2021).
query
select * from `<my-project-id>.<dataset>.<table>`
WHERE _localdatetime between '2021-12-21' and '2021-12-22'
and <clusteredfield> = 'Venezuela'
It will return the records available for that filter as the data its already defined in the view.
What you can't do is to have a view without the partition field as it must exist to query a partitioned table. You can also update use the queries inside a function to further customize your outputs.
Amateur SQL writer here having a problem with building out table based on values from an existing one.
The MASTER table logs a record with an ID every time a service is used. ID remains the same per user, but will repeat to track relevant information during that usage. Table holds about 2m records and 20k DISTINCT IDs.
*Example -
USER ID | Used_Amount
USER_1998 | 9GB,
USER_1999 | 4GB,
USER_1999 | 1GB,
USER_1999 |0.5 GB*
Would like for the new table is create column that SUMS the usage and organizes based on DISTINCT ID.
Goal -
ID . TOTAL USAGE
USER_1998 - 9GB
USER_1999 - 5.5GB
Code below is my attempt...
UPDATE ml_draft
SET true_usage = (
SELECT SUM(true_usage)
FROM table2 t2
INNER JOIN ml_draft ON
ml_draft.subscription_id = t2.subscription_id);
Let me know if there are any additional details to add. Errors vary
You want a correlated subquery. So, there is no need to use JOIN in the subquery:
UPDATE ml_draft d
SET true_usage = (SELECT SUM(t2.true_usage)
FROM table2 t2
WHERE d.subscription_id = t2.subscription_id
);
For performance, you want an index on table2(subscription_id, true_usage).
I have 2 tables. Table 1 has data from the bank account. Table 2 aggregates data from multiple other tables; to keep things simple, we will just have 2 tables. I need to append the data from table 1 into table 2.
I have a field in table2, "SrceFk". The concept is that when a record from Table1 appends, it will fill the table2.SrceFk with the table1 primary key and the table name. So record 302 will look like "BANK/302" after it appends. This way, when I run the append query, I can avoid duplicates.
The query is not working. I deleted the record from table2, but when I run the query, it just says "0 records appended". Even though the foreign key is not present.
I am new to SQL, Access, and programming in general. I understand basic concepts. I have googled this issue and looked on stackOverflow, but no luck.
This is my full statement:
INSERT INTO Main ( SrceFK, InvoDate, Descrip, AMT, Ac1, Ac2 )
SELECT Bank.ID &"/"& "BANK", Bank.TransDate, Bank.Descrip, Bank.TtlAmt, Bank.Ac1, Bank.Ac2
FROM Bank
WHERE NOT EXISTS
(
SELECT * FROM Main
WHERE Main.SrceFK = Bank.ID &"/"& "BANK"
);
I expect the query to add records that aren't present in the table, as needed.
I need to do specific ordering with use of order by field.
select * from table order by field(id,3,4,1,2.......upto 10000 ids)
As the ordering required is not gettable from SQL then how much it affect as per performance and is it feasible to do?
Updates from the comments:
Ordering depends on user and category IDs and can be anything the user wants.
The ordering specification changes (about) daily.
So, we need a custom ordering that depends on the user and category and this ordering needs to change daily.
The easiest way would be to put your ordering in a separate table (called ordering_table in this example):
id | position
----+----------
1 | 11
2 | 42
3 | 23
etc.
The above would mean "put an id of 1 at position 11, 2 at position 42, 3 at position 23, ...". Then you can join that ordering table in:
SELECT t.id, t.col1, t.col2
FROM some_table t
JOIN ordering_table o ON (t.id = o.id)
ORDER BY o.position
Where ordering_table is the table (as above) that defines your strange ordering. This approach simply represents your ordering function as a table (any function with a finite domain is, essentially, just a table after all).
This "ordering table" approach should work fine as long as the ordering table is complete.
If you only need this strange ordering in one place then you could merge the position column into your main table and add NOT NULL and UNIQUE constraints on that column to make sure you cover everything and have a consistent ordering.
Further commenting indicates that you want different orderings for different users and categories and that the ordering will change on a daily basis. You could make separate tables for each condition (which would lead to a combinatorial explosion) or, as Mikael Eriksson and ypercube suggest, add a couple more columns to the ordering table to hold the user and category:
CREATE TABLE ordering_table (
thing_id INT NOT NULL,
position INT NOT NULL,
user_id INT NOT NULL,
category_id INT NOT NULL
);
The thing_id, user_id, and category_id would be foreign keys to their respective tables and you'd probably want to index all the columns in ordering_table but a couple minutes of looking at the query plans would be worthwhile to see if the indexes get used would be worthwhile. You could also make all four columns the primary key to avoid duplicates. Then, the lookup query would be something like this:
SELECT t.id, t.col1, t.col2
FROM some_table t
LEFT JOIN ordering_table o
ON (t.id = o.thing_id AND o.user_id = $user AND o.category_id = $cat)
ORDER BY COALESCE(o.position, 99999)
Where $user and $cat are the user and category IDs (respectively). Note the change to a LEFT JOIN and the addition of COALESCE to allow for missing rows in ordering_table, these changes will push anything that doesn't have a specified position in the order to the bottom of the list rather than removing them from the results completely.