I have been delegated a task on a dataset that has been pre-extracted from another data source(s) and I currently only have Access available to query this data (Excel for basic data analysis as less than row limit at the moment). Essentially, I have three relevant fields:
FK_ID = arbitrary number associated with a transaction
CD = code associated with status of transaction (assume only BEGIN and END are the values)
TIMESTAMP = timestamp of transaction
Now a simplified example of this data set:
FK_ID CD TIMESTAMP
000012 END 2012-01-02-14.27.59.133612
000012 BEGIN 2012-01-02-14.27.57.176631
000015 END 2011-12-12-14.27.59.133612
000015 BEGIN 2011-12-11-14.27.59.133612
000019 END 2011-11-10-14.27.59.133612
000019 BEGIN 2011-11-09-14.27.59.133612
000019 END 2011-11-08-14.27.59.133612
000019 BEGIN 2011-11-07-14.27.59.133612
As you can see, it's not very complicated, the problem is I need to calculate the timestamp difference between the BEGIN and END codes for each unique FK_ID and then create a column to tally that difference, also accounting for the fact that some FK_IDs have multiple timestamps BEGIN/END pairs associated with them.
Now I have been authorized to ignore cases where more than a pair exists (by ignore, I mean only count that initial pair), but it is not preferable.
I need to get these differences though to determine a total average time to determine if that time is within our goals approximately.
What's the best query to go about getting this timestamp difference for each FK_ID pair or other automated means you'd suggest?
I do understand SQL and am proficient enough in C#, but the time frame and other factors are wreaking havoc on my ability to break down this problem logically.
Assuming the table name is TABLE1, in Access I would do something like:
SELECT Table1.FK_ID, DateDiff("s",[TABLE1].[TIMESTAMP],[END_QUERY].[TIMESTAMP]) AS DifferenceInSeconds
FROM Table1
INNER JOIN
(SELECT Table1.FK_ID, Table1.CD, Table1.TIMESTAMP
FROM Table1
WHERE (((Table1.CD)="END"))
ORDER BY Table1.FK_ID, Table1.CD) AS END_QUERY
ON Table1.FK_ID = END_QUERY.FK_ID
WHERE (((Table1.CD)="BEGIN"))
ORDER BY Table1.FK_ID, Table1.CD;
Basically, get all the BEGIN and END on two subqueries and get the difference between the queries (in seconds -- you didn't mention this part). The one issue you'll encounter is one a trasaction has multiple entries. You could do a GROUP BY to get the very first BEGIN and the very last END, but they may be some discrepancies.
I hope this helps you a little.
Related
Hi I have two table with million rows in each.I have oracle 11 g R1
I am sure many of us must have gone through this situation.
What is the most efficient and fast way to update from one table to another where the values are DIFFERENT.
Eg: Table 1 has 4 NUMBER columns with a high precision eg : 0.2212454215454212
Table 2 has 6 columns.
update table 2's four columns based on common column on both the tables, only the different ones.
I have something like this
DECLARE
TYPE test1_t IS TABLE OF test.score%TYPE INDEX BY PLS_..;
TYPE test2_t IS TABLE OF test.id%TYPE INDEX BY PLS..;
TYPE test3_t IS TABLE OF test.Crank%TYPE INDEX BY PLS..;
vscore test1_t;
vid test2_t;
vurank test4_t;
BEGIN
SELECT id,score,urank
BULK COLLECT INTO vid,vscore,vurank
FROM test;
FORALL i IN 1 .. vid.COUNT
MERGE INTO final T
USING (SELECT vid (i) AS o_id,
vurank (i) AS o_urank,
vscore (i) AS o_score FROM DUAL) S
ON (S.o_id = T.id)
WHEN MATCHED THEN
UPDATE SET T.crank = S.o_crank
WHERE T.crank <> S.o_crank;
Since the numbers are with high precision is it slowing down?
I tried Bulk Collect and Merge combination still its taking time ~ 30 mins for worst case scenario if I have to update 1 million rows.
Is there something with rowid?
Help will be appreciated.
If you want to update all the rows, then just use update:
update table_1
set (col1,
col2) = (
select col1,
col2
from table2
where table2.col_a = table1.col_a and
table2.col_b = table1.col_b)
Bulk collect or any PL/SQL technique will always be slower than a pure SQL technique.
The numeric precision is probably not significant, and rowid is not relevant as there is no common value between the two tables.
When dealing with millions of rows, parallel DML is a game changer. Of course you need to have Enterprise Edition to use parallel, but it's really the only thing which will make much difference.
I recommend you read an article on OraFAQ by rleishman comparing 8 Bulk Update Methods. His key finding is that "the cost of disk reads so far outweighs the context switches that that they are barely noticable (sic)". In other words, unless your data is already cached in memory there really isn't a significant difference between SQL and PL/SQL approaches.
The article does have some neat suggestions on employing parallel. The surprising outcome is that a parallel pipelined function offers the best performance.
Focusing on the syntax have been used and skipping the logic (may using a pure update + pure insert may solve the problem, merge cost, indexes, possible full scan on merge and else )
You should use Limit in Bulk Collect syntax
Using a bulk collect with no limit
Will case all records to be loaded in memory
With no partially committed merges, you will create a larg redolog,
that must be apply in the end of the process.
Both will reason in low performance.
DECLARE
v_fetchSize NUMBER := 1000; -- based on hardware, design and .... could be scaled
CURSOR a_cur IS
SELECT id,score,urank FROM test;
TYPE myarray IS TABLE OF a_cur%ROWTYPE;
cur_array myarray;
BEGIN
OPEN a_cur;
LOOP
FETCH a_cur BULK COLLECT INTO cur_array LIMIT v_fetchSize;
FORALL i IN 1 .. cur_array.COUNT
// DO Operation
COMMIT;
EXIT WHEN a_cur%NOTFOUND;
END LOOP;
CLOSE a_cur;
END;
Just to be sure: test.id and final.id must be indexed.
With first select ... from test you got too much records from Table 1 and after that you need to compare all of them with records on Table 2. Try to select only what you need to update. So, there are at least 2 variants:
a) select only changed records:
SELECT source_table.id, source_table.score, source_table.urank
BULK COLLECT INTO vid,vscore,vurank
FROM
test source_table,
final destination_table
where
source_table.id = destination_table.id
and
source_table.crank <> destination_table.crank
;
b) Add new field to source table with datetime value and fill it in trigger with current time. While synchronizing pick only records changed during last day. This field needs to be indexed.
After such a change on update phase you don't need to compare other fields, only match ID's:
FORALL i IN 1 .. vid.COUNT
MERGE INTO FINAL T
USING (
SELECT vid (i) AS o_id,
vurank (i) AS o_urank,
vscore (i) AS o_score FROM DUAL
) S
ON (S.o_id = T.id)
WHEN MATCHED
THEN UPDATE SET T.crank = S.o_crank
If you worry about size of undo/redo segments then variant b) is more useful, because you can get records from source Table 1 divided to time slices and commit changes after updating every slice. E.g. from 00:00 to 01:00 , from 01:00 to 02:00 etc.
In this variant update can be done just by SQL statement without selecting a data into collections in row with maintaining acceptable sizes of redo/undo logs.
I have spent a good portion of today and yesterday attempting to decide whether to utilize a loop or cursor in SQL or to figure out how to use set based logic to solve the problem. I am not new to set logic, but this problem seems to be particularly complex.
The Problem
The idea is that if I have a list of all transactions (10's, 100's of millions) and a date they occurred, I can start combining some of that data into a daily totals table so that it is more rapidly view able by reporting and analytic systems. The pseudocode for this is as such:
foreach( row in transactions_table )
if( row in totals_table already exists )
update totals_table, add my totals to the totals row
else
insert into totals_table with my row as the base values
delete ( or archive ) row
As you can tell, the block of the loop is relatively trivial to implement, and as is the cursor/looping iteration. However, the execution time is quite slow and unwieldy and my question is: is there a non-iterative way to perform such a task, or is this one of the rare exceptions where I just have to "suck it up" and use a cursor?
There have been a few discussions on the topic, some of which seem to be similar, but not usable due to the if/else statement and the operations on another table, for instance:
How to merge rows of SQL data on column-based logic? This question doesn't seem to be applicable because it simply returns a view of all sums, and doesn't actually make logical decisions about additions or updates to another table
SQL Looping seems to have a few ideas about selection with a couple of cases statements which seems possible, but there are two operations that I need done dependent upon the status of another table, so this solution does not seem to fit.
SQL Call Stored Procedure for each Row without using a cursor This solution seems to be the closest to what I need to do, in that it can handle arbitrary numbers of operations on each row, but there doesn't seem to be a consensus among that group.
Any advice how to tackle this frustrating problem?
Notes
I am using SQL Server 2008
The schema setup is as follows:
Totals: (id int pk, totals_date date, store_id int fk, machine_id int fk, total_in, total_out)
Transactions: (transaction_id int pk, transaction_date datetime, store_id int fk, machine_id int fk, transaction_type (IN or OUT), transaction_amount decimal)
The totals should be computed by store, by machine, and by date, and should total all of the IN transactions into total_in and the OUT transactions into total_out. The goal is to get a pseudo data cube going.
You would do this in two set-based statements:
BEGIN TRANSACTION;
DECLARE #keys TABLE(some_key INT);
UPDATE tot
SET totals += tx.amount
OUTPUT inserted.some_key -- key values updated
INTO #keys
FROM dbo.totals_table AS tot WITH (UPDLOCK, HOLDLOCK)
INNER JOIN
(
SELECT t.some_key, amount = SUM(amount)
FROM dbo.transactions_table AS t WITH (HOLDLOCK)
INNER JOIN dbo.totals_table AS tot
ON t.some_key = tot.some_key
GROUP BY t.some_key
) AS tx
ON tot.some_key = tx.some_key;
INSERT dbo.totals_table(some_key, amount)
OUTPUT inserted.some_key INTO #keys
SELECT some_key, SUM(amount)
FROM dbo.transactions_table AS tx
WHERE NOT EXISTS
(
SELECT 1 FROM dbo.totals_table
WHERE some_key = tx.some_key
)
GROUP BY some_key;
DELETE dbo.transactions_table
WHERE some_key IN (SELECT some_key FROM #keys);
COMMIT TRANSACTION;
(Error handling, applicable isolation level, rollback conditions etc. omitted for brevity.)
You do the update first so you don't insert new rows and then update them, performing work twice and possibly double counting. You could use output in both cases to a temp table, perhaps, to then archive/delete rows from the tx table.
I'd caution you to not get too excited about MERGE until they've resolved some of these bugs and you have read enough about it to be sure you're not lulled into any false confidence about how much "better" it is for concurrency and atomicity without additional hints. The race conditions you can work around; the bugs you can't.
Another alternative, from Nikola's comment
CREATE VIEW dbo.TotalsView
WITH SCHEMABINDING
AS
SELECT some_key_column(s), SUM(amount), COUNT_BIG(*)
FROM dbo.Transaction_Table
GROUP BY some_key_column(s);
GO
CREATE UNIQUE CLUSTERED INDEX some_key ON dbo.TotalsView(some_key_column(s));
GO
Now if you want to write queries that grab the totals, you can reference the view directly or - depending on query and edition - the view may automatically be matched even if you reference the base table.
Note: if you are not on Enterprise Edition, you may have to use the NOEXPAND hint to take advantage of the pre-aggregated values materialized by the view.
I do not think you need the loop.
You can just
Update all rows/sums that match your filters/ groups
Archive/ delete previous.
Insert all rows that do not match your filter/ groups
Archive/ delete previous.
SQL is supposed to use mass data not rows one by one.
I have a Postgres DB running 7.4 (Yeah we're in the midst of upgrading)
I have four separate queries to get the Daily, Monthly, Yearly and Lifetime record counts
SELECT COUNT(field)
FROM database
WHERE date_field
BETWEEN DATE_TRUNC('DAY' LOCALTIMESTAMP)
AND DATE_TRUNC('DAY' LOCALTIMESTAMP) + INTERVAL '1 DAY'
For Month just replace the word DAY with MONTH in the query and so on for each time duration.
Looking for ideas on how to get all the desired results with one query and any optimizations one would recommend.
Thanks in advance!
NOTE: date_field is timestamp without time zone
UPDATE:
Sorry I do filter out records with additional query constraints, just wanted to give the gist of the date_field comparisons. Sorry for any confusion
I have some idea of using prepared statements and simple statistics (record_count_t) table for that:
-- DROP TABLE IF EXISTS record_count_t;
-- DEALLOCATE record_count;
-- DROP FUNCTION updateRecordCounts();
CREATE TABLE record_count_t (type char, count bigint);
INSERT INTO record_count_t (type) VALUES ('d'), ('m'), ('y'), ('l');
PREPARE record_count (text) AS
UPDATE record_count_t SET count =
(SELECT COUNT(field)
FROM database
WHERE
CASE WHEN $1 <> 'l' THEN
DATE_TRUNC($1, date_field) = DATE_TRUNC($1, LOCALTIMESTAMP)
ELSE TRUE END)
WHERE type = $1;
CREATE FUNCTION updateRecordCounts() RETURNS void AS
$$
EXECUTE record_count('d');
EXECUTE record_count('m');
EXECUTE record_count('y');
EXECUTE record_count('l');
$$
LANGUAGE SQL;
SELECT updateRecordCounts();
SELECT type,count FROM record_count_t;
Use updateRecordCounts() function any time you need update statistics.
I'd guess that it is not possible to optimize this any further than it already is.
If you're collecting daily/monthly/yearly stats, as I'm assuming you are doing, one option (after upgrading, of course) is a with statement and the relevant joins, e.g.:
with daily_stats as (
(what you posted)
),
monthly_stats as (
(what you posted monthly)
),
etc.
select daily_stats.stats,
monthly_stats.stats,
etc.
stats
left join yearly_stats on ...
left join monthly_stats on ...
left join daily_stats on ...
However, that will actually perform less well than running each query separately in a production environment, because you'll introduce left joins in the DB which could be done just as well in the middleware (i.e. show daily, then monthly, then yearly and finally lifetime stats). (If not better, since you'll be avoiding full table scans.)
By keeping things as if, you'll save the precious DB resources to deal with reads and writes on actual data. The tradeoff (less network traffic between your database and your app) is almost certainly not worth it.
Yikes! Don't do this!!! Not because you can't do what you're asking, but because you probably shouldn't be doing what you're asking in this manner. I'm guessing the reason you've got date_field in your example is because you've got a date_field attached to a user or some other meta-data.
Think about it: you are asking PostgreSQL to scan 100% of the records relevant to a given user. Unless this is a one-time operation, you almost assuredly do not want to do this. If this is a one-time operation and you are planning on caching this value as a meta-data, then who cares about the optimizations? Space is cheap and will save you heaps of execution time down the road.
You should add 4x per-user (or whatever it is) meta-data fields that help sum up the data. You have two options, I'll let you figure out how to use this so that you keep historical counts, but here's the easy version:
CREATE TABLE user_counts_only_keep_current (
user_id , -- Your user_id
lifetime INT DEFAULT 0,
yearly INT DEFAULT 0,
monthly INT DEFAULT 0,
daily INT DEFAULT 0,
last_update_utc TIMESTAMP WITH TIME ZONE,
FOREIGN KEY(user_id) REFERENCES "user"(id)
);
CREATE UNIQUE INDEX this_tbl_user_id_udx ON user_counts_only_keep_current(user_id);
Setup some stored procedures that zero out individual columns if last_update_utc doesn't match the current day according to NOW(). You can get creative from here, but incrementing records like this is going to be the way to go.
Handling of time series data in any relational database requires special handling and maintenance. Look in to PostgreSQL's table inheritance if you want good temporal data management.... but really, don't do whatever it is you're about to do to your application because it's almost certainly going to result in bad things(tm).
I have a MySQL table called items that contains thousands of records. Each record has a user_id field and a created (datetime) field.
Trying to put together a query to SELECT 25 rows, passing a string of user ids as a condition and sorted by created DESC.
In some cases, there might be just a few user ids, while in other instances, there may be hundreds.
If the result set is greater than 25, I want to pare it down by eliminating duplicate user_id records. For instance, if there were two records for user_id = 3, only the most recent (according to created datetime) would be included.
In my attempts at a solution, I am having trouble because while, for example, it's easy to get a result set of 100 (allowing duplicate user_id records), or a result set of 16 (using GROUP BY for unique user_id records), it's hard to get 25.
One logical approach, which may not be the correct MySQL approach, is to get the most recent record for each for each user_id, and then, if the result set is less than 25, begin adding a second record for each user_id until the 25 record limit is met (maybe a third, fourth, etc. record for each user_id would be needed).
Can this be accomplished with a MySQL query, or will I need to take a large result set and trim it down to 25 with code?
I don't think what you're trying to accomplish is possible as a SQL query. Your desire is to return 25 rows, no matter what the normal data groupings are whereas SQL is usually picky about returning based on data groupings.
If you want a purely MySQL-based solution, you may be able to accomplish this with a stored procedure. (Supported in MySQL 5.0.x and later.) However, it might just make more sense to run the query to return all 100+ rows and then trim it programmatically within the application.
This will get you the most recent for each user --
SELECT user_id, create
FROM items AS i1
LEFT JOIN items AS i2
ON i1.user_id = i2.user_id AND i1.create > i2.create
WHERE i2.id IS NULL
his will get you the most recent two records for each user --
SELECT user_id, create
FROM items AS i1
LEFT JOIN items AS i2
ON i1.user_id = i2.user_id AND i1.create > i2.create
LEFT JOIN items IS i3
ON i2.user_id = i3.user_id AND i2.create > i3.create
WHERE i3.id IS NULL
Try working from there.
You could nicely put this into a stored procedure.
My opinion is to use application logic, as this is very much application layer logic you are trying to implement at the DB level, i.e. filtering down the results to make the search more useful to the end user.
You could implement a stored procedure (personally I would never do such a thing) or just get the application to decide which 25 results.
One approach would be to get the most recent item from each user, followed by the most recent items from all users, and limit that. You could construct pathological examples where this probably isn't what you want, but it should be pretty good in general.
Unfortunately, there is no easy way :( I had to do something similar when I built a report for my company that would pull up customer disables that were logged in a database. Only problem was that the disconnect is ran and logged every 30 minutes. Therefore, the rows would not be distinct since the timestamp was different in every disconnect. I solved this problem with sub queries. I don't have the exact code anymore, but I beleive this is how I implemented it:
SELECT CORP, HOUSE, CUST,
(
SELECT TOP 1 hsd
FROM #TempTable t2
WHERE t1.corp = t2.corp
AND t1.house = t2.house
AND t1.cust = t2.cust
) DisableDate
FROM #TempTable t1
GROUP BY corp, house, cust -- selecting distinct
So, my answer is to elimante the non-distinct column from the query by using sub queries. There might be an easier way to do it though. I'm curious to see what others post.
Sorry, i keep editing this, I keep trying to find ways to make it easier to show what I did.
We have a database that we are using to store test results for an embedded device. There's a table with columns for different types of failures (details not relevant), along with a primary key 'keynum' and a 'NUM_FAILURES' column that lists the number of failures. We store passes and failures, so a pass has a '0' in 'NUM_FAILURES'.
In order to keep the database from growing without bounds, we want to keep the last 1000 results, plus any of the last 50 failures that fall outside of the 1000. So, worst case, the table could have 1050 entries in it. I'm trying to find the most efficient SQL insert trigger to remove extra entries. I'll give what I have so far as an answer, but I'm looking to see if anyone can come up with something better, since SQL isn't something I do very often.
We are using SQLITE3 on a non-Windows platform, if it's relevant.
EDIT: To clarify, the part that I am having problems with is the DELETE, and specifically the part related to the last 50 failures.
The reason you want to remove these entries is to keep the database growing too big and not to keep it in some special state. For that i would really not use triggers and instead setup a job to run at some interval cleaning up the table.
So far, I have ended up using a View combined with a Trigger, but I'm not sure it's going to work for other reasons.
CREATE VIEW tablename_view AS SELECT keynum FROM tablename WHERE NUM_FAILURES!='0'
ORDER BY keynum DESC LIMIT 50;
CREATE TRIGGER tablename_trig
AFTER INSERT ON tablename WHEN (((SELECT COUNT(*) FROM tablename) >= 1000) or
((SELECT COUNT(NUM_FAILURES) FROM tablename WHERE NUM_FAILURES!='0') >= 50))
BEGIN
DELETE FROM tablename WHERE ((((SELECT MAX(keynum) FROM ibit) - keynum) >= 1000)
AND
((NUM_FAILURES=='0') OR ((SELECT MIN(keynum) FROM tablename_view) > keynum)));
END;
I think you may be using the wrong data structure. Instead I'd create two tables and pre-populate one with a 1000 rows (successes) and the other with 50 (failures). Put a primary ID on each. The when you record a result instead of inserting a new row find the ID+1 value for the last timestamped record entered (looping back to 0 if > max(id) in table) and update it with your new values.
This has the advantage of pre-allocating your storage, not requiring a trigger, and internally consistent logic. You can also adjust the size of the log very simply by just pre-populating more records rather than to have to change program logic.
There's several variations you can use on this, but the idea of using a closed loop structure rather than an open list would appear to match the problem domain more closely.
How about this:
DELETE
FROM table
WHERE ( id > ( SELECT max(id) - 1000 FROM table )
AND num_failures = 0
)
OR id > ( SELECT max(id) - 1050 FROM table )
If performance is a concern, it might be better to delete on a periodic basis, rather than on each insert.