I need to run a query at specific time an save it in other table, that part I've already done it, but later I need to run the same query but need to exclude the data that it was generated in the first run of the query. It is a process that has to be done three times a day. Any ideas? Thanks.
The question is a little confusing, but I think you want to backup several times a day the records that were inserted in the source table after the last backup.
Let's assume that your structure is the following:
**DataTable**
Id (IDENTITY)
Value
**BackupTable**
Id
SourceId -- identifies the source. Can be used when merging data from multiple sources
RefId -- identifier in source's scope (in our case DataTable.Id)
Value -- value copied
In our case, the job should look for those values that are not yet inserted:
DECLARE #SourceId INT = 1 -- this is a constant for DataTable as source
DECLARE #maxIdForSource INT = (SELECT MAX(RefId) FROM BackupTable WHERE SourceId = #SourceId)
INSERT INTO BackupTable
(SourceId, RefId, Value)
SELECT #SourceId, Id, Value
FROM DataTable
WHERE Id > #maxIdForSource
If you are not merging data from multiple sources, you do not define SourceId at all.
Related
I have a user table in Hive of the form:
User:
Id String,
Name String,
Col1 String,
UpdateTimestamp Timestamp
I'm inserting data in this table from a file which has the following format:
I/U,Timestamp when record was written to file, Id, Name, Col1, UpdateTimestamp
e.g. for inserting a user with Id 1:
I,2019-08-21 14:18:41.002947,1,Bob,stuff,123456
and updating col1 for the same user with Id 1:
U,2019-08-21 14:18:45.000000,1,,updatedstuff,123457
The columns which are not updated are returned as null.
Now simple insertion is easy in hive using load in path in a staging table and then ignoring the first two fields from the stage table.
However, how would I go about the update statements? So that my final row in hive looks like below:
1,Bob,updatedstuff,123457
I was thinking to insert all rows in a staging table and then perform some sort of merge query. Any ideas?
Typically with a merge statement your "file" would still be unique on ID and the merge statement would determine whether it needs to insert this as a new record, or update values from that record.
However, if the file is non-negotiable and will always have the I/U format, you could break the process up into two steps, the insert, then the updates, as you suggested.
In order to perform updates in Hive, you will need the users table to be stored as ORC and have ACID enabled on your cluster. For my example, I would create the users table with a cluster key, and the transactional table property:
create table test.orc_acid_example_users
(
id int
,name string
,col1 string
,updatetimestamp timestamp
)
clustered by (id) into 5 buckets
stored as ORC
tblproperties('transactional'='true');
After your insert statements, your Bob record would say "stuff" in col1:
As far as the updates - you could tackle these with an update or merge statement. I think the key here is the null values. It's important to keep the original name, or col1, or whatever, if the staging table from the file has a null value. Here's a merge example which coalesces the staging tables fields. Basically, if there is a value in the staging table, take that, or else fall back to the original value.
merge into test.orc_acid_example_users as t
using test.orc_acid_example_staging as s
on t.id = s.id
and s.type = 'U'
when matched
then update set name = coalesce(s.name,t.name), col1 = coalesce(s.col1, t.col1)
Now Bob will show "updatedstuff"
Quick disclaimer - if you have more than one update for Bob in the staging table, things will get messy. You will need to have a pre-processing step to get the latest non-null values of all the updates prior to doing the update/merge. Hive isn't really a complete transactional DB - it would be preferred for the source to send full user records any time there's an update, instead of just the changed fields only.
You can reconstruct each record in the table using you can use last_value() with the null option:
select h.id,
coalesce(h.name, last_value(h.name, true) over (partition by h.id order by h.timestamp) as name,
coalesce(h.col1, last_value(h.col1, true) over (partition by h.id order by h.timestamp) as col1,
update_timestamp
from history h;
You can use row_number() and a subquery if you want the most recent record.
I have a SQL Server table with just 3 columns, one of which is of type varbinary. The data in this column is actually a Json document which among other properties contains information about when the data was last modified. Unfortunately the SQL table itself does not contain information about when its rows were modified.
Now when doing sorting and filtering of the data I of course don't want fetch all rows in order to find e.g. the latest 100 entries.
So my question is: does SQL Server somehow remember when a row was added/modified? I have tried adding a timestamp and this is applied to all existing rows but this is applied randomly I think, because the sorting doesn't work. I don't need a datetime or anything, I just want to be able sort the records based on when they were last modified.
Thanks
For those looking to insert a tamestamp column of type DateTime into an existing DB table, you can do this like so:
ALTER TABLE TestTable
ADD DateInserted DATETIME NOT NULL DEFAULT (GETDATE());
The existing records will automatically get the value equal to the date/time of the moment when column is added.
New records will get up-to-date value upon insertion.
SQL Server will not track historically when a row was inserted or modified so you need to rely on the JSON data to figure that out yourself. You are going to need a new column to make this efficient to query. Once you have your new column you have some options:
Loop through all your records populating the new column with the relevant value from the JSON data.
If your version of SQL Server is recent enough, you can query the JSON data directly. Populate this column using a query like this:
UPDATE MyTable
SET MyNewColumn = JSON_VALUE(JsonDataColumn, '$.Customer.DateCreated')
The downside of this method is that you need to maintain this
Make SQL Server compute the value from the JSON automatically, for example:
ALTER TABLE MyTable
ADD MyNewColumn AS JSON_VALUE(JsonDataColumn, '$.Customer.DateCreated')
And, create an index to make it efficient:
CREATE INDEX IX_MyTable_MyNewColumn
ON MyTable(MyNewColumn)
Use a new column CreatedDate and store datetime every time you make an Insert.
You could use GetDate() for inserting date in the column.
A UpdatedDate column can be used for updates.
in order to find e.g. the latest 100 entries.
Timestamp is indeed what you need.
It's ever-increasing value, it's updated automatically, so you are always able to find all last modified/inserted rows.
Here is an example:
create table dbo.test1 (id int);
insert into dbo.test1 values(1), (2), (3);
alter table dbo.test1 add ts timestamp;
update dbo.test1
set id = 10
where id = 2
select top 1 *
from dbo.test1
order by ts desc;
--id ts
--10 0x000000001FCFABD2
insert into dbo.test1 (id)
values (100);
select top 1 *
from dbo.test1
order by ts desc;
--id ts
--100 0x000000001FCFABD3
As you see, you always get the last modified/inserted row.
For your purpose just use
select top 100 *
...
order by ts desc;
Thanks. Apparently I didn't look hard enough before I posted this question. The question has been asked a couple of times before and the answer is: Nope! There is no easy solution to this.
SQL Server does not keep track of when a record was created or modified, which was somehow what I was looking for. So I will go for the next best solution, which is probably to create a datetime column, retrieve the modified date from the Json document and then update the record. Or rather, the 1,4 million records:-(
In a few logging tables that are frequently written to, I'd like to be able to store a relative order so that I can union between these tables, and get the order that things actually occurred in.
DateTime2's resolution is lacking. Several rows will get the exact same date, so there is no way to tell which happened first.
Because sorting should work across several tables, sorting by Id is out.
Then I started looking at timestamp. This works for updated dates, but it does not work for created dates, because you can only have one timestamp column per table, and it automatically updates.
This is for Microsoft Sql Server 2008.
Any suggestions?
You can simulate it with another column typed as binary(8) (same as rowversion) and defaulting to ##DBTS:
create table TX (
ID int not null,
Updated rowversion not null,
Created binary(8) not null constraint DF_TX_Created DEFAULT (##DBTS)
)
go
insert into TX (ID)
values (1),(2)
go
update TX set ID = 3 where ID = 1
go
insert into TX (ID)
values (4)
go
select * from TX
Result:
ID Updated Created
----------- ------------------ ------------------
3 0x00000000000007D3 0x00000000000007D0
2 0x00000000000007D2 0x00000000000007D0
4 0x00000000000007D4 0x00000000000007D3
Notes:
The Created values will always be equal to the last rowversion value assigned, so they will "lag", in some sense, compared to Updated values.
Also, multiple inserts from a single statement will receive the same Created values, whereas Updated values will always be distinct.
I have run into this problem that I'm trying to solve: Every day I import new records into a table that have an ID number.
Most of them are new (have never been seen in the system before) but some are coming in again. What I need to do is to append an alpha to the end of the ID number if the number is found in the archive, but only if the data in the row is different from the data in the archive, and this needs to be done sequentially, IE, if 12345 is seen a 2nd time with different data, I change it to 12345A, and if 12345 is seen again, and is again different, I need to change it to 12345B, etc.
Originally I tried using a where loop where it would put all the 'seen again' records in a temp table, and then assign A first time, then delete those, assign B to what's left, delete those, etc., till the temp table was empty, but that hasn't worked out.
Alternately, I've been thinking of trying subqueries as in:
update table
set IDNO= (select max idno from archive) plus 1
Any suggestions?
How about this as an idea? Mind you, this is basically pseudocode so adjust as you see fit.
With "src" as the table that all the data will ultimately be inserted into, and "TMP" as your temporary table.. and this is presuming that the ID column in TMP is a double.
do
update tmp set id = id + 0.01 where id in (select id from src);
until no_rows_changed;
alter table TMP change id into id varchar(255);
update TMP set id = concat(int(id), chr((id - int(id)) * 100 + 64);
insert into SRC select * from tmp;
What happens when you get to 12345Z?
Anyway, change the table structure slightly, here's the recipe:
Drop any indices on ID.
Split ID (apparently varchar) into ID_Num (long int) and ID_Alpha (varchar, not null). Make the default value for ID_Alpha an empty string ('').
So, 12345B (varchar) becomes 12345 (long int) and 'B' (varchar), etc.
Create a unique, ideally clustered, index on columns ID_Num and ID_Alpha.
Make this the primary key. Or, if you must, use an auto-incrementing integer as a pseudo primary key.
Now, when adding new data, finding duplicate ID number's is trivial and the last ID_Alpha can be obtained with a simple max() operation.
Resolving duplicate ID's should now be an easier task, using either a while loop or a cursor (if you must).
But, it should also be possible to avoid the "Row by agonizing row" (RBAR), and use a set-based approach. A few days of reading Jeff Moden articles, should give you ideas in that regard.
Here is my final solution:
update a
set IDnum=b.IDnum
from tempimiportable A inner join
(select * from archivetable
where IDnum in
(select max(IDnum) from archivetable
where IDnum in
(select IDnum from tempimporttable)
group by left(IDnum,7)
)
) b
on b.IDnum like a.IDnum + '%'
WHERE
*row from tempimport table = row from archive table*
to set incoming rows to the same IDnum as old rows, and then
update a
set patient_account_number = case
when len((select max(IDnum) from archive where left(IDnum,7) = left(a.IDnum,7)))= 7 then a.IDnum + 'A'
else left(a.IDnum,7) + char(ascii(right((select max(IDnum) from archive where left(IDnum,7) = left(a.IDnum,7)),1))+1)
end
from tempimporttable a
where not exists ( *select rows from archive table* )
I don't know if anyone wants to delve too far into this, but I appreciate contructive criticism...
Is it possible in SQL (SQL Server) to retrieve the next ID (integer) from an identity column in a table before, and without actually, inserting a row? This is not necessarily the highest ID plus 1 if the most recent row was deleted.
I ask this because we occassionally have to update a live DB with new rows. The ID of the row is used in our code (e.g. Switch (ID){ Case ID: } and must be the same. If our development DB and live DB get out of sync, it would be nice to predict a row ID in advance before deployment.
I could of course SET IDENTITY OFF SET INSERT_IDENTITY ON or run a transaction (does this roll back the ID?) etc but wondered if there was a function that returned the next ID (without incrementing it).
try IDENT_CURRENT:
Select IDENT_CURRENT('yourtablename')
This works even if you haven't inserted any rows in the current session:
Returns the last identity value generated for a specified table or view. The last identity value generated can be for any session and any scope.
Edit:
After spending a number of hours comparing entire page dumps, I realised there is an easier way and I should of stayed on the DMVs.
The value survives a backup / restore, which is a clear indication that it is stored - I dumped all the pages in the DB and couldn't find the location / alteration for when
a record was added. Comparing 200k line dumps of pages isn't fun.
I had used the dedicated admin console I took a dump of every single internal table exposed inserted a row and then took a further dump of the system tables. Both of the dumps were identical, which indicates that whilst it survived, and therefore must be stored, it is not exposed even at that level.
So after going around in a circle I realised the DMV did have the answer.
create table foo (MyID int identity not null, MyField char(10))
insert into foo values ('test')
go 10
-- Inserted 10 rows
select Convert(varchar(8),increment_value) as IncrementValue,
Convert(varchar(8),last_value) as LastValue
from sys.identity_columns where name ='myid'
-- insert another row
insert into foo values ('test')
-- check the values again
select Convert(varchar(8),increment_value) as IncrementValue,
Convert(varchar(8),last_value) as LastValue
from sys.identity_columns where name ='myid'
-- delete the rows
delete from foo
-- check the DMV again
select Convert(varchar(8),increment_value) as IncrementValue,
Convert(varchar(8),last_value) as LastValue
from sys.identity_columns where name ='myid'
-- value is currently 11 and increment is 1, so the next insert gets 12
insert into foo values ('test')
select * from foo
Result:
MyID MyField
----------- ----------
12 test
(1 row(s) affected)
Just because the rows got removed, the last value was not reset, so the last value + increment should be the right answer.
Also going to write up the episode on my blog.
Oh, and the short cut to it all:
select ident_current('foo') + ident_incr('foo')
So it actually turns out to be easy - but this all assumes no one else has used your ID whilst you got it back. Fine for investigation, but I wouldn't want to use it in code.
This is a little bit strange but it will work:
If you want to know the next value, start by getting the greatest value plus one:
SELECT max(id) FROM yourtable
To make this work, you'll need to reset the identity on insert:
DECLARE #value INTEGER
SELECT #value = max(id) + 1 FROM yourtable
DBCC CHECKIDENT (yourtable, reseed, #value)
INSERT INTO yourtable ...
Not exactly an elegant solution but I haven't had my coffee yet ;-)
(This also assumes that there is nothing done to the table by your process or any other process between the first and second blocks of code).
You can pretty easily determine that the last value used is:
SELECT
last_value
FROM
sys.identity_columns
WHERE
object_id = OBJECT_ID('yourtablename')
Usually, the next ID will be last_value + 1 - but there's no guarantee for that.
Marc
Rather than using an IDENTITY column, you could use a UNIQUEIDENTIFIER (Guid) column as the unique row identifer and insert known values.
The other option (which I use) is SET IDENTITY_INSERT ON, where the row IDs are managed in a source controlled single 'document'.