Do I need a second table for this database logic? - sql

Firstly, this DB question could be a bit DB agnostic, but I am using Sql Server 2008 if that has a specialised solution for this problem, but please keep reading this if you're not an MS Sql Server person .. please :)
Ok, I read in a log file that contains data for a game. Eg. when a player connects, disconnects, does stuff, etc. Nothing too hard. Works great.
Now, two of the log file entry types are
NewConnection
LostConnection
What I'm trying to keep track of are the currently connected players, to the game.
So what I originally thought of was to create a second table where it contains the most recent new connection, per player. When a player disconnects/loses connection i then remove this entry from this second table.
Eg.
Table 1: LogEntries
LogEntryId INT PK NOT NULL
EntryTypeId TINYINT NOT NULL
PlayerId INT NOT NULL
....
Table 2: ConnectedPlayers
LogEntryId INT FK (back to LogEntries table) NOT NULL
Then, I thought I could use a trigger to insert this into the cache data into the ConnectedPlayers table. Don't forget, if it's a trigger, it needs to handle multiple records, updates and deletes.
But I'm not sure if this is the best way. Like, could I have an Indexed View?
I would love to know people's thoughts on this.
Oh, one more thing: for simplicity, lets just assume that when a player drops connection/lags out/modem dies/etc, the application is smart enough to know this and DOES record this as a LostConnection entry. There will be no phantom users reported as connected, when they have really got disconnected accidently, etc.
UPDATE:
I was thinking that maybe I could use a view instead? (and i can index this view if i want to, also :) ) By Partitioning my results, I could get the most recent event type, per player .. where the event is a NewConnection or a LostConnection. Then only grab those most recent NewConnection .. which means they are connected. No second table/triggers/extra insert .NET code/whatever needed ...
eg..
SELECT LogEntryId, EntryTypeId, PlayerId
FROM
(SELECT LogEntryId, EntryTypeId, PlayerId
RANK() OVER (PARTITION BY PlayerId ORDER BY LogEntryId DESC) AS MostRecentRank
FROM LogEntries
WHERE (EntryTypeId = 2 -- NewConnection
OR EntryTypeId = 4 -- LostConnection)
) SubQuery
WHERE MostRecentRank = 1
How does that sound/look?

You don't need a second table, but you do need a date column, which I assume is part of your log data. I would normalize the data and avoid the temptation to optimize prematurely. Make sure you index the key columns, mainly the LogEntryDate and PlayerId columns in the case of your query.
Then, use a standard aggregate query to determine the newest log entry for each user, and then filter out the ones that are not connected. You could further optimize this by only selecting from log entries from the last 24 hours (or last week or whatever makes sense for your app).
select l.*
from (
select PlayerId, max(LogEntryDate) as MaxLogEntryDate
from LogEntries
where EntryTypeId in (2,4)
and LogEntryDate > GetDate() - 7 --only look at the last week, as connections older than that have timed out
group by PlayerId
) lm
inner join LogEntries l on lm.PlayerId = l.PlayerId and lm.MaxLogEntryDate = l.LogEntryDate
where l.EntryTypeId = 2 --new connections only
If you find that you are still not getting the speed you want out of the query, then look at strategies for optimizing. You seem reluctant to cache in the application layer, so your proposal of indexed views would work. You could use the query above as a basis for this to create a Player view that includes a boolean IsConnected column.
Note: if you do not receive a date with each log entry but the LogEntryId is generated by the game, that should work as a substitute for the date. If you are generating the LogEntryId on insert though, I would caution against relying on that as it would only take one out of order import to throw off all of your data.

Depending on the size of the original table LogEntries, this almost seems like overkill.
The triggers would have to update with each change to the original table, as where if using the correct indexing, a simple query could give you these results when you require the data.
I would thus go against the option of a secondary table.

I'd make an is_connected flag, login_time and that's about it.
You can use a simple MySQL query to check every 10/60 seconds even and cache the data in a file.
Where is_connected=1, order by login_time limit 10/20/100 ...
A second table seems way too much and pretty useless. Extra caching (if needed, in a huge database..) could be done on files.

I personally would use a view that gets the most recent NewConnection or LostConnection(whichever is more recent) for each player, which implies you need some sort of date-time stamp in the log or an id that is ever increasing, and then filter that further throwing out all the LostConnection entries. This will leave you with all players having a NewConnection without a more recent LostConnection, hence they are connected.
The problem you might have with this approach is the log table might be huge. I would probably try performance testing with an index on the timestamp column or whatever column you use to determine what is the "most recent" entry.

As i understand, what you really need is
Player (ID)
Game (ID)
Connection (PlayerID, GameID, LastActivity DateTime)
..
#interestingTime is some time before current
select PlayerID, GameID
from Connection
where LastActivity > #interestingTime
gives you all currently connected players.
select PlayerID, GameID
from Connection
where LastActivity <= #interestingTime
gives you lost connections.

Related

Oracle join of two tables where joined to table returns highest of a result set restricted by the joining (parent?) table?

Sorry for the long title, but its a difficult dilemma to distill down to a short title.. :-)
I have two tables, an auto maintenance log table and an auto trip log table, something like the following:
auto_maint_log:
auto_id
maint_datetime
maint_description
auto_trip_log:
auto_id
trip_datetime
ending_odometer
I need to select all maintenance events for each auto, and for each event, lookup the most recent ending_odometer value at the time of the maintenance from the trip log table. The only way I have successfully accomplished this task is using a function (i.e. get_odometer(auto_id, maint_datetime) as a column in my query) in which the odometer reading from the trip log is evaluated to pull the most recent trip prior to the passed maint_datetime, thus returning the most recent ending_odometer.
While the function call does work, in theory, in practice it does not. My end goal is to create a view of the maintenance data that includes the (then) current odometer value, and I have millions of maintenance rows across hundreds of vehicles. The performance of the function call makes its use as a solution impractical, nay impossible.. :-)
I've tried all variations of MAX, rownum, subselects, analytics (over/partition, etc.) I've been able to scour from google'ing this, and have not been able to code a working query.
Any suggestions, or brilliant solutions are welcome!
Thanks,
You can do this with a correlated subquery:
select aml.*,
(select max(atl.ending_odometer)
from auto_trip_log atl
where alt.auto_id = aml.auto_id and
atl.trip_datetime <= aml.maint_datetime
) as ending_odometer
from auto_maint_log aml;
This query uses max() because (presumably) the odometer readings are steadily increasing.
EDIT:
select aml.*,
(select max(atl.ending_odometer) over (dense_rank first order by atl.trip_datetime desc)
from auto_trip_log atl
where alt.auto_id = aml.auto_id and
atl.trip_datetime <= aml.maint_datetime
) as ending_odometer
from auto_maint_log aml;

Store results of SQL Server query for pagination

In my database I have a table with a rather large data set that users can perform searches on. So for the following table structure for the Person table that contains about 250,000 records:
firstName|lastName|age
---------|--------|---
John | Doe |25
---------|--------|---
John | Sams |15
---------|--------|---
the users would be able to perform a query that can return about 500 or so results. What I would like to do is allow the user see his search results 50 at a time using pagination. I've figured out the client side pagination stuff, but I need somewhere to store the query results so that the pagination uses the results from his unique query and not from a SELECT * statement.
Can anyone provide some guidance on the best way to achieve this? Thanks.
Side note: I've been trying to use temp tables to do this by using the SELECT INTO statements, but I think that might cause some problems if, say, User A performs a search and his results are stored in the temp table then User B performs a search shortly after and User A's search results are overwritten.
In SQL Server the ROW_NUMBER() function is great for pagination, and may be helpful depending on what parameters change between searches, for example if searches were just for different firstName values you could use:
;WITH search AS (SELECT *,ROW_NUMBER() OVER (PARTITION BY firstName ORDER BY lastName) AS RN_firstName
FROM YourTable)
SELECT *
FROM search
WHERE RN BETWEEN 51 AND 100
AND firstName = 'John'
You could add additional ROW_NUMBER() lines, altering the PARTITION BY clause based on which fields are being searched.
Historically, for us, the best way to manage this is to create a complete new table, with a unique name. Then, when you're done, you can schedule the table for deletion.
The table, if practical, simply contains an index id (a simple sequenece: 1,2,3,4,5) and the primary key to the table(s) that are part of the query. Not the entire result set.
Your pagination logic then does something like:
SELECT p.* FROM temp_1234 t, primary_table p
WHERE t.pkey = p.primary_key
AND t.serial_id between 51 and 100
The serial id is your paging index.
So, you end up with something like (note, I'm not a SQL Server guy, so pardon):
CREATE TABLE temp_1234 (
serial_id serial,
pkey number
);
INSERT INTO temp_1234
SELECT 0, primary_key FROM primary_table WHERE <criteria> ORDER BY <sort>;
CREATE INDEX i_temp_1234 ON temp_1234(serial_id); // I think sql already does this for you
If you can delay the index, it's faster than creating it first, but it's a marginal improvement most likely.
Also, create a tracking table where you insert the table name, and the date. You can use this with a reaper process later (late at night) to DROP the days tables (those more than, say, X hours old).
Full table operations are much cheaper than inserting and deleting rows in to an individual table:
INSERT INTO page_table SELECT 'temp_1234', <sequence>, primary_key...
DELETE FROM page_table WHERE page_id = 'temp_1234';
That's just awful.
First of all, make sure you really need to do this. You're adding significant complexity, so go & measure whether the queries and pagination really hurts or you just "feel like you should". The pagination can be handled with ROW_NUMBER() quite easily.
Assuming you go ahead, once you've got your query, clearly you need to build a cache so first you need to identify what the key is. It will be the SQL statement or operation identifier (name of stored procedure perhaps) and the criteria used. If you don't want to share between users then the user name or some kind of session ID too.
Now when you do a query, you first look up in this table with all the key data then either
a) Can't find it so you run the query and add to the cache, storing the criteria/keys and the data or PK of the data depending on if you want a snapshot or real time. Bear in mind that "real time" isn't really because other users could be changing data under you.
b) Find it, so remove the results (or join the PK to the underlying tables) and return the results.
Of course now you need a background process to go and clean up the cache when it's been hanging around too long.
Like I said - you should really make sure you need to do this before you embark on it. In the example you give I don't think it's worth it.

Using a database table as a queue

I want to use a database table as a queue. I want to insert in it and take elements from it in the inserted order (FIFO). My main consideration is performance because I have thousands of these transactions each second. So I want to use a SQL query that gives me the first element without searching the whole table. I do not remove a row when I read it.
Does SELECT TOP 1 ..... help here?
Should I use any special indexes?
I'd use an IDENTITY field as the primary key to provide the uniquely incrementing ID for each queued item, and stick a clustered index on it. This would represent the order in which the items were queued.
To keep the items in the queue table while you process them, you'd need a "status" field to indicate the current status of a particular item (e.g. 0=waiting, 1=being processed, 2=processed). This is needed to prevent an item be processed twice.
When processing items in the queue, you'd need to find the next item in the table NOT currently being processed. This would need to be in such a way so as to prevent multiple processes picking up the same item to process at the same time as demonstrated below. Note the table hints UPDLOCK and READPAST which you should be aware of when implementing queues.
e.g. within a sproc, something like this:
DECLARE #NextID INTEGER
BEGIN TRANSACTION
-- Find the next queued item that is waiting to be processed
SELECT TOP 1 #NextID = ID
FROM MyQueueTable WITH (UPDLOCK, READPAST)
WHERE StateField = 0
ORDER BY ID ASC
-- if we've found one, mark it as being processed
IF #NextId IS NOT NULL
UPDATE MyQueueTable SET Status = 1 WHERE ID = #NextId
COMMIT TRANSACTION
-- If we've got an item from the queue, return to whatever is going to process it
IF #NextId IS NOT NULL
SELECT * FROM MyQueueTable WHERE ID = #NextID
If processing an item fails, do you want to be able to try it again later? If so, you'll need to either reset the status back to 0 or something. That will require more thought.
Alternatively, don't use a database table as a queue, but something like MSMQ - just thought I'd throw that in the mix!
If you do not remove your processed rows, then you are going to need some sort of flag that indicates that a row has already been processed.
Put an index on that flag, and on the column you are going to order by.
Partition your table over that flag, so the dequeued transactions are not clogging up your queries.
If you would really get 1.000 messages every second, that would result in 86.400.000 rows a day. You might want to think of some way to clean up old rows.
Everything depends on your database engine/implementation.
For me simple queues on tables with following columns:
id / task / priority / date_added
usually works.
I used priority and task to group tasks and in case of doubled task i choosed the one with bigger priority.
And don't worry - for modern databases "thousands" is nothing special.
This will not be any trouble at all as long as you use something to keep track of the datetime of the insert. See here for the mysql options. The question is whether you only ever need the absolute most recently submitted item or whether you need to iterate. If you need to iterate, then what you need to do is grab a chunk with an ORDER BY statement, loop through, and remember the last datetime so that you can use that when you grab your next chunk.
perhaps adding a LIMIT=1 to your select statement would help ... forcing the return after a single match...
Since you don't delete the records from the table, you need to have a composite index on (processed, id), where processed is the column that indicates if the current record had been processed.
The best thing would be creating a partitioned table for your records and make the PROCESSED field the partitioning key. This way, you can keep three or more local indexes.
However, if you always process the records in id order, and have only two states, updating the record would mean just taking the record from the first leaf of the index and appending it to the last leaf
The currently processed record would always have the least id of all unprocessed records and the greatest id of all processed records.
Create a clustered index over a date (or autoincrement) column. This will keep the rows in the table roughly in index order and allow fast index-based access when you ORDER BY the indexed column. Using TOP X (or LIMIT X, depending on your RDMBS) will then only retrieve the first x items from the index.
Performance warning: you should always review the execution plans of your queries (on real data) to verify that the optimizer doesn't do unexpected things. Also try to benchmark your queries (again on real data) to be able to make informed decisions.
I had the same general question of "how do I turn a table into a queue" and couldn't find the answer I wanted anywhere.
Here is what I came up with for Node/SQLite/better-sqlite3.
Basically just modify the inner WHERE and ORDER BY clauses for your use case.
module.exports.pickBatchInstructions = (db, batchSize) => {
const buf = crypto.randomBytes(8); // Create a unique batch identifier
const q_pickBatch = `
UPDATE
instructions
SET
status = '${status.INSTRUCTION_INPROGRESS}',
run_id = '${buf.toString("hex")}',
mdate = datetime(datetime(), 'localtime')
WHERE
id IN (SELECT id
FROM instructions
WHERE
status is not '${status.INSTRUCTION_COMPLETE}'
and run_id is null
ORDER BY
length(targetpath), id
LIMIT ${batchSize});
`;
db.run(q_pickBatch); // Change the status and set the run id
const q_getInstructions = `
SELECT
*
FROM
instructions
WHERE
run_id = '${buf.toString("hex")}'
`;
const rows = db.all(q_getInstructions); // Get all rows with this batch id
return rows;
};
A very easy solution for this in order not to have transactions, locks etc is to use the change tracking mechanisms (not data capture). It utilizes versioning for each added/updated/removed row so you can track what changes happened after a specific version.
So, you persist the last version and query the new changes.
If a query fails, you can always go back and query data from the last version.
Also, if you want to not get all changes with one query, you can get top n order by last version and store the greatest version I'd you have got to query again.
See this for example Using Change Tracking in SQL Server 2008

Semi-Distinct MySQL Query

I have a MySQL table called items that contains thousands of records. Each record has a user_id field and a created (datetime) field.
Trying to put together a query to SELECT 25 rows, passing a string of user ids as a condition and sorted by created DESC.
In some cases, there might be just a few user ids, while in other instances, there may be hundreds.
If the result set is greater than 25, I want to pare it down by eliminating duplicate user_id records. For instance, if there were two records for user_id = 3, only the most recent (according to created datetime) would be included.
In my attempts at a solution, I am having trouble because while, for example, it's easy to get a result set of 100 (allowing duplicate user_id records), or a result set of 16 (using GROUP BY for unique user_id records), it's hard to get 25.
One logical approach, which may not be the correct MySQL approach, is to get the most recent record for each for each user_id, and then, if the result set is less than 25, begin adding a second record for each user_id until the 25 record limit is met (maybe a third, fourth, etc. record for each user_id would be needed).
Can this be accomplished with a MySQL query, or will I need to take a large result set and trim it down to 25 with code?
I don't think what you're trying to accomplish is possible as a SQL query. Your desire is to return 25 rows, no matter what the normal data groupings are whereas SQL is usually picky about returning based on data groupings.
If you want a purely MySQL-based solution, you may be able to accomplish this with a stored procedure. (Supported in MySQL 5.0.x and later.) However, it might just make more sense to run the query to return all 100+ rows and then trim it programmatically within the application.
This will get you the most recent for each user --
SELECT user_id, create
FROM items AS i1
LEFT JOIN items AS i2
ON i1.user_id = i2.user_id AND i1.create > i2.create
WHERE i2.id IS NULL
his will get you the most recent two records for each user --
SELECT user_id, create
FROM items AS i1
LEFT JOIN items AS i2
ON i1.user_id = i2.user_id AND i1.create > i2.create
LEFT JOIN items IS i3
ON i2.user_id = i3.user_id AND i2.create > i3.create
WHERE i3.id IS NULL
Try working from there.
You could nicely put this into a stored procedure.
My opinion is to use application logic, as this is very much application layer logic you are trying to implement at the DB level, i.e. filtering down the results to make the search more useful to the end user.
You could implement a stored procedure (personally I would never do such a thing) or just get the application to decide which 25 results.
One approach would be to get the most recent item from each user, followed by the most recent items from all users, and limit that. You could construct pathological examples where this probably isn't what you want, but it should be pretty good in general.
Unfortunately, there is no easy way :( I had to do something similar when I built a report for my company that would pull up customer disables that were logged in a database. Only problem was that the disconnect is ran and logged every 30 minutes. Therefore, the rows would not be distinct since the timestamp was different in every disconnect. I solved this problem with sub queries. I don't have the exact code anymore, but I beleive this is how I implemented it:
SELECT CORP, HOUSE, CUST,
(
SELECT TOP 1 hsd
FROM #TempTable t2
WHERE t1.corp = t2.corp
AND t1.house = t2.house
AND t1.cust = t2.cust
) DisableDate
FROM #TempTable t1
GROUP BY corp, house, cust -- selecting distinct
So, my answer is to elimante the non-distinct column from the query by using sub queries. There might be an easier way to do it though. I'm curious to see what others post.
Sorry, i keep editing this, I keep trying to find ways to make it easier to show what I did.

SQL trigger for deleting old results

We have a database that we are using to store test results for an embedded device. There's a table with columns for different types of failures (details not relevant), along with a primary key 'keynum' and a 'NUM_FAILURES' column that lists the number of failures. We store passes and failures, so a pass has a '0' in 'NUM_FAILURES'.
In order to keep the database from growing without bounds, we want to keep the last 1000 results, plus any of the last 50 failures that fall outside of the 1000. So, worst case, the table could have 1050 entries in it. I'm trying to find the most efficient SQL insert trigger to remove extra entries. I'll give what I have so far as an answer, but I'm looking to see if anyone can come up with something better, since SQL isn't something I do very often.
We are using SQLITE3 on a non-Windows platform, if it's relevant.
EDIT: To clarify, the part that I am having problems with is the DELETE, and specifically the part related to the last 50 failures.
The reason you want to remove these entries is to keep the database growing too big and not to keep it in some special state. For that i would really not use triggers and instead setup a job to run at some interval cleaning up the table.
So far, I have ended up using a View combined with a Trigger, but I'm not sure it's going to work for other reasons.
CREATE VIEW tablename_view AS SELECT keynum FROM tablename WHERE NUM_FAILURES!='0'
ORDER BY keynum DESC LIMIT 50;
CREATE TRIGGER tablename_trig
AFTER INSERT ON tablename WHEN (((SELECT COUNT(*) FROM tablename) >= 1000) or
((SELECT COUNT(NUM_FAILURES) FROM tablename WHERE NUM_FAILURES!='0') >= 50))
BEGIN
DELETE FROM tablename WHERE ((((SELECT MAX(keynum) FROM ibit) - keynum) >= 1000)
AND
((NUM_FAILURES=='0') OR ((SELECT MIN(keynum) FROM tablename_view) > keynum)));
END;
I think you may be using the wrong data structure. Instead I'd create two tables and pre-populate one with a 1000 rows (successes) and the other with 50 (failures). Put a primary ID on each. The when you record a result instead of inserting a new row find the ID+1 value for the last timestamped record entered (looping back to 0 if > max(id) in table) and update it with your new values.
This has the advantage of pre-allocating your storage, not requiring a trigger, and internally consistent logic. You can also adjust the size of the log very simply by just pre-populating more records rather than to have to change program logic.
There's several variations you can use on this, but the idea of using a closed loop structure rather than an open list would appear to match the problem domain more closely.
How about this:
DELETE
FROM table
WHERE ( id > ( SELECT max(id) - 1000 FROM table )
AND num_failures = 0
)
OR id > ( SELECT max(id) - 1050 FROM table )
If performance is a concern, it might be better to delete on a periodic basis, rather than on each insert.