Indexing the "last added" row of a set of strings - sql

I have a database with 1Mil+ rows in it.
This database consists (for the sake of this question) of 2 columns; user_id, and username.
These values are not controlled by my application; I am not always certain that these are the current correct values. All I know is that the user_id is guaranteed to be unique. I get periodic updates which allow me to update my database to ensure I have an "eventually consistent" version of the user_id/username mapping.
I would like to be able to retrieve the latest addition of a certain username; "older" results should be ignored.
I believe there are two possible approaches here:
- indexing: there should be an index of username:row (hashmap?) where username is always the last added username; so gets updated on each row addition, or update.
- Setting username as unique, and doing an on conflict update to set the old row to the empty string, and the new row to the username
From what I've understood about indexing, it sounds like its the faster option (and wont require me checking the unicity of 1Mil rows in my database). I also hear hashmaps are a pain because they require rebuilding, so feel free to give other ideas.
My current implementation does a full search over the entire database, which is beginning to get quite slow at 1Mil+ rows. It currently gets the "last" value of this added string; which I am not even sure is a valid assumption at this point.
Given a sample database:
user_id, username
3 , bob
2 , alice
4 , joe
1 , bob
I would expect a search of `username = bob` to return (1, bob).
I cannot rely on ID ordering to solve this, since there is no linearity to which ID is assigned to which username.

You can do this using:
select distinct on (id) s.*
from sample s
where s.username = 'bob'
order by s.id desc;
For performance, you want an index on sample(username, id).
Alternatively, if you are doing periodic bulk updates, then you can construct a version of the table with unique rows per username:
create table most_recent_sample as
select max(id) as id, username
from sample
group by username;
create index idx_most_recent_sample_username on most_recent_sample(username);
This might take a short amount of time, but you are doing the update anyway.

Related

How does SQL SELECT statement work?

I created a table with id as primary key, firstname, lastname, email as fields. I issued the following query:
"SELECT email,COUNT(email) FROM mytable;"
The result had the first value for email i.e from the first record and the total number of values from the column 'email'. Why did it not display all the different values for email from different records?
https://dev.mysql.com/doc/refman/8.0/en/group-by-handling.html says:
Without GROUP BY, there is a single group and it is nondeterministic which name value to choose for the group.
Like with any grouping query, the COUNT(*) returns the count of rows in the group, and reduces the result to one row per group.
But since you don't have any GROUP BY clause, the whole table is treated as one big group, and the result only returns one row.
The email column returns one of the email values in the group. Technically this value is chosen arbitrarily from one of the rows in the group.
In practice, MySQL's implementation chooses the value from the first row read from the group, the order of which depends on which index was used to scan the table. Though this behavior of picking the first row is not documented and they make no guarantee to make this behavior the same from version to version.
It's better to avoid depending on queries that return an arbitrary, implementation-dependent value. It makes your code harder to maintain, because there's a risk it will change if you upgrade MySQL and they change their undocumented behavior.
To protect you from making these sorts of arbitrary queries, newer versions of MySQL enforce an SQL mode called ONLY_FULL_GROUP_BY. That option makes it an error to run a query like that one you show. For what it's worth, the query is already an error in most other brands of SQL database.

How to force ID column to remain sequential even if a recored has been deleted, in SQL server?

I don't know what is the best wording of the question, but I have a table that has 2 columns: ID and NAME.
when I delete a record from the table the related ID field deleted with it and then the sequence spoils.
take this example:
if I deleted row number 2, the sequence of ID column will be: 1,3,4
How to make it: 1,2,3
ID's are meant to be unique for a reason. Consider this scenario:
**Customers**
id value
1 John
2 Jackie
**Accounts**
id customer_id balance
1 1 $500
2 2 $1000
In the case of a relational database, say you were to delete "John" from the database. Now Jackie would take on the customer_id of 1. When Jackie goes in to check here balance, she will now show $500 short.
Granted, you could go through and update all of her other records, but A) this would be a massive pain in the ass. B) It would be very easy to make mistakes, especially in a large database.
Ids (primary keys in this case) are meant to be the rock that holds your relational database together, and you should always be able to rely on that value regardless of the table.
As JohnFx pointed out, should you want a value that shows the order of the user, consider using a built in function when querying.
In SQL Server identity columns are not guaranteed to be sequential. You can use the ROW_NUMBER function to generate a sequential list of ids when you query the data from the database:
SELECT
ROW_NUMBER() OVER (ORDER BY Id) AS SequentialId,
Id As UniqueId,
Name
FROM dbo.Details
If you want sequential numbers don't store them in the database. That is just a maintenance nightmare, and I really can't think of a very good reason you'd even want to bother.
Just generate them dynamically using tSQL's RowNumber function when you query the data.
The whole point of an Identity column is creating a reliable identifier that you can count on pointing to that row in the DB. If you shift them around you undermine the main reason you WANT an ID.
In a real world example, how would you feel if the IRS wanted to change your SSN every week so they could keep the Social Security Numbers sequential after people died off?

How to construct an sqlite table that assign and returns IDs to any name?

I would like to have an sqlite table that maps names into unique IDs. I can create this table in the following way:
CREATE TABLE name_to_id (id INTEGER PRIMARY KEY AUTOINCREMENT, name TEXT)
With a select statement I can get the row containing a needed name and get from this row the corresponding ID.
The problem appears if I try to get ID for a name that is not yet in the table. The expected behavior in this case is that the new name will be added and its newly generated ID will be returned. I have two possible solutions/implementations of that.
The first solution is trivial:
We check if name is in the table.
If not we insert a row with the name.
We select the row with the name and read the needed ID from that row.
I do not like this solution because it can happen that the first process checks if the name in the table, it sees that the name is not there, meanwhile another process adds the name to the table and then the first process tries to add the same name.
The second solution seems to be better:
For any name we use insert if not exist.
We select from the table the row containing the name and get its ID.
Is the second solution optimal or there are better solutions?
The normal way to avoid duplicate entries in a table is to create an unique constraint. The database will then check for you if the record is already there and fail if so. That should be the best in terms of reliability and performance.
Next, the SQLite FAQ suggests to use the function last_insert_rowid() to fetch the ID instead of running a second query. This is actually the first question of the FAQ at all ;)
In pseudocode, the first solution looks like this:
cursor = db.execute("SELECT id FROM name_to_id WHERE name = ?", name)
if cursor.has_some_row:
id = cursor["id"]
else:
db.execute("INSERT INTO name_to_id(name) VALUES(?)", name)
id = db.last_insert_rowid
and the second like this:
db.execute("INSERT OR IGNORE INTO name_to_id(name) VALUES(?)", name)
cursor = db.execute("SELECT id FROM name_to_id WHERE name = ?", name)
id = cursor["id"]
The first solution requires a transaction around both commands, but this would be a good idea for the second solution, too, to avoid the overhead of multiple implicit transactions.
The second solution requires a unique constaint on name, but this would be a good idea for the first solution, too, for correctness and to speed up the name lookups.
Both solution use two SQL statements, and have similar speed.
(The second searches the row two times, but that data is cached.)
So there isn't anything obvious that makes one better that the other.

query - select data by first inserted [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
select bottom rows in natural order
People imagine that i have this table :
persons
columns of the table are NAME and ID
and i insert this
insert into persons values ('name','id');
insert into persons values ('John','1');
insert into persons values ('Jack','3');
insert into persons values ('Alice','2');
How can i select this information order by the insertion? My query would like :
NAME ID
name id
John 1
Jack 3
Alice 2
Without indexs (autoincrements), it's possible?
I'm pretty sure its not. From my knowldege sql data order is not sequetional with respect to insertion. The only idea I have is along with each insertion have a timestamp and sort by that time stamp
This is not possible without adding a column or table containing a timestamp. You could add a timestamp column or create another table containing IDs and a timestamp and insert in to that at the same time.
You cannot have any assumptions about how the DBMS will store data and retrieve them without specifying order by clause. I.e. PostgreSQL uses MVCC and if you update any row, physically a new copy of a row will be created at the end of a table datafile. Using a plain select causes pg to use sequence scan scenario - it means that the last updated row will be returned as the last one.
I have to agree with the other answers, Without a specific field/column todo this... well its a unreliable way... While i have not actually ever had a table without an index before i think..
you will need something to index it by, You can go with many other approaches and methods... For example, you use some form of concat/join of strings and then split/separate the query results later.
--EDIT--
For what reason do you wish not to use these methods? time/autoinc
Without storing some sort of order information during insert, the database does not automatically keep track of every record ever inserted and their order (this is probably a good thing ;) ). Autoincrement cannot be avoided... even with timestamp, they can hold same value.

SQL Table Design - Identity Columns

SQL Server 2008 Database Question.
I have 2 tables, for arguments sake called Customers and Users where a single Customer can have 1 to n Users. The Customers table generates a CustomerId which is a seeded identity with a +1 increment on it. What I'm after in the Users table is a compound key comprising the CustomerId and a sequence number such that in all cases, the first user has a sequence of 1 and subsequent users are added at x+1.
So the table looks like this...
CustomerId (PK, FK)
UserId (PK)
Name
...and if for example, Customer 485 had three customers the data would look like...
CustomerId | UserId | Name
----------
485 | 1 | John
485 | 2 | Mark
485 | 3 | Luke
I appreciate that I can manually add the 1,2,3,...,n entry for UserId however I would like to get this to happen automatically on row insert in SQL, so that in the example shown I could effectively insert rows with the CustomerId and the Name with SQL Server protecting the Identity etc. Is there a way to do this through the database design itself - when I set UserId as an identity it runs 1 to infinity across all customers which isn't what I am looking for - have I got a setting wrong somewhere, or is this not an option?
Hope that makes sense - thanks for your help
I can think of no automatic way to do this without implementing a custom Stored Procedure that inserted the rows and checked to increment the Id appropriately, althouh others with more knowledge may have a better idea.
However, this smells to me of naturalising a surrogate key - which is not always a good idea.
More info here:
http://www.agiledata.org/essays/keys.html
That's not really an option with a regular identity column, but you could set up an insert trigger to auto populate the user id though.
The naive way to do this would be to have the trigger select the max user id from the users table for the customer id on the inserted record, then add one to that. However, you'll run into concurrency problems there if more than one person is creating a user record at the same time.
A better solution would be to have a NextUserID column on the customers table. In your trigger you would:
Start a transaction.
Increment the NextUserID for the customer (locking the row).
Select the updated next user id.
use that for the new User record.
commit the transaction.
This should ensure that simultaneous additions of users don't result in the same user id being used more than once.
All that said, I would recommend that you just don't do it. It's more trouble than it's worth and just smells like a bad idea to begin with.
So you want a generated user_id field that increments within the confines of a customer_id.
I can't think of one database where that concept exists.
You could implement it with a trigger. But my question is: WHY?
Surrogate keys are supposed to not have any kind of meaning. Why would you try to make a key that, simultaneously, is the surrogate and implies order?
My suggestions:
Create a date_created field, defaulting to getDate(). That will allow you to know the order (time based) in which each user_id was created.
Create an ordinal field - which can be updated by a trigger, to support that order.
Hope that helps.