How to get the data of the newly accessed record by a query on PostgreSQL using it's internal variables and functions? - sql

Let's say I have the following 'items' table in my PostgreSQL database:
id
item
value
1
a
10
2
b
20
3
c
30
For some reason I can't control I need to run the following query:
select max(value) from items;
which will return 30 as the result.
At this point, I know that I can find the record that contains that value using simple select statements, etc. That's not the actual problem.
My real questions are:
Does PostgreSQL know (behind the scenes) what's is the ID of that
record, although the query shows only the max value of the column
'value'?
If yes, can I have access to that information and,
therefore, get the ID and other data from the found record?
I'm not allowed to create indexes and sequences, or change way the max value is retrieved. That's a given. I need to work from that point onward and find a solution (which I have, actually, from regular query work).
I'm just guessing that the database knows in which record that information (30) is and that I could have access to it.
I've been searching for an answer for a couple of hours but wasn't able to find anything.
What am I missing? Any ideas?
Note: postgres (PostgreSQL) 12.5 (Ubuntu 12.5-0ubuntu0.20.10.1)

You can simply extract the whole record that contains max(value) w/o bothering about Postgres internals like this:
select id, item, "value"
from items
order by "value" desc
limit 1;
I do not think that using undocumented "behind the scenes" ways is a good idea at all. The planner is smart enough to do exactly what you need w/o extra work.

Related

Why does the number of returned samples where name='keyword' does not match the number of observed samples with 'keyword' in table?

I have a Postgres table whose header is [id(uuid), name(str), arg_name(str), measurements(list), run_id(uuid), parent_id(uuid)] with a total of 237K entries.
When I want to filter for specific measurements I can use 'name', but for the majority of entries in the table 'name' == 'arg_name' and thus map to the same sample.
In my peculiar case I am interested in retrieving samples whose 'name'='TimeM12nS' and whose 'arg_name'='Time'. These two attributes point to the same samples when visually inspecting the table through PgAdmin. That is to say all entries which have arg_name='Time' also have the name='TimeM12nS' and vice-versa.
Its obvious there's a problem because of the quantity of returned samples is not the same. I first noticed the problem using django orm, but the problem is also present when I query the DB using PgAdmin.
SELECT *
FROM TableA
WHERE name='TimeM12nS'
returns 301 entries (name='TimeM12nS' and arg_name='Time' in all cases)
BUT the query:
SELECT *
FROM TableA
WHERE arg_name='Time'
returns 3945 (name='TimeM12nS' and arg_name='Time' in all cases)
I am completely stumped, anyone think they can shed some light into what's happening here?
EDIT:
I should add that the query by 'arg_name' returns the 301 entries that are returned when querying by 'name'
First let me say thank you to everyone who pitched in ideas to solve this conundrum and especially to JGH for the solution (found in the comments of the original post).
Indeed the problem was a indexing issue. After re-indexing the queries return the same number of entries '3945' as expected.
In Postgress re-indexing a table can be achieved through pgAdmin by navigating to Databases > 'database_name' > Schemas > Tables then right-clicking on the table_name selecting Maintenance and pressing the REINDEX button.
or more simply by running the following command
REINDEX TABLE table_name
Postgress Re-Indexing Docs
Without access to the database, it's not possibly to give a definitive answer. All I can provide is the next query that I would use in this case.
SELECT COUNT(*), LENGTH(name), name, arg_name
FROM TableA
WHERE arg_name='Time'
GROUP BY name, arg_name;
This should show you any differences in the name column that you aren't able to see. The length of that string could also be informative.

Why is there no `select last` or `select bottom` in SQL Server like there is `select top`?

I know this might probably sound like a stupid question, but please bear with me.
In SQL-server we have
SELECT TOP N ...
now in that we can get the first n rows in ascending order (by default), cool. If we want records to be sorted on any other column, we just specify that in the order by clause, something like this...
SELECT TOP N ... ORDER BY [ColumnName]
Even more cool. But what if I want the last row? I just write something like this...
SELECT TOP N ... ORDER BY [ColumnName] DESC
But there is a slight concern with that. I said concern and not issue because it isn't actually an issue. By this way, I could get the last row based on that column, but what if I want the last row that was inserted. I know about SCOPE_IDENTITY, IDENT_CURRENT and ##IDENTITY, but consider a heap (a table without a clustered index) without any identity column, and multiple accesses from many places (please don't go into this too much as to how and when these multiple operation are happening, this doesn't concern the main thing). So in this case there doesn't seems to be an easy way to find which row was actually inserted last. Some might answer this as
If you do a select * from [table] the last row shown in the sql result window will be the last one inserted.
To anything thinking about this, this is not actually the case, at least not always and one that you can always rely on (msdn, please read the Advanced Scanning section).
So the question boils down to this, as in the title itself. Why doesn't SQL Server have a
SELECT LAST
or say
SELECT BOTTOM
or something like that, where we don't have to specify the Order By and then it would give the last record inserted in the table at the time of executing the query (again I am not going into details about how would this result in case of uncommitted reads or phantom reads).
But if still, someone argues that we can't talk about this without talking about these read levels, then, for them, we could make it behave as the same way as TOP work but just the opposite. But if your argument is then we don't need it as we can always do
SELECT TOP N ... ORDER BY [ColumnName] DESC
then I really don't know what to say here. I know we can do that, but are there any relation based reason, or some semantics based reason, or some other reason due to which we don't have or can't have this SELECT LAST/BOTTOM. I am not looking for way to does Order By, I am looking for reason as to why do don't have it or can't have it.
Extra
I don't know much about how NOSQL works, but I've worked (just a little bit) with mongodb and elastic search, and there too doesn't seems to be anything like this. Is the reason they don't have it is because no one ever had it before, or is it for some reason not plausible?
UPDATE
I don't need to know that I need to specify order by descending or not. Please read the question and understand my concern before answering or commenting. I know how will I get the last row. That's not even the question, the main question boils down to why no select last/bottom like it's counterpart.
UPDATE 2
After the answers from Vladimir and Pieter, I just wanted to update that I know the the order is not guaranteed if I do a SELECT TOP without ORDER BY. I know from what I wrote earlier in the question might make an impression that I don't know that's the case, but if you just look a further down, I have given a link to msdn and have mentioned that the SELECT TOP without ORDER BY doesn't guarantees any ordering. So please don't add this to your answer that my statement in wrong, as I have already clarified that myself after a couple of lines (where I provided the link to msdn).
You can think of it like this.
SELECT TOP N without ORDER BY returns some N rows, neither first, nor last, just some. Which rows it returns is not defined. You can run the same statement 10 times and get 10 different sets of rows each time.
So, if the server had a syntax SELECT LAST N, then result of this statement without ORDER BY would again be undefined, which is exactly what you get with existing SELECT TOP N without ORDER BY.
You have stressed in your question that you know and understand what I've written below, but I'll still keep it to make it clear for everyone reading this later.
Your first phrase in the question
In SQL-server we have SELECT TOP N ... now in that we can get the
first n rows in ascending order (by default), cool.
is not correct. With SELECT TOP N without ORDER BY you get N "random" rows. Well, not really random, the server doesn't jump randomly from row to row on purpose. It chooses some deterministic way to scan through the table, but there could be many different ways to scan the table and server is free to change the chosen path when it wants. This is what is meant by "undefined".
The server doesn't track the order in which rows were inserted into the table, so again your assumption that results of SELECT TOP N without ORDER BY are determined by the order in which rows were inserted in the table is not correct.
So, the answer to your final question
why no select last/bottom like it's counterpart.
is:
without ORDER BY results of SELECT LAST N would be exactly the same as results of SELECT TOP N - undefined.
with ORDER BY result of SELECT LAST N ... ORDER BY X ASC is exactly the same as result of SELECT TOP N ... ORDER BY X DESC.
So, there is no point to have two key words that do the same thing.
There is a good point in the Pieter's answer: the word TOP is somewhat misleading. It really means LIMIT result set to some number of rows.
By the way, since SQL Server 2012 they added support for ANSI standard OFFSET:
OFFSET { integer_constant | offset_row_count_expression } { ROW | ROWS }
[
FETCH { FIRST | NEXT } {integer_constant | fetch_row_count_expression } { ROW | ROWS } ONLY
]
Here adding another key word was justified that it is ANSI standard AND it adds important functionality - pagination, which didn't exist before.
I would like to thank #Razort4x here for providing a very good link to MSDN in his question. The "Advanced Scanning" section there has an excellent example of mechanism called "merry-go-round scanning", which demonstrates why the order of the results returned from a SELECT statement cannot be guaranteed without an ORDER BY clause.
This concept is often misunderstood and I've seen many question here on SO that would greatly benefit if they had a quote from that link.
The answer to your question
Why doesn't SQL Server have a SELECT LAST or say SELECT BOTTOM or
something like that, where we don't have to specify the ORDER BY and
then it would give the last record inserted in the table at the time
of executing the query (again I am not going into details about how
would this result in case of uncommitted reads or phantom reads).
is:
The devil is in the details that you want to omit. To know which record was the "last inserted in the table at the time of executing the query" (and to know this in a somewhat consistent/non-random manner) the server would need to keep track of this information somehow. Even if it is possible in all scenarios of multiple simultaneously running transactions, it is most likely costly from the performance point of view. Not every SELECT would request this information (in fact very few or none at all), but the overhead of tracking this information would always be there.
So, you can think of it like this: by default the server doesn't do anything specific to know/keep track of the order in which the rows were inserted, because it affects performance, but if you need to know that you can use, for example, IDENTITY column. Microsoft could have designed the server engine in such a way that it required an IDENTITY column in every table, but they made it optional, which is good in my opinion. I know better than the server which of my tables need IDENTITY column and which do not.
Summary
I'd like to summarise that you can look at SELECT LAST without ORDER BY in two different ways.
1) When you expect SELECT LAST to behave in line with existing SELECT TOP. In this case result is undefined for both LAST and TOP, i.e. result is effectively the same. In this case it boils down to (not) having another keyword. Language developers (T-SQL language in this case) are always reluctant to add keywords, unless there are good reasons for it. In this case it is clearly avoidable.
2) When you expect SELECT LAST to behave as SELECT LAST INSERTED ROW. Which should, by the way, extend the same expectations to SELECT TOP to behave as SELECT FIRST INSERTED ROW or add new keywords LAST_INSERTED, FIRST_INSERTED to keep existing keyword TOP intact. In this case it boils down to the performance and added overhead of such behaviour. At the moment the server allows you to avoid this performance penalty if you don't need this information. If you do need it IDENTITY is a pretty good solution if you use it carefully.
There is no select last because there is no need for it. Consider a "select top 1 * from table" . Top 1 would get you the first row that is returned. And then the process stops.
But there is no guarantees about ordering if you don't specify an order by. So it may as well be any row in the dataset you get back.
Now do a "select last 1 * from table". Now the database will have to process all the rows in order to get you the last one.
And because ordering is non-deterministic, it may as well be the same result as from the select "top 1".
See now where the problem comes? Without an order by top and last are actually the same, only "last" will take more time. And with an order by, there's really only a need for top.
SELECT TOP N ...
now in that we can get the first n rows in ascending order (by
default), cool. If we want records to be sorted on any other column,
we just specify that in the order by clause, something like this...
What you say here is totally wrong and absolutely NOT how it works. There is no guarantee on what order you get. Ascending order on what ?
create table mytest(id int, id2 int)
insert into mytest(id,id2)values(1,5),(2,4),(3,3),(4,2),(5,1)
select top 1 * from mytest
select * from mytest
create clustered index myindex on mytest(id2)
select top 1 * from mytest
select * from mytest
insert into mytest(id,id2)values(6,0)
select top 1 * from mytest
Try this code line by line and see what you get with the last "select top 1".....you get in this case the last inserted record.
update
I think you understand that "select top 1 * from table" basically means: "Select a random row from the table".
So what would last mean? "Select the last random row from the table?" Wouldn't the last random row from a table be conceptually the same as saying any 1 random row from the table? And if that's true, top and last are the same, so there is no need for last.
Update 2
In hindsight I was happier with the syntax mysql uses : LIMIT.
Top doesn't say anything about ordering it is only there to specify the number of rows to be returned.
Limits the rows returned in a query result set to a specified number of rows or percentage of rows in SQL Server 2014.
The reasons why SELECT LAST_INSERTED does not make sense.
It cannot be easily applied to non-heap tables.
Heap data can be freely moved by DBMS so those "natural" order is subject to change. To keep it the system needs some additional mechanism which seems to be a useless waste.
If really desired it can be simulated by adding some 'auto-increment' column.
SQL Server ordering is arbitrary unless otherwise stated. It's set based, therefore you must define what your set is. Correct SCOPE_IDENTITY() is the correct way to capture the last inserted record, or the OUTPUT clause. Why would you do inserts on a heap that you need to reference chronologically anyway?? That's super bad database design.

Query to Find Adjacent Date Records

There exists in my database a page_history table; the idea is that whenever a record in the page table is changed, that record's old values are stored in the history table.
My job now is to find occasions in which a record was changed, and retrieve the pre- and post-conditions of that change. Specifically, I want to know when a page changed groups, and what groups were involved in the change. The query I have below can find these instances, but with the use of the min function, I can only get back the values that match between the two records:
select page_id,
original_group,
min(created2) change_date
from (select h.page_id,
h.group_id original_group,
i.group_id new_group,
h.created_dttm created1,
i.created_dttm created2
from page_history h,
page_history i
where h.page_id = i.page_id
and h.created_dttm < i.created_dttm
and h.group_id != i.group_id)
group by page_id, original_group, created1
order by page_id
When I try to get, say, any details of the second record, like new_group, I'm hit with a ORA-00979: not a GROUP BY expression error. I don't want to group by new_group, though, because that's going to destroy the logic (I think it would find records displaying times a page changed from a group to another group, regardless of any changes to other groups in between).
My question, then, is how can I modify this query, or go about writing a new one, that achieves a similar end, but with the added availability of columns that do not match between the two records? In essence, how can I find that min record without sacrificing all the other columns I'm not trying to compare? I don't exactly need a complete answer, any suggestions that point me in the right direction would be appreciated.
I use PL/SQL Developer, and it looks like version 11.2.0.2.0 of Oracle.
EDIT: I have found a solution. It's not pretty, and I'd still like to see some alternatives, but if helping me out would threaten to explode your brain, I would advise relocating to an easier question.
Without seeing your table structure it's hard to re-write the query but when you have a min function used like that it invariably seems better to put it into a separate sub select to get what you want and then compare the result of that.

Counting occurence of each distinct element in a table

I am writing a log viewer app in ASP.NET / C#. There is a report window, where it will be possible to check some information about the whole database. One kind of information there I want to display on the screen is the number of times each generator (an entity in my domain, not Firebirds sequence) appears in the table. How do I do that using COUNT ?
Do I have to :
Gather the key for each different generator
Run one query for each generator key using count
Display it somehow
Is there any way that I can do it without having to do two queries to the database? The database size can be HUGE, and having to query it "X" times where "X" is the number of generators would just suck.
I am using a Firebird database, is there any way to fetch this information from any metadata schema or there is no such thing available?
Basically, what I want is to count each occurrence of each generator in the table. Result would be something like : GENERATOR A:10 times,GENERATOR B:7 Times,Generator C:0 Times and so on.
If I understand your question correctly, it is a simple matter of using the GROUP BY clause, e.g.:
select
key,
count(*)
from generators
group by key;
Something like the query below should be sufficient (depending on your exact structure and requirements)
SELECT KEY, COUNT(*)
FROM YOUR_TABLE
GROUP BY KEY
I solved my problem using this simple Query:
SELECT GENERATOR_,count(*)
FROM EVENTSGENERAL GROUP BY GENERATOR_;
Thanks for those who helped me.
It took me 8 hours to come back and post the answer,because of the StackOverflow limitation to answer my own questions based in my reputation.

Explanation of particular sql injection

Browsing through the more dubious parts of the web, I happened to come across this particular SQL injection:
http://server/path/page.php?id=1+union+select+0,1,concat_ws(user(),0x3a,database(),0x3a,version()),3,4,5,6--
My knowledge of SQL - which I thought was half decent - seems very limiting as I read this.
Since I develop extensively for the web, I was curious to see what this code actually does and more importantly how it works.
It replaces an improperly written parametrized query like this:
$sql = '
SELECT *
FROM products
WHERE id = ' . $_GET['id'];
with this query:
SELECT *
FROM products
WHERE id = 1
UNION ALL
select 0,1,concat_ws(user(),0x3A,database(),0x3A,version()),3,4,5,6
, which gives you information about the database name, version and username connected.
The injection result relies on some assumptions about the underlying query syntax.
What is being assumed here is that there is a query somewhere in the code which will take the "id" parameter and substitute it directly into the query, without bothering to sanitize it.
It's assuming a naive query syntax of something like:
select * from records where id = {id param}
What this does is result in a substituted query (in your above example) of:
select * from records where id = 1 union select 0, 1 , concat_ws(user(),0x3a,database(),0x3a,version()), 3, 4, 5, 6 --
Now, what this does that is useful is that it manages to grab not only the record that the program was interested in, but also it UNIONs it with a bogus dataset that tells the attacker (these values appear separated by colons in the third column):
the username with which we are
connected to the database
the name of the database
the version of the db software
You could get the same information by simply running:
select concat_ws(user(),0x3a,database(),0x3a,version())
Directly at a sql prompt, and you'll get something like:
joe:production_db:mysql v. whatever
Additionally, since UNION does an implicit sort, and the first column in the bogus data set starts with a 0, chances are pretty good that your bogus result will be at the top of the list. This is important because the program is probably only using the first result, or there is an additional little bit of SQL in the basic expression I gave you above that limits the result set to one record.
The reason that there is the above noise (e.g. the select 0,1,...etc) is that in order for this to work, the statement you are calling the UNION with must have the same number of columns as the first result set. As a consequence, the above injection attack only works if the corresponding record table has 7 columns. Otherwise you'll get a syntax error and this attack won't really give you what you want. The double dashes (--) are just to make sure anything that might happen afterwords in the substitution is ignored, and I get the results I want. The 0x3a garbage is just saying "separate my values by colons".
Now, what makes this query useful as an attack vector is that it is easily re-written by hand if the table has more or less than 7 columns.
For example if the above query didn't work, and the table in question has 5 columns, after some experimentation I would hit upon the following query url to use as an injection vector:
http://server/path/page.php?id=1+union+select+0,1,concat_ws(user(),0x3a,database(),0x3a,version()),3,4--
The number of columns the attacker is guessing is probably based on an educated look at the page. For example if you're looking at a page listing all the Doodads in a store, and it looks like:
Name | Type | Manufacturer
Doodad Foo Shiny Shiny Co.
Doodad Bar Flat Simple Doodads, Inc.
It's a pretty good guess that the table you're looking at has 4 columns (remember there's most likely a primary key hiding somewhere if we're searching by an 'id' parameter).
Sorry for the wall of text, but hopefully that answers your question.
this code adds an additional union query to the select statement that is being executed on page.php. The injector has determined that the original query has 6 fields, thus the selection of the numeric values (column counts must match with a union). the concat_ws just makes one field with the values for the database user , the database, and the version, separated by colons.
It seems to retrieve the user used to connect to the database, the database adress and port, the version of it. And it will be put by the error message.