I'm converting a SAS script to Python for a PostgreSQL environment. In a few places I've found a data statement in SAS, which looks something like this (in multiple scripts):
data dups;
set picc;
by btn wtn resp_ji;
if not (first.resp_ji and last.resp_ji);
run;
Obviously datasets aren't the same in python or SQL environments, and I'm having trouble determining what this specific statement is doing. To be clear, there are a number of scripts being converted which create a dataset in this manner with this same name. So my expectation would be that most of these would be overwritten over and over.
I'm also unclear as to what the postgres equivalent to the condition in the data dups statement would be.
Is there an obvious PostgreSQL statement that would work in its place? Something like this?:
CREATE TABLE dups AS
SELECT btn, wtn, resp_ji
WHERE /*some condition that matches the condition in the data statement*/
Does the
by btn wtn respji;
statement mean which columns are copied over, or is that the equivalent of an ORDER BY clause in PostgreSQL?
Thanks.
The statement is using what's called 'by group processing'. Before the step can run, it requires that the data is sorted by btn wtn resp_ji.
The first.resp_ji piece is checking to see if it's the first time it's seen the current value of resp_ji within the current btn/wtn combination. Likewise the last.resp_ji piece is checking if it's the final time that it will see the current value of resp_ji within the current btn/wtn combination.
Combining it all together the statement:
if not (first.resp_ji and last.resp_ji);
Is saying, if the current value of resp_ji occurs multiple times for the current combination of btn/wtn then keep the record, otherwise discard the record. The behaviour of the if statement when used like that implicitly keeps/discards the record.
To do the equivalent in SQL, you could do something like:
Find all records to discard.
Discard those records from the original dataset.
So...
create table rows_to_discard as
select btn, wtn, resp_ji, count(*) as freq
from mytable
group by btn, wtn, resp_ji
having count(*) = 1
create table want as
select a.*
from mytable a
left join rows_to_discard b on b.btn = a.btn
and b.wtn = a.wtn
and b.resp_ji = a.resp_ji
where b.btn is null
EDIT : I should mention that there is no simple SQL equivalent. It may be possible by numbering rows in subqueries, and then building logic on top of that but it'd be ugh-ly. It may also depend on the specific flavour of SQL being used.
As someone that learned SAS before postgressql, I found the following much more similar to SAS first. last. logic:
--first.
select distinct on (resp_ji) from <table> order by resp_ji
--last.
select distinct on (resp_ji) from <table> order by resp_ji desc
A way to detect duplicates (when no extra differentiating field is available) is to use the ctid as tie-breaker:
CREATE TABLE dups
AS
SELECT * FROM pics p
WHERE EXISTS ( SELECT * FROM pics x
WHERE x.btn = p.btn
AND x.wtn = p.wtn
AND x.resp_ji = p.resp_ji
AND x.ctid <> p.ctid
);
Related
I have data in several views that I would like to run a check against for computed data.
The first step involves a query that returns several rows with a VehicleID column that should be used in the "for each" aspect of the next query, this example has been simplified.
The next step gets the entries from the view [dbo].[viewDataVehicle] that match the VehicleID and returns a row with the VehicleID, Timestamp and Speed.
From here I need to calculate the average of these "Speed" values and then select all rows where Speed > AverageSpeed + SpeedVariable(that should be set in the query).
The result should output the entry rows if the condition is met with an additional OverAverage column (lets say it's a boolean TRUE or FALSE, which is this example would all be TRUE).
This is repeated for each of the other VehicleIDs and the final result is a table containing all the rows that matched the conditions.
I can group by and format later on so this aspect is not important.
How would I write a query to do this?
Generally, in SQL Select statements wheere-ever you have a From TableName you can substitute another select statement for TableName. So, start with selecting your vehicle id's:
select vehicleId
from <table>
where <whatever>
Now match to your view:
select vdv.vehicleId, Timestamp, speed
from dbo.viewDataVehicle vdv
inner join
(
select vehicleId
from <table>
where <whatever
) v on v.vehicleId = vdv.vehicleId
Now use this as input to your next step, and so on and so n.
As Lamu says in a comment, with SQL never think of individual rows: always think of sets. RAT (or Row at a Time) is not the way to go.
I'm running a pretty straightforward query using the database/sql and lib/pq (postgres) packages and I want to toss the results of some of the fields into a slice, but I need to know how big to make the slice.
The only solution I can find is to do another query that is just SELECT COUNT(*) FROM tableName;.
Is there a way to both get the result of the query AND the count of returned rows in one query?
Conceptually, the problem is that the database cursor may not be enumerated to the end so the database does not really know how many records you will get before you actually read all of them. The only way to count (in general case) is to go through all the records in the resultset.
But practically, you can enforce it to do so by using subqueries like
select *, (select count(*) from table) from table
and just ignore the second column for records other than first. But it is very rude and I do not recommend doing so.
Not sure if this is what you are asking for but you can call the ##Rowcount function to return the count of the previous select statement that has been executed.
SELECT mytable.mycol FROM mytable WHERE mytable.foo = 'bar'
SELECT ##Rowcount
If you want the row count included in your result set you can use the the OVER clause (MSDN)
SELECT mytable.mycol, count(*) OVER(PARTITION BY mytable.foo) AS 'Count' FROM mytable WHERE mytable.foo = 'bar'
You could also perhaps just separate two SQL statements with the a ; . This would return a result set of both statements executed.
You would used count(*)
SELECT count(distinct last)
FROM (XYZTable)
WHERE date(FROM_UNIXTIME(time)) >= '2013-10-28' AND
id = 90 ;
I have two separate databases. I am trying to update a column in one database to the values of a column from the other database:
UPDATE customer
SET customer_id=
(SELECT t1 FROM dblink('port=5432, dbname=SERVER1 user=postgres password=309245',
'SELECT store_key FROM store') AS (t1 integer));
This is the error I am receiving:
ERROR: more than one row returned by a subquery used as an expression
Any ideas?
Technically, to remove the error, add LIMIT 1 to the subquery to return at most 1 row. The statement would still be nonsense.
... 'SELECT store_key FROM store LIMIT 1' ...
Practically, you want to match rows somehow instead of picking an arbitrary row from the remote table store to update every row of your local table customer.
I assume a text column match_name in both tables (UNIQUE in store) for the sake of this example:
... 'SELECT store_key FROM store
WHERE match_name = ' || quote_literal(customer.match_name) ...
But that's an extremely expensive way of doing things.
Ideally, you completely rewrite the statement.
UPDATE customer c
SET customer_id = s.store_key
FROM dblink('port=5432, dbname=SERVER1 user=postgres password=309245'
, 'SELECT match_name, store_key FROM store')
AS s(match_name text, store_key integer)
WHERE c.match_name = s.match_name
AND c.customer_id IS DISTINCT FROM s.store_key;
This remedies a number of problems in your original statement.
Obviously, the basic error is fixed.
It's typically better to join in additional relations in the FROM clause of an UPDATE statement than to run correlated subqueries for every individual row.
When using dblink, the above becomes a thousand times more important. You do not want to call dblink() for every single row, that's extremely expensive. Call it once to retrieve all rows you need.
With correlated subqueries, if no row is found in the subquery, the column gets updated to NULL, which is almost always not what you want. In my updated query, the row only gets updated if a matching row is found. Else, the row is not touched.
Normally, you wouldn't want to update rows, when nothing actually changes. That's expensively doing nothing (but still produces dead rows). The last expression in the WHERE clause prevents such empty updates:
AND c.customer_id IS DISTINCT FROM sub.store_key
Related:
How do I (or can I) SELECT DISTINCT on multiple columns?
The fundamental problem can often be simply solved by changing an = to IN, in cases where you've got a one-to-many relationship. For example, if you wanted to update or delete a bunch of accounts for a given customer:
WITH accounts_to_delete AS
(
SELECT account_id
FROM accounts a
INNER JOIN customers c
ON a.customer_id = c.id
WHERE c.customer_name='Some Customer'
)
-- this fails if "Some Customer" has multiple accounts, but works if there's 1:
DELETE FROM accounts
WHERE accounts.guid =
(
SELECT account_id
FROM accounts_to_delete
);
-- this succeeds with any number of accounts:
DELETE FROM accounts
WHERE accounts.guid IN
(
SELECT account_id
FROM accounts_to_delete
);
This means your nested SELECT returns more than one rows.
You need to add a proper WHERE clause to it.
This error means that the SELECT store_key FROM store query has returned two or more rows in the SERVER1 database. If you would like to update all customers, use a join instead of a scalar = operator. You need a condition to "connect" customers to store items in order to do that.
If you wish to update all customer_ids to the same store_key, you need to supply a WHERE clause to the remotely executed SELECT so that the query returns a single row.
USE LIMIT 1 - so It will return only 1 row.
Example
customerId- (select id from enumeration where enumerations.name = 'Ready To Invoice' limit 1)
The result produced by the Query is having no of rows that need proper handling this issue can be resolved if you provide the valid handler in the query like
1. limiting the query to return one single row
2. this can also be done by providing "select max(column)" that will return the single row
I have a batch job that I run on a table which I'm sure I could write as a prepared statement. Currently it's all in Java and no doubt less efficient than it could be. For a table like so:
CREATE TABLE thing (
`tag` varchar,
`document` varchar,
`weight` float,
)
I want to create a new table that contains the top N entries for every tag. Currently I do this:
create new table with same schema
select distinct tag
for each tag:
select * limit N insert into the new table
This requires executing a query to get the distinct tags, then selecting the top N items for that tag and inserting them... all very inefficient.
Is there a stored procedure (or even a simple query) that I could use to do this? If dialect is important, I'm using MySQL.
(And yes, I do have my indexes sorted!)
Cheers
Joe
I haven't done this in a while (spoiled by CTE's in SQL Server), and I'm assuming that your data is ordered by weight; try
SELECT tag, document, weight
FROM thing
WHERE (SELECT COUNT(*)
FROM thing as t
WHERE t.tag = thing.tag AND t.weight < thing.weight
) < N;
I think that will do it.
EDIT: corrected error in code; need < N, not <= N.
If you were using SQL Server, I would suggest using the ROW_NUMBER function, grouped by tag, and select where row_number < N. (So in other words, order and number the rows for each tag according to their position in the tag group, then pick the top N rows from each group.) I found an article about simulating the ROW_NUMBER function in MySQL here:
http://www.xaprb.com/blog/2006/12/02/how-to-number-rows-in-mysql/
See if this helps you out!
I'm doing a probability calculation. I have a query to calculate the total number of times an event occurs. From these events, I want to get the number of times a sub-event occurs. The query to get the total events is 25 lines long and I don't want to just copy + paste it twice.
I want to do two things to this query: calculate the number of rows in it, and calculate the number of rows in the result of a query on this query. Right now, the only way I can think of doing that is this (replace #total# with the complicated query to get all rows, and #conditions# with the less-complicated conditions that rows, from #total#, must have to match the sub-event):
SELECT (SELECT COUNT(*) FROM (#total#) AS t1 WHERE #conditions#) AS suboccurs,
COUNT(*) AS totaloccurs FROM (#total#) as t2
As you notice, #total# is repeated twice. Is there any way around this? Is there a better way to do what I'm trying to do?
To re-emphasize: #conditions# does depend on what #total# returns (it does stuff like t1.foo = bar).
Some final notes: #total# by itself takes ~250ms. This more complicated query takes ~300ms, so postgres is likely doing some optimization, itself. Still, the query looks terribly ugly with #total# literally pasted in twice.
If your sql supports subquery factoring, then rewriting it using the WITH statement is an option. It allows subqueries to be used more than once. With will create them as either an inline-view or a temporary table in Oracle.
Here is a contrived example.
WITH
x AS
(
SELECT this
FROM THERE
WHERE something is true
),
y AS
(
SELECT this-other-thing
FROM somewhereelse
WHERE something else is true
),
z AS
(
select count(*) k
FROM X
)
SELECT z.k, y.*, x.*
FROM x,y, z
WHERE X.abc = Y.abc
SELECT COUNT(*) as totaloccurs, COUNT(#conditions#) as suboccurs
FROM (#total# as t1)
Put the reused sub-query into a temp table, then select what you need from the temp table.
#EvilTeach:
I've not seen the "with" (probably not implemented in Sybase :-(). I like it: does what you need in one chunk then goes away, with even less cruft than temp tables. Cool.