I would appreciate some guidance on the following query. We have a list of experiments and their current progress state (for simplicity, I've reduced the statuses to 4types, but we have 10 different statuses in our data). I need to eventually return a list of the current status of all non-finished experiments.
Given a table exp_status,
Experiment | ID | Status
----------------------------
A | 1 | Starting
A | 2 | Working On It
B | 3 | Starting
B | 4 | Working On It
B | 5 | Finished Type I
C | 6 | Starting
D | 7 | Starting
D | 8 | Working On It
D | 9 | Finished Type II
E | 10 | Starting
E | 11 | Working On It
F | 12 | Starting
G | 13 | Starting
H | 14 | Starting
H | 15 | Working On It
H | 16 | Finished Type II
Desired Result Set:
Experiment | ID | Status
----------------------------
A | 2 | Working On It
C | 6 | Starting
E | 11 | Working On It
F | 12 | Starting
G | 13 | Starting
The most recent ID number will correspond to the most recent status.
Now, the current code I have executes in 150 seconds.
SELECT *
FROM
(SELECT Experiment, ID, Status,
row_number () over (partition by Experiment
order by ID desc) as rn
FROM exp_status)
WHERE rn = 1
AND status NOT LIKE ('Finished%')
The thing is, this code wastes its time. The result set is 45 thousand rows pulled from a table of 3.9 million. This is because most experiments are in the finished status. The code goes through and orders all of them then only filters out the finished at the end. About 95% of the experiments in the table are in the finished phase. I could not figure out how to make the query first pick out all the experiments and statuses where there isn't a 'Finished' for that experiment. I tried the following but had very slow performance.
SELECT *
FROM exp_status
WHERE experiment NOT IN
(
SELECT experiment
FROM exp_status
WHERE status LIKE ('Finished%')
)
Any help would be appreciated!
Given your requirement, I think your current query with with row_number() is one of the most efficient possible. This query takes time not because it has to sort the data, but because there is so much data to read in the first place (the extra cpu time is negligible compared to the fetch time). Furthermore, the first query makes a FULL SCAN that is really the best way to read lots of data.
You need to find a way to read a lot less rows if you want to improve performance. The second query doesn't go in the right direction:
the inner query will likely be a full scan since the 'finished' rows will be spread across the whole table and likely represent a big percentage of all rows.
the outer query will also likey be a full scan and a nice ANTI-HASH JOIN which should be quicker than 45k * (number of status change per experiment) non-unique index scans.
So the second query seems to have at least twice the number of reads (plus a join).
If you want to really improve performance, I think you will need a change of design.
You could for instance build a table of active experiments and join to this table. You would maintain this table either as a materialized view or with a modification to the code that inserts experiment statuses. You could go further and store the last status in this table. Maintaining this "last status" will likely be an extra burden but this could be justified by the improved performance.
Consider partitioning your table by status
www.orafaq.com/wiki/Partitioning_FAQ
You could also create materialized views to avoid having to recalculate your aggregations if these types of queries are frequent.
Could you provide the execution plans of your queries. Without those it is difficult to know the exact reason it is taking so long
You can improve your first query slightly by using this variant:
select experiment
, max(id) id
, max(status) keep (dense_rank last order by id) status
from exp_status
group by experiment
having max(status) keep (dense_rank last order by id) not like 'Finished%'
If you compare the plans, you'll notice one step less
Regards,
Rob.
Related
I have a huge database of eCommerce transactions on Redshift, running into about 900 million rows, with the headers being somewhat similar to this.
id | date_stamp | location | item | amount
001 | 2009-12-28 | A1 | Apples | 2
002 | 2009-12-28 | A2 | Juice | 2
003 | 2009-12-28 | A1 | Apples | 1
004 | 2009-12-28 | A4 | Apples | 2
005 | 2009-12-29 | A1 | Juice | 6
006 | 2009-12-29 | A4 | Apples | 2
007 | 2009-12-29 | A1 | Water | 7
008 | 2009-12-28 | B7 | Juice | 14
Is it possible to find trends within items? For example, if I wanted to see how "Apples" performed in terms of sales, between 2009-12-28 and 2011-12-28, at location A4, how would I go about it? Ideally I would like to generate a table with positive/negative trending, somewhat similar to the post here -
Aggregate function to detect trend in PostgreSQL
I have performed similar analysis on small data sets in R, and even visualizing it using ggplot isn't a big challenge, but the sheer size of the database is causing me some troubles, and extremely long querying times as well.
For example,
select *
from fruitstore.sales
where item = 'Apple' and location = 'A1'
order by date_stamp
limit 1000000;
takes about 2500 seconds to execute, and times out often.
I appreciate any help on this.
900M rows is quite a bit for stock Postgres to handle. One of the MPP variants (like Citus) would be able to handle it better.
Another option is to change how you're storing the data. A far more efficient structure would be to have 1 row for each month/item/location, and store an int array of amounts. That would cut things down to ~300M rows, which is much more manageable. I suspect most of your analysis tools will want to see the data as an array anyway.
Take a look at window functions. They're great for this type of use case. They were a bit tough for me to get my head around but can save you some serious contortions with SQL.
This will show you how many apples were sold per day for the period you're interested in:
select date_trunc('day', date_stamp) as day, count(*) as sold
from fruitstore.sales
where item = 'Apple' and location = 'A4'
and date_stamp::date >= '2009-12-28'::date and date_stamp::date <= '2011-12-28'::date
group by 1 order by 1 asc
Regarding performance, avoid using select * in Redshift. It's a columnar store where data for different columns is spread across nodes. Being explicit about the columns and only referencing the ones you use will save Redshift from moving a lot of unneeded data over the network.
Make sure you're picking good distkey and sortkeys for your tables. In a time series table the timestamp should definitely be one of the sortkeys. Enabling compression on your tables can help too.
Schedule regular VACUUM and ANALYZE runs on your tables.
Also if there's any way to restrict the range of data you're looking at by filtering possible records out in the where clause, it can help a lot. For example, if you know you only care about the trend for the last few days it can make a huge difference to limit on time like:
where date_stamp >= sysdate::date - '5 day'::interval
Here's a good article with performance tips.
To filter results in your SQL query, you can use a WHERE clause:
SELECT *
FROM myTable
WHERE
item='Apple' AND
date_stamp BETWEEN '2009-12-28' AND '2011-12-28' AND
location = 'A4'
Using Aggregate functions, you can summarize fruit sales between two dates at a location, for instance:
SELECT item as "fruit", sum(amount) as "total"
FROM myTable
WHERE
date_stamp BETWEEN '2009-12-28' AND '2011-12-28' AND
location = 'A4'
GROUP BY item
Your question asking how apples "Fared" isn't terrible descriptive, but using a WHERE clause and aggregate functions (don't forget your group by) are probably where you need to aim.
I am looking to calculate cumulative sum across columns in Google Big Query.
Assume there are five columns (NAME,A,B,C,D) with two rows of integers, for example:
NAME | A | B | C | D
----------------------
Bob | 1 | 2 | 3 | 4
Carl | 5 | 6 | 7 | 8
I am looking for a windowing function or UDF to calculate the cumulative sum across rows to generate this output:
NAME | A | B | C | D
-------------------------
Bob | 1 | 3 | 6 | 10
Carl | 5 | 11 | 18 | 27
Any thoughts or suggestions greatly appreciated!
I think, there are number of reasonable workarounds for your requirements mostly in the area of designing better your table. All really depends on how you input your data and most importantly how than you consume it
Still, if to stay with presented requirements - Below is not exactly what you expect in your question as an output, but might be usefull as an example:
SELECT name, GROUP_CONCAT(STRING(cum)) AS all FROM (
SELECT name,
SUM(INTEGER(num))
OVER(PARTITION BY name
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS cum
FROM (
SELECT name, SPLIT(all) AS num FROM (
SELECT name,
CONCAT(STRING(a),',',STRING(b),',',STRING(c),',',STRING(d)) AS all
FROM yourtable
)
)
)
GROUP BY name
Output is:
name all
Bob 1,3,6,10
Carl 5,11,18,26
Depends on how you than consume this data - it still can work for you
Note, not you avoiding now writing something like col1 + col2 + .. + col89 + col90 - but still need to explicitelly mention each column just ones.
in case if you have "luxury" of implementing your requirements outside of GBQ UI, but rather in some Client- you can use BigQuery API to programatically aquire table schema and build on fly your logic/query and than execute it
Take a look at below APIs to start with:
To get table schema - https://cloud.google.com/bigquery/docs/reference/v2/tables/get
To issue query job - https://cloud.google.com/bigquery/docs/reference/v2/jobs/insert
There's no need for a UDF:
SELECT name, a, a+b, a+b+c, a+b+c+d
FROM tab
The website stores information about all specifications of large amount of items and provides user ability to search through the data by adding some filtering conditions at the front-end. At back-end all the conditions are being translated into clauses and joined by AND operand.
My aim is to give the user an idea how many goods are being thrown away or left after each filter. Exact numbers aren't very important for the initial sieving (some fuzzy or approximations are fine, because the whole amount is quite large), but at latter stages, when there's ten or so items left, the user should get the proper amount.
There's obvious straightforward way of making as much SELECT COUNT queries as he has filters, but I feel that it might be some technique to archive it in more elegant way and without abusing DB much.
There are many ways to achieve this with varying levels of difficulty and performance.
The first and most obvious way to me is to simply do a count on the filters which performs fairly well and is not that difficult to implement. An alternative but similar approach would be to group by the values and do a count.
Here's a fiddle as an example of both methods: http://sqlfiddle.com/#!15/0cdcb/26
select
count(product.id) total,
sum((v0.value = 'spam')::int) v0_is_spam,
sum((v0.value != 'spam')::int) v0_not_spam,
sum((v1.value = 'spam')::int) v1_is_spam,
sum((v1.value != 'spam')::int) v1_not_spam
from product
left join specification_value v0 on v0.product_id = product.id and v0.specification_id = 1
left join specification_value v1 on v1.product_id = product.id and v1.specification_id = 2;
select specification.id, value, count(*)
from specification
left join specification_value on specification.id = specification_value.specification_id
group by specification.id, value;
A slightly more difficult way to do something like that is using window functions, a lot more flexible but not as easy to grasp. Docs are here: http://www.postgresql.org/docs/9.3/static/tutorial-window.html
Example query and results:
SELECT depname, empno, salary, avg(salary) OVER (PARTITION BY depname) FROM empsalary;
depname | empno | salary | avg
-----------+-------+--------+-----------------------
develop | 11 | 5200 | 5020.0000000000000000
develop | 7 | 4200 | 5020.0000000000000000
develop | 9 | 4500 | 5020.0000000000000000
develop | 8 | 6000 | 5020.0000000000000000
develop | 10 | 5200 | 5020.0000000000000000
personnel | 5 | 3500 | 3700.0000000000000000
personnel | 2 | 3900 | 3700.0000000000000000
sales | 3 | 4800 | 4866.6666666666666667
sales | 1 | 5000 | 4866.6666666666666667
sales | 4 | 4800 | 4866.6666666666666667
(10 rows)
And lastly, by far the fastest but also most inaccurate and difficult to implement. Using the database statistics to guess the amount of rows. I would not opt for this unless you have millions of rows within a filter set and have no way to reducing it further. Also, don't do this unless the performance is really so bad that it's needed.
The following is a problem which is not well-suited to an RDBMS, I think, but that is what I've got deal with.
I am trying to write a tool to search through logs stored in a database.
Some rows might be:
Time | ID | Object | Description
2012-01-01 13:37 | 1 | 1 | Something happened
2012-01-01 13:39 | 2 | 2 | Something else happened
2012-01-01 13:50 | 3 | 2 | Bad
2012-01-01 14:08 | 4 | 1 | Good
2012-01-01 14:27 | 5 | 1 | Bad
2012-01-01 14:30 | 6 | 2 | Good
Object is a foreign key. In practice, Time will increase with ID but that is not an actual constraint. In reality there are more fields. It's a Postgres database - I'd like to be able to support SQLite as well but am aware this may well be impossible.
Now, I want to be able to run a query for, say, all Bad events that happened to Object 2:
SELECT * FROM table WHERE Object = 2 AND Description = 'Bad';
But it would often be useful to see some lines of context around the results - just as with the -C option to grep is very useful when searching through text logs.
For the above query, if we wanted one line of context either side, we would want rows 2 and 6 in addition to row 3.
If the original query returned multiple rows, more context would need to be retrieved.
Notice that the context is not retrieved from the events associated with Object 1; we eliminate only the restriction on the Description.
Also, the order involved, and hence what determines what is adjacent to what, is that induced by the Time field.
This specifies what I want to achieve, but the database concerned is fairly big, at least in comparison to the power of the machine it's running on.
The most often cited solution for getting adjacent rows requires you to run one extra query per result in what I'll call the base query; this is no good because that might be thousands of queries.
My current least bad solution is to run a query to retrieved the IDs of all possible rows that could be context - in the above example, that would be a search for all rows relating to Object 2. Then I get the IDs matching the base query, expand (using the list of all possible IDs) to a list of IDs of rows matching the base query or in context, then finally retrieve the data for those IDs.
This works, but is inelegant and slow.
It is especially slow when using the tool from a remote computer, as that initial list of IDs can be very large, and retrieving it and then just transmitting it over the internet can be inordinate.
Another solution I have tried is using a subquery or view that computes the "buffer sequence" of the rows.
Here's what the table looks like with this field added:
Time | ID | Sequence | Object | Description
2012-01-01 13:37 | 1 | 1 | 1 | Something happened
2012-01-01 13:39 | 2 | 1 | 2 | Something else happened
2012-01-01 13:50 | 3 | 2 | 2 | Bad
2012-01-01 14:08 | 4 | 2 | 1 | Good
2012-01-01 14:27 | 5 | 3 | 1 | Bad
2012-01-01 14:30 | 6 | 3 | 2 | Good
Running the base query on this table then allows you to generate the list of IDs you want by adding or subtracting from the Sequence value.
This eliminates the problem of transferring loads of rows over the wire, but now the database has to run this complicated subquery, and it's unacceptably slow, especially on the first run - given the use-case, queries are sporadic and caching is not very effective.
If I were in charge of the schema I'd probably just store this field there in the database, but I'm not, so any suggestions for improvements are welcome. Thanks!
You should use the ROW_NUMBER windowing function
http://www.postgresql.org/docs/current/static/functions-window.html
Adjacency is an abstract construct and relies on an explicit sort (or PARTITION OVER) ... do you mean the one with the preceeding time stamp?
Decide how you decide on what sort of "adjacent" you want, then get ROW_NUMBER over that criteria.
Once you have that you would just JOIN each row on the item having ROW_NUMBER +/- 1
You can try this with sqlite
SELECT DISTINCT t2.*
FROM (SELECT * FROM t WHERE object=2 AND description='Bad') t1
JOIN
(SELECT * FROM t WHERE object=2) t2
ON t1.id = t2.id OR
t2.id IN (SELECT id FROM t WHERE object=2 AND t.time<t1.time ORDER BY t.time DESC LIMIT 1) OR
t2.id IN (SELECT id FROM t WHERE object=2 AND t.time>t1.time ORDER BY t.time ASC LIMIT 1)
ORDER BY t2.time
;
Change the limit values by more context
SQL to find duplicate entries (within a group)
I have a small problem and I'm not sure what would be the best way to fix it, as I only have limited access to the database (Oracle) itself.
In our Table "EVENT" we have about 160k entries, each EVENT has a GROUPID and a normal entry has exactly 5 rows with the same GROUPID. Due to a bug we currently get a couple of duplicate entries (duplicate, so 10 rows instead of 5, just a different EVENTID. This may change, so it's just <> 5). We need to filter all the entries of these groups.
Due to limited access to the database we can not use a temporary table, nor can we add an index to the GROUPID column to make it faster.
We can get the GROUPIDs with this query, but we would need a second query to get the needed data
select A."GROUPID"
from "EVENT" A
group by A."GROUPID"
having count(A."GROUPID") <> 5
One solution would be a subselect:
select *
from "EVENT" A
where A."GROUPID" IN (
select B."GROUPID"
from "EVENT" B
group by B."GROUPID"
having count(B."GROUPID") <> 5
)
Without an index on GROUPID and 160k entries, this takes much too long.
Tried thinking about a join that can handle this, but can't find a good solution so far.
Anybody can find a good solution for this maybe?
Small edit:
We don't have 100% duplicates here, as each entry still has a unique ID and the GROUPID is not unique either (that's why we need to use "group by") - or maybe I just miss an easy solution for it :)
Small example about the data (I don't want to delete it, just find it)
EVENTID | GROUPID | TYPEID
123456 123 12
123457 123 145
123458 123 2612
123459 123 41
123460 123 238
234567 123 12
234568 123 145
234569 123 2612
234570 123 41
234571 123 238
It has some more columns, like timestamp etc, but as you can see already, everything is identical, besides the EVENTID.
We will run it more often for testing, to find the bug and check if it happens again.
A classic problem for analytic queries to solve:
select eventid,
groupid,
typeid
from (
Select eventid,
groupid,
typeid,
count(*) over (partition by group_id) count_by_group_id
from EVENT
)
where count_by_group_id <> 5
You can get the answer with a join instead of a subquery
select
a.*
from
event as a
inner join
(select groupid
from event
group by groupid
having count(*) <> 5) as b
on a.groupid = b.groupid
This is a fairly common way of obtaining the all the information out of the rows in a group.
Like your suggested answer and the other responses, this will run a lot faster with an index on groupid. It's up to the DBA to balance the benefit of making your query run a lot faster against the cost of maintaining yet another index.
If the DBA decides against the index, make sure the appropriate people understand that its the index strategy and not the way you wrote the query that is slowing things down.
How long does that SQL actually take? You are only going to run it once I presume, having fixed the bug that caused the corruption in the first place? I just set up a test case like this:
SQL> create table my_objects as
2 select object_name, ceil(rownum/5) groupid, rpad('x',500,'x') filler
3 from all_objects;
Table created.
SQL> select count(*) from my_objects;
COUNT(*)
----------
83782
SQL> select * from my_objects where groupid in (
2 select groupid from my_objects
3 group by groupid
4 having count(*) <> 5
5 );
OBJECT_NAME GROUPID FILLER
------------------------------ ---------- --------------------------------
XYZ 16757 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
YYYY 16757 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Elapsed: 00:00:01.67
Less than 2 seconds. OK, my table has half as many rows as yours, but 160K isn't huge. I added the filler column to make the table take up some disk space. The AUTOTRACE execution plan was:
-------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)|
-------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 389 | 112K| 14029 (2)|
|* 1 | HASH JOIN | | 389 | 112K| 14029 (2)|
| 2 | VIEW | VW_NSO_1 | 94424 | 1198K| 6570 (2)|
|* 3 | FILTER | | | | |
| 4 | HASH GROUP BY | | 1 | 1198K| 6570 (2)|
| 5 | TABLE ACCESS FULL| MY_OBJECTS | 94424 | 1198K| 6504 (1)|
| 6 | TABLE ACCESS FULL | MY_OBJECTS | 94424 | 25M| 6506 (1)|
-------------------------------------------------------------------------
If your DBAs won't add an index to make this faster, ask them what they suggest you do (that's what they're paid for, after all). Presumably you have a business case why you need this information in which case your immediate management should be on your side.
Perhaps you could ask your DBAs to duplicate the data into a database where you could add an index.
From a SQL perspective I think you've already answered your own question. The approach you've described (ie using the sub-select) is fine, and I'd be surprised if any other way of writing the query differed vastly in performance.
160K records doesn't seem like a lot to me. I could understand if you were unhappy with the performance of that query if it was going into a piece of application code, but from the sounds of it you're just using it as part of some data cleansing excercise. (and so would expect you to be a little more tolerant in performance terms).
Even without any supporting index, its still just two full table table scans on 160K rows, which frankly, I'd expect to perform in some sort of vaguely reasonable time.
Talk to your db administrators. They've helped create the problem, so let them be part of the solution.
/EDIT/ In the meantime, run the query you have. Find out how long it takes, rather than guessing. Even better would be to run it, with set autotrace on, and post the results here, then we might be able to help you refine it somewhat.
Does this work do what you want, and does it offer better performance? (I just thought I'd throw it in as a suggestion).
select *
from group g
where (select count(*) from event e where g.groupid = e.groupid) <> 5
How about an analytic:
SELECT * FROM (
SELECT eventid, groupid, typeid, COUNT(groupid) OVER (PARTITION BY groupid) group_count
FROM event
)
WHERE group_count <> 5