SELECT only a certain set of rows at a time - sql

I need to select data from one table and insert it into another table. Currently the SQL looks something like this:
INSERT INTO A (x, y, z)
SELECT x, y, z
FROM B b
WHERE ...
However, the SELECT is huge, resulting in over 2 millions rows and we think it is taking up too much memory. Informix, the db in this case, runs out of virtual memory when the query is run.
How would I go about selecting and inserting a set of rows (say 2000)? Given that I don't think there are any row ids etc.

You can do SELECT FIRST n * from Table. Where n is the amount of rows you want, say 2000. Also, in the WHERE clause do an embedded select that checks the table you are inserting in to for rows already existing. So that the next time the statement is ran, it will not include already inserted data.

I assume that you have some script that this is executed from? You can just loop and limit as long as you order the values returned from the nested select. Here is some pseudo code.
total = SELECT COUNT(x) FROM B WHERE ...
while (total > 0)
INSERT INTO A (x, y, z) SELECT x, y, z FROM B b WHERE ... ORDER BY x LIMIT 2000
total = total - 2000
end

I'm almost certain that IDS only lets you use the FIRST clause where the data is returned to the client1, and that is something you want to avoid if at all possible.
You say you get an out of memory error (rather than, say, a long transaction aborted error)? Have you looked at the configuration of your server to ensure it has a reasonable amount of memory?
It depends in part on how big your data set is, and what the constraints are - why you are doing the load across tables. But I would normally aim to determine a way of partitioning the data into loadable subsets and run those sequentially in a loop. For example, if the sequence numbers are between 1 and 10,000,000, I might run the loop ten times, with condition on the sequence number for AND seqnum >= 0 AND seqnum < 1000000' and thenAND seqnum >= 1000000 AND seqnum < 2000000', etc. Preferably in a language with the ability to substitute the range via variables.
This is a bit nuisancy, and you want to err on the conservative side in terms of range size (more smaller partitions rather than fewer bigger ones - to reduce the risk of running out of memory).
1 Over-simplifying slightly. A stored procedure would have to count as 'the client', for example, and the communication cost in a stored procedure is (a lot) less than the cost of going to the genuine client.

Related

Subquery is taking much longer then normal

I encounter a problem with optimization.
When I use a query like this:
Select * (around 100 columns)
from x
where RepoDate = '2020-05-18'
It's taking around 0.2 seconds.
Unless I'm using query like this:
Select * (around 100 columns)
from x
where RepoDate = (select max(RepoDate) from y)
It takes around 1 hour.
Table y has only dates (2020-05-17, 2020-05-18, ... )
Can you tell me why there is so much difference in time to execute?
Technically, you are comparing a simple first query with a query having another subquery that might be processing a heavy table "y" already.
So these two queries are not the same to start with. We need to have explain plans to have an estimation of the cost of the subquery first (i.e. how many rows, index usage etc.) then we move to the parent query.

SQL: Reduce resultset to X rows?

I have the following MYSQL table:
measuredata:
- ID (bigint)
- timestamp
- entityid
- value (double)
The table contains >1 billion entries. I want to be able to visualize any time-window. The time window can be size of "one day" to "many years". There are measurement values round about every minute in DB.
So the number of entries for a time-window can be quite different. Say from few hundrets to several thousands or millions.
Those values are ment to be visualiuzed in a graphical chart-diagram on a webpage.
If the chart is - lets say - 800px wide, it does not make sense to get thousands of rows from database if time-window is quite big. I cannot show more than 800 values on this chart anyhow.
So, is there a way to reduce the resultset directly on DB-side?
I know "average" and "sum" etc. as aggregate function. But how can I i.e. aggregate 100k rows from a big time-window to lets say 800 final rows?
Just getting those 100k rows and let the chart do the magic is not the preferred option. Transfer-size is one reason why this is not an option.
Isn't there something on DB side I can use?
Something like avg() to shrink X rows to Y averaged rows?
Or a simple magic to just skip every #th row to shrink X to Y?
update:
Although I'm using MySQL right now, I'm not tied to this. If PostgreSQL f.i. provides a feature that could solve the issue, I'm willing to switch DB.
update2:
I maybe found a possible solution: https://mike.depalatis.net/blog/postgres-time-series-database.html
See section "Data aggregation".
The key is not to use a unixtimestamp but a date and "trunc" it, avergage the values and group by the trunc'ed date. Could work for me, but would require a rework of my table structure. Hmm... maybe there's more ... still researching ...
update3:
Inspired by update 2, I came up with this query:
SELECT (`timestamp` - (`timestamp` % 86400)) as aggtimestamp, `entity`, `value` FROM `measuredata` WHERE `entity` = 38 AND timestamp > UNIX_TIMESTAMP('2019-01-25') group by aggtimestamp
Works, but my DB/index/structue seems not really optimized for this: Query for last year took ~75sec (slow test machine) but finally got only a one value per day. This can be combined with avg(value), but this further increases query time... (~82sec). I will see if it's possible to further optimize this. But I now have an idea how "downsampling" data works, especially with aggregation in combination with "group by".
There is probably no efficient way to do this. But, if you want, you can break the rows into equal sized groups and then fetch, say, the first row from each group. Here is one method:
select md.*
from (select md.*,
row_number() over (partition by tile order by timestamp) as seqnum
from (select md.*, ntile(800) over (order by timestamp) as tile
from measuredata md
where . . . -- your filtering conditions here
) md
) md
where seqnum = 1;

Why does select result fields double data scanned in BigQuery

I have a table with 2 integer fields x,y and few millions of rows.
The fields are created with the following code:
Field.newBuilder("x", LegacySQLTypeName.INTEGER).setMode(Field.Mode.NULLABLE).build();
If I run the following from the web:
SELECT x,y FROM [myproject:Test.Test] where x=1 LIMIT 50
Query Editor: "Valid: This query will process 64.9 MB when run."
compared to:
SELECT x FROM [myproject:Test.Test] where x=1 LIMIT 50
Query Editor: " Valid: This query will process 32.4 MB when run."
It scans more than double of the original data scanned.
I would expect it will first find the relevant rows based on where clause and then bring the extra field without scanning the entire second field.
Any inputs on why it doubles the data scanned and how to avoid it will be appreciated.
In my application I have hundred of possible fields which I need to fetch for a very small number of rows (50) which answer the query.
Does this means I will need to processed all fields data?
* I'm aware how columnar database works, but wasn't aware for the huge price when you want to brings lots of fields based on a very specific where clause.
The following link provide very clear answer:
best-practices-performance-input
BigQuery does not have a concept of index or something like that. When you query a field column, BigQuery will scan through all the values of that column and then make the operations you want (for a deeper deep understanding they have some pretty cool posts about the inner workings of BQ).
That means that when you select x and y where x = 1, BQ will read through all values of x and y and then find where x = 1.
This ends up being an amazing feature of BQ, you just load your data there and it just works. It does force you to be aware on how much data you retrieve from each query. Queries of the type select * from table should be used only if you really need all columns.

postgres: get random entries from table - too slow

In my postgres database, I have the following relationships (simplified for the sake of this question):
Objects (currently has about 250,000 records)
-------
n_id
n_store_object_id (references store.n_id, 1-to-1 relationship, some objects don't have store records)
n_media_id (references media.n_id, 1-to-1 relationship, some objects don't have media records)
Store (currently has about 100,000 records)
-----
n_id
t_name,
t_description,
n_status,
t_tag
Media
-----
n_id
t_media_path
So far, so good. When I need to query the data, I run this (note the limit 2 at the end, as part of the requirement):
select
o.n_id,
s.t_name,
s.t_description,
me.t_media_path
from
objects o
join store s on (o.n_store_object_id = s.n_id and s.n_status > 0 and s.t_tag is not null)
join media me on o.n_media_id = me.n_id
limit
2
This works fine and gives me two entries back, as expected. The execution time on this is about 20 ms - just fine.
Now I need to get 2 random entries every time the query runs. I thought I'd add order by random(), like so:
select
o.n_id,
s.t_name,
s.t_description,
me.t_media_path
from
objects o
join store s on (o.n_store_object_id = s.n_id and s.n_status > 0 and s.t_tag is not null)
join media me on o.n_media_id = me.n_id
order by
random()
limit
2
While this gives the right results, the execution time is now about 2,500 ms (over 2 seconds). This is clearly not acceptable, as it's one of a number of queries to be run to get data for a page in a web app.
So, the question is: how can I get random entries, as above, but still keep the execution time within some reasonable amount of time (i.e. under 100 ms is acceptable for my purpose)?
Of course it needs to sort the whole thing according to random criteria before getting first rows. Maybe you can work around by using random() in offset instead?
Here's some previous work done on the topic which may prove helpful:
http://blog.rhodiumtoad.org.uk/2009/03/08/selecting-random-rows-from-a-table/
I'm thinking you'll be better off selecting random objects first, then performing the join to those objects after they're selected. I.e., query once to select random objects, then query again to join just those objects that were selected.
It seems like your problem is this: You have a table with 250,000 rows and need two random rows. Thus, you have to generate 250,000 random numbers and then sort the rows by their numbers. Two seconds to do this seems pretty fast to me.
The only real way to speed up the selection is not have to come up with 250,000 random numbers, but instead lookup rows through an index.
I think you'd have to change the table schema to optimize for this case. How about something like:
1) Create a new column with a sequence starting at 1.
2) Every row will then have a number.
3) Create an index on: number % 1000
4) Query for rows where number % 1000 is equal to a random number
between 0 and 999 (this should hit the index and load a random
portion of your database)
5) You can probably then add on RANDOM() to your ORDER BY clause and
it will then just sort that chunk of your database and be 1,000x
faster.
6) Then select the first two of those rows.
If this still isn't random enough (since rows will always be paired having the same "hash"), you could probably do a union of two random rows, or have an OR clause in the query and generate two random keys.
Hopefully something along these lines could be very fast and decently random.

Biased random in SQL?

I have some entries in my database, in my case Videos with a rating and popularity and other factors. Of all these factors I calculate a likelihood factor or more to say a boost factor.
So I essentially have the fields ID and BOOST.The boost is calculated in a way that it turns out as an integer that represents the percentage of how often this entry should be hit in in comparison.
ID Boost
1 1
2 2
3 7
So if I run my random function indefinitely I should end up with X hits on ID 1, twice as much on ID 2 and 7 times as much on ID 3.
So every hit should be random but with a probability of (boost / sum of boosts). So the probability for ID 3 in this example should be 0.7 (because the sum is 10. I choose those values for simplicity).
I thought about something like the following query:
SELECT id FROM table WHERE CEIL(RAND() * MAX(boost)) >= boost ORDER BY rand();
Unfortunately that doesn't work, after considering the following entries in the table:
ID Boost
1 1
2 2
It will, with a 50/50 chance, have only the 2nd or both elements to choose from randomly.
So 0.5 hit goes to the second element
And 0.5 hit goes to the (second and first) element which is chosen from randomly so so 0.25 each.
So we end up with a 0.25/0.75 ratio, but it should be 0.33/0.66
I need some modification or new a method to do this with good performance.
I also thought about storing the boost field cumulatively so I just do a range query from (0-sum()), but then I would have to re-index everything coming after one item if I change it or develop some swapping algorithm or something... but that's really not elegant and stuff.
Both inserting/updating and selecting should be fast!
Do you have any solutions to this problem?
The best use case to think of is probably advertisement delivery. "Please choose a random ad with given probability"... however i need it for another purpose but just to give you a last picture what it should do.
edit:
Thanks to kens answer i thought about the following approach:
calculate a random value from 0-sum(distinct boost)
SET #randval = (select ceil(rand() * sum(DISTINCT boost)) from test);
select the boost factor from all distinct boost factors which added up surpasses the random value
then we have in our 1st example 1 with a 0.1, 2 with a 0.2 and 7 with a 0.7 probability.
now select one random entry from all entries having this boost factor
PROBLEM: because the count of entries having one boost is always different. For example if there is only 1-boosted entry i get it in 1 of 10 calls, but if there are 1 million with 7, each of them is hardly ever returned...
so this doesnt work out :( trying to refine it.
I have to somehow include the count of entries with this boost factor ... but i am somehow stuck on that...
You need to generate a random number per row and weight it.
In this case, RAND(CHECKSUM(NEWID())) gets around the "per query" evaluation of RAND. Then simply multiply it by boost and ORDER BY the result DESC. The SUM..OVER gives you the total boost
DECLARE #sample TABLE (id int, boost int)
INSERT #sample VALUES (1, 1), (2, 2), (3, 7)
SELECT
RAND(CHECKSUM(NEWID())) * boost AS weighted,
SUM(boost) OVER () AS boostcount,
id
FROM
#sample
GROUP BY
id, boost
ORDER BY
weighted DESC
If you have wildly different boost values (which I think you mentioned), I'd also consider using LOG (which is base e) to smooth the distribution.
Finally, ORDER BY NEWID() is a randomness that would take no account of boost. It's useful to seed RAND but not by itself.
This sample was put together on SQL Server 2008, BTW
I dare to suggest straightforward solution with two queries, using cumulative boost calculation.
First, select sum of boosts, and generate some number between 0 and boost sum:
select ceil(rand() * sum(boost)) from table;
This value should be stored as a variable, let's call it {random_number}
Then, select table rows, calculating cumulative sum of boosts, and find the first row, which has cumulative boost greater than {random number}:
SET #cumulative_boost=0;
SELECT
id,
#cumulative_boost:=(#cumulative_boost + boost) AS cumulative_boost,
FROM
table
WHERE
cumulative_boost >= {random_number}
ORDER BY id
LIMIT 1;
My problem was similar: Every person had a calculated number of tickets in the final draw. If you had more tickets then you would have an higher chance to win "the lottery".
Since I didn't trust any of the found results rand() * multiplier or the one with -log(rand()) on the web I wanted to implement my own straightforward solution.
What I did and in your case would look a little bit like this:
(SELECT id, boost FROM foo) AS values
INNER JOIN (
SELECT id % 100 + 1 AS counter
FROM user
GROUP BY counter) AS numbers ON numbers.counter <= values.boost
ORDER BY RAND()
Since I don't have to run it often I don't really care about future performance and at the moment it was fast for me.
Before I used this query I checked two things:
The maximum number of boost is less than the maximum returned in the number query
That the inner query returns ALL numbers between 1..100. It might not depending on your table!
Since I have all distinct numbers between 1..100 then joining on numbers.counter <= values.boost would mean that if a row has a boost of 2 it would end up duplicated in the final result. If a row has a boost of 100 it would end up in the final set 100 times. Or in another words. If sum of boosts is 4212 which it was in my case you would have 4212 rows in the final set.
Finally I let MySql sort it randomly.
Edit: For the inner query to work properly make sure to use a large table, or make sure that the id's don't skip any numbers. Better yet and probably a bit faster you might even create a temporary table which would simply have all numbers between 1..n. Then you could simply use INNER JOIN numbers ON numbers.id <= values.boost