So I have tables with the following structure:
TimeStamp,
var_1,
var_2,
var_3,
var_4,
var_5,...
This contains about 600 columns named var_##, the user parses some data stored by a machine and I have to update all null values inside that table to the last valid value. At the moment I use the following query:
update tableName
set var_## =
(select b.var_## from tableName as
where b.timeStamp <= tableName.timeStamp and b.var_## is not null
order by timeStamp desc limit 1)
where tableName.var_## is null;
Problem right now is the tame it takes to run this query for all columns, is there any way to optimize this query?
UPDATE: this is the output query plan when executin te query for one column:
update wme_test2
set var_6 =
(select b.var_6 from wme_test2 as b
where b.timeStamp <= wme_test2.timeStamp and b.var_6 is not null
order by timeStamp desc limit 1)
where wme_test2.var_6 is null;
Having 600 indexes on the data columns would be silly. (But not necessarily more silly than having 600 columns.)
All queries can be sped up with an index on the timeStamp column.
Related
I have table with 17,000,000 rows. I need to delete 500,000 with certain conditions. At this moment i have a script with 500,000 rows looks like
delete from table where name = 'John' and date = '2010-08-04';
delete from table where name = 'John' and date = '2010-08-05';
delete from table where name = 'Adam' and date = '2010-08-06';
One row executed about 2.5 seconds. It's too long. How can i improve speed?
If there is no index on name and date field then try to create below index and try your code.
CREATE INDEX idx_table_name_date ON table (name, date)
If possible you can also minimize the number of delete statement by merging them.
Instead of
delete from table where name = 'John' and date = '2010-08-04';
delete from table where name = 'John' and date = '2010-08-05';
It can be:
delete from table where name = 'John' and date in('2010-08-04','2010-08-05');
I would suggest that you load the rows to delete into a table and use:
delete from table t
from todelete td
where t.name = td.name and t.date = td.date;
Even without indexes, this should be faster than zillions of separate delete statements. But you want an index on table(name, date) for performance.
If the data already comes from a table or query, then you can just use that directly.
You can also incorporate this into one query by listing the values explicitly in a from clause:
delete from table t
from (values ('John', '2010-08-04'),
('John', '2010-08-05')
('Adam', '2010-08-06')
) todelete
where t.name = td.name and t.date = td.date;
I have PostgreSQL DB table user_book_details with 451007 records. The user_book_details table is getting populated on daily basis with around 1K new records.
I have the following query that is taking a long time(13 Hrs) to complete every time.
update user_book_details as A1 set min_date=
(select min(A2.acc_date) as min_date from user_book_details A2 where A2.user_id=A1.user_id
and A2.book_id=A1.book_id) where A1.min_date is null;
How I can rewrite the query to improve the performance?
FYI, there is no index on user_id and book_id column.
Your query is okay:
update user_book_details ubd
set min_date = (select min(ubd2.acc_date)
from user_book_details ubd2
where ubd2.user_id = ubd.user_id and
ubd2.book_id = ubd.book_id
)
where ubd.min_date is null;
For performance you want an index on user_book_details(user_id, book_id). I also think it would be faster written like this:
update user_book_details ubd
set min_date = min_acc_date
from (select ubd2.user_id, ubd2.book_id, min(ubd2.acc_date) as min_acc_date
from user_book_details ubd2
group by ubd2.user_id, ubd2.book_id
) ubd2
where ubd2.user_id = ubd.user_id and
ubd2.book_id = ubd.book_id and
ubd.min_date is null;
The first method uses the index to look up the values for each row (something that might be a little complicated when updating the same query). The second method aggregates the data and then joins in the values.
I should note that this value is easily calculated on the fly:
select ubd.*,
min(acc_date) over (partition by user_id, book_id) as min_acc_date
from user_book_details ubd;
This might be preferable to trying to keep it up-to-date in the table.
I am trying to merge 2 partitioned tables in BigQuery:
'source_t' is a source table. Its partitioned by Ingestion Time with Partition filter –
Required. Pseudo field _PARTITIONTIME is timestamp
'target_t' is a target table partitioned by field 'date' with Partition filter
Required. Field date is date
I want to get data from last partition of source table and merge it to target table. To filter the search task on tagret table I need to use the field 'date' from the data of source table. I wrote a query but editor shows following query error:
Cannot query over table 'MyDataSet.target_t' without a filter over column(s) 'date'
Here is my query:
declare latest default (select date(max(_PARTITIONTIME)) as latest from MyDataSet.source_t where _PARTITIONTIME >= timestamp(date_sub(current_date(),interval 1 day)));
declare first_date default (select min(date) as first_date from MyDataSet.source_t where date(_PARTITIONTIME) = latest);
merge `MyDataSet.target_t` as a
using (select * from `MyDataSet.source_t` where _PARTITIONTIME = latest) as b
on
a.date >= first_date and
a.date = b.date and
a.account_id = b.account_id and
a.campaign_id = b.campaign_id and
a.adset_id = b.adset_id and
a.ad_id = b.ad_id
when matched then update set
a.account_name = b.account_name,
a.campaign_name = b.campaign_name,
a.adset_name = b.adset_name,
a.ad_name = b.ad_name,
a.impressions = b.impressions,
a.clicks = b.clicks,
a.spend = b.spend,
a.date = b.date
when not matched then insert row;
If I input date instead of 'latest' variable ("where _PARTITIONTIME = '2020-10-01') as b") there wont be any error. But I want to filter the source table properly.
And I don't get it how it affects the following 'on' statement and why everything brokes >.<
Could you please help? What is a proper syntax to write such query. And is there any other ways to run such merge without variables?
declare latest timestamp;
Your variable latest is a TIMESTAMP. Making it a DATE type then your query should work.
------ Update --------
The error is complaining about MyDataSet.target_t doesn't have a good filter on date column. Could you try put after on clause a.date = latest (if this is not the right filter, come up with other constant filter)
I have a table A, where there is a column D_DATE with value in the form YYYYMMDD (I am not bothered about the date format). I also happen to have another table B, where there is a column name V_TILL. Now, I want to update the V_TILL column value of table B with the value of D_DATE column in table A which happens to have duplicates as well. Meaning, the inner query can return multiple records from where I form a query to update the table.
I currently have this query written but it throws the error:
ORA-01427: single-row subquery returns more than one row
UPDATE TAB_A t1
SET (V_TILL) = (SELECT TO_DATE(t2.D_DATE,'YYYYMMDD')
FROM B t2
WHERE t1.BR_CODE = t2.BR_CODE
AND t1.BK_CODE = t2.BK_CODE||t2.BR_CODE)
WHERE EXISTS (
SELECT 1
FROM TAB_B t2
WHERE t1.BR_CODE = t2.BR_CODE
AND t1.BK_CODE = t2.BK_CODE||t2.BR_CODE)
PS: BK_CODE IS THE CONCATENATION OF BK_CODE and BR_CODE
Kindly help me as I am stuck in this quagmire! Any help would be appreciated.
If the subquery returns many values which one do you want to use ?
If any you can use rownum <=1;
If you know that there is only one value use distinct
SET (V_TILL) = (SELECT TO_DATE(t2.D_DATE,'YYYYMMDD')
FROM B t2
WHERE t1.BR_CODE = t2.BR_CODE
AND t1.BK_CODE = t2.BK_CODE||t2.BR_CODE AND ROWNUM <=1)
or
SET (V_TILL) = (SELECT DISTINCT TO_DATE(t2.D_DATE,'YYYYMMDD')
FROM B t2
WHERE t1.BR_CODE = t2.BR_CODE
AND t1.BK_CODE = t2.BK_CODE||t2.BR_CODE)
above are workarounds. To do it right you have to analyze why you are getting more than one value. Maybe more sophisticated logic is needed to select the right value.
I got it working with this command:
MERGE INTO TAB_A A
USING TAB_B B
ON (A.BK_CODE = B.BK_CODE || B.BR_CODE
AND A.BR_CODE = B.BR_CODE AND B.BR_DISP_TYPE <> '0'
AND ((B.BK_CODE, B.BR_SUFFIX) IN (SELECT BK_CODE,
MIN(BR_SUFFIX)
FROM TAB_B
GROUP BY BK_CODE)))
As mentioned earlier by many, I was missing an extra condition and got it working, otherwise the above mentioned techniques work very well.
Thanks to all!
I currently have a query that randomly selects a job from a table of jobs:
select jobs.job_id
from jobs
where (jobs.type is null)
and (jobs.project_id = 5)
and (jobs.status = 'Available')
offset floor(random() * (select count(*) from jobs
where (jobs.type is null) and (jobs.project_id = 5)
and (jobs.status = 'Available')))
limit 1
This has the desired functionality, but is too slow. I am using Postgres 9.2 so I can't use TABLESAMPLE, unfortunately.
On the plus side, I do not need it to be truly random, so I'm thinking I can optimize it by making it slightly less random.
Any ideas?
Could I suggest an index on jobs(project_id, status, type)? That might speed your query, if it is not already defined on the table.
Instead of using OFFSET and LIMIT, why don't you use
ORDER BY random() LIMIT 1
If that is also too slow, you could replace your subquery
select count(*) from jobs
where (jobs.type is null) and (jobs.project_id = 5)
and (jobs.status = 'Available')
with something like
SELECT reltuples * <factor> FROM pg_class WHERE oid = <tableoid>
where <tableoid> is the OID of the jobs table and <factor> is a number that is slightly bigger than the selectivity of the WHERE condition of your subquery.
That will save one sequential scan, with the downside that you occasionally get no result and have to repeat the query.
Is that good enough?
A dirty trick: store the random value inside the table and build a (partial) index on it. (you may want to re-randomise this field from time to time to avoid records to never be picked ;-)
-- assumed table definition
CREATE table jobs
( job_id SERIAL NOT NULL PRIMARY KEY
, type VARCHAR
, project_id INTEGER NOT NULL
, status VARCHAR NOT NULL
-- pre-computed magic random number
-- , magic DOUBLE PRECISION NOT NULL DEFAULT random()
);
-- some data to play with
INSERT INTO jobs(type,project_id,status)
SELECT 'aaa' , gs %10 , 'Omg!'
FROM generate_series(1,10000) gs;
UPDATE jobs SET type = NULL WHERE random() < 0.2;
UPDATE jobs SET status = 'Available' WHERE random() < 0.2;
-- add a column containing random numbers
ALTER TABLE jobs
ADD column magic DOUBLE PRECISION NOT NULL DEFAULT random()
;
CREATE INDEX ON jobs(magic)
-- index is only applied for the conditions you will be searching
WHERE status = 'Available' AND project_id = 5 AND type IS NULL
;
-- make sure statistics are present
VACUUM ANALYZE jobs;
-- EXPLAIN
SELECT j.job_id
FROM jobs j
WHERE j.type is null
AND j.project_id = 5
AND j.status = 'Available'
ORDER BY j.magic
LIMIT 1
;
Something similar can be accomplished by using a serial with a rather high increment value (some prime number around 3G) instead of random+float.