I have to delete a lot of rows from our log db.
Currently, the table holds about 6.3 billion of rows that I want to delete.
First of all, I'm doing it right now (until I have a better solution) in million increments via Python, which takes around ~600 seconds on average per million.
I've tried to copy the data we need (which is around 3% of all data in this table) in another table, but an error occurred because the data amounts too much, it seems, to insert that via select into another table.
(my plan was to get the data we need, insert it into another table, and just delete the old one and rename the new table)
I found this one too:
Delete specific rows from a table with billions of rows
Is that the best approach for me? (my procedure writing skills are nearly non existent, so I'm a little scared to try this, that's why I'm doing it with Python right now)
Thank you for any advice
Edit:
My Python Code:
import time
from helperfunc import dbExecAny
def delete_lg_log():
dbExecAny("delete from inspire.lg_log WHERE lg_wp_type != 'Matching.SAP.Zahlstatus' and rownum <= 1000000")
i = 0
while i <= 6200:
i += 1
print(f"Process loop {i}")
start = time.time()
delete_lg_log()
end = time.time()
print(start - end)
delete function:
def dbExecAny(stmt, db_name=""):
try:
if db_name.lower() == "cobra":
a_cur = cobra_conn.cursor()
else:
a_cur = conn.cursor()
if "drop" in stmt.lower() in stmt.lower() or "truncate" in stmt.lower() or "where" not in stmt.lower():
print("Invalid statement: " + stmt)
return []
a_cur.execute(stmt)
return a_cur.execute(stmt)
except Exception as e:
print("Coult not execute statement " + stmt + ", error: " + str(e))
return []
finally:
conn.commit()
The quickest way to save your 3% of data that should remain is to do a ctas (Create Table As Select) from your source table with no logging option. After that truncate your source table, drop it and rename the newly created table.
Don’t forget about things like dependencies like indexes, triggers, constraints and privileges.
This no logging only works if the database is not in force logging mode because it needs to keep a standby database in sync.
Related
I am looking to an R solution to this problem. My list of parameters is over 18000 long, so I attempted to split this up into a for-loop to run the query during each iteration with 2000 parameters (except the last iteration, which may have less than 2000). However, it seems to be "storing" parameters somewhere during each iteration, so after the first iteration it tells me I hit the limit. If I break it up into chunks of 1000, it breaks down after the second iteration. My code looks like:
Start_List<-(some list of values)
for (i in 1:ceiling(length(List)/2000)) {
#Partition List into chunks of length 2000
List<-Start_List[2000*(i-1)+1:min(2000*i,length(Start_List))]
#Create qmarks for List
qmarks_List <- paste(rep("?", length(List)), collapse = ",")
#Query
query <- paste("
SELECT columns
FROM table
WHERE column IN (", qmarks_List, ")
")
loop_df <- dbGetQuery(db, query, params= c(as.list(List)))
#Store the query in a list
query_list[[i]]<-loop_df
}
How can I clear the parameters so it starts back at 0 parameters each iteration?
Update: 8/24/2022 still looking for a solution to this problem.
While I cannot reproduce this bug, I can present an alternative: upload your value to a temporary table, query against it (either set-membership or left-join), then delete it. The reason to use bound-parameters in the first place is to prevent SQL injection (malicious or accidental corruption), and I believe this suggestion preserves that intent.
DBI::dbWriteTable(con, "#sometable", data.frame(val = Start_List), create = TRUE)
## choose one from:
DBI::dbGetQuery(con, "select Columns from table where column in (select val from #sometable)")
DBI::dbGetQuery(con, "select t2.Columns from #sometable t1 left join table t2 on t1.val=t2.column")
## cleanup, though the temp table will auto-delete when you disconnect
DBI::dbExecute(con, "drop table #sometable")
While this should fix the problem you're having, it also should simplify and speed-up your process: instead of iterating over groups of your 18K-long list, it does a single query, single data-pull, to retrieve all of the records. If you still need them grouped afterwards, that can be done easily in R (perhaps more easily than in SQL, but I'm confident a SQL-guru could demonstrate a safe/fast/efficient SQL method for this as well).
If you aren't aware of temp tables in SQL Server: prepending with # makes it a per-connection temporary table, meaning that no other connected user (even same user, different connection) will see this table; prepending with ## makes it a "global" temporary table, meaning that the same user on any connection will see the table. Both types are automatically dropped when this connection closes.
I have a dataset within BigQuery with roughly 1000 tables, one for each variable. Each table contains two columns: observation_number, variable_name. Please note that the variable_name column assumes the actual variable name. Each table contains at least 20000 rows. What is the best way to merge these tables on the observation number?
I have developed a Python code that is going to run on a Cloud Function and it generates the SQL query to merge the tables. It does that by connecting to the dataset and looping through the tables to get all of the table_ids. However, the query ends up being too large and the performance is not that great.
Here it is the sample of the Python code that generates the query (mind it's still running locally, not yet in a Cloud Function).
from google.cloud import bigquery
# Construct a BigQuery client object.
client = bigquery.Client()
# TODO(developer): Set project_id and dataset_id.
project_id = 'project-id-gcp'
dataset_name = 'sample_dataset'
dataset_id = project_id+'.'+dataset_name
dataset = client.get_dataset(dataset_id)
# View tables in dataset
tables = list(client.list_tables(dataset)) # API request(s)
table_names = []
if tables:
for table in tables:
table_names.append(table.table_id)
else:
print("\tThis dataset does not contain any tables.")
query_start = "select "+table_names[0]+".observation"+","+table_names[0]+"."+table_names[0]
query_select = ""
query_from_select = "(select observation,"+table_names[0]+" from `"+dataset_name+"."+table_names[0]+"`) "+table_names[0]
for table_name in table_names:
if table_name != table_names[0]:
query_select = query_select + "," + table_name+"."+table_name
query_from_select = query_from_select + " FULL OUTER JOIN (select observation," + table_name + " from " + "`"+dataset_name+"."+table_name+"`) "+table_name+" on "+table_names[0]+".observation="+table_name+".observation"
query_from_select = " from ("+query_from_select + ")"
query_where = " where " + table_names[0] + ".observation IS NOT NULL"
query_order_by = " order by observation"
query_full = query_start+query_select+query_from_select+query_where+query_order_by
with open("query.sql","w") as f:
f.write(query_full)
And this is a sample of the generated query for two tables:
select
VARIABLE1.observation,
VARIABLE1.VARIABLE1,
VARIABLE2.VARIABLE2
from
(
(
select
observation,
VARIABLE1
from
`sample_dataset.VARIABLE1`
) VARIABLE1 FULL
OUTER JOIN (
select
observation,
VARIABLE2
from
`sample_dataset.VARIABLE2`
) VARIABLE2 on VARIABLE1.observation = VARIABLE2.observation
)
where
VARIABLE1.observation IS NOT NULL
order by
observation
As the number of tables grows, this query gets larger and larger. Any suggestions on how to improve the performance of this operation? Any other way to approach this problem?
I don't know if there is a great technical answer to this question. It seems like you are trying to do a huge # of joins in a single query, and BQ's strength is not realized with many joins.
While I outline a potential solution below, have you considered if/why you really need a table with 1000+ potential columns? Not saying you haven't, but there might be alternate ways to solve your problem without creating such a complex table.
One possible solution is to subset your joins/tables into more manageable chunks.
If you have 1000 tables for example, run your script against smaller subsets of your tables (2/5/10/etc) and write those results to intermediate tables. Then join your intermediate tables. This might take a few layers of intermediate tables depending on the size of your sub-tables. Basically, you want to minimize (or make reasonable) the number of joins in each query. Delete the intermediate tables after you are finished to help with unnecessary storage costs.
Assume that we have two tables, named Tb1 and Tb2 and we are going to replace data from one to another. Tb1 is the main source of data and Tb2 is the Destination. This replacement operation has 3 parts.
In the first part we are going to validate all rows in Tb1 and check if they are correct. For example National security code must exactly have 10 digits or a real customer must have a valid birth date so according to these validation rules, 28 different validation methods and error codes have been considered. During the validation every spoiled row's description and status will be updated to a new state.
Part 2 fixes the rows' problems and the third one replace them to the Tb2.
For instance this row says that it has 4 different error.
-- Tb1.desc=6,8,14,16
-- Tb1.sts=0
A correct row of data
-- Tb1.desc=Null i
-- Tb1.sts=1
I have been working on the first part recently and have come up with a solution which works fine but it is too slow. Unfortunately It takes exactly 31 minutes to validate 100,000 rows. In a real situation we are going to validate more than 2 million records so it is totally useless despite all it's functionality.
Let's take look at my package :
procedure Val_primary IS
begin
Open X_CUSTOMER;
Loop
fetch X_CUSTOMER bulk collect into CUSTOMER_RECORD;
EXIT WHEN X_CUSTOMER%notfound;
For i in CUSTOMER_RECORD.first..CUSTOMER_RECORD.last loop
Val_CTYP(CUSTOMER_RECORD(i).XCUSTYP);
Val_BRNCH(CUSTOMER_RECORD(i).XBRNCH);
--Rest of the validations ...
UptDate_Val(CUSTOMER_RECORD(i).Xrownum);
end loop;
CUSTOMER_RECORD.delete;
End loop;
Close X_CUSTOMER;
end Val_primary;
Inside a validation procedure :
procedure Val_CTYP(customer_type IN number)IS
Begin
IF(customer_type<1 or customer_type>3)then
RW_FINAL_STATUS:=0;
FINAL_ERR_DSC:=Concat(FINAL_ERR_DSC,ERR_INVALID_CTYP);
End If;
End Val_CTYP;
Inside the update procedure :
procedure UptDate_Val(rownumb IN number) IS
begin
update tb1 set tb1.xstst=RW_FINAL_STATUS,tb1.xdesc=FINAL_ERR_DSC where xc1customer.xrownum=rownumb;
RW_FINAL_STATUS:=1;
FINAL_ERR_DSC:=null;
end UptDate_Val;
Is there any way to reduce execution time ?
It must be done less than 20 minutes for more than 2 million records.
Maybe each validation check could be a case expression within an inline view, and you could concatenate them etc in the enclosing query, giving you a single SQL statement that could drive an update. Something along the lines of:
select xxx, yyy, zzz -- whatever columns you need from xc1customer
, errors -- concatenation of all error codes that apply
, case when errors is not null then 0 else 1 end as status
from ( select xxx, yyy, zzz
, trim(ltrim(val_ctyp||' ') || ltrim(val_abc||' ') || ltrim(val_xyz||' ') || etc...) as errors
from ( select c.xxx, c.yyy, c.zzz
, case when customer_type < 1 or customer_type > 3 then err_invalid_ctyp end as val_ctyp
, case ... end as val_abc
, case ... end as val_xyz
from xc1customer c
)
);
Sticking with the procedural approach, the slow part seems to be the single-row update. There is no advantage to bulk-collecting all 20 million rows into session memory only to apply 20 million individual updates. The quick fix would be to add a limit clause to the bulk collect (and move the exit to the bottom of the loop where it should be), have your validation procedures set a value in the array instead of updating the table, and batch the updates into one forall per loop iteration.
You can be a bit freer with passing records and arrays in and out of procedures rather than having everything a global variable, as passing by reference means there is no performance overhead.
There are two potential lines of attack.
Specific implementation. Collections are read into session memory. This is usually quite small compared to global memory allocation. Reading 100000 longish rows into session memory is a bad idea and can cause performance issues. So breaking up the process into smaller chunks (say 1000 rows) will most likely improve throughput.
General implementation. What is the point of the tripartite process? Updating Table1 with some error flags is an expensive activity. A more efficient approach would be to apply the fixes to the data in the collection and apply that to Table2. You can write a log record if you need to track what changes are made.
Applying these suggestion you'd end up with a single procedure which looks a bit like this:
procedure one_and_only is
begin
open x_customer;
<< tab_loop >>
loop
fetch x_customer bulk collect into customer_record
limit 1000;
exit when customer_record.count() = 0;
<< rec_loop >>
for i in customer_record.first..customer_record.last loop
val_and_fix_ctyp(customer_record(i).xcustyp);
val_and_fix_brnch(customer_record(i).xbrnch);
--rest of the validations ...
end loop rec_loop;
-- apply the cleaned data to target table
forall j in 1..customer_record.count()
insert into table_2
values customer_record(j);
end loop tab_loop;
close x_customer;
end one_and_only;
Note that this approach requires the customer_record collection to match the projection of the target table. Also, don't use %notfound to test for end of the cursor unless you can guarantee the total number of read records is an exact multiple of the LIMIT number.
I have a table in MS Access which has stock prices arranged like
Ticker1, 9:30:00, $49.01
Ticker1, 9:30:01, $49.08
Ticker2, 9:30:00, $102.02
Ticker2, 9:30:01, $102.15
and so on.
I need to do some calculation where I need to compare prices in 1 row, with the immediately previous price (and if the price movement is greater than X% in 1 second, I need to report the instance separately).
If I were doing this in Excel, it's a fairly simple formula. I have a few million rows of data, so that's not an option.
Any suggestions on how I could do it in MS Access?
I am open to any kind of solutions (with or without SQL or VBA).
Update:
I ended up trying to traverse my records by using ADODB.Recordset in nested loops. Code below. I though it was a good idea, and the logic worked for a small table (20k rows). But when I ran it on a larger table (3m rows), Access ballooned to 2GB limit without finishing the task (because of temporary tables, the size of the original table was more like ~300MB). Posting it here in case it helps someone with smaller data sets.
Do While Not rstTickers.EOF
myTicker = rstTickers!ticker
rstDates.MoveFirst
Do While Not rstDates.EOF
myDate = rstDates!Date_Only
strSql = "select * from Prices where ticker = """ & myTicker & """ and Date_Only = #" & myDate & "#" 'get all prices for a given ticker for a given date
rst.Open strSql, cn, adOpenKeyset, adLockOptimistic 'I needed to do this to open in editable mode
rst.MoveFirst
sPrice1 = rst!Open_Price
rst!Row_Num = i
rst.MoveNext
Do While Not rst.EOF
i = i + 1
rst!Row_Num = i
rst!Previous_Price = sPrice1
sPrice2 = rst!Open_Price
rst!Price_Move = Round(Abs((sPrice2 / sPrice1) - 1), 6)
sPrice1 = sPrice2
rst.MoveNext
Loop
i = i + 1
rst.Close
rstDates.MoveNext
Loop
rstTickers.MoveNext
Loop
If the data is always one second apart without any milliseconds, then you can join the table to itself on the Ticker ID and the time offsetting by one second.
Otherwise, if there is no sequence counter of some sort to join on, then you will need to create one. You can do this by doing a "ranking" query. There are multiple approaches to this. You can try each and see which one works the fastest in your situation.
One approach is to use a subquery that returns the number of rows are before the current row. Another approach is to join the table to itself on all the rows before it and do a group by and count. Both approaches produce the same results but depending on the nature of your data and how it's structured and what indexes you have, one approach will be faster than the other.
Once you have a "rank column", you do the procedure described in the first paragraph, but instead of joining on an offset of time, you join on an offset of rank.
I ended up moving my data to a SQL server (which had its own issues). I added a row number variable (row_num) like this
ALTER TABLE Prices ADD Row_Num INT NOT NULL IDENTITY (1,1)
It worked for me (I think) because my underlying data was in the order that I needed for it to be in. I've read enough comments that you shouldn't do it, because you don't know what order is the server storing the data in.
Anyway, after that it was a join on itself. Took me a while to figure out the syntax (I am new to SQL). Adding SQL here for reference (works on SQL server but not Access).
Update A Set Previous_Price = B.Open_Price
FROM Prices A INNER JOIN Prices B
ON A.Date_Only = B.Date_Only
WHERE ((A.Ticker=B.Ticker) AND (A.Row_Num=B.Row_Num+1));
BTW, I had to first add the column Date_Only like this (works on Access but not SQL server)
UPDATE Prices SET Prices.Date_Only = Format([Time_Date],"mm/dd/yyyy");
I think the solution for row numbers described by #Rabbit should work better (broadly speaking). I just haven't had the time to try it out. It took me a whole day to get this far.
I have 2 tables in SQL : Event and Swimstyle
The Event table has a value SwimstyleId which refers to Swimstyle.id
The Swimstyle table has 3 values : distance, relaycount and strokeid
Normally there would be somewhere between 30 and 50 rows in the table Swimstyle, which would hold all possible values (these are swimming distances like 50 (distance), 1 (relaycount), FREE (strokeid)).
However, due to a programming mistake the lookup for existing values didn't work and the importer of new results created a new swimstyle entry for each event added...
My Swimstyle table now consists of almost 200k rows, which ofcourse is performance wise not the best idea ;)
To fix this i want to go through all Events, get the swimstyle values that are attached, lookup the first existing row in Swimstyle that has the same distance, relaycount and strokeid values and update the Event.SwimstyleId with that value.
When this is all done i can delete all orphaned Swimstyle rows, leaving a table with only 30-50 rows.
I have been trying to make a query that does this, but not getting anywhere. Anyone to point me in the right direction ?
These 2 statements should fix the problem, if I've read it right. N.B. I haven't been able to try this out anywhere, and I've made a few assumptions about the table structure.
UPDATE event e
set swimstyle_id = (SELECT MIN(s_min.id)
FROM swimstyle s_min,swimstyle s_cur
WHERE s_min.distance = s_cur.distance
AND s_min.relaycount = s_cur.relaycount
AND s_min.strokeid = s_cur.strokeid
AND s_cur.id = e.swimstyle_id);
DELETE FROM swimstyle s
WHERE NOT EXISTS (SELECT 1
FROM event e
WHERE e.swimstyle_id = s.id);