How to merge one thousand tables within a BigQuery dataset? - sql

I have a dataset within BigQuery with roughly 1000 tables, one for each variable. Each table contains two columns: observation_number, variable_name. Please note that the variable_name column assumes the actual variable name. Each table contains at least 20000 rows. What is the best way to merge these tables on the observation number?
I have developed a Python code that is going to run on a Cloud Function and it generates the SQL query to merge the tables. It does that by connecting to the dataset and looping through the tables to get all of the table_ids. However, the query ends up being too large and the performance is not that great.
Here it is the sample of the Python code that generates the query (mind it's still running locally, not yet in a Cloud Function).
from google.cloud import bigquery
# Construct a BigQuery client object.
client = bigquery.Client()
# TODO(developer): Set project_id and dataset_id.
project_id = 'project-id-gcp'
dataset_name = 'sample_dataset'
dataset_id = project_id+'.'+dataset_name
dataset = client.get_dataset(dataset_id)
# View tables in dataset
tables = list(client.list_tables(dataset)) # API request(s)
table_names = []
if tables:
for table in tables:
table_names.append(table.table_id)
else:
print("\tThis dataset does not contain any tables.")
query_start = "select "+table_names[0]+".observation"+","+table_names[0]+"."+table_names[0]
query_select = ""
query_from_select = "(select observation,"+table_names[0]+" from `"+dataset_name+"."+table_names[0]+"`) "+table_names[0]
for table_name in table_names:
if table_name != table_names[0]:
query_select = query_select + "," + table_name+"."+table_name
query_from_select = query_from_select + " FULL OUTER JOIN (select observation," + table_name + " from " + "`"+dataset_name+"."+table_name+"`) "+table_name+" on "+table_names[0]+".observation="+table_name+".observation"
query_from_select = " from ("+query_from_select + ")"
query_where = " where " + table_names[0] + ".observation IS NOT NULL"
query_order_by = " order by observation"
query_full = query_start+query_select+query_from_select+query_where+query_order_by
with open("query.sql","w") as f:
f.write(query_full)
And this is a sample of the generated query for two tables:
select
VARIABLE1.observation,
VARIABLE1.VARIABLE1,
VARIABLE2.VARIABLE2
from
(
(
select
observation,
VARIABLE1
from
`sample_dataset.VARIABLE1`
) VARIABLE1 FULL
OUTER JOIN (
select
observation,
VARIABLE2
from
`sample_dataset.VARIABLE2`
) VARIABLE2 on VARIABLE1.observation = VARIABLE2.observation
)
where
VARIABLE1.observation IS NOT NULL
order by
observation
As the number of tables grows, this query gets larger and larger. Any suggestions on how to improve the performance of this operation? Any other way to approach this problem?

I don't know if there is a great technical answer to this question. It seems like you are trying to do a huge # of joins in a single query, and BQ's strength is not realized with many joins.
While I outline a potential solution below, have you considered if/why you really need a table with 1000+ potential columns? Not saying you haven't, but there might be alternate ways to solve your problem without creating such a complex table.
One possible solution is to subset your joins/tables into more manageable chunks.
If you have 1000 tables for example, run your script against smaller subsets of your tables (2/5/10/etc) and write those results to intermediate tables. Then join your intermediate tables. This might take a few layers of intermediate tables depending on the size of your sub-tables. Basically, you want to minimize (or make reasonable) the number of joins in each query. Delete the intermediate tables after you are finished to help with unnecessary storage costs.

Related

Deleting billion of rows in an oracle db

I have to delete a lot of rows from our log db.
Currently, the table holds about 6.3 billion of rows that I want to delete.
First of all, I'm doing it right now (until I have a better solution) in million increments via Python, which takes around ~600 seconds on average per million.
I've tried to copy the data we need (which is around 3% of all data in this table) in another table, but an error occurred because the data amounts too much, it seems, to insert that via select into another table.
(my plan was to get the data we need, insert it into another table, and just delete the old one and rename the new table)
I found this one too:
Delete specific rows from a table with billions of rows
Is that the best approach for me? (my procedure writing skills are nearly non existent, so I'm a little scared to try this, that's why I'm doing it with Python right now)
Thank you for any advice
Edit:
My Python Code:
import time
from helperfunc import dbExecAny
def delete_lg_log():
dbExecAny("delete from inspire.lg_log WHERE lg_wp_type != 'Matching.SAP.Zahlstatus' and rownum <= 1000000")
i = 0
while i <= 6200:
i += 1
print(f"Process loop {i}")
start = time.time()
delete_lg_log()
end = time.time()
print(start - end)
delete function:
def dbExecAny(stmt, db_name=""):
try:
if db_name.lower() == "cobra":
a_cur = cobra_conn.cursor()
else:
a_cur = conn.cursor()
if "drop" in stmt.lower() in stmt.lower() or "truncate" in stmt.lower() or "where" not in stmt.lower():
print("Invalid statement: " + stmt)
return []
a_cur.execute(stmt)
return a_cur.execute(stmt)
except Exception as e:
print("Coult not execute statement " + stmt + ", error: " + str(e))
return []
finally:
conn.commit()
The quickest way to save your 3% of data that should remain is to do a ctas (Create Table As Select) from your source table with no logging option. After that truncate your source table, drop it and rename the newly created table.
Don’t forget about things like dependencies like indexes, triggers, constraints and privileges.
This no logging only works if the database is not in force logging mode because it needs to keep a standby database in sync.

How many predicates are allowed in a WHERE clause?

We are dynamically building a SQL statement in which the WHERE clause will consist of multiple predicates joined together using OR
SELECT cols
FROM t
WHERE (t.id = id1 AND t.PARTITIONDATE = “yyyy-mm-dd”)
OR (t.id = id2 AND t.PARTITIONDATE = “yyyy-mm-dd”)
OR (t.id = id3 AND t.PARTITIONDATE = “yyyy-mm-dd”)
OR (t.id = id4 AND t.PARTITIONDATE = “yyyy-mm-dd”)
etc…
What is the maximum number of conditions that BigQuery allows in such a SQL statement?
I’ve looked at the documentation (https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#where_clause) but the answer is not there.
Those are two factors for you to consider:
Maximum unresolved Standard SQL query length - 1 MB - An unresolved Standard SQL query can be up to 1 MB long. If your query is longer, you receive the following error: The query is too large. To stay within this limit, consider replacing large arrays or lists with query parameters.
and
Maximum resolved legacy and Standard SQL query length - 12 MB - The limit on resolved query length includes the length of all views and wildcard tables referenced by the query.
You can easily experiment with how many predicates you can use - for example - i just did very quick and simplified experiment and was able to use 50K predicates joined together with OR using below super simplified and totally dummy script
execute immediate (
select 'select 1 from (select 1) where ' || string_agg('1 = ' || num, ' or ')
from unnest(generate_array(1,50000)) num
)

Calculation based on values in 2 different rows

I have a table in MS Access which has stock prices arranged like
Ticker1, 9:30:00, $49.01
Ticker1, 9:30:01, $49.08
Ticker2, 9:30:00, $102.02
Ticker2, 9:30:01, $102.15
and so on.
I need to do some calculation where I need to compare prices in 1 row, with the immediately previous price (and if the price movement is greater than X% in 1 second, I need to report the instance separately).
If I were doing this in Excel, it's a fairly simple formula. I have a few million rows of data, so that's not an option.
Any suggestions on how I could do it in MS Access?
I am open to any kind of solutions (with or without SQL or VBA).
Update:
I ended up trying to traverse my records by using ADODB.Recordset in nested loops. Code below. I though it was a good idea, and the logic worked for a small table (20k rows). But when I ran it on a larger table (3m rows), Access ballooned to 2GB limit without finishing the task (because of temporary tables, the size of the original table was more like ~300MB). Posting it here in case it helps someone with smaller data sets.
Do While Not rstTickers.EOF
myTicker = rstTickers!ticker
rstDates.MoveFirst
Do While Not rstDates.EOF
myDate = rstDates!Date_Only
strSql = "select * from Prices where ticker = """ & myTicker & """ and Date_Only = #" & myDate & "#" 'get all prices for a given ticker for a given date
rst.Open strSql, cn, adOpenKeyset, adLockOptimistic 'I needed to do this to open in editable mode
rst.MoveFirst
sPrice1 = rst!Open_Price
rst!Row_Num = i
rst.MoveNext
Do While Not rst.EOF
i = i + 1
rst!Row_Num = i
rst!Previous_Price = sPrice1
sPrice2 = rst!Open_Price
rst!Price_Move = Round(Abs((sPrice2 / sPrice1) - 1), 6)
sPrice1 = sPrice2
rst.MoveNext
Loop
i = i + 1
rst.Close
rstDates.MoveNext
Loop
rstTickers.MoveNext
Loop
If the data is always one second apart without any milliseconds, then you can join the table to itself on the Ticker ID and the time offsetting by one second.
Otherwise, if there is no sequence counter of some sort to join on, then you will need to create one. You can do this by doing a "ranking" query. There are multiple approaches to this. You can try each and see which one works the fastest in your situation.
One approach is to use a subquery that returns the number of rows are before the current row. Another approach is to join the table to itself on all the rows before it and do a group by and count. Both approaches produce the same results but depending on the nature of your data and how it's structured and what indexes you have, one approach will be faster than the other.
Once you have a "rank column", you do the procedure described in the first paragraph, but instead of joining on an offset of time, you join on an offset of rank.
I ended up moving my data to a SQL server (which had its own issues). I added a row number variable (row_num) like this
ALTER TABLE Prices ADD Row_Num INT NOT NULL IDENTITY (1,1)
It worked for me (I think) because my underlying data was in the order that I needed for it to be in. I've read enough comments that you shouldn't do it, because you don't know what order is the server storing the data in.
Anyway, after that it was a join on itself. Took me a while to figure out the syntax (I am new to SQL). Adding SQL here for reference (works on SQL server but not Access).
Update A Set Previous_Price = B.Open_Price
FROM Prices A INNER JOIN Prices B
ON A.Date_Only = B.Date_Only
WHERE ((A.Ticker=B.Ticker) AND (A.Row_Num=B.Row_Num+1));
BTW, I had to first add the column Date_Only like this (works on Access but not SQL server)
UPDATE Prices SET Prices.Date_Only = Format([Time_Date],"mm/dd/yyyy");
I think the solution for row numbers described by #Rabbit should work better (broadly speaking). I just haven't had the time to try it out. It took me a whole day to get this far.

Access - Match Column Range / Complex Join

How do we match columns based on condition of closeness to value.
This requires Complex Query / Range Comparison / Multiple Joins conditions.
Getting Query size exceeded 2GB error.
Tables :
InvDetails1 / InvDetails2 / INVDL / ExpectedResult
Field Relation :
InvDetails1.F1 = InDetails2.F3
InvDetails2.F5 = INVDL.F1
INVDL.DLID = ExpectedResult.DLID
ExpectedResult.Total - 1 < InvDetails1.F6 < ExpectedResult.Total + 1
left(InvDetails1.F21,10) = '2013-03-07'
Return Results where Number of records from ExpectedResult is only 1.
Group by InvDetails1.F1 , count(ExpectedResult.DLID) works.
From this result.
Final Result :
InvDetails1.F1 , InvDetails1.F16 , ExpectedResult.DLID , ExpectedResult.NMR
ExpectedResult - has millions of rows.
InvDetails - few hundred thousands
If I was in that situation and was finding that my query was "hitting a wall" at 2GB then one thing I would try would be to create a separate saved Select Query to isolate the InvDetails1 records just for the specific date in question. Then I would use that query instead of the full InvDetails1 table when joining to the other tables.
My reasoning is that the query optimizer may not be able to use your left(InvDetails1.F21,10) = '2013-03-07' condition to exclude InvDetails1 records early in the execution plan, possibly causing the query to grow much larger than it really needs to (internally, while it is being processed). Forcing the date selection to the beginning of the process by putting it in a separate (prerequisite) query may keep the size of the "main" query down to a more feasible size.
Also, if I found myself in the situation where my queries were getting that big I would also keep a watchful eye on the size of my .accdb (or .mdb) file to ensure that it does not get too close to 2GB. I've never had it happen myself, but I've heard that database files that hit the 2GB barrier can result in some nasty errors and be rather "challenging" to recover.

Rails SQL query optimization

I have a featured section in my website that contains featured posts of three types: normal, big and small. Currently I am fetching the three types in three separate queries like so:
#featured_big_first = Post.visible.where(pinged: 1).where('overlay_type =?', :big).limit(1)
#featured_big_first = Post.visible.where(pinged: 1).where('overlay_type =?', :small).limit(1)
#featured_big_first = Post.visible.where(pinged: 1).where('overlay_type =?', :normal).limit(5)
Basically I am looking for a query that will combine those three in to one and fetch 1 big, 1 small, 5 normal posts.
I'd be surprised if you don't want an order. As you have it, it is supposed to find a random small, random large, and 5 random normal.
Yes, you can use a UNION. However, you will have to do an execute SQL. Look at the log for the SQL for each of your three queries, and do an execute SQL of a string which is each of the three queries with UNION in between. It might work, or it might have problems with the limit.
It is possible in SQL by joining the table to itself, doing a group by on one of the aliases for the table, a where when the other aliased table is <= the group by table, and adding a having clause where count of the <= table is under the limit.
So, if you had a simple query of the posts table (without the visible and pinged conditions) and wanted the records with the latest created_at date, then the normal query would be:
SELECT posts1.*
FROM posts posts1, posts posts2
WHERE posts2.created_at >= posts1.create_at
AND posts1.overlay_type = 'normal'
AND posts2.overlay_type = 'normal'
GROUP BY posts1.id
HAVING count(posts2.id) <= 5
Take this SQL, and add your conditions for visible and pinged, remembering to use the condition for both posts1 and posts2.
Then write the big and small versions and UNION it all together.
I'd stick with the three database calls.
I don't think this is possible but you can use scope which is more rails way to write a code
Also it may just typo but you are reassigning the #featured_big_first so it will contain the data of the last query only
in post.rb
scope :overlay_type_record lamda{|type| joins(:visible).where(["visible.pinged=1 AND visible.overlay_type =", type])}
and in controller
#featured_big_first = Post.overlay_type_record(:big).limit(1)
#featured_small_first = Post.overlay_type_record(:small).limit(1)
#featured_normal_first = Post.overlay_type_record(:normal).limit(5)