We are dynamically building a SQL statement in which the WHERE clause will consist of multiple predicates joined together using OR
SELECT cols
FROM t
WHERE (t.id = id1 AND t.PARTITIONDATE = “yyyy-mm-dd”)
OR (t.id = id2 AND t.PARTITIONDATE = “yyyy-mm-dd”)
OR (t.id = id3 AND t.PARTITIONDATE = “yyyy-mm-dd”)
OR (t.id = id4 AND t.PARTITIONDATE = “yyyy-mm-dd”)
etc…
What is the maximum number of conditions that BigQuery allows in such a SQL statement?
I’ve looked at the documentation (https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#where_clause) but the answer is not there.
Those are two factors for you to consider:
Maximum unresolved Standard SQL query length - 1 MB - An unresolved Standard SQL query can be up to 1 MB long. If your query is longer, you receive the following error: The query is too large. To stay within this limit, consider replacing large arrays or lists with query parameters.
and
Maximum resolved legacy and Standard SQL query length - 12 MB - The limit on resolved query length includes the length of all views and wildcard tables referenced by the query.
You can easily experiment with how many predicates you can use - for example - i just did very quick and simplified experiment and was able to use 50K predicates joined together with OR using below super simplified and totally dummy script
execute immediate (
select 'select 1 from (select 1) where ' || string_agg('1 = ' || num, ' or ')
from unnest(generate_array(1,50000)) num
)
Related
This might be a novice question – I'm still learning. I'm on PostgreSQL 9.6 with the following query:
SELECT locales, count(locales) FROM (
SELECT lower((regexp_matches(locale, '([a-z]{2,3}(-[a-z]{2,3})?)', 'i'))[1])
AS locales FROM users)
AS _ GROUP BY locales
My query returns the following dynamic rows:
locales
count
en
10
fr
7
de
3
n additional locales (~300)...
n-count
I'm trying to rotate it so that locale values end up as columns with a single row, like this:
en
fr
de
n additional locales (~300)...
10
7
3
n-count
I'm having to do this to play nice with a time-series db/app
I've tried using crosstab(), but all the examples show better defined tables with 3 or more columns.
I've looked at examples using join, but I can't figure out how to do it dynamically.
Base query
In Postgres 10 or later you could use the simpler and faster regexp_match() instead of regexp_matches(). (Since you only take the first match per row anyway.) But don't bother and use the even simpler substring() instead:
SELECT lower(substring(locale, '(?i)[a-z]{2,3}(?:-[a-z]{2,3})?')) AS locale
, count(*)::int AS ct
FROM users
WHERE locale ~* '[a-z]{2,3}' -- eliminate NULL, allow index support
GROUP BY 1
ORDER BY 2 DESC, 1
Simpler and faster than your original base query.
About those ordinal numbers in GROUP BY and ORDER BY:
Select first row in each GROUP BY group?
Subtle difference: regexp_matches() returns no row for no match, while substring() returns null. I added a WHERE clause to eliminate non-matches a-priori - and allow index support if applicable, but I don't expect indexes to help here.
Note the prefixed (?i), that's a so-called "embedded option" to use case-insensitive matching.
Added a deterministic ORDER BY clause. You'd need that for a simple crosstab().
Aside: you might need _ in the pattern instead of - for locales like "en_US".
Pivot
Try as you might, SQL does not allow dynamic result columns in a single query. You need two round trips to the server. See;
How do I generate a pivoted CROSS JOIN where the resulting table definition is unknown?
You can use a dynamically generated crosstab() query. Basics:
PostgreSQL Crosstab Query
Dynamic query:
PostgreSQL convert columns to rows? Transpose?
But since you generate a single row of plain integer values, I suggest a simple approach:
SELECT 'SELECT ' || string_agg(ct || ' AS ' || quote_ident(locale), ', ')
FROM (
SELECT lower(substring(locale, '(?i)[a-z]{2,3}(?:-[a-z]{2,3})?')) AS locale
, count(*)::int AS ct
FROM users
WHERE locale ~* '[a-z]{2,3}'
GROUP BY 1
ORDER BY 2 DESC, 1
) t
Generates a query of the form:
SELECT 10 AS en, 7 AS fr, 3 AS de, 3 AS "de-at"
Execute it to produce your desired result.
In psql you can append \gexec to the generating query to feed the generated SQL string back to the server immediately. See:
My function returned a string. How to execute it?
I have a dataset within BigQuery with roughly 1000 tables, one for each variable. Each table contains two columns: observation_number, variable_name. Please note that the variable_name column assumes the actual variable name. Each table contains at least 20000 rows. What is the best way to merge these tables on the observation number?
I have developed a Python code that is going to run on a Cloud Function and it generates the SQL query to merge the tables. It does that by connecting to the dataset and looping through the tables to get all of the table_ids. However, the query ends up being too large and the performance is not that great.
Here it is the sample of the Python code that generates the query (mind it's still running locally, not yet in a Cloud Function).
from google.cloud import bigquery
# Construct a BigQuery client object.
client = bigquery.Client()
# TODO(developer): Set project_id and dataset_id.
project_id = 'project-id-gcp'
dataset_name = 'sample_dataset'
dataset_id = project_id+'.'+dataset_name
dataset = client.get_dataset(dataset_id)
# View tables in dataset
tables = list(client.list_tables(dataset)) # API request(s)
table_names = []
if tables:
for table in tables:
table_names.append(table.table_id)
else:
print("\tThis dataset does not contain any tables.")
query_start = "select "+table_names[0]+".observation"+","+table_names[0]+"."+table_names[0]
query_select = ""
query_from_select = "(select observation,"+table_names[0]+" from `"+dataset_name+"."+table_names[0]+"`) "+table_names[0]
for table_name in table_names:
if table_name != table_names[0]:
query_select = query_select + "," + table_name+"."+table_name
query_from_select = query_from_select + " FULL OUTER JOIN (select observation," + table_name + " from " + "`"+dataset_name+"."+table_name+"`) "+table_name+" on "+table_names[0]+".observation="+table_name+".observation"
query_from_select = " from ("+query_from_select + ")"
query_where = " where " + table_names[0] + ".observation IS NOT NULL"
query_order_by = " order by observation"
query_full = query_start+query_select+query_from_select+query_where+query_order_by
with open("query.sql","w") as f:
f.write(query_full)
And this is a sample of the generated query for two tables:
select
VARIABLE1.observation,
VARIABLE1.VARIABLE1,
VARIABLE2.VARIABLE2
from
(
(
select
observation,
VARIABLE1
from
`sample_dataset.VARIABLE1`
) VARIABLE1 FULL
OUTER JOIN (
select
observation,
VARIABLE2
from
`sample_dataset.VARIABLE2`
) VARIABLE2 on VARIABLE1.observation = VARIABLE2.observation
)
where
VARIABLE1.observation IS NOT NULL
order by
observation
As the number of tables grows, this query gets larger and larger. Any suggestions on how to improve the performance of this operation? Any other way to approach this problem?
I don't know if there is a great technical answer to this question. It seems like you are trying to do a huge # of joins in a single query, and BQ's strength is not realized with many joins.
While I outline a potential solution below, have you considered if/why you really need a table with 1000+ potential columns? Not saying you haven't, but there might be alternate ways to solve your problem without creating such a complex table.
One possible solution is to subset your joins/tables into more manageable chunks.
If you have 1000 tables for example, run your script against smaller subsets of your tables (2/5/10/etc) and write those results to intermediate tables. Then join your intermediate tables. This might take a few layers of intermediate tables depending on the size of your sub-tables. Basically, you want to minimize (or make reasonable) the number of joins in each query. Delete the intermediate tables after you are finished to help with unnecessary storage costs.
I have a table with a column called metrics that have different possible metrics as:
Metric Value
--------------
A 100
B 200
C 300
I want to derive another table from this base table that may have rows like:
Metric Value
--------------
A 100
B 200
C 300
C/A 3
B/A 2
Basically keeping original rows as is + adding some new rows based on existing value's combinations.
One way I could think of doing this is:
1. Pivot the data
2. Put it in some temp table or CTE
3. Select all existing metric columns + New calculated columns I need
4. unpivot the output of the last step
Is there a better way to achieve this with SQL? Or perhaps any other possible way?
Also, redshift doesn't support Pivot function, is there a workaround for that in addition to using Case Statements?
You could join the table with itself and apply the operation on the pairs of metrics you like. And UNION ALL the table as it is to include the original metrics.
One possibility for your example would be (assuming Postgres):
SELECT metric,
value
FROM metrics
UNION ALL
SELECT concat(m1.metric, '/', m2.metric),
m1.value / m2.value
FROM metrics m1
CROSS JOIN metrics m2
WHERE (m1.metric,
m2.metric) IN (('C',
'A'),
('B',
'A'));
SQL Fiddle
Of course this could be extended to ternary, ... operations by adding another join and several different operations by adding other queries and UNIONing them.
select
case when x1.metric = x2.metric
then x1.metric
else x1.metric || ' / ' || x2.metric end,
case when x1.metric = x2.metric
then x1.value
else x1.value / x2.value end
from mytable x1
join mytable x2
on x1.metric = x2.metric or x2.metric = 'A'
This is one way to do it, and its using purely standard sql. Note however, that different RDMBS software have different levels of standards-conformance, and may not support some of the features used here. Specifically, string concatenation operator || isn't implemented in all databases. Some databases use the function concat or + instead.
The following query does not work in Postgres 9.4.5.
SELECT * FROM (
SELECT M.NAME, M.VALUE AS V
FROM METRICS AS M, METRICATTRIBUTES AS A
WHERE M.NAME=A.NAME AND A.ISSTRING='FALSE'
) AS S1
WHERE CAST(S1.V AS NUMERIC)<0
I get an error like:
invalid input syntax for type numeric: "astringvalue"
Read on to see why I made query this overly complicated.
METRICS is a table of metric, value pairs. The values are stored as strings and some of the values of the VALUE field are, in fact strings. The METRICATTRIBUTES table identifies those metric names which may have string values. I populated the METRICATTRIBUTES table from an analysis of the METRICS table.
To check, I ran...
SELECT * FROM (
SELECT M.NAME, M.VALUE AS V
FROM METRICS AS M, METRICATTRIBUTES AS A
WHERE M.NAME=A.NAME AND A.ISSTRING='FALSE'
) AS S1
WHERE S1.V LIKE 'a%'
This returns no values (like I would expect). The error seems to be in the execution plan. Which looks something like this (sorry, I had to fat finger this)
1 -> HAS JOIN
2 HASH COND: ((M.NAME::TEXT=(A.NAME)::TEXT))
3 SEQ SCAN ON METRICS M
4 FILTER: ((VALUE)::NUMERIC<0::NUMERIC)
5 -> HASH
6 -> Seq Scan on METRICATTRIBUTES A
7 Filter: (NOT ISSTRING)
I am not an expert on this (only 1 week of Postgres experience) but it looks like Postgres is trying to apply the cast (line 4) before it applies the join condition (line 2). By doing this, it will try to apply the cast to invalid string values which is precisely what I am trying to avoid!
Writing this with an explicit join did not make any difference. Writing it as a single select statement was my first attempt, never expecting this type of problem. That also did not work.
Any ideas?
As you can see from your plan, table METRICS is being scanned in full (Seq Scan) and filtered with your condition: CAST(S1.V AS NUMERIC)<0—join does not limits the scope at all.
Obviously, you have some rows, that contain non-numeric data in the METRICS.VALUE.
Check your table for such rows like this:
SELECT * FROM METRICS
WHERE NOT VALUE ~ '^([0-9].,e)*$'
Note, that it is difficult to catch all possible combinations with regular expression, therefore check out this related question: isnumeric() with PostgreSQL
Name VALUE for the column is not good, as this word is a reserved one.
Edit: If you're absolutely sure, that joined tables will produce wanted VALUE-s, than you can use CTEs, which have optimization fence feature in PostgreSQL:
WITH S1 AS (
SELECT M.NAME, M.VALUE AS V
FROM METRICS AS M
JOIN METRICATTRIBUTES AS A USING (NAME)
WHERE A.ISSTRING='FALSE'
)
SELECT *
FROM S1
WHERE CAST(S1.V AS NUMERIC)<0;
I'm working with a database, where one of the fields I extract is something like:
1-117 3-134 3-133
Each of these number sets represents a different set of data in another table. Taking 1-117 as an example, 1 = equipment ID, and 117 = equipment settings.
I have another table from which I need to extract data based on the previous field. It has two columns that split equipment ID and settings. Essentially, I need a way to go from the queried column 1-117 and run a query to extract data from another table where 1 and 117 are two separate corresponding columns.
So, is there anyway to split this number to run this query?
Also, how would I split those three numbers (1-117 3-134 3-133) into three different query sets?
The tricky part here is that this column can have any number of sets here (such as 1-117 3-133 or 1-117 3-134 3-133 2-131).
I'm creating these queries in a stored procedure as part of a larger document to display the extracted data.
Thanks for any help.
Since you didn't provide the DB vendor, here's two posts that answer this question for SQL Server and Oracle respectively...
T-SQL: Opposite to string concatenation - how to split string into multiple records
Splitting comma separated string in a PL/SQL stored proc
And if you're using some other DBMS, go search for "splitting text ". I can almost guarantee you're not the first one to ask, and there's answers for every DBMS flavor out there.
As you said the format is constant though, you could also do something simpler using a SUBSTRING function.
EDIT in response to OP comment...
Since you're using SQL Server, and you said that these values are always in a consistent format, you can do something as simple as using SUBSTRING to get each part of the value and assign them to T-SQL variables, where you can then use them to do whatever you want, like using them in the predicate of a query.
Assuming that what you said is true about the format always being #-### (exactly 1 digit, a dash, and 3 digits) this is fairly easy.
WITH EquipmentSettings AS (
SELECT
S.*,
Convert(int, Substring(S.AwfulMultivalue, V.Value * 6 - 5, 1) EquipmentID,
Convert(int, Substring(S.AwfulMultivalue, V.Value * 6 - 3, 3) Settings
FROM
SourceTable S
INNER JOIN master.dbo.spt_values V
ON V.Value BETWEEN 1 AND Len(S.AwfulMultivalue) / 6
WHERE
V.type = 'P'
)
SELECT
E.Whatever,
D.Whatever
FROM
EquipmentSettings E
INNER JOIN DestinationTable D
ON E.EquipmentID = D.EquipmentID
AND E.Settings = D.Settings
In SQL Server 2005+ this query will support 1365 values in the string.
If the length of the digits can vary, then it's a little harder. Let me know.
Incase if the sets does not increase by more than 4 then you can use Parsename to retrieve the result
Declare #Num varchar(20)
Set #Num='1-117 3-134 3-133'
select parsename(replace (#Num,' ','.'),3)
Result :- 1-117
Now again use parsename on the same resultset
Select parsename(replace(parsename(replace (#Num,' ','.'),3),'-','.'),1)
Result :- 117
If the there are more than 4 values then use split functions