Simplified example of the problem:
select p.id,
p.name,
-- other columns from joined tables
decode(get_complicated_number(p.id), null, null, "The number is: " || get_complicated_number(p.id)))
from some_table p
-- join other tables and WHERE clause
It includes get_complicated_number call which queries multiple tables. I wasn't able to write it as a JOIN statement that would be as fast and as easy to maintain as a separate function so far.
Currently the function is called twice in case its return value is not NULL.
In reality I have an XML generation package that gets the data with a select:
select distinct xmlAgg
(
xmlelement
(
"TestElement",
xmlelement("Id", p.id),
xmlelement("Name", p.name),
-- other elements from joined tables
decode(get_complicated_number(p.id), null, null, xmlelement("ComplicatedNum", get_complicated_number(p.id)))
)
)
from some_table p
-- join other tables and WHERE clause
Is there a way to make it only one call and still avoid creating an empty element on NULL?
You can use WITH Syntax (Common Table Expressions) as:
with complicated_number as (
select get_complicated_number(p.id) as num from some_table p
) select distinct xmlAgg
--...
decode(complicated_number.num, null, null, xmlelement("ComplicatedNum", complicated_number.num))
from complicated_number
common table expression (CTE) is a named temporary result set that exists within the scope of a single statement and that can be referred to later within that statement, possibly multiple times
user7294900's answer is good, but if it's hard to combine with your existing joins, here's an alternate version with an inline view instead of a CTE.
select distinct xmlAgg
(
xmlelement
(
"TestElement",
xmlelement("Id", p2.id),
xmlelement("Name", p2.name),
-- other elements from joined tables
decode(p2.num, null, null, xmlelement("ComplicatedNum", p2.num))
)
)
from (
select p.id, p.name, get_complicated_number(p.id) as num
from some_table p
) p2
-- join other tables to p2. or put them inside it.
If you want help with adding your existing joins to these example queries, you might need to edit your question and add your other tables and WHERE clauses.
Related
I can implement one nested SELECT statement in the FROM clause like this and it works:
SELECT [table1].[ID] as [table_1_ID]
FROM
(SELECT [ID] FROM [the_table_1] WHERE [address] like 'street5' ) as [table1]
Ideally, I wish to add multiple similar nested SELECT statement inside the FROM clause without using join, to end up something like this (the following code obviously doesn't work)
SELECT [nested_selects].[table1].[ID] as [table_1_ID],
[nested_selects].[table2].[ID] as [table_2_ID]
FROM (
(SELECT [ID] FROM [the_table_1] WHERE [address] like 'street5' ) as [table1],
(SELECT [ID] FROM [the_table_2] WHERE [address] like 'street5' ) as [table2]
) as [nested_selects]
(From each source table I need only one single value. WHERE clause does that.)
I know how to do it with JOIN, but for some reason I wish to do it without JOIN.
Is such a thing possible in SQL Server?
If you can guarantee that only a single value will be returned from the subqueries, you can nest the selects inside the select. You don't need a from at all (I have replaced your like with =):
select
table_1_id = (SELECT [ID] FROM [the_table_1] WHERE [address] = 'street5' ),
table_2_id = (SELECT [ID] FROM [the_table_2] WHERE [address] = 'street5' );
You don't technically have to enforce this guarantee. But if you happen to get more than one row back for a subquery in the select, SQL with throw a Subquery returned more than 1 value error. So, either put a unique constraint on address, or add something to each sub-select that guarantees only one row is returned.
I have 3 tables in an Oracle database. FooDBflights, FooDB1 and FooDB2.
I use the "with" operator to create simpler version of these tables named, FLIGHTS, MESSAGES1 and MESSAGES2.
I want to make a select statement that returns one table. FLIGHTS joined with the union of MESSAGES1 and MESSAGES2
Here is my SQL statement.
WITH FLIGHTS AS (
SELECT DISTINCT id,ARCADDR,CALLSIGN,trunc(FIRSTTIMEENTRY)
AS DOE FROM FooDBflights
WHERE FIRSTTIMEENTRY IS NOT null),
MESSAGES1 AS (
SELECT DISTINCT flightID,AIRCRAFTADDRESS,trunc(SYS_DATETIME)
FROM FooDB1
WHERE AIRCRAFTADDRESS!=' ' ),
MESSAGES2 AS (
SELECT DISTINCT flightID,AIRCRAFTADDRESS,trunc(SYS_DATETIME)
FROM FooDB2
WHERE AIRCRAFTADDRESS!=' ' )
SELECT a.*,b.*,substr(b.AIRCRAFTADDRESS, 3)
FROM FLIGHTS a
LEFT JOIN MESSAGES1 b
ON a.callsign=trim(b.flightid)
AND trim(a.arcaddr)=substr(UPPER (b.AIRCRAFTADDRESS), 3)
This query returns FLIGHTS and MESSAGES1 joined perfectly, but I can't figure out how to make the union between MESSAGES1 and MESSAGES2. How can I do this?
It would be the easiest to perform the union in the with clause:
Instead of
MESSAGES1 AS ...
MESSAGES2 AS ...
Do:
MESSAGES AS (
SELECT flightID,AIRCRAFTADDRESS,trunc(SYS_DATETIME) AS DT
FROM FooDB1
WHERE AIRCRAFTADDRESS!=' '
UNION
SELECT flightID,AIRCRAFTADDRESS,trunc(SYS_DATETIME)
FROM FooDB2
WHERE AIRCRAFTADDRESS!=' ' )
... and reference that one in our main SELECT query.
Note that DISTINCT is not needed once you do UNION.
Remark
Using trim() in your join conditions could make the query slow when there is a lot of data: the database engine cannot benefit from indexes. It would be better to have foreign keys, and enforce those as database constraints. Then trim will not be necessary.
Also substr() and upper() will potentially have a negative effect on performance. Consider splitting the field AIRCRAFTADDRESS into two in the table itself, and make sure the text is already in uppercase on insertion.
I have a table of several million strings that I want to match against a table of about twenty thousand strings like this:
#standardSQL
SELECT record.* FROM `record`
JOIN `fragment` ON record.name
LIKE CONCAT('%', fragment.name, '%')
Unfortunately this is taking an awful long time.
Considering that the fragment table is only 20k records, can I load it into a JavaScript array using a UDF and match it that way? I'm trying to figure out how to this right now but perhaps there's already some magic I could do here to make this faster. I tried a CROSS JOIN and got resource exceeded fairly quickly. I've also tried using EXISTS but I can't reference the record.name inside that subquery's WHERE without getting an error.
Example using Public Data
This seems to reflect about the same amount of data ...
#standardSQL
WITH record AS (
SELECT LOWER(text) AS name
FROM `bigquery-public-data.hacker_news.comments`
), fragment AS (
SELECT LOWER(name) AS name, COUNT(*)
FROM `bigquery-public-data.usa_names.usa_1910_current`
GROUP BY name
)
SELECT record.* FROM `record`
JOIN `fragment` ON record.name
LIKE CONCAT('%', fragment.name, '%')
Below is for BigQuery Standard SQL
#standardSQL
WITH record AS (
SELECT LOWER(text) AS name
FROM `bigquery-public-data.hacker_news.comments`
), fragment AS (
SELECT DISTINCT LOWER(name) AS name
FROM `bigquery-public-data.usa_names.usa_1910_current`
), temp_record AS (
SELECT record, TO_JSON_STRING(record) id, name, item
FROM record, UNNEST(REGEXP_EXTRACT_ALL(name, r'\w+')) item
), temp_fragment AS (
SELECT name, item FROM fragment, UNNEST(REGEXP_EXTRACT_ALL(name, r'\w+')) item
)
SELECT AS VALUE ANY_VALUE(record) FROM (
SELECT ANY_VALUE(record) record, id, r.name name, f.name fragment_name
FROM temp_record r
JOIN temp_fragment f
USING(item)
GROUP BY id, name, fragment_name
)
WHERE name LIKE CONCAT('%', fragment_name, '%')
GROUP BY id
above was completed in 375 seconds, while original query is still running at 2740 seconds and keep running, so I will not even wait for it to complete
Mikhail's answer appears to be faster - but lets have one that doesn't need to SPLIT nor separate the text into words.
First, compute a regular expression with all the words to be searched:
#standardSQL
WITH record AS (
SELECT text AS name
FROM `bigquery-public-data.hacker_news.comments`
), fragment AS (
SELECT name AS name, COUNT(*)
FROM `bigquery-public-data.usa_names.usa_1910_current`
GROUP BY name
)
SELECT FORMAT('(%s)',STRING_AGG(name,'|'))
FROM fragment
Now you can take that resulting string, and use it in a REGEX ignoring case:
#standardSQL
WITH record AS (
SELECT text AS name
FROM `bigquery-public-data.hacker_news.comments`
), largestring AS (
SELECT '(?i)(mary|margaret|helen|more_names|more_names|more_names|josniel|khaiden|sergi)'
)
SELECT record.* FROM `record`
WHERE REGEXP_CONTAINS(record.name, (SELECT * FROM largestring))
(~510 seconds)
As eluded to in my question, I worked on a version using a JavaScript UDF which solves this albeit in a slower way than the answer I accepted. For completeness, I'm posting it here because perhaps someone (like myself in the future) may find it useful.
CREATE TEMPORARY FUNCTION CONTAINS_ANY(str STRING, fragments ARRAY<STRING>)
RETURNS STRING
LANGUAGE js AS """
for (var i in fragments) {
if (str.indexOf(fragments[i]) >= 0) {
return fragments[i];
}
}
return null;
""";
WITH record AS (
SELECT text AS name
FROM `bigquery-public-data.hacker_news.comments`
WHERE text IS NOT NULL
), fragment AS (
SELECT name AS name, COUNT(*)
FROM `bigquery-public-data.usa_names.usa_1910_current`
WHERE name IS NOT NULL
GROUP BY name
), fragment_array AS (
SELECT ARRAY_AGG(name) AS names, COUNT(*) AS count
FROM fragment
GROUP BY LENGTH(name)
), records_with_fragments AS (
SELECT record.name,
CONTAINS_ANY(record.name, fragment_array.names)
AS fragment_name
FROM record INNER JOIN fragment_array
ON CONTAINS_ANY(name, fragment_array.names) IS NOT NULL
)
SELECT * EXCEPT(rownum) FROM (
SELECT record.name,
records_with_fragments.fragment_name,
ROW_NUMBER() OVER (PARTITION BY record.name) AS rownum
FROM record
INNER JOIN records_with_fragments
ON records_with_fragments.name = record.name
AND records_with_fragments.fragment_name IS NOT NULL
) WHERE rownum = 1
The idea is that the list of fragments is relatively small enough that it can be processed in an array, similar to Felipe's answer using regular expressions. The first thing I do is create a fragment_array table which is grouped by the fragment lengths ... a cheap way of preventing an over-sized array which I found can cause UDF timeouts.
Next I create a table called records_with_fragments that joins those arrays to the original records, finding only those which contain a matching fragment using the JavaScript UDF CONTAINS_ANY(). This will result in a table containing some duplicates since one record may match multiple fragments.
The final SELECT then pulls in the original record table, joins to records_with_fragments to determine which fragment matched, and also uses the ROW_NUMBER() function to prevent duplicates, e.g. only showing the first row of each record as uniquely identified by its name.
Now, the reason I do the join in the final query is because in my actual data there are more fields I want besides just the string being matched. Earlier on in my actual data I create a table of DISTINCT strings which then later need to be re-joined.
Voila! Not the most elegant but it gets the job done.
I have a table that looks like this:
stuff
id integer
content text
score double
children[] (an array of id's from this same table)
I'd like to run a query that selects all the children for a given id, and then right away gets the full row for all these children, sorted by score.
Any suggestions on the best way to do this? I've looked into WITH RECURSIVE but I'm not sure that's workable. Tried posting at postgresql SE with no luck.
The following query will find all rows corresponding to the children of the object with id 14:
SELECT *
FROM unnest((SELECT children FROM stuff WHERE id=14)) t(id)
JOIN stuff USING (id)
ORDER BY score;
This works by finding the children of 14 as array first, then we convert it into a table using the unnest function, and then we join with stuff to find all rows with the given ids.
The ANY construct in the join condition would be simplest:
SELECT c.*
FROM stuff p
JOIN stuff c ON id = ANY (p.children)
WHERE p.id = 14
ORDER BY c.score;
Doesn't matter for the query whether the array of children IDs is in the same table or different one. You just need table aliases here to be unambiguous.
Related:
Check if value exists in Postgres array
Similar solution:
With Postgres you can use a recursive common table expression:
with recursive rel_tree as (
select rel_id, rel_name, rel_parent, 1 as level, array[rel_id] as path_info
from relations
where rel_parent is null
union all
select c.rel_id, rpad(' ', p.level * 2) || c.rel_name, c.rel_parent, p.level + 1, p.path_info||c.rel_id
from relations c
join rel_tree p on c.rel_parent = p.rel_id
)
select rel_id, rel_name
from rel_tree
order by path_info;
Ref: Postgresql query for getting n-level parent-child relation stored in a single table
In the code_list CTE in this query I have a row constructor that will eventually take any number of arguments. The column icd in the patient_codes CTE is a five digit identifier that is most descriptive that the three digit codes that the row constructor has. The table icd_patient has a 100 million rows so for performance's sake, I would like to filer the rows on this table before I do any further work. I have
;with code_list(code_list)
as
(
select x.code_list
from (values ('70700'),('25002')) as x(code_list)
),patient_codes
as
(
select distinct icd,pat_id,id
from icd_patient
where icd in (select icd from code_list)
)
select distinct pat_id from patient_codes
The problem is, however, is that in the icd_patient table all of the icd columns are five digit and more descriptive. If I look at the execution plan of this query it's pretty streamlined. If I do
;with code_list(code_list)
as
(
select x.code_list
from (values ('70700'),('25002')) as x(code_list)
),patient_codes
as
(
select substring(icd,1,3) as icd,pat_id
from icd_patient2
where substring(icd,1,3) in (select * from code_list)
)
select * from patient_codes
this if course has a large performance impact because of the substring expression in the where clause. Does something akin to like in exist so I can take advantage of my indexes?
Index on icd_patient
CREATE NONCLUSTERED INDEX [ix_icd_patient] ON [dbo].[icd_patient2]
(
[pat_id] ASC
)
INCLUDE ( [id],
This much simpler query should be better than (or, at worst, the same as) your existing query.
select pat_id
FROM dbo.icd_patient
where icd LIKE '707%'
OR icd LIKE '250%'
GROUP BY pat_id;
Note that sargability only matters if there is actually an index on this column.
An alternative (since OR can sometimes give the optimizer fits):
SELECT pat_id FROM
(
SELECT pat_id
FROM dbo.icd_patient
WHERE icd LIKE '707%'
UNION ALL
SELECT pat_id
FROM dbo.icd_patient
WHERE icd LIKE '250%'
) AS x
GROUP BY pat_id;
To make this extensible beyond a handful of OR conditions, I would use a table-valued parameter (TVP).
CREATE TYPE dbo.StringPatterns AS TABLE(s VARCHAR(3) PRIMARY KEY);
Then your stored procedure could say:
CREATE PROCEDURE dbo.whatever
#sp dbo.StringPatterns READONLY
AS
BEGIN
SET NOCOUNT ON;
SELECT p.pat_id
FROM dbo.icd_patient AS p
INNER JOIN #sp AS sp
ON p.pat_id LIKE sp.s + '%'
GROUP BY p.pat_id;
END
Then you can pass in your set of three-character substrings from a DataTable or other collection in C#. From T-SQL just as an example:
DECLARE #p dbo.StringPatterns;
INSERT #p VALUES('707'),('250');
EXEC dbo.whatever #sp = #p;
Something like like in does not exist. The following is sargable:
select *
from icd_patient
where icd like '70700%' or
icd like '25002%'
Because like with a constant initial substring is a special case for SQL Server. This does not work when the strings on the right are variables.
One solution is to create an indexed view on the icd_patient table with an index on the first five characters of the icd code.
Using "IN" makes that part of a command non-sargable on both sides. End of discussion.
Saying he fixes it using substring, completely changes what it would return while it remains non sarged.
Any "fix" should exactly match results. The actual fix is to join the cte so the five characters match or put three characters in the cte and match that in a join or put 4 characters in the cte where the fourth is "%" and join matching by using LIKE
Using a "like" that starts with "%" increases the complexity of the search, but it would still use the index to find the value because parsing the index should use less reading by only getting the full table row when a search is successful.