Google Bigquery use of substr, never returns back results - google-bigquery

I have a table which has two sets of data, one set of data has information like
Type | Name | Id
PackagedDrug |Pseudoephedrine HCl Oral Tablet 120 MG| 110
PackagedDrug |Pseudoephedrine HCl Oral Tablet 60 MG|111
DrugName| Pseudoephedrine HCl| 112
What I want to do is join PackagedDrug with DrugName concepts, so get all Ids for Type PackagedDrug whose Name is matching with Name for Type DrugName. If I hardcode the Name for DrugName in the following query, it runs instantenously, but if I take out the hardcoding then it just keeps on running. Could you please suggest me suitable ways to speed up the big query?
SELECT a.MSC_ID MSC_id, a.MSC_CONcept_type, a.concept_id, a.concept_name , b.concept_name
from
(select MSC_id, MSC_CONcept_type, concept_id, concept_name
FROM [ClientAlerts.MSC_Concepts]
where MSC_CONcept_type in ('MediSpan.Concepts.PackagedDrug') ) a
CROSS JOIN
(select MSC_CONcept_type, concept_id, concept_name , length(concept_name) len
FROM [ClientAlerts.MSC_Concepts]
where MSC_CONcept_type in ('MediSpan.Concepts.NamebasedClassification.DrugName')
-- and concept_name in ('Pseudoephedrine HCl')
) b
where substr(a.concept_name,1,b.len)+' ' = b.concept_name
Thanks,
Savita

This has nothing to do with BigQuery itself. When you hardcode, your values are "filtered" way faster, because it doesn't have to check every row, since it looks for the hardcoded value.
If you don't use the hardcoded value, it will look at WAY more rows, compare ALL the rows from your first query with your second. Honestly, if you describe your use case properly here, I don't think of any way to do this faster.
But one question does come to mind. Why do you have a "type". It seems like it should be two different tables instead.

Related

Redshift SQL result set 100s of rows wide efficiency (long to wide)

Scenario: Medical records reporting to state government which requires a pipe delimited text file as input.
Challenge: Select hundreds of values from a fact table and produce a wide result set to be (Redshift) UNLOADed to disk.
What I have tried so far is a SQL that I want to make into a VIEW.
;WITH
CTE_patient_record AS
(
SELECT
record_id
FROM fact_patient_record
WHERE update_date = <yesterday>
)
,CTE_patient_record_item AS
(
SELECT
record_id
,record_item_name
,record_item_value
FROM fact_patient_record_item fpri
INNER JOIN CTE_patient_record cpr ON fpri.record_id = cpr.record_id
)
Note that fact_patient_record has 87M rows and fact_patient_record_item has 97M rows.
The above code runs in 2 seconds for 2 test records and the CTE_patient_record_item CTE has about 200 rows per record for a total of about 400.
Now, produce the result set:
,CTE_result AS
(
SELECT
cpr.record_id
,cpri002.record_item_value AS diagnosis_1
,cpri003.record_item_value AS diagnosis_2
,cpri004.record_item_value AS medication_1
...
FROM CTE_patient_record cpr
INNER JOIN CTE_patient_record_item cpri002 ON cpr.cpr.record_id = cpri002.cpr.record_id
AND cpri002.record_item_name = 'diagnosis_1'
INNER JOIN CTE_patient_record_item cpri003 ON cpr.cpr.record_id = cpri003.cpr.record_id
AND cpri003.record_item_name = 'diagnosis_2'
INNER JOIN CTE_patient_record_item cpri004 ON cpr.cpr.record_id = cpri004.cpr.record_id
AND cpri003.record_item_name = 'mediation_1'
...
) SELECT * FROM CTE_result
Result set looks like this:
record_id diagnosis_1 diagnosis_2 medication_1 ...
100001 09 9B 88X ...
...and then I use the Reshift UNLOAD command to write to disk pipe delimited.
I am testing this on a full production sized environment but only for 2 test records.
Those 2 test records have about 200 items each.
Processing output is 2 rows 200 columns wide.
It takes 30 to 40 minutes to process just just the 2 records.
You might ask me why I am joining on the item name which is a string. Basically there is no item id, no integer, to join on. Long story.
I am looking for suggestions on how to improve performance. With only 2 records, 30 to 40 minutes is unacceptable. What will happen when I have 1000s of records?
I have also tried making the VIEW a MATERIALIZED VIEW however, it takes 30 to 40 minutes (not surprisingly) to compile the materialized view also.
I am not sure which route to take from here.
Stored procedure? I have experience with stored procs.
Create new tables so I can create integer id's to join on and indexes? However, my managers are "new table" averse.
?
I could just stop with the first two CTEs, pull the data down to python and process using pandas dataframe which I've done before successfully but it would be nice if I could have an efficient query, just use Redshift UNLOAD and be done with it.
Any help would be appreciated.
UPDATE: Many thanks to Paul Coulson and Bill Weiner for pointing me in the right direction! (Paul I am unable to upvote your answer as I am too new here).
Using (pseudo code):
MAX(CASE WHEN t1.name = 'somename' THEN t1.value END ) AS name
...
FROM table1 t1
reduced execution time from 30 minutes to 30 seconds.
EXPLAIN PLAN for the original solution is 2700 lines long, for the new solution using conditional aggregation is 40 lines long.
Thanks guys.
Without some more information it is impossible to know what is going on for sure but what you are doing is likely not ideal. An explanation plan and the execution time per step would help a bunch.
What I suspect is getting you is that you are reading a 97M row table 200 times. This will slow things down but shouldn't take 40 min. So I also suspect that record_item_name is not unique per value of record_id. This will lead to row replication and could be expanding the data set many fold. Also is record_id unique in fact_patient_record? If not then this will cause row replication. If all of this is large enough to cause significant spill and significant network broadcasting your 40 min execution time is very plausible.
There is no need to be joining when all the data is in a single copy of the table. #PhilCoulson is correct that some sort of conditional aggregation could be applied and the decode() syntax could save you space if you don't like case. Several of the above issues that might be affecting your joins would also make this aggregation complicated. What are you looking for if there are several values for record_item_value for each record_id and record_item_name pair? I expect you have some discovery of what your data holds in your future.

SQL concat value as column

I'm trying to convert one string to a valid column.
As you can see I need to get something like 'mo._olddb_uid_'+nu._olddb_name_db that should use the column mo._olddb_uid_001
How can I achieve this:
SELECT *
FROM [PI_CONSOLIDATION].[dbo].[new_unite] nu
LEFT JOIN [PI_CONSOLIDATION].[dbo].[Motif_Orientation] mo ON CONCAT('mo._olddb_uid_',nu._olddb_name_db) = CONCAT(nu._olddb_name_db,'_',nu.id_motif_orientation)
WHERE nu.nom_res = 'TEST' and nu.prenom_res = 'Foobar'
Thanks
EDITED:
I have 18 application each with his DB. Those applications are almost similar with different data but sometimes the data can be found on also on the other sources.
So [new_unite] has the id of patient, id_group and the source database
id_resident id_groupe_res _olddb_name_db
728 31 src1
629 21 src6
731 25 src9
934 12 src18
...
The other table has some parameters that have identical params but with different IDs depending on the DB of the source.
So the [Motif_Orientation] looks like :
So mainly this query is just only to test if the data is stored correctly on the final application where there is just one DB and all the data merged
id_motif_orientation label
1407 Famille
1410 Structures d'hébergement
1422 Etablissement d'Education Spéciale
What you could try to do is make a function with the output being the value you want to left join. Your output should be a single value doesn't matter the type. Then, call it as:
LEFT JOIN
(func_name(param1,param2,...) AS CONCAT(...) FROM DUAL)
ON 1=1
The result should be your value left joined to the current table under whatever column name you need. The only problem is that you will need to do this one by one for each row, so it is only really useful for smaller tables. Wish you gave more info so I could give a better answer but this worked for me when I was having a similar issue with renaming a column.

Get once time a duplicate record (SQL)

SELECT DISTINCT A.LeaseID,
C.SerialNumber,
B.LeasedProjectNumber As 'ProjectNumber',
A.LeaseComment As 'LeaseContractComments'
FROM aLease A
LEFT OUTER JOIN aLeasedAsset B
ON a.LeaseID = B.LeaseID
LEFT OUTER JOIN aAsset C
ON B.LeasedProjectNumber = C.ProjectNumber AND B.PartID = C.aPartid
WHERE A.LeaseComment IS NOT NULL
I got this result from a query statement. But I don't want to get repeated the last column(Comments) for the 3 records in the second column.
I want for the values on the second column write once the repeated comment. Like a Group By
Alright, I'll take a stab at this. It's pretty unclear what exactly you're hoping for, but reading your comments, it sounds like you're looking to build a hierarchy of sorts in your table.
Something like this:
"Lease Terminated Jan 29, 2013 due to the event of..."
216 24914 87
216 724992 87
216 724993 87
"Other potential column"
217 2132 86
...
...
Unfortuantely, I don't believe that that's possible. SQL Server is pretty strict about returning a table, which is two-dimensional by definition. There's no good way to describe a hierarchy such as this in SQL. There is the hierarchyid type, but that's not really relevant here.
Given that, you only really have two options:
My preference 99% of the time, just accept the duplicates. Handle them in your procedural code later on, which probably does have support for these trees. Unless you're dealing with performance-critical situations, or if you're pulling back a lot of data (or really long comments), that should be totally fine.
If you're hoping to print this result directly to the user, or if network performance is a big issue, aggregate your columns into a single record for each comment. It's well-known that you can't have multiple values in the same column, for the very same reason as the above-listed result isn't possible. But what you could do, data and your own preferences permitting, is write an aggregate function to concatenate the grouped results into a single, comma-delimited column.
You'd likely then have to parse those commas out, though, so unless network traffic is your biggest concern, I'd really just do it procedural-side.
SELECT STUFF((SELECT DISTINCT ', ' + SerialNumber
FROM [vLeasedAsset]
WHERE A.LeaseID = LeaseID AND A.ProjectNumber = ProjectNumber
FOR XML PATH (''))
, 1, 1, '') AS SerialNumber, [ProjectNumber],
MAX(ContractComment) 'LeaseContractComment'
FROM [vLeasedAsset] A
WHERE ContractComment != ''
GROUP BY [ProjectNumber], LeaseID
Output:
SerialNumber
24914, 724993
23401, 720356
ProjectNumber
87
91

SQL query to find records with specific prefix

I'm writing SQL queries and getting tripped up by wanting to solve everything with loops instead of set operations. For example, here's two tables (lists, really - one column each); idPrefix is a subset of idFull. I want to select every full ID that has a prefix I'm interested in; that is, every row in idFull which has a corresponding entry in idPrefix.
idPrefix.ID idFull.ID
---------- ----------
12 8
15 12
300 12-1-1
12-1-2
15
15-1
300
Desired result would be everything in idFull except the value 8. Super-easy with a for each loop, but I'm just not conceptualizing it as a set operation. I've tried a few variations on the below; everything seems to return all of one table. I'm not sure if my issue is with how I'm doing joins, or how I'm using LIKE.
SELECT f.ID
FROM idPrefix AS p
JOIN idFull AS f
ON f.ID LIKE (p.ID + '%')
Details:
Values are varchars, prefixes can be any length but do not contain the delimiter '-'.
This question seems similar, but more complex; this one only uses one table.
Answer doesn't need to be fast/optimized/whatever.
Using SQL Server 2008, but am more interested in conceptual understanding than a flavor-specific query.
Aaaaand I'm coming back to both real coding & SO after ~3 years, so sorry if I'm rusty on any etiquette.
Thanks!
You can join the full table to the prefix table with a LIKE
SELECT idFull.ID
FROM idFull full
INNER JOIN idPrefix pre ON full.ID LIKE pre.ID + '%'

Optimizing a strange MySQL Query

Hoping someone can help with this. I have a query that pulls data from a PHP application and turns it into a view for use in a Ruby on Rails application. The PHP app's table is an E-A-V style table, with the following business rules:
Given fields: First Name, Last Name, Email Address, Phone Number and Mobile Phone Carrier:
Each property has two custom fields defined: one being required, one being not required. Clients can use either one, and different clients use different ones based on their own rules (e.g. Client A may not care about First and Last Name, but client B might)
The RoR app must treat each "pair" of properties as only a single property.
Now, here is the query. The problem is it runs beautifully with around 11,000 records. However, the real database has over 40,000 and the query is extremely slow, taking roughly 125 seconds to run which is totally unacceptable from a business perspective. It's absolutely required that we pull this data, and we need to interface with the existing system.
The UserID part is to fake out a Rails-esque foreign key which relates to a Rails table. I'm a SQL Server guy, not a MySQL guy, so maybe someone can point out how to improve this query? They (the business) demand that it be sped up but I'm not sure how to since the various group_concat and ifnull calls are required due to the fact that I need every field for every client and then have to combine the data.
select `ls`.`subscriberid` AS `id`,left(`l`.`name`,(locate(_utf8'_',`l`.`name`) - 1)) AS `user_id`,
ifnull(min((case when (`s`.`fieldid` in (2,35)) then `s`.`data` else NULL end)),_utf8'') AS `first_name`,
ifnull(min((case when (`s`.`fieldid` in (3,36)) then `s`.`data` else NULL end)),_utf8'') AS `last_name`,
ifnull(`ls`.`emailaddress`,_utf8'') AS `email_address`,
ifnull(group_concat((case when (`s`.`fieldid` = 81) then `s`.`data` when (`s`.`fieldid` = 154) then `s`.`data` else NULL end) separator ''),_utf8'') AS `mobile_phone`,
ifnull(group_concat((case when (`s`.`fieldid` = 100) then `s`.`data` else NULL end) separator ','),_utf8'') AS `sms_only`,
ifnull(group_concat((case when (`s`.`fieldid` = 34) then `s`.`data` else NULL end) separator ','),_utf8'') AS `mobile_carrier`
from ((`list_subscribers` `ls`
join `lists` `l` on((`ls`.`listid` = `l`.`listid`)))
left join `subscribers_data` `s` on((`ls`.`subscriberid` = `s`.`subscriberid`)))
where (left(`l`.`name`,(locate(_utf8'_',`l`.`name`) - 1)) regexp _utf8'[[:digit:]]+')
group by `ls`.`subscriberid`,`l`.`name`,`ls`.`emailaddress`
EDIT
I removed the regexp and that sped the query up to about 20 seconds, instead of nearly 120 seconds. If I could remove the group by then it would be faster, but I cannot as removing this causes it to duplicate rows with blank data for each field, instead of aggregating them. For instance:
With group by
id user_id first_name last_name email_address mobile_phone sms_only mobile_carrier
1 1 John Doe jdoe#example.com 5551234567 0 Sprint
Without group by
id user_id first_name last_name email_address mobile_phone sms_only mobile_carrier
1 1 John jdoe#xample.com
1 1 Doe jdoe#example.com
1 1 jdoe#example.com
1 1 jdoe#example.com 5551234567
And so on. What we need is the first result.
EDIT #2
The query still seems to take a long time, but earlier today it was running in only about 20 seconds on the production database. Without changing a thing, the same query is now once again taking over 60 seconds. This is still unacceptable.. any other ideas on how to improve this?
That is, without a doubt, the second most hideous SQL query I have ever laid my eyes on :-)
My advice is to trade storage requirements for speed. This is a common trick used when you find your queries have a lot of per-row functions (ifnull, case and so forth). These per-row functions never scale very well as the table becomes larger.
Create new fields in the table which will hold the values you want to extract and then calculate those values on insert/update (with a trigger) rather than select. This doesn't technically break 3NF since the triggers guarantee data consistency between columns.
The vast majority of database tables are read far more often than they're written so this will amortise the cost of the calculation across many selects. In addition, just about every reported problem with databases is one of speed, not storage.
An example of what I mean. You can replace:
case when (`s`.`fieldid` in (2,35)) then `s`.`data` else NULL end
with:
`s`.`data_2_35`
in your query if your insert/update trigger simply sets the data_2_35 column to data or NULL depending on the value of fieldid. Then you index data_2_35 and, voila, instant speed improvement at the cost of a little storage.
This trick can be done to the five case clauses, the left/regexp bit and the "naked" ifnull function as well (the ifnull functions containing min and group_concat may be harder to do).
The problem is most likely the WHERE condition:
where (left(`l`.`name`,(locate(_utf8'_',`l`.`name`) - 1)) regexp _utf8'[[:digit:]]+')
This looks like complex string comparison, so no index can be used, which results in a full table scan, possibly for every row in the result set. I am not a MySQL expert, but if you can simplify this into more simple column comparisons it will probably run much faster.
The first thing that jumps out at me as the source of all the trouble:
The PHP app's table is an E-A-V style table...
Trying to convert data in EAV format into conventional relational format on the fly using SQL is bound to be awkward and inefficient. So don't try to smash it into a conventional column-per-attribute format. The following query returns multiple rows per subscriber, one row per EAV attribute:
SELECT ls.subscriberid AS id,
SUBSTRING_INDEX(l.name, _utf8'_', 1) AS user_id,
COALESCE(ls.emailaddress, _utf8'') AS email_address,
s.fieldid, s.data
FROM list_subscribers ls JOIN lists l ON (ls.listid = l.listid)
LEFT JOIN subscribers_data s ON (ls.subscriberid = s.subscriberid
AND s.fieldid IN (2,3,34,35,36,81,100,154)
WHERE SUBSTRING_INDEX(l.name, _utf8'_', 1) REGEXP _utf8'[[:digit:]]+'
This eliminates the GROUP BY which is not optimized well in MySQL -- it usually incurs a temporary table which kills performance.
id user_id email_address fieldid data
1 1 jdoe#example.com 2 John
1 1 jdoe#example.com 3 Doe
1 1 jdoe#example.com 81 5551234567
But you'll have to sort out the EAV attributes in application code. That is, you can't seamlessly use ActiveRecord in this case. Sorry about that, but that's one of the disadvantages of using a non-relational design like EAV.
The next thing that I notice is the killer string manipulation (even after I've simplified it with SUBSTRING_INDEX()). When you're picking substrings out of a column, this says you me that you've overloaded one column with two distinct pieces of information. One is the name and the other is some kind of list-type attribute that you would use to filter the query. Store one piece of information in each column.
You should add a column for this attribute, and index it. Then the WHERE clause can utilize the index:
SELECT ls.subscriberid AS id,
SUBSTRING_INDEX(l.name, _utf8'_', 1) AS user_id,
COALESCE(ls.emailaddress, _utf8'') AS email_address,
s.fieldid, s.data
FROM list_subscribers ls JOIN lists l ON (ls.listid = l.listid)
LEFT JOIN subscribers_data s ON (ls.subscriberid = s.subscriberid
AND s.fieldid IN (2,3,34,35,36,81,100,154)
WHERE l.list_name_contains_digits = 1;
Also, you should always analyze an SQL query with EXPLAIN if it's important for them to have good performance. There's an analogous feature in MS SQL Server, so you should be accustomed to the concept, but the MySQL terminology may be different.
You'll have to read the documentation to learn how to interpret the EXPLAIN report in MySQL, there's too much info to describe here.
Re your additional info: Yes, I understand you can't do away with the EAV table structure. Can you create an additional table? Then you can load the EAV data into it:
CREATE TABLE subscriber_mirror (
subscriberid INT PRIMARY KEY,
first_name VARCHAR(100),
last_name VARCHAR(100),
first_name2 VARCHAR(100),
last_name2 VARCHAR(100),
mobile_phone VARCHAR(100),
sms_only VARCHAR(100),
mobile_carrier VARCHAR(100)
);
INSERT INTO subscriber_mirror (subscriberid)
SELECT DISTINCT subscriberid FROM list_subscribers;
UPDATE subscriber_data s JOIN subscriber_mirror m USING (subscriberid)
SET m.first_name = IF(s.fieldid = 2, s.data, m.first_name),
m.last_name = IF(s.fieldid = 3, s.data, m.last_name),
m.first_name2 = IF(s.fieldid = 35, s.data, m.first_name2),
m.last_name2 = IF(s.fieldid = 36, s.data, m.last_name2),
m.mobile_phone = IF(s.fieldid = 81, s.data, m.mobile_phone),
m.sms_only = IF(s.fieldid = 100, s.data, m.sms_only),
m.mobile_carrer = IF(s.fieldid = 34, s.data, m.mobile_carrier);
This will take a while, but you only need to do it when you get a new data update from the vendor. Subsequently you can query subscriber_mirror in a much more conventional SQL query:
SELECT ls.subscriberid AS id, l.name+0 AS user_id,
COALESCE(s.first_name, s.first_name2) AS first_name,
COALESCE(s.last_name, s.last_name2) AS last_name,
COALESCE(ls.email_address, '') AS email_address),
COALESCE(s.mobile_phone, '') AS mobile_phone,
COALESCE(s.sms_only, '') AS sms_only,
COALESCE(s.mobile_carrier, '') AS mobile_carrier
FROM lists l JOIN list_subscribers USING (listid)
JOIN subscriber_mirror s USING (subscriberid)
WHERE l.name+0 > 0
As for the userid that's embedded in the l.name column, if the digits are the leading characters in the column value, MySQL allows you to convert to an integer value much more easily:
An expression like '123_bill'+0 yields an integer value of 123. An expression like 'bill_123'+0 has no digits at the beginning, so it yields an integer value of 0.