Max match same numbers from each row - sql

To generate 1mln rows of report with the below mentioned script is taking almost 2 days so, really appreciate if somebody could help me with different script which the report can be generated within 10-15mins please.
The requirement of the report is as following;
Table “cover” contains 5mln rows & 6 columns of data and likewise table “data” contains 500,000 rows and 6 columns.
So, each numbers of the rows in table cover has to go through table date and provide the maximum matches.
For instance, as mentioned on the below tables, there could be 3 matches in row #1, 2 matches in row #2 and 5 matches in row #3 so the script has to select the max selection which is 5 in row #3.
Sample table
UPDATE public.cover_sheet AS fc
SET maxmatch = (SELECT MAX(tmp.mtch)
FROM (
SELECT (SELECT CASE WHEN fc.a=drwo.a THEN 1 ELSE 0 END) +
(SELECT CASE WHEN fc.b=drwo.b THEN 1 ELSE 0 END) +
(SELECT CASE WHEN fc.c=drwo.c THEN 1 ELSE 0 END) +
(SELECT CASE WHEN fc.d=drwo.d THEN 1 ELSE 0 END) +
(SELECT CASE WHEN fc.e=drwo.e THEN 1 ELSE 0 END) +
(SELECT CASE WHEN fc.f=drwo.f THEN 1 ELSE 0 END) AS mtch
FROM public.data AS drwo
) AS tmp)
WHERE fc.code>0;
SELECT *
FROM public.cover_sheet AS fc
WHERE fc.maxmatch>0;

As #a_horse_with_no_name mentioned in the comment to the question, your question is not clear...
Seems, you want to get the number of records which 6 fields from both tables are equal.
I'd suggest to:
reduce the number of select statements, then the speed of query execution will increase,
split your query into few smaller ones (good practice), to check your logic,
use join to get equal data, see: Visual Representation of SQL Joins
use subquery or cte to get result on which you'll be able to update table.
I think you want to get result as follow:
SELECT COUNT(*) mtch
FROM public.cover_sheet AS fc INNER JOIN public.data AS drwo ON
fc.a=drwo.a AND fc.b=drwo.b AND fc.c=drwo.c AND fc.d=drwo.d AND fc.e=drwo.e AND fc.f=drwo.f
If i'm not wrong and above query is correct, the time of execution of above query will reduce to about 1-2 minutes.
Finally, update query may look like:
WITH qry AS
(
-- proper select statement here
)
UPDATE public.cover_sheet AS fc
SET maxmatch = qry.<fieldname>
FROM qry
WHERE fc.code>0 AND fc.<key> = qry.<key>;
Note:
I do not see your data and i know nothing about its structure, relationships, etc. So, you have to change above query to your needs.

Related

Solution for SQL conditional query

I've been tasked with coming up with a solution for a problem that was found this morning. I have a query that I need to do some math with. I have three pertinent columns.
SELECT lQ.[QUANTITY], lQ.[FORM_FACTOR_ID], oQ.[INDIVIDUAL_PACKAGING]
FROM [dbo].[AOF_ORDER_LINE_QUEUE] as lQ
LEFT JOIN [dbo].[AOF_ORDER_QUEUE] AS oQ
ON lQ.[SALES_ORDER_NUMBER] = oQ.[SALES_ORDER_NUMBER]
I can see myself doing this in a loop easily in languages I know best. It doesn't seem that looping is a good thing to do in SQL based on some preliminary research so I am reaching out for suggestions.
I need to output a total value which is a conditional sum of lQ.[QUANTITY]. The condition is if oQ.[FORM_FACTOR_ID] is equal to 1 then the output for that particular row is equal to the value of lQ.[QUANTITY]. If oQ.[FORM_FACTOR_ID] is equal to 2 then if oQ.[INDIVIDUAL_PACKAGING] is true, then the output of that particular row in the query is equal to lQ.[QUANTITY]. If the value is false, then the output of that particular row in the query is divided by 2. The final output needs to be a single integer.
QUANTITY FORM_FACTOR_ID INDIVIDUAL_PACKAGING
4 2 1
5 1 1
I would need a query that outputs the value 7 for the above table.
QUANTITY FORM_FACTOR_ID INDIVIDUAL_PACKAGING
4 2 0
5 2 0
That same query needs to output 5 for the above table.
What would be the best way to go about doing this?
If I understand the question correctly, you just want conditional aggregation -- a CASE as an argument to SUM().
If I follow the logic, it would look like:
SELECT SUM(CASE WHEN oq.FORM_FACTOR_ID = 1 THEN lQ.QUANTITY
WHEN oQ.FORM_FACTOR_ID = 2 AND oQ.INDIVIDUAL_PACKAGING = 1 THEN lQ.QUANTITY
WHEN oQ.FORM_FACTOR_ID = 2 AND oQ.INDIVIDUAL_PACKAGING = 0 THEN lQ.QUANTITY / 2
END)
FROM [dbo].[AOF_ORDER_LINE_QUEUE] lQ LEFT JOIN
[dbo].[AOF_ORDER_QUEUE] oQ
ON lQ.[SALES_ORDER_NUMBER] = oQ.[SALES_ORDER_NUMBER];

Oracle SQL statement to update column values based on specific condition

I have a table which is having 3 columns-PID,LOCID,ISMGR. Now in existing scenario, for some person, based on the location ID, he is set as ISMGR=true.
But as per the new requirement, we have to make all the ISMGR=true for any person who is having at least one ISMGR=true(means if he is mangager for any one location, he should be manager for all the locations).
Table Data before running the script:
PID|LOCID|ISMGR
1 1 1
1 2 0
1 3 0
2 1 0
2 2 1
Table Data after running the script:
PID|LOCID|ISMGR
1 1 1
1 2 1
1 3 1
2 1 1
2 2 1
Any help will be highly appreciated..
Thanks in advance.
I would be inclined to write this using exists:
update t
set ismgr = 1
where ismgr = 0 and
exists (select 1 from t t2 where t2.pid = t.pid and t2.ismgr = 1);
exists should be more efficient than doing a subquery with an aggregation.
This will work best with indexes on t(pid, ismgr) and t(ismgr).
This is not an answer but a test of the two solutions offered so far - I will call them the "EXISTS" and the "AGGREGATE" solutions or approaches.
Details of the tests are below, but here are two overall conclusions:
Both approaches have comparable execution times; on average the AGGREGATE approach worked a little faster than the EXISTS approach, but by a very small margin (smaller than the differences between running times from one trial to the next). Without indexes on any columns, the run times were: (first number is for the EXISTS approach and the second for AGGREGATE). Trial 1: 8.19s 8.08s Trial 2: 8.98s 8.22s Trial 3: 9.46s 9.55s Note - Estimated optimizer costs should be used only to compare different execution plans for the same statement, not for different solutions using different approaches. Even so, someone will inevitably ask; so - for the EXISTS approach the lowest cost the Optimizer found was 4766; for AGGREGATE, 2665. Again, though, this is completely meaningless.
If a lot of rows need to be updated, indexes will hurt performance much more than they help it. Indeed, when rows are updated, the indexes must be updated as well. If only a small number of rows must be updated, then the indexes will help, because most of the time is spent finding the rows that must be updated and only little time is spent in the updates themselves. In my example almost 25% of rows had to be updated... so the AGGREGATE solution took 51.2 seconds and the EXISTS solution took 59.3 seconds! RECOMMENDATION: If you expect that a large number of rows may need to be updated, and you already have indexes on the table, you may be better off DROPPING them and re-creating them after the updates! Or, perhaps there are other solutions to this problem; I am not an expert (keep that in mind!)
To test properly, after I created the test table and committed, I ran each solution by itself, then I rolled back and, logged in as SYS (in a different session), I ran alter system flush buffer_cache to make sure performance is not randomly helped by cache hits or hurt by misses. In all cases everything is done from disk storage.
I created a table with id's from 1 to 1.2 million and a random integer between 1 and 3, with probabilities 40%, 40% and 20% respectively (see the use of dbms_random below). Then from this prep data I created the test table: each pid was included one, two or three times based on this random integer; and a random 0 or 1 was added as ismgr (with 50-50 probability) in each row. I also added a random integer between 1 and 4 as locid just to simulate the actual data; I didn't worry about duplicate locid since that column plays no role in the problem.
Of the 1.2 million pids, approximately 480,000 (40%) appear just once in the test table, another ~480,000 appear twice and ~240,000 three times. Total rows should be about 2,160,000. That's the cardinality of the base table (in reality it ended up being 2,160,546). Then: none of the ~480,000 rows with unique pid need to be changed; half of the 480,000 pids with a count of 2 will have the same ismgr (so no change) and the other half will be split, so we will need to change 240,000 rows from these; and a simple combinatorial argument shows that 3/8, or 270,000 rows, of the 720,000 rows for pids that appear three times in the table must be changed. So we should expect that 510,000 rows should be changed. In fact the update statements resulted in 510,132 rows updated (same for both solutions). These sanity checks show that the test was probably set up correctly. Below I show also a small sample from the base table, also as a sanity check.
CREATE TABLE statement:
create table tbl as
with prep ( pid, dup ) as (
select level,
round( dbms_random.value(0.5, 3) ) as dup
from dual
connect by level <= 1200000
)
select pid,
round( dbms_random.value(0.5, 4.5) ) as locid,
round( dbms_random.value(0, 1) ) as ismgr
from prep
connect by level <= dup
and prior pid = pid
and prior sys_guid() is not null
;
commit;
Sanity checks:
select count(*) from tbl;
COUNT(*)
----------
2160546
select * from tbl where pid between 324720 and 324730;
PID LOCID ISMGR
---------- ---------- ----------
324720 4 1
324721 1 0
324721 4 1
324722 3 0
324723 1 0
324723 3 0
324723 3 1
324724 3 1
324724 2 0
324725 4 1
324725 2 0
324726 2 0
324726 1 0
324727 3 0
324728 4 1
324729 1 0
324730 3 1
324730 3 1
324730 2 0
19 rows selected
UPDATE statements:
update tbl t
set ismgr = 1
where ismgr = 0 and
exists (select 1 from tbl t2 where t2.pid = t.pid and t2.ismgr = 1);
rollback;
update tbl
set ismgr = 1
where ismgr = 0
and pid in ( select pid
from tbl
group by pid
having max(ismgr) = 1);
rollback;
-- statements to create indexes, used in separate testing:
create index pid_ismgr_idx on tbl(pid, ismgr);
create index ismgr_ids on tbl(ismgr);
Why PL/SQL? All you need is a plain SQL statement. For example:
update your_table t -- enter your actual table name here
set ismgr = 1
where ismgr = 0
and pid in ( select pid
from your_table
group by pid
having max(ismgr) = 1)
;
The existing solutions are perfectly fine, but I prefer to use merge any time I'm updating rows from a correlated sub-query. I find it to be more readable and the performance is typically commensurate with the exists method.
MERGE INTO t
USING (SELECT DISTINCT pid
FROM t
WHERE ismgr = 1) src
ON (t.pid = src.pid)
WHEN MATCHED THEN
UPDATE SET ismgr = 1
WHERE ismgr = 0;
As #mathguy pointed out, in this case using group by and having is more efficient than distinct. To use that with merge is just a matter of changing the sub-query:
MERGE INTO t
USING (SELECT pid
FROM t
GROUP BY pid
HAVING MAX(ismgr) = 1) src
ON (t.pid = src.pid)
WHEN MATCHED THEN
UPDATE SET ismgr = 1
WHERE ismgr = 0;

How to prioritize in OR condition in SQL Server

I have a simple query where I want to search for a value in a column, and if no match is found then I want to discard that where condition and select any random value from that column.
I have tried to achieve this using OR condition, using case, but to no avail
select top 10 *
from tblRFCWorkload
where (f1 ='mbb' or f1 = f1)
In the above query I want the result table to have all rows with 'mbb', if 'mbb' is not found then give any value instead of mbb. But instead it returns only two rows with mbb even though there are 10 matching rows
select top 10 *
from tblRFCWorkload
where 1 = case when f1 = 'mbb' then 1 else 1 end
This query also returns the same result as the 1st query
If I change it to
select top 10 *
from tblRFCWorkload
where 1 = case when f1 = 'mbb' then 1 else 0 end
then it only rows having the mbb value.
Need similar thing with between clause... search for one range if no results found then search for second range... how do I do that???
I think you want something like this:
select top 10 *
from tblRFCWorkload
order by (case when f1 = 'mbb' then 1 else 2 end);
This prioritizes the rows that you want, fetching the top 10 which will be 'mbb', if they are available.
You want only the results where f1 = 'mbb' if there are any, otherwise nothing? In that case you need to be testing whether there are any matching rows. Something like this
SELECT TOP 10 *
FROM tblRFCWorkload
WHERE f1 = 'mbb' --the match you need
OR NOT EXISTS --or there are no matches
(SELECT * FROM tblRFCWorkload WHERE f1 = 'mbb')

select sum(a), sum(b where c=1) from db; sql conditions in select statement

i guess i just lack the keywords to search, but this is burning on my mind:
how can i add a condition to the sum-function in the select-statement like
select sum(a), sum(b where c=1) from db;?
this means, i want to see the sum of column a and the sum of column b, but only of the records in column b of which column c has the value 1.
the output of heidi just says "bad syntac near WHERE". may there be any other way?
thanks in advance and best regards from Berlin, joachim
The exact syntax may differ depending on the database engine, however it will be along the lines of
SELECT
sum(a),
sum(CASE WHEN c = 1 THEN b ELSE 0 END)
FROM
db
select sum(case when c=1 then b else 0 end)
This technique is useful when you need a lot of aggregates on the same set of data - you can query the entire table without applying a where filter, and have a bunch of these which give you aggregated data for a specific filter.
It's also useful when you need a lot of counts based on filters - you can do sums of 1 or 0:
select sum(case when {somecondition} then 1 else 0 end)

SQL (SQLite) count for null-fields over all columns

I've got a table called datapoints with about 150 columns and 2600 rows. I know, 150 columns is too much, but I got this db after importing a csv and it is not possible to shrink the number of columns.
I have to get some statistical stuff out of the data. E.g. one question would be:
Give me the total number of fields (of all columns), which are null. Does somebody have any idea how I can do this efficiently?
For one column it isn't a problem:
SELECT count(*) FROM datapoints tb1 where 'tb1'.'column1' is null;
But how can I solve this for all columns together, without doing it by hand for every column?
Best,
Michael
Building on Lamak's idea, how about this idea:
SELECT (N * COUNT(*)) - (
COUNT(COLUMN_1)
+ COUNT(COLUMN_2)
+ ...
+ COUNT(COLUMN_N)
)
FROM DATAPOINTS;
where N is the number of columns. The trick will be in making the summation series of COUNT(column), but that shouldn't be too terrible with a good text editor and/or spreadsheet.
i don't think there is an easy way to do it. i'd get started on the 150 queries. you only have to replace one word (column name) each time.
Well, COUNT (and most aggregations funcions) ignore NULL values. In your case, since you are using COUNT(*), it counts every row in the table, but you can do that on any column. Something like this:
SELECT TotalRows-Column1NotNullCount, etc
FROM (
SELECT COUNT(1) TotalRows,
COUNT(column1) Column1NotNullCount,
COUNT(column2) Column2NotNullCount,
COUNT(column3) Column3NotNullCount ....
FROM datapoints) A
To get started it's often helpful to use a visual query tool to generate a field list and then use cut/paste/search/replace or manipulation in a spreadsheet program to transform it into what is needed. To do it all in one step you can use something like:
SELECT SUM(CASE COLUMN1 WHEN NULL THEN 1 ELSE 0 END) +
SUM(CASE COLUMN2 WHEN NULL THEN 1 ELSE 0 END) +
SUM(CASE COLUMN3 WHEN NULL THEN 1 ELSE 0 END) +
...
FROM DATAPOINTS;
With a visual query builder you can quickly generate:
SELECT COLUMN1, COLUMN2, COLUMN3 ... FROM DATAPOINTS;
You can then replace the comma with all the text that needs to appear between two field names followed by fixing up the first and last fields. So in the example search for "," and replace with " WHEN NULL 1 ELSE 0 END) + SUM(CASE " and then fix up the first and last fields.