I am using UTL matching to retrieve the values matching from two different tables and how similar they are. I am filtering by those who are at least 90 out of 100 similars so I can manually check if those values actually are the same or not.
As the resulting data is too big I am working on some new queries to take out those values who are surely the same value and does not need any manual checking. Like those >90% similar with a size of 9 characters.
In order to do those I have just been using WHERE CLAUSE, but now I want to insert a CASE STATEMENT as I want to state that those values that contain the word "University" and does not have a 95% similarity should not appear.
The code I am using seem to be running but It's taking a lot of time. Would you know if it can be improved (for time). Thank you!!!
The code I am using will be like:
with consolidate_table as (....)
select
column1, column2,
UTL_MATCH.jaro_winkler_similarity(column1, column2) as jws
from consolidate_table
where UTL_MATCH.jaro_winkler_similarity(column1, column2) >= 90
AND UTL_MATCH.jaro_winkler_similarity(column1, column2) < 100
AND LENGTH (column1) <9
AND column1 = (CASE
WHEN column1 LIKE '%University%'
AND UTL_MATCH.jaro_winkler_similarity(column1, column2) > 94
THEN column1 ELSE NULL
END)
;
Unsure without sample data & expected results.
But at a first glace it can be simplified.
Using a BETWEEN the function is only called once.
And filtering those that are equal could help speed it up.
...
WHERE column1 <> column2
AND UTL_MATCH.jaro_winkler_similarity(column1, column2) BETWEEN (CASE WHEN column1 LIKE '%University%' THEN 95 ELSE 91 END) AND 99
Related
I have two solutions for finding the sum of positive integers and negative integers. Please,tell which one is more correct and more optimized?
Or Is there any other more optimized and correct query ?
Q:
Consider Table A with col1 and below values.
col1
20
-20
40
-40
-30
30
I need below output
POSITIVE_SUM NEGATIVE_SUM
90 -90
I have two solutions.
/q1/
select POSITIVE_SUM,NEGATIVE_SUM from
(select distinct sum(a2.col1) AS "POSITIVE_SUM" from A a1 join A a2 on a2.col1>0
group by a1.col1)
t1
,
(select distinct sum(a2.col1) AS "NEGATIVE_SUM"from A a1 join A a2 on a2.col1<0
group by a1.col1) t2;
/q2/
select sum (case when a1.col1 >= 0 then a1.col1 else 0 end) as positive_sum,
sum (case when a1.col1 < 0 then a1.col1 else 0 end) as negative_sum
from A a1;
POSITIVE_SUM NEGATIVE_SUM
90 -90
I wonder how you even came up with your 1st solution:
- self-join (twice) the table,
- producing 6 (identical) rows each and finally with distinct get 1 row,
- then cross join the 2 results.
I prepared a demo so you can see the steps that lead to the result of your 1st solution.
I don't know if this can be in any way optimized,
but is there case that it can beat a single scan of the table with conditional aggregation like your 2nd solution?
I don't think so.
The second query is not only better performing, but it returns the correct values. If you run the first query, you'll see that it returns multiple rows.
I think for the first query, you are looking for something like:
select p.positive_sum, n.negative_sum
from (select sum(col1) as positive_sum from a1 where col1 > 0) p cross join
(select sum(col1) as negative_sum from a1 where col1 < 0) n
And that you are asking wither the case expression is faster than the where.
What you are missing is that this version needs to scan the table twice. Reading data is generally more expensive than any functions on data elements.
Sometimes the second query might have very similar performance. I can think of three cases. First is when there is a clustered index on col1. Second is when col1 is used as a partitioning key. And third is on very small amounts of data (say data that fits on a single data page).
I'm trying to evaluate multiple columns to save myself a few keystrokes (granted, at this point, the time and effort of the search has long since negated any "benefit" I would ever receive) rather than multiple different compares.
Basically, I have:
WHERE column1 = column2
AND column2 = column3
I want:
WHERE column1 = column2 = column3
I found this other article, that was tangentially related:
Oracle SQL Syntax - Check multiple columns for IS NOT NULL
Use:
x=all(y,z)
instead of
x=y and y=z
The above saves 1 keystroke (1/11 = 9% - not much).
If column names are longer, then it gives bigger savings:
This is 35 characters long:
column1=column2 AND column2=column3
while this one only 28
column1=ALL(column2,column3)
But for this one (95 characters):
column1=column2 AND column2=column3 AND column3=column4
AND column4=column5 AND column5=column6
you will get 43/95 = almost 50% savings
column1=all(column2,column3,column4,column5,column6)
ALL operator is a part of ANSII SQL, it is supported by most databases (Mysql, Postgresql, SQLServer etc.
http://www.w3resource.com/sql/special-operators/sql_all.php
A simple test case that shows how it works:
create table t( x int, y int, z int );
insert all
into t values( 1,1,1)
into t values(1,2,2)
into t values(1,1,2)
into t values(1,2,1)
select 1 from dual;
select *
from t
where x = all(y,z);
X Y Z
---------- ---------- ----------
1 1 1
One possible trick is to utilize the least and greatest functions - if the largest and the smallest values of a list of values are equal, it must mean all the values are equal:
LEAST(col1, col2, col3) = GREATEST(col1, col2, col3)
I'm not sure it saves any keystrokes on a three column list, but if you have many columns, it could save some characters. Note that this solution implicitly assumes that none of the values are null, but so does your original solution, so it should be OK.
Is there any way to make this query run faster (meaning take less reads/IO from SQL Server). The logic essentially is
I count distinct values from a column
if there is any more than 1 distinct value, it is considered as existing
List is built with the name of the column and a 1 or 0 if it is existing
I would like to do something with EXISTS (in t-sql that terminates the scan of the table/index if SQL Server finds the match to the EXISTS predicate). I am not sure if that is possible in this query.
Note: I am not looking for answers like is there an index on the table...well beyond that :)
with SomeCTE as
(
select
count(distinct(ColumnA)) as ColumnA,
count(distinct(ColumnB)) as ColumnB,
count(distinct(ColumnC)) as ColumnC
from VERYLARGETABLE
)
select 'NameOfColumnA', case when ColumnA > 1 then 1 else 0 end from SomeCTE
UNION ALL
select 'NameOfColumnB', case when ColumnB > 1 then 1 else 0 end from SomeCTE
UNION ALL
select 'NameOfColumnC', case when ColumnC > 1 then 1 else 0 end from SomeCTE
Just to copy what I posted below in the comments.
So after testing this solution. It makes the queries run "faster". To give two examples..one query went from 50 seconds to 3 seconds. Another went from 9+ minutes (stopped running it) went down to 1min03seconds. Also I am missing indexes (so according to DTA should run 14% faster) also I am running this in the SQL Azure DB (where you are being throttled drastically in terms of I/O, CPU and tempddb memory)...very nice solution all around. One downside is that min/max does not work on bit columns, but those can be converted.
If the datatype of the columns allow aggregate functions and if there are indexes, one for every column, this will be fast:
SELECT 'NameOfColumnA' AS ColumnName,
CASE WHEN MIN(ColumnA) < MAX(ColumnA)
THEN 1 ELSE 0
END AS result
FROM VERYLARGETABLE
UNION ALL
SELECT 'NameOfColumnB',
CASE WHEN MIN(ColumnB) < MAX(ColumnB)
THEN 1 ELSE 0
END
FROM VERYLARGETABLE
UNION ALL
SELECT 'NameOfColumnC'
CASE WHEN MIN(ColumnC) < MAX(ColumnC)
THEN 1 ELSE 0
END
FROM VERYLARGETABLE ;
I've got a table called datapoints with about 150 columns and 2600 rows. I know, 150 columns is too much, but I got this db after importing a csv and it is not possible to shrink the number of columns.
I have to get some statistical stuff out of the data. E.g. one question would be:
Give me the total number of fields (of all columns), which are null. Does somebody have any idea how I can do this efficiently?
For one column it isn't a problem:
SELECT count(*) FROM datapoints tb1 where 'tb1'.'column1' is null;
But how can I solve this for all columns together, without doing it by hand for every column?
Best,
Michael
Building on Lamak's idea, how about this idea:
SELECT (N * COUNT(*)) - (
COUNT(COLUMN_1)
+ COUNT(COLUMN_2)
+ ...
+ COUNT(COLUMN_N)
)
FROM DATAPOINTS;
where N is the number of columns. The trick will be in making the summation series of COUNT(column), but that shouldn't be too terrible with a good text editor and/or spreadsheet.
i don't think there is an easy way to do it. i'd get started on the 150 queries. you only have to replace one word (column name) each time.
Well, COUNT (and most aggregations funcions) ignore NULL values. In your case, since you are using COUNT(*), it counts every row in the table, but you can do that on any column. Something like this:
SELECT TotalRows-Column1NotNullCount, etc
FROM (
SELECT COUNT(1) TotalRows,
COUNT(column1) Column1NotNullCount,
COUNT(column2) Column2NotNullCount,
COUNT(column3) Column3NotNullCount ....
FROM datapoints) A
To get started it's often helpful to use a visual query tool to generate a field list and then use cut/paste/search/replace or manipulation in a spreadsheet program to transform it into what is needed. To do it all in one step you can use something like:
SELECT SUM(CASE COLUMN1 WHEN NULL THEN 1 ELSE 0 END) +
SUM(CASE COLUMN2 WHEN NULL THEN 1 ELSE 0 END) +
SUM(CASE COLUMN3 WHEN NULL THEN 1 ELSE 0 END) +
...
FROM DATAPOINTS;
With a visual query builder you can quickly generate:
SELECT COLUMN1, COLUMN2, COLUMN3 ... FROM DATAPOINTS;
You can then replace the comma with all the text that needs to appear between two field names followed by fixing up the first and last fields. So in the example search for "," and replace with " WHEN NULL 1 ELSE 0 END) + SUM(CASE " and then fix up the first and last fields.
I have a table which has 32 columns in an Oracle table.
Two of these columns are identity columns
the rest are values
I would like to get the average of all the value columns, which is complicated by the null (identity) columns. Below is the pseudocode for what I am trying to achieve:
SELECT
((nvl(val0, 0) + nvl(val1, 0) + ... nvl(valn, 0))
/ nonZero_Column_Count_In_This_Row)
Such that: nonZero_Column_Count_In_This_Row = (ifNullThenZeroElse1(val0) + ifNullThenZeroElse1(val1) ... ifNullThenZeroElse(valn))
The difficulty here is of course in getting 1 for any non-null column. It seems I need a function similar to NVL, but with an else clause. Something that will return 0 if the value is null, but 1 if not, rather than the value itself.
How should I go about about getting the value for the denominator?
PS: I feel I must explain some motivation behind this design. Ideally this table would have been organized as the identity columns and one value per row with some identifier for the row itself. This would have made it more normalized and the solution to this problem would have been pretty simple. The reasons for it not to be done like this are throughput, and saving space. This is a huge DB where we insert 10 million values per minute into. Making each of these values one row would mean 10M rows per minute, which is definitely not attainable. Packing 30 of them into a single row reduces the number of rows inserted to something we can do with a single DB, and the overhead data amount (the identity data) much less.
(Case When col is null then 0 else 1 end)
You could use NVL2(val0, 1, 0) + NVL2(val1, 1, 0) + ... since you are using Oracle.
Another option is to use the AVG function, which ignores NULLs:
SELECT AVG(v) FROM (
WITH q AS (SELECT val0, val1, val2, val3 FROM mytable)
SELECT val0 AS v FROM q
UNION ALL SELECT val1 FROM q
UNION ALL SELECT val2 FROM q
UNION ALL SELECT val3 FROM q
);
If you're using Oracle11g you can use the UNPIVOT syntax to make it even simpler.
I see this is a pretty old question, but I don't see a sufficient answer. I had a similar problem, and below is how I solved it. It's pretty clear a case statement is needed. This solution is a workaround for such cases where
SELECT COUNT(column) WHERE column {IS | IS NOT} NULL
does not work for whatever reason, or, you need to do several
SELECT COUNT ( * )
FROM A_TABLE
WHERE COL1 IS NOT NULL;
SELECT COUNT ( * )
FROM A_TABLE
WHERE COL2 IS NOT NULL;
queries but want it as a data set when you run the script. See below; I use this for analysis and it's been working great for me so far.
SUM(CASE NVL(valn, 'X')
WHEN 'X'
THEN 0
ELSE 1
END) as COLUMN_NAME
FROM YOUR_TABLE;
Cheers!
Doug
Generically, you can do something like this:
SELECT (
(COALESCE(val0, 0) + COALESCE(val1, 0) + ...... COALESCE(valn, 0))
/
(SIGN(ABS(COALESCE(val0, 0))) + SIGN(ABS(COALESCE(val1, 0))) + .... )
) AS MyAverage
The top line will return the sum of values (omitting NULL values) whereas the bottom line will return the number of non-null values.
FYI - it's SQL Server syntax, but COALESCE is just like ISNULL for the most part. SIGN just returns -1 for a negative number, 0 for zero, and 1 for a positive number. ABS is "absolute value".