I want to perform data profiling on the columns of a table. In this particular case - what percentage of data is date/integer/numeric/bit. The query that I am using:
SELECT
CAST(SUM(CASE WHEN TRY_CAST([column1] AS date) IS NOT NULL AND TRY_CAST(TRY_CAST([column1] AS VARCHAR(8000)) AS date) between '1950-01-01' AND '2049-12-31' AND LEN(RTRIM(CAST([column1] AS VARCHAR(MAX)))) > 0 THEN 1 ELSE 0 END) AS NUMERIC(25,2))/(CAST(SUM(CASE WHEN [column1] IS NOT NULL AND LOWER(LTRIM(RTRIM(CAST([column1] AS VARCHAR(MAX))))) not IN ('null', 'n/a') AND LEN(LTRIM(RTRIM(CAST([column1] AS VARCHAR(MAX))))) > 0 THEN 1 ELSE 0 END) AS NUMERIC(25,2))+0.00000001) AS PercentDate,
CAST(SUM(CASE WHEN TRY_CAST([column1] AS FLOAT) IS NOT NULL AND LEN(RTRIM(CAST([column1] AS VARCHAR(MAX)))) > 0 THEN 1 ELSE 0 END) AS NUMERIC(25,2))/(CAST(SUM(CASE WHEN [column1] IS NOT NULL AND LOWER(LTRIM(RTRIM(CAST([column1] AS VARCHAR(MAX))))) not IN ('null', 'n/a') AND LEN(LTRIM(RTRIM(CAST([column1] AS VARCHAR(MAX))))) > 0 THEN 1 ELSE 0 END) AS NUMERIC(25,2))+0.00000001) AS PercentNumeric,
CAST(SUM(CASE WHEN TRY_CAST([column1] AS BIGINT) IS NOT NULL AND LEN(RTRIM(CAST([column1] AS VARCHAR(MAX)))) > 0 THEN 1 ELSE 0 END) AS NUMERIC(25,2))/(CAST(SUM(CASE WHEN [column1] IS NOT NULL AND LOWER(LTRIM(RTRIM(CAST([column1] AS VARCHAR(MAX))))) not IN ('null', 'n/a') AND LEN(LTRIM(RTRIM(CAST([column1] AS VARCHAR(MAX))))) > 0 THEN 1 ELSE 0 END) AS NUMERIC(25,2))+0.00000001) AS PercentInteger,
CAST(SUM(CASE WHEN LOWER(TRY_CAST([column1] AS VARCHAR(MAX))) IN ('1', '0', 't', 'f', 'y', 'n', 'true', 'false', 'yes', 'no') THEN 1 ELSE 0 END) AS NUMERIC(25,2))/(CAST(SUM(CASE WHEN [column1] IS NOT NULL AND LOWER(LTRIM(RTRIM(CAST([column1] AS VARCHAR(MAX))))) not IN ('null', 'n/a') AND LEN(LTRIM(RTRIM(CAST([column1] AS VARCHAR(MAX))))) > 0 THEN 1 ELSE 0 END) AS NUMERIC(25,2))+0.00000001) AS PercentBit
FROM tbl
This query works really slow even if I choose only top 1 row. Actually I am not able to get any result, or at least I cannot wait such a long time.
The column that I am checking is of type decimal if this is of any importance.
The number of records in the table is: 37,431,866. This is why I choose only top 1000 for example, but still does not load any result for more than 40 minutes
If you want this to run faster, then you don't want to limit the rows in the query you are using. After all, an aggregation query with no GROUP BY only returns one row.
Instead use a subquery:
SELECT . . .
FROM (SELECT TOP (1000) t.*
FROM tbl t
) t
Note that this is not a random sample. And if you attempt ORDER BY newid() you will kill performance. One alternative to get an approximate n% sample is to use logic like this:
SELECT . . .
FROM (SELECT TOP (1000) t.*
FROM tbl t
WHERE RAND(CHECKSUM(NEWID())) < 0.001
) t
The 0.001 would be about a 0.1% sample.
Your question can be simpliefied. The part:
CAST(SUM(CASE WHEN TRY_CAST([column1] AS date) IS NOT NULL AND TRY_CAST(TRY_CAST([column1] AS VARCHAR(8000)) AS date) between '1950-01-01' AND '2049-12-31' AND LEN(RTRIM(CAST([column1] AS VARCHAR(MAX)))) > 0 THEN 1 ELSE 0 END) AS NUMERIC(25,2))
can alo be written as:
CAST(SUM(CASE WHEN TRY_CAST(TRY_CAST([column1] AS VARCHAR(8000)) AS date) between '1950-01-01' AND '2049-12-31' THEN 1 ELSE 0 END) AS NUMERIC(25,2))
The second one is quicker then the first one, with the same result. (AFAIK)
This probably can also be applied to the other parts in the query.
Related
I am not able to understand output of query run on hive
select count(*),
count(col1) as count,
count( distinct col1) as distinct,
SUM (case when (cast(col1 as BIGINT) is null or cast(col1 as BIGINT) is not null )then 1 else 0 end) as total_count,
SUM (case when cast(col1 as BIGINT) is null then 1 else 0 end) as non_int_count,
SUM (case when cast(col1 as BIGINT) is not null then 1 else 0 end) as int_count,
SUM (case when cast(col1 as BIGINT) is null then 0 else 1 end) as int_count2,
FROM TABLE
where conditions
Result for this query in Hive is
count(*) count distinct total_count non_int_count int_count int_count2
23030525 23030525 1631400 23030525 2 258898 258898
Shouldn't total_count=count(*)=(non_int_count+ int_count). ?
I have a bit of sql code that look similar to this:
select sum(case when latitude = '0' then 1 else 0 end) as count_zero,
sum(case when latitude is NULL then 1 else 0 end) as count_null,
sum((case when latitude = '0' then 1 else 0 end) +
(case when latitude is NULL then 1 else 0 end)
) as total_zero,
count(latitude) as count_not_nulls,
count(*) as total
from sites_database
Is there a "cleaner" way to write this same query. I have tried using the "sum" expression using the column alias, something like:
Sum(count_zero + count_null) as total_null
But this doesn't seem to work for some reason
You could use COUNT instead of SUM:
SELECT
COUNT(CASE WHEN latitude = '0' THEN 1 END) As count_zero,
COUNT(CASE WHEN latitude IS NULL THEN 1 END) AS count_null,
COUNT(CASE WHEN COALESCE(latitude, '0') = '0' THEN 1 END) AS total_zero,
COUNT(latitude) As count_not_nulls,
COUNT(*) as total
FROM sites_database;
Using COUNT here saves a bit of coding, because we don't have to provide an explicit ELSE condition (the default ELSE is NULL, which just isn't counted at all). Also note that for the total_zero conditional sum, I used COALESCE to merge the two counts into just one.
I have int values v >= 0 in nullable column and I would like to count number of occurrences of Null, 0, 1 and 2+ in column how to do it efficiently?
One method is group by:
select (case when col in (0, 1) then cast(col as varchar(255))
else '2+'
end) as grp, count(*)
from t
group by (case when col in (0, 1) then cast(col as varchar(255))
else '2+'
end)
order by min(col);
The exact syntax for the cast() might depend on the database. This also assumes all values are non-negative.
You can put the counts in different columns as well:
select sum(case when val = 0 then 1 else 0 end) as cnt_0,
sum(case when val = 1 then 1 else 0 end) as cnt_1,
sum(case when val >= 2 then 1 else 0 end) as cnt_2pl
from t;
The below SQL query creates a table with n number of columns named in the next line.
...., curr_amount, tax_amount, ....
I am having a very tough time updating the below query to create a new column called total and position it exactly after tax_amount column and the total column should contain the values that are obtained after sum of curr_amount & tax_amount.
I have been working on this from more than one day but couldn't figure it out.
P.S. Still a noob here. Thanks alot for your time.
.
SELECT Isnull(t.total_month, 'Total') total_month,
t.tax_amount,
t.curr_amount,
t.usage_qty,
t.kh_qty,
t.bill_cnt
FROM (SELECT dbo.Sigmadf(bm.posted_date, 'YYYY-MM') total_month,
Sum(CASE
WHEN rr.usage_qty IS NULL THEN 0
ELSE Cast (rr.usage_qty AS NUMERIC(18, 2))
END) usage_qty,
Sum(CASE
WHEN bm.curr_amount IS NULL THEN 0
ELSE bm.curr_amount
END) curr_amount,
Sum(CASE
WHEN bm.adj_amount IS NULL THEN 0
ELSE bm.adj_amount
END) adj_amount,
Sum(CASE
WHEN bm.bal_fwd_amount IS NULL THEN 0
ELSE bm.bal_fwd_amount
END) bal_forward,
Sum(CASE
WHEN bm.tax_amount IS NULL THEN 0
ELSE bm.tax_amount
END) tax_amount,
Sum(CASE
WHEN bm.due_amount IS NULL THEN 0
ELSE bm.due_amount
END) due_amount,
Sum(CASE
WHEN bm.last_total_paid_amount IS NULL THEN 0
ELSE bm.last_total_paid_amount * -1
END) paid_amount,
Sum(CASE
WHEN bm.bill_print = 'Y' THEN 1
ELSE 0
END) pdf_cnt,
Sum(CASE
WHEN Isnull(bm.bill_handling_code, '0') = '0' THEN 1
ELSE 0
END) reg_cnt,
Sum(CASE
WHEN Isnull(bm.bill_handling_code, '0') = '1' THEN 1
ELSE 0
END) ftime_cnt,
Sum(CASE
WHEN Isnull(bm.bill_handling_code, '0') = '9999' THEN 1
ELSE 0
END) ltime_cnt,
Count(*) bill_cnt,
Sum(CASE
WHEN bill_status = '01' THEN 1
ELSE 0
END) canc_cnt,
Sum(CASE
WHEN bill_status = '01' THEN
CASE
WHEN rr.usage_qty IS NULL THEN 0
ELSE Cast (rr.usage_qty AS NUMERIC(18, 2))
END
ELSE 0
END) canc_usg,
Sum(CASE
WHEN vis.kh_qty IS NULL THEN 0
ELSE Cast(vis.kh_qty AS NUMERIC(18, 2))
END) kh_qty
FROM bill_master bm WITH (nolock)
INNER JOIN (SELECT bill_no,
Sum(CASE
WHEN vpb.recurr_charge_type IN ( 'T4',
'SLF' )
THEN
CASE
WHEN vpb.print_qty = 'Y'
AND vpb.usage_qty IS NOT NULL
THEN
Cast (vpb.usage_qty AS
NUMERIC(18, 2))
ELSE 0
END
ELSE 0
END) usage_qty
FROM v_print_bills_all vpb
GROUP BY bill_no) rr
ON rr.bill_no = bm.bill_no
LEFT OUTER JOIN vis_bill_master_cr vis WITH (nolock)
ON bm.bill_no = vis.bill_no
WHERE 1 = 1
AND dbo.Trunc(bm.posted_date) >= '20150101'
AND dbo.Trunc(bm.posted_date) <= '20151124'
AND bm.posted_date IS NOT NULL
AND bm.cust_id NOT IN (SELECT cc.code_type cust_id
FROM code_table cc WITH (nolock)
WHERE cc.code_tabname = 'RptExclCust'
AND cc.code_value = 'cust_id')
GROUP BY dbo.Sigmadf(bm.posted_date, 'YYYY-MM') WITH rollup)t
I must say that the explanation is not so clear.
From my understanding, you want the total of two columns.
So, wrap all your query between parenthesis, call it subQuery, and make the sum of the two columns on top:
SELECT subQuery.total_month as bill_date,
subQuery.curr_amount as amount,
subQuery.tax_amount tax,
subQuery.curr_amount + subQuery.tax_amount as [total],
...
FROM
(..your entire query here..) as subQuery
I have a table like this
TABLEMAIN
Q1 Name Group Zone Month Type
1 'N1' 'G1' 'Z1' 12 'T1'
4 'N1' 'G3' 'Z2' 12 'T6'
6 'N1' 'G1' 'Z5' 12 'T2'
3 'N2' 'G4' 'Z5' 12 'T4'
.
.
.
And I have something like this to get certain results
Query1:
select
(SUM(CASE Q1>=2 and Q1<=4 THEN 1 ELSE 0 END)) TOTAL,
(CASE WHEN Type = 'T1' THEN SUM(CASE WHEN Q1=4 THEN 1 ELSE 0 END)) T1TYPE,
(CASE WHEN Type = 'T1' THEN SUM(CASE WHEN Q1=4 THEN 1 ELSE 0 END)) T2TYPE,
Type,
**Zone,**
Month
from
TABLEMAIN
GROUP BY Type, **Zone,** Month;
Query2:
select
(SUM(CASE Q1>=2 and Q1<=4 THEN 1 ELSE 0 END)) TOTAL,
(CASE WHEN Type = 'T1' THEN SUM(CASE WHEN Q1=4 THEN 1 ELSE 0 END)) T1TYPE,
(CASE WHEN Type = 'T1' THEN SUM(CASE WHEN Q1=4 THEN 1 ELSE 0 END)) T2TYPE,
Type,
**Group,**
Month
from
TABLEMAIN
GROUP BY Type, **Group,** Month;
As you can see I group this table many times in many ways, but this part is the same in every query
select
(SUM(CASE Q1>=2 and Q1<=4 THEN 1 ELSE 0 END)) TOTAL,
(CASE WHEN Type = 'T1' THEN SUM(CASE WHEN Q1=4 THEN 1 ELSE 0 END)) T1TYPE,
(CASE WHEN Type = 'T1' THEN SUM(CASE WHEN Q1=4 THEN 1 ELSE 0 END)) T2TYPE,
Is there a better way to do this? I'm not sure if I can use a materialized view for this
Perhaps. You can do it all in one query, if you like by using grouping sets:
select SUM(CASE Q1>=2 and Q1<=4 THEN 1 ELSE 0 END) as TOTAL,
(CASE WHEN Type = 'T1' THEN SUM(CASE WHEN Q1=4 THEN 1 ELSE 0 END)) as T1TYPE,
(CASE WHEN Type = 'T1' THEN SUM(CASE WHEN Q1=4 THEN 1 ELSE 0 END)) as T2TYPE
Type, **Zone,**, **Group,** Month
from TABLEMAIN
GROUP BY GROUPING SETS((Type, **Zone,** Month), (Type, **Group,** Month));
This puts all the results in a single table.
I second with #GolezTrol comment. Would like to explain further.
SUBQUERY FACTORING is what you need. The WITH clause, or subquery factoring clause, is part of the SQL-99 standard and was added into the Oracle SQL syntax in Oracle 9.2. The WITH clause may be processed as an inline view or resolved as a temporary table. The advantage of the latter is that repeated references to the subquery may be more efficient as the data is easily retrieved from the temporary table, rather than being required by each reference.
WITH data AS(
<your subquery>
)
SELECT * FROM data
bla bla bla...