How do I aggregate numbers from a string column in SQL - sql

I am dealing with a poorly designed database column which has values like this
ID cid Score
1 1 3 out of 3
2 1 1 out of 5
3 2 3 out of 6
4 3 7 out of 10
I want the aggregate sum and percentage of Score column grouped on cid like this
cid sum percentage
1 4 out of 8 50
2 3 out of 6 50
3 7 out of 10 70
How do I do this?

You can try this way :
select
t.cid
, cast(sum(s.a) as varchar(5)) +
' out of ' +
cast(sum(s.b) as varchar(5)) as sum
, ((cast(sum(s.a) as decimal))/sum(s.b))*100 as percentage
from MyTable t
inner join
(select
id
, cast(substring(score,0,2) as Int) a
, cast(substring(score,charindex('out of', score)+7,len(score)) as int) b
from MyTable
) s on s.id = t.id
group by t.cid
[SQLFiddle Demo]

Redesign the table, but on-the-fly as a CTE. Here's a solution that's not as short as you could make it, but that takes advantage of the handy SQL Server function PARSENAME. You may need to tweak the percentage calculation if you want to truncate rather than round, or if you want it to be a decimal value, not an int.
In this or most any solution, you have to count on the column values for Score to be in the very specific format you show. If you have the slightest doubt, you should run some other checks so you don't miss or misinterpret anything.
with
P(ID, cid, Score2Parse) as (
select
ID,
cid,
replace(Score,space(1),'.')
from scores
),
S(ID,cid,pts,tot) as (
select
ID,
cid,
cast(parsename(Score2Parse,4) as int),
cast(parsename(Score2Parse,1) as int)
from P
)
select
cid, cast(round(100e0*sum(pts)/sum(tot),0) as int) as percentage
from S
group by cid;

Related

Snowflake: Repeating rows based on column value

How to repeat rows based on column value in snowflake using sql.
I tried a few methods but not working such as dual and connect by.
I have two columns: Id and Quantity.
For each ID, there are different values of Quantity.
So if you have a count, you can use a generator:
with ten_rows as (
select row_number() over (order by null) as rn
from table(generator(ROWCOUNT=>10))
), data(id, count) as (
select * from values
(1,2),
(2,4)
)
SELECT
d.*
,r.rn
from data as d
join ten_rows as r
on d.count >= r.rn
order by 1,3;
ID
COUNT
RN
1
2
1
1
2
2
2
4
1
2
4
2
2
4
3
2
4
4
Ok let's start by generating some data. We will create 10 rows, with a QTY. The QTY will be randomly chosen as 1 or 2.
Next we want to duplicate the rows with a QTY of 2 and leave the QTY =1 as they are.
Obviously you can change all parameters above to suit your needs - this solution works super fast and in my opinion way better than table generation.
Simply stack SPLIT_TO_TABLE(), REPEAT() with a LATERAL() join and voila.
WITH TEN_ROWS AS (SELECT ROW_NUMBER()OVER(ORDER BY NULL)SOME_ID,UNIFORM(1,2,RANDOM())QTY FROM TABLE(GENERATOR(ROWCOUNT=>10)))
SELECT
TEN_ROWS.*
FROM
TEN_ROWS,LATERAL SPLIT_TO_TABLE(REPEAT('hire me $10/hour',QTY-1),'hire me $10/hour')ALTERNATIVE_APPROACH;

SQL Server - Find similarities in column and write them into new column

I have a big table with data like this:
ID Title
-- ------------------------
1 01_SOMESTRING_038
2 01_SOMESTRING K5038
3 01_SOMESTRING-648
4 K-OTHERSTRING_T_73474
5 K-OTHERSTRING_T_ffk
6 ABC
7 DEF
And the task is now to find similarities in that column, and write that found similarity to a new column.
So the desired output would be like this:
ID Title Similarity
-- ------------------------ -----------------
1 01_SOMESTRING_038 01_SOMESTRING
2 01_SOMESTRING K5038 01_SOMESTRING
3 01_SOMESTRING-648 01_SOMESTRING
4 K-OTHERSTRING_T_73474 K-OTHERSTRING_T_
5 K-OTHERSTRING_T_ffk K-OTHERSTRING_T_
6 ABC NULL
7 DEF NULL
How can I achieve that in MS SQL Server 17?
Any help is much appreciated. Thanks!
EDIT: The strings are not only broken by delimiters such as "-", "_".
And for handling competeing similrities I would set a minimum length for the similarity. For instance 10.
Try the following, using a recursive CTE to split out the letters, then we can group them up to find the greatest match:
WITH TITLE_EXPAND AS (
SELECT
1 MatchLen
,CAST(SUBSTRING(Title,1,1) as NVARCHAR(255)) MatchString
,Title
,ID
FROM
[SourceDataTable]
UNION ALL
SELECT
MatchLen + 1
,CAST(SUBSTRING(Title,1,MatchLen+1) AS NVARCHAR(255))
,Title
,ID
FROM
TITLE_EXPAND
WHERE
MatchLen < LEN(Title)
)
SELECT DISTINCT
SDT.ID
,SDT.title
,FIRST_VALUE(MatchString) OVER (PARTITION BY SDT.ID ORDER BY SC.MatchLen DESC, SC.MatchCount DESC) Similarity
FROM
[SourceDataTable] SDT
LEFT JOIN
(SELECT
*
,COUNT(*) OVER (PARTITION BY MatchString, MatchLen) MatchCount
FROM
TITLE_EXPAND) SC
ON
SDT.ID = SC.ID
AND
SC.MatchCount > 1
ORDER BY SDT.ID
Where SourceDataTable is your source table. The Similarity value will be the longest matched similar value.

more efficiently pivot rows

I am trying to join multiple tables together. One of the tables I am trying to join has hundreds of rows per ID of data. I am trying to pivot about 100 rows for each ID into columns. The value I am trying to use isn't always in the same row. Below is an example (my real table has hundreds of rows per ID). AccNum for example in ID 1 may be in the NumV column, but for ID 2 it may be in the CharV column.
ID QType CharV NumV
1 AccNum 10
1 EmpNam John Inc 0
1 UW Josh 0
2 AccNum 11
2 EmpNam CBS 0
2 UW Dan 0
The original code I used was a select statement with hundreds of lines like one below:
Max(Case When PM.[QType] = 'AccNum' Then NumV End) as AccNum
This code with hundreds on lines completed in just under 10 min. The problem however is that in only pulls in values from the column I specify, so I will always loss the data that is in a different column. (In the example above I would get AccNum 10, but not AccNum11 because it's in the CharV column).
I updated the code to use a pivot:
;with CTE
As
(
Select [PMID], [QType],
Value=concat(Nullif([CharV],''''),Nullif([NumV],0))
From [DBase].[dbo].[PM]
)
Select C.[ID] AS M_ID
,Max(c.[AccNum]) As AcctNum
,Max(c.[EmpNam]) As EmpName
and so on...
I then select all of my hundreds of rows and then pivot it the data:
from CTE
pivot (max(Value) for [QType] in ([AccNum],[EmpNam],(more rows)))As c
The problem with this code, however, is that it takes almost 2 hours to run.
Is there a different, more efficient solution to what I am trying to accomplish? I need to have the speed of the first code, but the result of the second.
Perhaps you can reduce the Concat/NullIf processing by using a UNION ALL
Select ID,QType,Value=CharV From #YourTable where CharV>''
Union All
Select ID,QType,Value=cast(NumV as varchar(25)) From #YourTable where NumV>0
For the conditional aggregation approach
No need to worry about which field, just reference VALUE
Select [ID]
,[Accnum] = Max(Case When [QType] = 'AccNum' Then Value End)
,[EmpNam] = Max(Case When [QType] = 'EmpNam' Then Value End)
,[UW] = Max(Case When [QType] = 'UW' Then Value End)
From (
Select ID,QType,Value=CharV From #YourTable where CharV>''
Union All
Select ID,QType,Value=cast(NumV as varchar(25)) From #YourTable where NumV>0
) A
Group By ID
For the PIVOT approach
Select [ID],[AccNum],[EmpNam],[UW]
From (
Select ID,QType,Value=CharV From #YourTable where CharV>''
Union All
Select ID,QType,Value=cast(NumV as varchar(25)) From #YourTable where NumV>0
) A
Pivot (max([Value]) For [QType] in ([AccNum],[EmpNam],[UW])) p

SQL. how to compare values and from two table, and report per-row results

I have two Tables.
table A
id name Size
===================
1 Apple 7
2 Orange 15
3 Banana 22
4 Kiwi 2
5 Melon 28
6 Peach 9
And Table B
id size
==============
1 14
2 5
3 31
4 9
5 1
6 16
7 7
8 25
My desired result will be (add one column to Table A, which is the number of rows in Table B that have size smaller than Size in Table A)
id name Size Num.smaller.in.B
==============================
1 Apple 7 2
2 Orange 15 5
3 Banana 22 6
4 Kiwi 2 1
5 Melon 28 7
6 Peach 9 3
Both Table A and B are pretty huge. Is there a clever way of doing this. Thanks
Use this query it's helpful
SELECT id,
name,
Size,
(Select count(*) From TableB Where TableB.size<Size)
FROM TableA
The standard way to get your result involves a non-equi-join, which will be a product join in Explain. First duplicating 20,000 rows, followed by 7,000,000 * 20,000 comparisons and a huge intermediate spool before the count.
There's a solution based on OLAP-functions which is usually quite efficient:
SELECT dt.*,
-- Do a cumulative count of the rows of table #2
-- sorted by size, i.e. count number of rows with a size #2 less size #1
Sum(CASE WHEN NAME = '' THEN 1 ELSE 0 end)
Over (ORDER BY SIZE, NAME DESC ROWS Unbounded Preceding)
FROM
( -- mix the rows of both tables, an empty name indicates rows from table #2
SELECT id, name, size
FROM a
UNION ALL
SELECT id, '', size
FROM b
) AS dt
-- only return the rows of table #1
QUALIFY name <> ''
If there are multiple rows with the same size in table #2 you better count before the Union to reduce the size:
SELECT dt.*,
-- Do a cumulative sum of the counts of table #2
-- sorted by size, i.e. count number of rows with a size #2 less size #1
Sum(CASE WHEN NAME ='' THEN id ELSE 0 end)
Over (ORDER BY SIZE, NAME DESC ROWS Unbounded Preceding)
FROM
( -- mix the rows of both tables, an empty name indicates rows from table #2
SELECT id, name, size
FROM a
UNION ALL
SELECT Count(*), '', SIZE
FROM b
GROUP BY SIZE
) AS dt
-- only return the rows of table #1
QUALIFY NAME <> ''
There is no clever way of doing that, you just need to join the tables like this:
select a.*, b.size
from TableA a join TableB b on a.id = b.id
To improve performance you'll need to have indexes on the id columns.
maybe
select
id,
name,
a.Size,
sum(cnt) as sum_cnt
from
a inner join
(select size, count(*) as cnt from b group by size) s on
s.size < a.size
group by id,name,a.size
if you're working with large tables. Indexing table b's size field could help. I'm also assuming the values in table B converge, that there's many duplicates you don't care about, other than you want to count them.
sqlfiddle
#Ritesh solution is perfectly correct, another similar solution is using CROSS JOIN as shown below
use tempdb
create table dbo.A (id int identity, name varchar(30), size int );
create table dbo.B (id int identity, size int);
go
insert into dbo.A (name, size)
values ('Apple', 7)
,('Orange', 15)
,('Banana', 22)
,('Kiwi', 2 )
,('Melon', 28)
,('Peach', 6 )
insert into dbo.B (size)
values (14), (5),(31),(9),(1),(16), (7),(25)
go
-- using cross join
select a.*, t.cnt
from dbo.A
cross apply (select cnt=count(*) from dbo.B where B.size < A.size) T(cnt)
try this query
SELECT
A.id,A.name,A.size,Count(B.size)
from A,B
where A.size>B.size
group by A.size
order by A.id;

SQL group table by "leading rows" without pl/sql

I have this table (short example) with two columns
1 a
2 a
3 a3
4 a
5 a
6 a6
7 a
8 a8
9 a
and I would like to group/partition them into groups separated by those leading "a", ideally to add another column like this, so I can address those groups easily.
1 a 0
2 a 0
3 a3 3
4 a 3
5 a 3
6 a6 6
7 a 6
8 a8 8
9 a 8
problem is that setup of the table is dynamic so I can't use staticly lag or lead functions, any ideas how to do this without pl/sql in postgres version 9.5
Assuming the leading part is a single character. Hence the expression right(data, -1) works to extract the group name. Adapt to your actual prefix.
The solution uses two window functions, which can't be nested. So we need a subquery or a CTE.
SELECT id, data
, COALESCE(first_value(grp) OVER (PARTITION BY grp_nr ORDER BY id), '0') AS grp
FROM (
SELECT *, NULLIF(right(data, -1), '') AS grp
, count(NULLIF(right(data, -1), '')) OVER (ORDER BY id) AS grp_nr
FROM tbl
) sub;
Produces your desired result exactly.
NULLIF(right(data, -1), '') to get the effective group name or NULL if none.
count() only counts non-null values, so we get a higher count for every new group in the subquery.
In the outer query, we take the first grp value per grp_nr as group name and default to '0' with COALESCE for the first group without name (which has a NULL as group name so far).
We could use min() or max() as outer window function as well, since there is only one non-null value per partition anyway. first_value() is probably cheapest since the rows are sorted already.
Note the group name grp is data type text. You may want to cast to integer, if those are clean (and reliably) integer numbers.
This can be achieved by setting rows containing a to a specific value and all the other rows to a different value. Then use a cumulative sum to get the desired number for the rows. The group number is set to the next number when a new value in the val column is encountered and all the proceeding rows with a will have the same group number as the one before and this continues.
I assume that you would need a distinct number for each group and the number doesn't matter.
select id, val, sum(ex) over(order by id) cm_sum
from (select t.*
,case when val = 'a' then 0 else 1 end ex
from t) x
The result for the query above with the data in question, would be
id val cm_sum
--------------
1 a 0
2 a 0
3 a3 1
4 a 1
5 a 1
6 a6 2
7 a 2
8 a8 3
9 a 3
With the given data, you can use a cumulative max:
select . . .,
coalesce(max(substr(col2, 2)) over (order by col1), 0)
If you don't strictly want the maximum, then it gets a bit more difficult. The ANSI solution is to use the IGNORE NULLs option on LAG(). However, Postgres does not (yet) support that. An alternative is:
select . . ., coalesce(substr(reft.col2, 2), 0)
from (select . . .,
max(case when col2 like 'a_%' then col1 end) over (order by col1) as ref_col1
from t
) tt join
t reft
on tt.ref_col1 = reft.col1
You can also try this :
with mytable as (select split_part(t,' ',1)::integer id,split_part(t,' ',2) myvalue
from (select unnest(string_to_array($$1 a;2 a;3 a3;4 a;5 a;6 a6;7 a;8 a8;9 a$$,
';'))t) a)
select id,myvalue,myresult from mytable join (
select COALESCE(NULLIF(substr(myvalue,2),''),'0') myresult,idmin id_down
,COALESCE(lead(idmin) over (order by myvalue),999999999999) id_up
from (
select myvalue,min(id) idmin from mytable group by 1
) a) b
on id between id_down and id_up-1