Frequency table of continuous variable in SQL? - sql

I have a continuous variable SQL table:
x
1 622.108
2 622.189
3 622.048
4 622.758
5 622.191
6 622.677
7 622.598
8 622.020
9 621.228
10 622.690
...
and I try to get a simple frequency table, e.g. with 3 buckets, like this:
bucket n
[621.228-621.738[ 1
[621.738-622.248[ 5
[622.248-622.758] 4
Seems easy but I cannot manage to make it in SQL (I am running it on a Cloudera Impala engine).
I have looked into dense_rank() or ntile() without success.
Any idea ?

You can use window functions to divide the range into three equal parts and then use arithmetic:
select min_x + range * (row_number() over (order by min(x)) - 1) as bucket_hi,
min_x + range * row_number() over (order by min(x)) as bucket_hi,
count(*)
from (select t.*,
min(x) over () as min_x,
max(x) over () as max_x,
0.000001 + max(x) over () - min(x) over () as range
from t
) t
group by floor((x - min_x) / range)), min_x, range

There are at least two problems with your question:
You have not provided any code to show us what you have tried. It really is good sometimes to just work out the problem yourself. Nevertheless, I found the problem interesting and decided to play.
Your range blocks overlap. If, for example, you were to have the value 621.738 in your list, which bucket would contain it? [621.228-621.738] or [621.738-622.248]?
There are also at least three problems with my answer, so I don't expect you to accept this. However, maybe it will get you started. Hopefully, this disclaimer will keep me from getting down voted. :-)
The answer is in T-SQL. Sorry, it's what I have to work with.
The answer is not generic. It always creates three and only three buckets.
It only works if the data type limits the result to 3 decimal places.
Remember, this is only one possible solution, and in my mind a very weak one at that.
With those disclaimers, here's what I wrote:
SELECT
'[' + STR( RANGES.RANGESTART, 7, 3 )
+ ' - '
+ STR( RANGES.RANGEEND, 7, 3 ) + ']' AS 'BUCKET'
,COUNT(*) AS 'N'
FROM
( SELECT
VALS.MINVAL + (CAST( CNT.INC AS DECIMAL(7,3) ) * VALS.RANGEWIDTH) AS 'RANGESTART'
,CASE WHEN CNT.INC < 2
THEN VALS.MINVAL + (CAST( CNT.INC + 1 AS DECIMAL(7,3) ) * VALS.RANGEWIDTH) - 0.001
ELSE VALS.MINVAL + (CAST( CNT.INC + 1 AS DECIMAL(7,3) ) * VALS.RANGEWIDTH)
END AS 'RANGEEND'
FROM
( SELECT
MIN(CURVAL) AS 'MINVAL'
,MAX(CURVAL) AS 'MAXVAL'
,(MAX(CURVAL) - MIN(CURVAL)) / 3 AS 'RANGEWIDTH'
FROM
MYVALUE ) VALS
CROSS JOIN (VALUES (0), (1), (2) ) CNT(INC)
) RANGES
INNER JOIN MYVALUE V
ON V.CURVAL BETWEEN RANGES.RANGESTART AND RANGES.RANGEEND
GROUP BY
RANGES.RANGESTART
,RANGES.RANGEEND
ORDER BY 1
;
In the above, your values would be in the CURVAL column of the MYVALUE table.
Good luck. I hope this helps you on your way.

Related

Is there a way I can Query Missing numbers in a table?

I work for a Logistics Company and we have to have a 7 digit Pro Number on each piece of freight that is in a pre-determined order. So we know there is gaps in the numbers, but is there any way I can Query the system and find out what ones are missing?
So show me all the numbers from 1000000 to 2000000 that do not exist in column name trace_number.
So as you can see below the sequence goes 1024397, 1024398, then 1051152 so I know there is a substantial gap of 26k pro numbers, but is there anyway to just query the gaps?
Select t.trace_number,
integer(trace_number) as number,
ISNUMERIC(trace_number) as check
from trace as t
left join tlorder as tl on t.detail_number = tl.detail_line_id
where left(t.trace_number,1) in ('0','1','2','3','4','5','6','7','8','9')
and date(pick_up_by) >= current_date - 1 years
and length(t.trace_number) = 7
and t.trace_type = '2'
and site_id in ('SITE5','SITE9','SITE10')
and ISNUMERIC(trace_number) = 'True'
order by 2
fetch first 10000 rows only
I'm not sure what your query has to do with the question, but you can identify gaps using lag()/lead(). The idea is:
select (trace_number + 1) as start_gap,
(next_tn - 1) as end_gap
from (select t.*,
lead(trace_number) order by (trace_number) as next_tn
from t
) t
where next_tn <> trace_number + 1;
This does not find them within a range. It just finds all gaps.
try Something like this (adapt the where condition, put into clause "on") :
with Range (nb) as (
values 1000000
union all
select nb+1 from Range
where nb<=2000000
)
select *
from range f1 left outer join trace f2
on f2.trace_number=f1.nb
and f2.trace_number between 1000000 and 2000000
where f2.trace_number is null

How can I alter this to run in postgresql?

What needs changing for this to run in postgreSQL?
I was given the piece of sql
UPDATE ACC
SET ACC.ACC_EC = SITESmin.ACC_EC,
ACC.ACC_NC = SITESmin.ACC_NC
FROM ACC
INNER JOIN LATERAL ( SELECT TOP 1
*
FROM SITES
ORDER BY ( acc_ec - site_etg ) * ( acc_ec - site_etg ) + (acc_ncb - site_ntg ) * ( acc_ncb - site_ntg )
) SITESmin;
It seems to be using SET but I do not know why, so if it's not needed drop it.
I am trying to get postgresql to work out distances. For every record in file one I have to compare to 3300 records in file 2 and select the nearest. Received wisdom suggests an array solution for the 3300 but I do not know how to do that. Perhaps it it a "sub query" in SQL.
If I am permitted to upload samples I will do so, though I have the feeling this is not allowed?
Here are the filed names
public.acc.Location_Easting_OSGR
public.acc.Location_Northing_OSGR
"public"."Sites"."SITE_ETG"
"public"."Sites"."SITE_NTG"
Try this:
WITH SITESmin as (
SELECT ACC_EC, ACC_NC
FROM SITES
ORDER BY ( acc_ec - site_etg ) * ( acc_ec - site_etg ) + (acc_ncb - site_ntg ) * ( acc_ncb - site_ntg )
LIMIT 1
)
UPDATE ACC
SET ACC.ACC_EC = SITESmin.ACC_EC,
ACC.ACC_NC = SITESmin.ACC_EC
FROM SITESmin;
If it does not work, please provide the schema and some data to make it easier to reproduce

SQL Get count with most common part of string

I am able to get count of column with most same values, e.g.
SELECT COUNT(*) AS Count, ProjectID
FROM Projects
GROUP BY ProjectID
ORDER BY Count DESC
So now I have table like this,
ProjectID ProjectUrl
1 http://www.CompanyA.com/Projects/123
2 http://www.CompanyB.com/Projects/124
3 http://www.CompanyA.com/Projects/125
4 http://www.CompanyB.com/Projects/126
5 http://www.CompanyA.com/Projects/127
Now Expected result without providing any parameter
ProjectUrl = http://www.CompanyA.com Count = 3
ProjectUrl = http://www.CompanyB.com Count = 2
Edit
Sorry I forgot to mention types of Urls I have in the table, Urls are quiet random though, but there are urls that are common. As we are creating project categories, so project category url can be,
https://spanish.CompanyAa2342.com/portal/projectA/projectTeamA/ProjectPersonA/Task/124
but for some projects there are no project team or so on, so it's bit random :?
I will need to query something more like generic.
What Url will have in common
http://ramdomLanguage.CompanyName.com/portal/RandomName.....
Please try:
select
Col,
COUNT(Col) Cnt
from(
select
SUBSTRING(ProjectUrl, 0, PATINDEX('%.com/%', ProjectUrl)+4) Col
from tbl
)x group by Col
SQL Fiddle Demo
Not sure of performance when dealing with a huge dataset, but this is a solution. I've tried to get a row for each portion of the URLs, delimited by /. Then do a quick aggregate at the end to bring up the counts of each individual part. Fiddle is here: http://www.sqlfiddle.com/#!3/742c4/12 (I've added one row for demo's sake - thanks, TechDo.)
WITH cteFSPositions
AS
(
SELECT ProjectID,
ProjectURL,
1 AS CharPos,
MAX(LEN(ProjectURL)) AS MaxLen,
CHARINDEX('/', ProjectURL) AS FSPos
FROM Projects
GROUP BY ProjectID,
ProjectURL
UNION ALL
SELECT ProjectID,
ProjectURL,
CharPos + 1,
MaxLen,
CHARINDEX('/', ProjectURL, CharPos + 1) AS FSPos
FROM cteFSPositions
WHERE CharPos <= MaxLen
),
cteProjectURLParts
AS
(
SELECT DISTINCT ProjectID,
LEFT(ProjectURL, FSPos) AS ProjectURLPart,
FSPos
FROM cteFSPositions
WHERE FSPos > 0
UNION ALL
SELECT ProjectID,
ProjectURL,
LEN(ProjectURL)
FROM Projects
),
cteFilteredProjectURLParts
AS
(
SELECT ProjectID,
ProjectURLPart
FROM cteProjectURLParts
WHERE ProjectURLPart NOT IN ('http:', 'http:/', 'http://', 'https:', 'https:/', 'https://')
)
SELECT ProjectURLPart,
COUNT(*) AS Instances
FROM cteFilteredProjectURLParts
GROUP BY ProjectURLPart
ORDER BY Instances DESC,
ProjectURLPart;
This produces (with the additional row I added in):
ProjectURLPart Instances
http://www.CompanyA.com/ 4
http://www.CompanyA.com/Projects/ 3
http://www.CompanyB.com/ 2
http://www.CompanyB.com/Projects/ 2
http://www.CompanyA.com/BlahblahBlah/ 1
http://www.CompanyA.com/BlahblahBlah/More1/ 1
http://www.CompanyA.com/BlahblahBlah/More1/More2 1
http://www.CompanyA.com/Projects/123 1
http://www.CompanyA.com/Projects/125 1
http://www.CompanyA.com/Projects/127 1
http://www.CompanyB.com/Projects/124 1
http://www.CompanyB.com/Projects/126 1
EDIT: Oops, original post had code of fiddle in progress. Have supplied the finalized code and updated fiddle link.
EDIT 2: Realized I was cutting off the end part of the URLS due to the way I was cutting the URLs up. For completeness' sake, I've added them back in to the final dataset. Updated fiddle as well.

T-SQL Sum Values of Like Rows

I currently use this select statement in SSRS to report Recent Demand and Days of Inventory to end users.
select Issue.MATERIAL_NUMBER,
SUM(Issue.SHIPPED_QTY)AS DEMAND_QTY,
Main.QUANTITY_TOTAL_STOCK / SUM(Issue.SHIPPED_QTY) * 122 AS [DOI]
From AGS_DATAMART.dbo.GOODS_ISSUE AS Issue
join AGS_DATAMART.dbo.OPR_MATERIAL_DIM AS MAT on MAT.MATERIAL_NUMBER = Issue.MATERIAL_NUMBER
join AGS_DATAMART.dbo.SCE_ECC_MAIN_FINAL_INV_FACT AS MAIN on MAT.MATERIAL_SID = MAIN.MATERIAL_SID
join AGS_DATAMART.dbo.SCE_PLANT_DIM AS PLANT on PLANT.PLANT_SID = MAIN.PLANT_SID
Where Issue.SHIP_TO_CUSTOMER_ID = #CUSTID
and Issue.ACTUAL_PGI_DATE > GETDATE() - 122
and PLANT.PLANT_CODE = #CUSTPLANT
and MAIN.STORAGE_LOCATION = '0001'
Group by Issue.MATERIAL_NUMBER,Main.QUANTITY_TOTAL_STOCK
Pretty Simple.
But is has come to my attention, that they have similar Material Numbers whos values need to be combined.
Material | Qty
0242-55161W 1
0242-55161 3
The two Material Numbers above should be combined and reported as 0242-55161 Qty 4.
How do I combine rows like this? This is just 1 of many queries that will need to be adjusted. Is it possible?
EDIT - The similar material will always be the base number plus the "W", if that matters.
Please note I am brand new to SQL and SSRS, and this is my first time posting here.
Let me know if I need to include any other details.
Thanks in advance.
Answer;
Using just replace, it kept returning 2 unique lines even when using SUM.
I was able to get the desired result using the following. Can you see anything wrong with this method?
with Issue_Con AS
(
select replace(Issue.MATERIAL_NUMBER,'W','') As [MATERIAL_NUMBER],
Issue.SHIPPED_QTY AS [SHIPPED_QTY]
From AGS_DATAMART.dbo.GOODS_ISSUE AS Issue
Where Issue.SHIP_TO_CUSTOMER_ID = #CUSTSHIP
and Issue.SALES_ORDER_TYPE_CODE = 'ZTPC'
and Issue.ACTUAL_PGI_DATE > GETDATE() - 122
)
select Issue_Con.MATERIAL_NUMBER,
SUM(Issue_Con.SHIPPED_QTY)AS [DEMAND_QTY],
Main_Con.QUANTITY_TOTAL_STOCK / SUM(Issue_Con.SHIPPED_QTY) * 122 AS [DOI]
From Issue_Con
join Main_Con on Main_Con.MATERIAL_Number = Issue_Con.MATERIAL_Number
Group By Issue_Con.MATERIAL_NUMBER, Main_Con.QUANTITY_TOTAL_STOCK;
You need to replace Issue.MATERIAL_NUMBER in the select and group by with something else. What that something else is depends on your data.
If it's always 10 digits with anything afterwards ignored, then you can use substr(Issue.MATERIAL_NUMBER, 1, 10)
If the extraneous character is always W and there are no Ws in the proper number, then you can use replace(Issue.MATERIAL_NUMBER, 'W', '')
If it's anything from the first alphabetic character, then you can use case when patindex('%[A-Za-z]%', Issue.MATERIAL_NUMBER) = 0 then Issue.MATERIAL_NUMBER else substr(Issue.MATERIAL_NUMBER, 1, patindex('%[A-Za-z]%', Issue.MATERIAL_NUMBER)) end
You could group your data by this expression instead of MATERIAL_NUMBER:
CASE SUBSTRING(MATERIAL_NUMBER, LEN(MATERIAL_NUMBER), 1)
WHEN 'W' THEN LEFT(MATERIAL_NUMBER, LEN(MATERIAL_NUMBER) - 1)
ELSE MATERIAL_NUMBER
END
That is, check if the last character is W. If it is, return all but the last character, otherwise return the entire value.
To avoid repeating the same expression twice (once in GROUP BY and once in SELECT) you could use a subselect, for example like this:
select Issue.MATERIAL_NUMBER_GROUP,
SUM(Issue.SHIPPED_QTY)AS DEMAND_QTY,
Main.QUANTITY_TOTAL_STOCK / SUM(Issue.SHIPPED_QTY) * 122 AS [DOI]
From (
SELECT
*,
CASE SUBSTRING(MATERIAL_NUMBER, LEN(MATERIAL_NUMBER), 1)
WHEN 'W' THEN LEFT(MATERIAL_NUMBER, LEN(MATERIAL_NUMBER) - 1)
ELSE MATERIAL_NUMBER
END AS MATERIAL_NUMBER_GROUP
FROM AGS_DATAMART.dbo.GOODS_ISSUE
) AS Issue
join AGS_DATAMART.dbo.OPR_MATERIAL_DIM AS MAT on MAT.MATERIAL_NUMBER = Issue.MATERIAL_NUMBER
join AGS_DATAMART.dbo.SCE_ECC_MAIN_FINAL_INV_FACT AS MAIN on MAT.MATERIAL_SID = MAIN.MATERIAL_SID
join AGS_DATAMART.dbo.SCE_PLANT_DIM AS PLANT on PLANT.PLANT_SID = MAIN.PLANT_SID
Where Issue.SHIP_TO_CUSTOMER_ID = #CUSTID
and Issue.ACTUAL_PGI_DATE > GETDATE() - 122
and PLANT.PLANT_CODE = #CUSTPLANT
and MAIN.STORAGE_LOCATION = '0001'
Group by Issue.MATERIAL_NUMBER_GROUP,Main.QUANTITY_TOTAL_STOCK

Selecting elements that don't exist

I am working on an application that has to assign numeric codes to elements. This codes are not consecutives and my idea is not to insert them in the data base until have the related element, but i would like to find, in a sql matter, the not assigned codes and i dont know how to do it.
Any ideas?
Thanks!!!
Edit 1
The table can be so simple:
code | element
-----------------
3 | three
7 | seven
2 | two
And I would like something like this: 1, 4, 5, 6. Without any other table.
Edit 2
Thanks for the feedback, your answers have been very helpful.
This will return NULL if a code is not assigned:
SELECT assigned_codes.code
FROM codes
LEFT JOIN
assigned_codes
ON assigned_codes.code = codes.code
WHERE codes.code = #code
This will return all non-assigned codes:
SELECT codes.code
FROM codes
LEFT JOIN
assigned_codes
ON assigned_codes.code = codes.code
WHERE assigned_codes.code IS NULL
There is no pure SQL way to do exactly the thing you want.
In Oracle, you can do the following:
SELECT lvl
FROM (
SELECT level AS lvl
FROM dual
CONNECT BY
level <=
(
SELECT MAX(code)
FROM elements
)
)
LEFT OUTER JOIN
elements
ON code = lvl
WHERE code IS NULL
In PostgreSQL, you can do the following:
SELECT lvl
FROM generate_series(
1,
(
SELECT MAX(code)
FROM elements
)) lvl
LEFT OUTER JOIN
elements
ON code = lvl
WHERE code IS NULL
Contrary to the assertion that this cannot be done using pure SQL, here is a counter example showing how it can be done. (Note that I didn't say it was easy - it is, however, possible.) Assume the table's name is value_list with columns code and value as shown in the edits (why does everyone forget to include the table name in the question?):
SELECT b.bottom, t.top
FROM (SELECT l1.code - 1 AS top
FROM value_list l1
WHERE NOT EXISTS (SELECT * FROM value_list l2
WHERE l2.code = l1.code - 1)) AS t,
(SELECT l1.code + 1 AS bottom
FROM value_list l1
WHERE NOT EXISTS (SELECT * FROM value_list l2
WHERE l2.code = l1.code + 1)) AS b
WHERE b.bottom <= t.top
AND NOT EXISTS (SELECT * FROM value_list l2
WHERE l2.code >= b.bottom AND l2.code <= t.top);
The two parallel queries in the from clause generate values that are respectively at the top and bottom of a gap in the range of values in the table. The cross-product of these two lists is then restricted so that the bottom is not greater than the top, and such that there is no value in the original list in between the bottom and top.
On the sample data, this produces the range 4-6. When I added an extra row (9, 'nine'), it also generated the range 8-8. Clearly, you also have two other possible ranges for a suitable definition of 'infinity':
-infinity .. MIN(code)-1
MAX(code)+1 .. +infinity
Note that:
If you are using this routinely, there will generally not be many gaps in your lists.
Gaps can only appear when you delete rows from the table (or you ignore the ranges returned by this query or its relatives when inserting data).
It is usually a bad idea to reuse identifiers, so in fact this effort is probably misguided.
However, if you want to do it, here is one way to do so.
This the same idea which Quassnoi has published.
I just linked all ideas together in T-SQL like code.
DECLARE
series #table(n int)
DECLARE
max_n int,
i int
SET i = 1
-- max value in elements table
SELECT
max_n = (SELECT MAX(code) FROM elements)
-- fill #series table with numbers from 1 to n
WHILE i < max_n BEGIN
INSERT INTO #series (n) VALUES (i)
SET i = i + 1
END
-- unassigned codes -- these without pair in elements table
SELECT
n
FROM
#series AS series
LEFT JOIN
elements
ON
elements.code = series.n
WHERE
elements.code IS NULL
EDIT:
This is, of course, not ideal solution. If you have a lot of elements or check for non-existing code often this could cause performance issues.