Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed last month.
Improve this question
My SQL table looks like this
unit_number
name
party_number
1700
Facilities
000018727
1800
Human Resource
000018728
1801
PRO
000092293
1802
Human Resource
000092294
1803
Recruitment
000092295
1804
Learning & Development
000092296
1805
Administration
000100783
1900
Information Technology
000018729
1901
Information Technology
000092297
F&B
F&B
000045759
PRODUCT.
Product
000103719
I want to make a hierarchy structure of the above table where 1700,1800,1900...(at 100th place) are at level 1, while subsequent 1801, 1802 at level 2, level 3 resp.
What should be the query for that? I want to add a level column to the table
You could use the dense rank function as the following:
To make the not number values appear at the bottom of the hierarchy level use the following:
select *,
dense_rank() over
(order by
case
when try_cast(unit_number as int) is null
then 1 else 0
end,
isnull(try_cast(unit_number as int),99) % 100
) as h_level
from table_name
To make the not number values appear at the top of the hierarchy level use the following:
select *,
dense_rank() over
(order by
case
when try_cast(unit_number as int) is null
then 0 else 1
end,
isnull(try_cast(unit_number as int),1) % 100
) as h_level
from table_name
And if you want to consider the not number values to be in level 1 along with the (1700, 1800, ...), try this:
select *,
dense_rank() over
(order by isnull(try_cast(unit_number as int),0) % 100) as h_level
from table_name
See demo
You can try as follows:
UPDATE my_table
SET my_new_col = 1
WHERE unit_number LIKE '%00'
Then repeat with both values increased.
It can be done with a more complicated command in one UPDATE and some string manipulation and type casting, but it sounds like this will be a one-time operation so I don't see a lot of value in trying to do that.
Sounds like you're looking for the modulo operator:
/* Start Demo Data */
DECLARE #Units TABLE (Unit_Number NVARCHAR(10), Name NVARCHAR(50), Party_Number NVARCHAR(10), HierarchyLevel NVARCHAR(10))
INSERT INTO #Units (Unit_Number, Name, Party_Number) VALUES
('1700', 'Facilities ', '000018727'),
('1800', 'Human Resource ', '000018728'),
('1801', 'PRO ', '000092293'),
('1802', 'Human Resource ', '000092294'),
('1803', 'Recruitment ', '000092295'),
('1804', 'Learning & Development', '000092296'),
('1805', 'Administration ', '000100783'),
('1900', 'Information Technology', '000018729'),
('1901', 'Information Technology', '000092297'),
('F&B', 'F&B' , '000045759'),
('PRODUCT.','Product' , '000103719')
/* End Demo Data */
UPDATE #Units
SET HierarchyLevel = COALESCE((TRY_CAST(Unit_Number AS INT) % 100),0) + 1
FROM #Units
SELECT *
FROM #Units
Unit_Number Name Party_Number HierarchyLevel
------------------------------------------------------------------
1700 Facilities 000018727 1
1800 Human Resource 000018728 1
1801 PRO 000092293 2
1802 Human Resource 000092294 3
1803 Recruitment 000092295 4
1804 Learning & Development 000092296 5
1805 Administration 000100783 6
1900 Information Technology 000018729 1
1901 Information Technology 000092297 2
F&B F&B 000045759 1
PRODUCT. Product 000103719 1
Modulo returns the remainder after dividing a value by the value specified. In this case we want to know what the remainder is after dividing Unit_Number by 100. In order to start with a 1-indexed value as you requested we then add 1 to that.
100 returns 0, + 1: 1
101 returns 1, + 1: 2
We then just use this to update the HeierarchyLevel column.
When you're providing demo data for questions like this, it's very helpful to do so in an easily reusable format so folks can quickly work on your problem.
Related
I am trying to write a query in SSMS 2016 that will isolate the value(s) for a group that are unlike the other values within a column. I can explain better with an example:
Each piece of equipment in our fleet has an hour meter reading that gets recorded from a handheld device. Sometimes people in the field enter in a typo meter reading which skews our hourly readings.
So a unit's meter history may look like this:
10/1/2019: 2000
10/2/2019: 2208
10/4/2019: 2208
10/7/2019: 2212
10/8/2019: 2
10/8/2019: 2225
...etc.
It's obvious that the "2" is a bad record because an hour meter can never decrease. edit: Sometimes the opposite extreme may occur, where they enter a reading like "22155" and then I would need the query to adapt to find values that are too high and isolate those as well. This data is stored in a meter history table where there is a single row for each meter reading. I am tasked with creating some type of procedure that will automatically isolate the bad data and delete those rows from the table. How can I write a query that understands the context of the meter history and knows that the 2 is bad?
Any tips welcome, thanks in advance.
You can use filter to get rid of "decreases":
select t.*
from (select t.*, lag(col2) over (order by col1) as prev_col2
from t
) t
where prev_col2 < col2;
I would not advise "automatically deleting" such records.
Automatically deleting data is risky, so I'm not certain I'd recommend unleashing that without some serious thought, but here's my idea based on your sample data showing that it's usually a pretty consistent number.
DECLARE #Median numeric(22,0);
;with CTE as
(
select t.*, row_number() over (order by t.value) as "rn" from t
)
select #Median = cte.value
where cte.rn = (select (SUM( MAX(RN) + MIN(RN)) / 2 from cte); -- floors if dividing an odd number
select * from dataReadings where reading_value < (0.8 * #median) OR reading_value > (1.2 * #median);
The goal of this is to give you a +/- 20% range of the median value, which shouldn't be as skewed by mistakes as an average would be. Again, this assumes that your values should fall into an acceptable range.
If this is meant to be an always-increasing reading and you shouldn't ever encounter lower values, Gordon's answer is perfect.
I would think to look at the variation of each reading from the mean reading value. (I picked up the lag() check from #Gordon Linoff's reply too.) For example:
create table #test (the_date date, reading int)
insert #test (the_date, reading) values ('10/1/2019', 2000)
, ('10/2/2019', 2208)
, ('10/4/2019', 2208)
, ('10/7/2019', 2212)
, ('10/8/2019', 2)
, ('10/8/2019', 2225)
, ('10/8/2019', 2224)
, ('10/9/2019', 22155)
declare #avg int, #stdev float
select #avg = avg(reading)
, #stdev = stdev(reading) * 0.5
from #test
select t.*
, case when reading < #avg - #stdev then 'SUSPICIOUS - too low'
when reading > #avg + #stdev then 'SUSPICIOUS - too high'
when reading < prev_reading then 'SUSPICIOUS - decrease'
end Comment
from (select t.*, lag(reading) over (order by the_date) as prev_reading
from #test t
) t
Which results in:
the_date reading prev_reading Comment
2019-10-01 2000 NULL NULL
2019-10-02 2208 2000 NULL
2019-10-04 2208 2208 NULL
2019-10-07 2212 2208 NULL
2019-10-08 2 2212 SUSPICIOUS - too low
2019-10-08 2225 2 NULL
2019-10-08 2224 2225 SUSPICIOUS - decrease
2019-10-09 22155 2224 SUSPICIOUS - too high
I have a query that collects many different columns, and I want to include a column that sums the price of every component in an order. Right now, I already have a column that simply shows the price of every component of an order, but I am not sure how to create this new column.
I would think that the code would go something like this, but I am not really clear on what an aggregate function is or why I get an error regarding the aggregate function when I try to run this code.
SELECT ID, Location, Price, (SUM(PriceDescription) FROM table GROUP BY ID WHERE PriceDescription LIKE 'Cost.%' AS Summary)
FROM table
When I say each component, I mean that every ID I have has many different items that make up the general price. I only want to find out how much money I spend on my supplies that I need for my pressure washers which is why I said `Where PriceDescription LIKE 'Cost.%'
To further explain, I have receipts of every customer I've worked with and in these receipts I write down my cost for the soap that I use and the tools for the pressure washer that I rent. I label all of these with 'Cost.' so it looks like (Cost.Water), (Cost.Soap), (Cost.Gas), (Cost.Tools) and I would like it so for Order 1 it there's a column that sums all the Cost._ prices for the order and for Order 2 it sums all the Cost._ prices for that order. I should also mention that each Order does not have the same number of Costs (sometimes when I use my power washer I might not have to buy gas and occasionally soap).
I hope this makes sense, if not please let me know how I can explain further.
`ID Location Price PriceDescription
1 Park 10 Cost.Water
1 Park 8 Cost.Gas
1 Park 11 Cost.Soap
2 Tom 20 Cost.Water
2 Tom 6 Cost.Soap
3 Matt 15 Cost.Tools
3 Matt 15 Cost.Gas
3 Matt 21 Cost.Tools
4 College 32 Cost.Gas
4 College 22 Cost.Water
4 College 11 Cost.Tools`
I would like for my query to create a column like such
`ID Location Price Summary
1 Park 10 29
1 Park 8
1 Park 11
2 Tom 20 26
2 Tom 6
3 Matt 15 51
3 Matt 15
3 Matt 21
4 College 32 65
4 College 22
4 College 11 `
But if the 'Summary' was printed on every line instead of just at the top one, that would be okay too.
You just require sum(Price) over(Partition by Location) will give total sum as below:
SELECT ID, Location, Price, SUM(Price) over(Partition by Location) AS Summed_Price
FROM yourtable
WHERE PriceDescription LIKE 'Cost.%'
First, if your Price column really contains values that match 'Cost.%', then you can not apply SUM() over it. SUM() expects a number (e.g. INT, FLOAT, REAL or DECIMAL). If it is text then you need to explicitly convert it to a number by adding a CAST or CONVERT clause inside the SUM() call.
Second, your query syntax is wrong: you need GROUP BY, and the SELECT fields are not specified correctly. And you want to SUM() the Price field, not the PriceDescription field (which you can't even sum as I explained)
Assuming that Price is numeric (see my first remark), then this is how it can be done:
SELECT ID
, Location
, Price
, (SELECT SUM(Price)
FROM table
WHERE ID = T1.ID AND Location = T1.Location
) AS Summed_Price
FROM table AS T1
to get exact result like posted in question
Select
T.ID,
T.Location,
T.Price,
CASE WHEN (R) = 1 then RN ELSE NULL END Summary
from (
select
ID,
Location,
Price ,
SUM(Price)OVER(PARTITION BY Location)RN,
ROW_number()OVER(PARTITION BY Location ORDER BY ID )R
from Table
)T
order by T.ID
Let's say, I have music video play stats table mydataset.stats for a given day (3B rows, 1M users, 6K artists).
Simplified schema is:
UserGUID String, ArtistGUID String
I need pivot/transpose artists from rows to columns, so schema will be:
UserGUID String, Artist1 Int, Artist2 Int, … Artist8000 Int
With Artist plays count by respective user
There was an approach suggested in How to transpose rows to columns with large amount of the data in BigQuery/SQL? and How to create dummy variable columns for thousands of categories in Google BigQuery? but looks like it doesn’t scale for numbers I have in my example
Can this approach be scaled for my example?
I tried below approach for up to 6000 features and it worked as expected. I believe it will work up to 10K features which is hard limit for number of columns in a table
STEP 1 - Aggregate plays by user / artist
SELECT userGUID as uid, artistGUID as aid, COUNT(1) as plays
FROM [mydataset.stats] GROUP BY 1, 2
STEP 2 – Normalize uid and aid – so they are consecutive numbers 1, 2, 3, … .
We need this at least for two reasons: a) make later dynamically created sql as compact as possible and b) to have more usable/friendly columns names
Combined with first step – it will be:
SELECT u.uid AS uid, a.aid AS aid, plays
FROM (
SELECT userGUID, artistGUID, COUNT(1) AS plays
FROM [mydataset.stats]
GROUP BY 1, 2
) AS s
JOIN (
SELECT userGUID, ROW_NUMBER() OVER() AS uid FROM [mydataset.stats] GROUP BY 1
) AS u ON u. userGUID = s.userGUID
JOIN (
SELECT artistGUID, ROW_NUMBER() OVER() AS aid FROM [mydataset.stats] GROUP BY 1
) AS a ON a.artistGUID = s.artistGUID
Let’s write output to table - mydataset.aggs
STEP 3 – Using already suggested (in above mentioned questions) approach for N features (artists) at a time.
In my particular example, by experimenting, I found that basic approach works well for number of features between 2000 and 3000.
To be on safe side I decided to use 2000 features at a time
Below script is used for dynamically generating query that then run to create partitioned tables
SELECT 'SELECT uid,' +
GROUP_CONCAT_UNQUOTED(
'SUM(IF(aid=' + STRING(aid) + ',plays,NULL)) as a' + STRING(aid)
)
+ ' FROM [mydataset.aggs] GROUP EACH BY uid'
FROM (SELECT aid FROM [mydataset.aggs] GROUP BY aid HAVING aid > 0 and aid < 2001)
Above query produces yet another query like below:
SELECT uid,SUM(IF(aid=1,plays,NULL)) a1,SUM(IF(aid=3,plays,NULL)) a3,
SUM(IF(aid=2,plays,NULL)) a2,SUM(IF(aid=4,plays,NULL)) a4 . . .
FROM [mydataset.aggs] GROUP EACH BY uid
This should be run and written to mydataset.pivot_1_2000
Executing STEP 3 two more times (adjusting HAVING aid > NNNN and aid < NNNN) we get three more tables mydataset.pivot_2001_4000, mydataset.pivot_4001_6000
As you can see - mydataset.pivot_1_2000 has expected schema but for features with aid from 1 to 2001; mydataset.pivot_2001_4000 has only features with aid from 2001 to 4000; and so on
STEP 4 – Merging all partitioned pivot table to final pivot table with all features represented as columns in one table
Same as in above steps. First we need generate query and then run it
So, initially we will “stitch” mydataset.pivot_1_2000 and mydataset.pivot_2001_4000. Then result with mydataset.pivot_4001_6000
SELECT 'SELECT x.uid uid,' +
GROUP_CONCAT_UNQUOTED(
'a' + STRING(aid)
)
+ ' FROM [mydataset.pivot_1_2000] AS x
JOIN EACH [mydataset.pivot_2001_4000] AS y ON y.uid = x.uid
'
FROM (SELECT aid FROM [mydataset.aggs] GROUP BY aid HAVING aid < 4001 ORDER BY aid)
Output string from above should be run and result written to mydataset.pivot_1_4000
Then we repeat STEP 4 like below
SELECT 'SELECT x.uid uid,' +
GROUP_CONCAT_UNQUOTED(
'a' + STRING(aid)
)
+ ' FROM [mydataset.pivot_1_4000] AS x
JOIN EACH [mydataset.pivot_4001_6000] AS y ON y.uid = x.uid
'
FROM (SELECT aid FROM [mydataset.aggs] GROUP BY aid HAVING aid < 6001 ORDER BY aid)
Result to be written to mydataset.pivot_1_6000
The resulted table has following schema:
uid int, a1 int, a2 int, a3 int, . . . , a5999 int, a6000 int
NOTE:
a. I tried this approach only up to 6000 features and it worked as expected
b. Run time for second/main queries in step 3 and 4 varied from 20 to 60 min
c. IMPORTANT: billing tier in steps 3 and 4 varied from 1 to 90. The good news is that respective table’s size is relatively small (30-40MB) so does billing bytes. For “before 2016” projects everything is billed as tier 1 but after October 2016 this can be an issue.
For more information, see Timing in High-Compute queries
d. Above example shows power of large-scale data transformation with BigQuery! Still I think (but I can be wrong) that storing materialized feature matrix is not the best idea
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I inherited a table that has a column containing hand-entered award numbers. It has been used for many years by many people. The award numbers in general look like this:
R01AR012345-01
R01AR012345-02
R01AR012345-03
Award numbers get assigned each year. Because so many different people have had their hands in this in the past, there isn't a lot of consistency in how these are entered. For instance, an award sequence may appear like this:
R01AR012345-01
1 RO1AR012345-02
12345-03
12345-05A1
1234506
The rule I've been given to find is to return any record in which 5 consecutive integers from that column match with another record.
I know how to match a given string, but am at a loss when the 5 consecutive integers are unknown.
Here's a sample table to make what I'm looking for more clear:
+----------------------+
| table: AWARD |
+-----+----------------+
| ID | AWARD_NO |
+-----+----------------+
| 12 | R01AR015123-01 |
+-----+----------------+
| 13 | R01AR015124-01 |
+-----+----------------+
| 14 | 15123-02A1 |
+-----+----------------+
| 15 | 1 Ro1XY1512303 |
+-----+----------------+
| 16 | R01XX099232-01 |
+-----+----------------+
In the above table, the following IDs would be returned: 12,13,14,15
The five consecutive integers that match are:
12,13: 01512
12,14: 15123
12,15: 15123
In our specific case, ID 13 is a false positive... but we're willing to deal with those on a case-by-case basis.
Here's the desired return set for the above table:
+-----+-----+----------------+----------------+
| ID1 | ID2 | AWARD_NO_1 | AWARD_NO_2 |
+-----+-----+----------------+----------------+
| 12 | 13 | R01AR015123-01 | R01AR015124-01 |
+-----+-----+----------------+----------------+
| 12 | 14 | R01AR015123-01 | 15123-02A1 |
+-----+-----+----------------+----------------+
| 12 | 15 | R01AR015123-01 | 1 Ro1XY1512303 |
+-----+-----+----------------+----------------+
Now... I'm OK with false positives (like 12 matching 13) and duplicates (because if 12 matches 14, then 14 also matches 12). We're looking through something like 18,000 rows. Optimization isn't really necessary in this situation, because it's only needed to be run one time.
This should handle removing duplicates and most false-positives:
DECLARE #SPONSOR TABLE (ID INT NOT NULL PRIMARY KEY, AWARD_NO VARCHAR(50))
INSERT INTO #SPONSOR VALUES (12, 'R01AR015123-01')
INSERT INTO #SPONSOR VALUES (13, 'R01AR015124-01')
INSERT INTO #SPONSOR VALUES (14, '15123-02A1')
INSERT INTO #SPONSOR VALUES (15, '1 Ro1XY1512303')
INSERT INTO #SPONSOR VALUES (16, 'R01XX099232-01')
;WITH nums AS
(
SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 1)) AS [Num]
FROM sys.objects
),
cte AS
(
SELECT sp.ID,
sp.AWARD_NO,
SUBSTRING(sp.AWARD_NO, nums.Num, 5) AS [TestCode],
SUBSTRING(sp.AWARD_NO, nums.Num + 5, 1) AS [FalsePositiveTest]
FROM #SPONSOR sp
CROSS JOIN nums
WHERE nums.Num < LEN(sp.AWARD_NO)
AND SUBSTRING(sp.AWARD_NO, nums.Num, 5) LIKE '%[1-9][0-9][0-9][0-9][0-9]%'
-- AND SUBSTRING(sp.AWARD_NO, nums.Num, 5) LIKE '%[0-9][0-9][0-9][0-9][0-9]%'
)
SELECT sp1.ID AS [ID1],
sp2.ID AS [ID2],
sp1.AWARD_NO AS [AWARD_NO1],
sp2.AWARD_NO AS [AWARD_NO2],
sp1.TestCode
FROM cte sp1
CROSS JOIN #SPONSOR sp2
WHERE sp2.AWARD_NO LIKE '%' + sp1.TestCode + '%'
AND sp1.ID < sp2.ID
--AND 1 = CASE
-- WHEN (
-- sp1.FalsePositiveTest LIKE '[0-9]'
-- AND sp2.AWARD_NO NOT LIKE
-- '%' + sp1.TestCode + sp1.FalsePositiveTest + '%'
-- ) THEN 0
-- ELSE 1
-- END
Output:
ID1 ID2 AWARD_NO1 AWARD_NO2 TestCode
12 14 R01AR015123-01 15123-02A1 15123
12 15 R01AR015123-01 1 Ro1XY1512303 15123
14 15 15123-02A1 1 Ro1XY1512303 15123
If IDs 14 and 15 should not match, we might be able to correct for that as well.
EDIT:
Based on the comment from #Serpiton I commented out the creation and usage of the [FalsePositiveTest] field since changing the initial character range in the LIKE clause on the SUBSTRING to be [1-9] accomplished the same goal and slightly more efficiently. However, this change assumes that no valid Award # will start with a 0 and I am not sure that this is a valid assumption. Hence, I left the original code in place but just commented out.
You want to use the LIKE command in your where clause and use a pattern to look for the 5 numbers. See this post here:
There are probably better ways of representing this but the below example looks for 5 digits from 0-9 next to each other in the data anywhere in your column value. This could perform quite slowly however...
Select *
from blah
Where column LIKE '%[0-9][0-9][0-9][0-9][0-9]%'
Create a sql server function to extract the 5 numbers and then use the function in your query.
Perhaps something like:
select GetAwardNumber(AwardNumberField) as AwardNumber
from Awards
group by GetAwardNumber(AwardNumberField)
I will not post the code, but an idea on how to do it.
First of all you need to make a table valued function that will return all number sequences from a string bigger then 5 characters. (there are examples on SO)
So for each entry your function will return a list of numbers.
After that the query will simplify like:
;with res as (
select
id, -- hopefully there is an id on your table
pattern -- pattern is from the list of patterns the udtf returns
from myTable
cross apply udtf_custom(myString) -- myString is the string you need to split
)
select
pattern
from res
group by pattern
having count(distinct id)>1
I have to note that this is for example purposes, there should be some coding and testing involved, but this should be the story with it.
Good luck, hope it helps.
Here's what I ended up with:
SELECT a1.ID as AWARD_ID_1, a2.ID as AWARD_ID_2, a1.AWARD_NO as Sponsor_Award_1, a2.AWARD_NO as Sponsor_Award_2
FROM AWARD a1
LEFT OUTER JOIN AWARD a2
ON SUBSTRING(a1.AWARD_NO,PATINDEX('%[0-9][0-9][0-9][0-9][0-9]%',a1.AWARD_NO + '1'),5) = SUBSTRING(a2.AWARD_NO,PATINDEX('%[0-9][0-9][0-9][0-9][0-9]%',a2.AWARD_NO + '1'),5)
WHERE
a1.AWARD_NO <> '' AND a2.AWARD_NO <> ''
AND a1.ID <> a2.ID
AND a1.AWARD_NO LIKE '%[0-9][0-9][0-9][0-9][0-9]%' AND a2.AWARD_NO LIKE '%[0-9][0-9][0-9][0-9][0-9]%'
There's a possibility that the first substring of five characters might not match (when they should generate a match), but it's close enough for us. :-)
I am adding batches of records to a table using a single insert statement. I want each new batch to be allocated incrementing numbers, but starting from 1 each time.
So, if I have
Batch Name IncementingValue
1 Joe 1
1 Pete 2
1 Andy 3
2 Sue 1
2 Mike 2
2 Steve 3
and I then add two records (using a single insert statement) :
3 Dave
3 Paul
How can I run an update statement against this table so that Dave will be set to 1 and Paul to 2. I don't want to use a cursor.
The ranking function ROW_NUMBER should do what you need. You didn't mention any specific rules about how the sequence number should be allocated, so I've done it here using the name:
INSERT targetTable(Batch,Name,IncementingValue)
SELECT BatchId,
Name,
ROW_NUMBER() OVER (ORDER BY Name)
FROM sourceTable
I needed to accomplish something similar with dates and formatted numbers.
Hopefully, someone will find this example useful.
update TEST_TABLE
set ref = reference
from (
select
*,
(CONVERT(VARCHAR(10),GETDATE(),12) + RIGHT('0000' + CAST(ROW_NUMBER() OVER (ORDER BY id) AS VARCHAR(4)), 4)) as reference
from TEST_TABLE
WHERE
test_table.id > 4
and
test_table.id < 8
) TEST_TABLE