For simplicity, suppose I have a SQL table with one column, containing numeric data. Sample data is
11
13
21
22
23
3
31
32
33
41
42
131
132
133
141
142
143
If the table contains ALL the numbers of the form x1, x2, x3, but NOT x (where x is all of the digits of a number but the last digit. So for 123456, x would be 12345), then I want to replace these three rows with a new row, x.
The desired output of the above data would be:
11
13
2
3
31
32
33
41
42
131
132
133
14
How would I accomplish this with SQL? I should mention that I do not want to permanently alter the data - just for the query results.
Thanks.
I assume
the presence of to functions: lastDigit and head producing the last digit and the rest of the input value respectively
the data is unique
only the digits 1,2,3 are used for constructing the table values
the table is named t and has a single column x
you don't want this to work recursively
create a view n like this: select head(x) h, lastDigit(x) last from t You can use inline views instead
create a view c like this
select h
from n
group by h
having count(*) = 3
Then this should give the desired result:
select distinct x
from (
select x from t where head(x) not in (select h from c)
union
select h from c
)
You need two SQL commands, one to remove the existing rows and the second to add the new row. They would look like this (where :X is a parameter containing your base number, 14 in your example):
DELETE FROM YourTable WHERE YourColumn BETWEEN (:X*10) AND ((:X*10) + 9)
INSERT INTO YourTable (YourColumn) VALUES (:X)
Note: I assume you want all the numbers from x0 to x9 to be removed, so that's what I wrote above. If you really only want x1, x2, x3 removed, then you would use this DELETE statement instead:
DELETE FROM YourTable WHERE YourColumn BETWEEN ((:X*10) + 1) AND ((:X*10) + 3)
Related
The following SQL query is supposed to return the max consecutive numbers in a set.
WITH RECURSIVE Mystery(X,Y) AS (SELECT A AS X, A AS Y FROM R)
UNION (SELECT m1.X, m2.Y
FROM Mystery m1, Mystery m2
WHERE m2.X = m1.Y + 1)
SELECT MAX(Y-X) + 1 FROM Mystery;
This query on the set {7, 9, 10, 14, 15, 16, 18} returns 3, because {14 15 16} is the longest chain of consecutive numbers and there are three numbers in that chain. But when I try to work through this manually I don't see how it arrives at that result.
For example, given the number set above I could create two columns:
m1.x
m2.y
7
7
9
9
10
10
14
14
15
15
16
16
18
18
If we are working on rows and columns, not the actual data, as I understand it WHERE m2.X = m1.Y + 1 takes the value from the next row in Y and puts it in the current row of X, like so
m1.X
m2.Y
9
7
10
9
14
10
15
14
16
15
18
16
18
Null?
The main part on which I am uncertain is where in the SQL recursion actually happens. According to Denis Lukichev recursion is the R part - or in this case the RECURSIVE Mystery(X,Y) - and stops when the table is empty. But if the above is true, how would the table ever empty?
Since I don't know how to proceed with the above, let me try a different direction. If WHERE m2.X = m1.Y + 1 is actually a comparison, the result should be:
m1.X
m2.Y
14
14
15
15
16
16
But at this point, it seems that it should continue recursively on this until only two rows are left (nothing else to compare). If it stops here to get the correct count of 3 rows (2 + 1), what is actually stopping the recursion?
I understand that for the above example the MAX(Y-X) + 1 effectively returns the actual number of recursion steps and adds 1.
But if I have 7 consecutive numbers and the recursion flows down to 2 rows, should this not end up with an incorrect 3 as the result? I understand recursion in C++ and other languages, but this is confusing to me.
Full disclosure, yes it appears this is a common university question, but I am retired, discovered this while researching recursion for my use, and need to understand how it works to use similar recursion in my projects.
Based on this db<>fiddle shared previously, you may find it instructive to alter the CTE to include an iteration number as follows, and then to show the content of the CTE rather than the output of final SELECT. Here's an amended CTE and its content after the recursion is complete:
Amended CTE
WITH RECURSIVE Mystery(X,Y) AS ((SELECT A AS X, A AS Y, 1 as Z FROM R)
UNION (SELECT m1.X, m2.A, Z+1
FROM Mystery m1
JOIN R m2 ON m2.A = m1.Y + 1))
CTE Content
x
y
z
7
7
1
9
9
1
10
10
1
14
14
1
15
15
1
16
16
1
18
18
1
9
10
2
14
15
2
15
16
2
14
16
3
The Z field holds the iteration count. Where Z = 1 we've simply got the rows from the table R. The, values X and Y are both from the field A. In terms of what we are attempting to achieve these represent sequences consecutive numbers, which start at X and continue to (at least) Y.
Where Z = 2, the second iteration, we find all the rows first iteration where there is a value in R which is one higher than our Y value, or one higher than the last member of our sequence of consecutive numbers. That becomes the new highest number, and we add one to the number of iterations. As only three numbers in our original data set have successors within the set, there are only three rows output in the second iteration.
Where Z = 3, the third iteration, we find all the rows of the second iteration (note we are not considering all the rows of the first iteration again), where there is, again, a value in R which is one higher than our Y value, or one higher than the last member of our sequence of consecutive numbers. That, again, becomes the new highest number, and we add one to the number of iterations.
The process will attempt a fourth iteration, but as there are no rows in R where the value is one more than the Y values from our third iteration, no extra data gets added to the CTE and recursion ends.
Going back to the original db<>fiddle, the process then searches our CTE content to output MAX(Y-X) + 1, which is the maximum difference between the first and last values in any consecutive sequence, plus one. This finds it's value from the record produced in the third iteration, using ((16-14) + 1) which has a value of 3.
For this specific piece of code, the output is always equivalent to the value in the Z field as every addition of a row through the recursion adds one to Z and adds one to Y.
I have got a data set that contains 3 columns and has 15565 observations. one of the columns has got several words in the same row.
What I am looking to do is to extract a particular word from each row and append it to a new column (i will have 4 cols in total)
The problem is that the word that i am looking for are not the same and they are not always on the same position.
Here is an extract of my DS:
x y z
-----------------------------------------------------------------------
1 T 3C00652722 (T558799A)
2 T NA >> MSP: T0578836A & 3C03024632
3 T T0579010A, 3C03051500, EAET03051496
4 U T0023231A > MSP: T0577506A & 3C02808556
8 U (T561041A C72/59460)>POPMigr.T576447A,C72/221816*3C00721502
I am looking to extract all the words that start with 3Cand are 10 characters long and then append the to a new col so it looks like this:
x y z Ref
----------------------------------------------------------------
1 T 3C00652722 (T558799A) 3C00652722
2 T NA >> MSP: T0578836A & 3C03024632 3C03024632
3 T T0579010A, 3C03051500, EAET03051496 3C03051500
4 U T0023231A > MSP: T0577506A & 3C02808556 3C02808556
8 U >POPMigr.T576447A,C72/221816*3C00721502 3C00721502
I have tried using the Contains, Like and substring methods but it does not give me the results i am looking for as it basically finds the rows that have the 3C number but does not extract it, it just copies the whole cell and pastes is on the Ref column.
SQL Server doesn't have good string functions, but this should suffice if you only want to extract one value per row:
select t.*,
left(stuff(col,
1,
patindex('%3C[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]%', col),
''
), 10)
from t ;
For example, the dataset a is
id x
1 15
2 25
3 35
4 45
I want to add a column y to dataset a, y being the average of x excluding the current id.
so y_1 = (x_2+x_3+x_4)/3 = (25+35+45)/3.
Easiest way to do it without SQL is to add the mean and the n to each row (use PROC MEANS, then merge on the values), and then use math to remove the current value. IE, if x_mean=(15+25+35+45)/4 = 30, and x=15, then
x_mean_others = ((30*4)-15)/(4-1) = 105/3 = 35
Alternateively, in SQL, you can calculate it on the fly with the same idea.
proc sql;
create table want as
select x, (mean(x)*n(x) - x)/(n(x)-1) as y
from have H
;
quit;
This takes advantage of SAS's automatic remerging, in something like SQL Server you'd need a WITH clause to make this work I imagine.
Every letter has a value
a b c d e f g h i j k l m n o p q r s t u v w x y z
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
TableA
String Length Value Subwords
exampledomain 13 132 #example-domain#example-do-main#
creditcard 10 85 #credit-card#credit-car-d#
TableB
Words Length Value
example 7 76
do 2 19
main 4 37
domain 6 56
credit 6 59
card 4 26
car 3 22
d 1 4
Explanation
TableA has string based over milion rows, and it will be new added 100k rows/daily to tableA.
And also "string" column has no whitespaces
TableB has words based over milion rows,there is every letter and words in 1-2 languages
What i want to do
i want to split strings in TableA to its subwords, as you see in example; "creditcard" i search in TableB all words and try to find which words when comes together matches the string
What i did,and couldnt solve my question
i took the string and JOIN the TableB with INNER JOINS i made 2-3 times INNER JOINS because there can be 3word 4word strings too, and that WORKED!! but it takes too much time even doing it for 100-200 strings. Guess i want to do it for 100k/everyday???
Now what i try to do
i gave values to everyletter as you see above,
Took the strings one by one and from their including letters i count the value of strings..
And the same for the words too in TableB..
Now i have every string in TableA and everyword in TableB with their VALUES..
_
1- i will take the string,length and value of it (Exmple; creditcard - 10 - 85)
2- and make a search in TableB to find the possible words when they come together, with their SUM(length), and SUM(value) matches the strings length and value, and write theese possibilities to a new column.
At last even their sum of length and sum of values matches each other there can be some posibilities that doesnt match the whole string i will elliminate theese ones (Example; "doma-in" can be "moda-in" too and their lengths and values are same but not same words)
I dont know but,i guess with that value method i can solve the time proplem??? , or if there is another ways to do that, i will be gratefull taking your advices.
Thanks
You could try to find the solutions recursively by looking always at the next letter. For example for the word DOMAIN
D - no
DO - is a word!
M - no
MA - no
MAI - no
MAIN - is a word!
No more letters --> DO + MAIN
DOM - is a word!
A - no
AI - no
AIN - no
Finished without result
DOMA - no
DOMAI - no
DOMAIN - is a word!
No more letters --> DOMAIN
I tried to find solutions for this and it is somehow easy to solve when records are below a certain number. But...
I have an original list with 81,590 records.
Id Loc Sales LatLong
1 a 100 ...
2 b 110 ...
3 c 105 ...
4 d 125 ...
5 e 123 ...
6 f 35 ...
.
.
.
81,590 ... ... ...
I need to compare all items in the list against each other.
Id L1 L2 Dist
1 a a 0 --> Not needed. Self comparison.
2 a b 26
3 a c 150 --> Not needed. Distance >100.
4 a d 58
5 b a 26 --> Not needed. Repeated record.
6 b b 0 --> Not needed. Self comparison.
7 b c 15
8 b d 151 --> Not needed. Distance >100.
9 c a 150 --> Not needed. Repeated record.
10 c b 15 --> Not needed. Repeated record.
11 c c 0 --> Not needed. Self comparison.
12 c d 75
13 d a 58 --> Not needed. Repeated record.
14 d b 151 --> Not needed. Repeated record.
15 d c 75 --> Not needed. Repeated record.
16 d d 0 --> Not needed. Self comparison.
But as shown next to the records above, the end result needs to be a list that:
1) Compares records against each other ONLY when they are located at a certain distance, say <100 miles.
2) Does not contain duplicates in the sense that comparing Loc1 to Loc2 is the same as comparing Loc2 to Loc1.
3) And the obvious one, no need to compare Loc1 to itself.
The end result would be:
Id L1 L2 Dist
2 a b 26
4 a d 58
7 b c 15
12 c d 75
Approach:
In theory, the total number of records after comparing all items against themselves is 81,590 ^ 2 = 6,656,928,100 records.
Subtracting repeated iterations (LocA-LocB = LocB-LocA) would mean 6,656,928,100 / 2 = 3,328,464,050.
Further cleaning by getting rid of self-repeating iterations (LocA-LocA), should be 3,328,464,050 - 81,590 = 3,328,382,460.
Then I could get rid of all records with distance > 100 miles.
This is highly inefficient, I'd be building a table with 6Bn records, then deleting half, etc. etc. etc.
Is there an approach to arrive to the end product in a much more efficient (less steps, less select/delete/update) way?
What is the select statement needed to insert the final data-set into destination?
It sounds to me like there is a join of the table with itself and a filtering by iterations of the key but here is where I am stuck.
What algorithm are you using to calculate distance between two points? Simple “the world is flat” cartesian math, or that trigonometry-laden “the word is an oblate spherloid” one? This can turn into serious CPU requirements.
It’s probably best to generate a table of “locations that are within distance X of this location” once and store it permanently; barring major events like earthquakes, it’s just not going to change.
Query-wise, the base join is trivial:
SELECT
t1.Loc L1
,t2.Loc L2
from MyTable t1
inner join MyTable t2
on t2.Loc > t1.Loc
If have the distance formula in, say, a function named “distanceFunction”, it might look something like:
WITH cteCalc as (
select
t1.Loc L1
,t2.Loc L2
,dbo.distanceFunction(t1.LatLong, t2.LatLong) Dist
from MyTable t1
inner join MyTable t2
on t2.Loc > t1.Loc
where dbo.distanceFunction(t1.LatLong, t2.LatLong) < #MaxDistance)
INSERT TargetTable (L1, L2, Dist)
SELECT
L1
,L2
,Dist
where Dist <= #MaxDistance
This, of course, may break your system, if only because the transaction log will grow too big while you’re writing a few billion rows to the target table. I'd say build a loop, processing each location in turn, with the final query like:
WITH cteCalc as (
select
t1.Loc L1
,t2.Loc L2
,dbo.distanceFunction(t1.LatLong, t2.LatLong) Dist
from MyTable t1
inner join MyTable t2
on t2.Loc > t1.Loc
where dbo.distanceFunction(t1.LatLong, t2.LatLong) < #MaxDistance
and t1.Loc = #ThisIterationLoc)
INSERT TargetTable (L1, L2, Dist)
SELECT
L1
,L2
,Dist
where Dist <= #MaxDistance
First pass returns 81,589 less whichever are too far away, second pass as 81,588 to process, and so forth.
Here is an outline of how I would solve this problem:
Put indexes on latitude and longitude
Do the math for lat and long of your distance for the range (box) of your distance. Then you know that your distance (as box not a circle) is contained in this delta. You also know that it is not outside of this delta. This constrains the problem considerably.
For example, if the change in lat and long is 10 for your distance then a location at (100,100) your box would be defined by (95,95) and (105,105) values for lat and long.
Write a query that looks at each element (from lowest id) and searches for other elements (with greater id to avoid duplicates) within the the delta of lat and log and save this to a temporary table.
Iterate over that table and do a full calculation to see if it is within the circle (not the box) of your distance.