Add Unique Identifier column in SQL View - sql

I'm using SQL in Databricks and I have a view of a table which looks like this:
COL_A | COL_B |
-----------------------
AA | AB |
AA | AC |
AA | AD |
AA | AE |
I would like to create a column which increments by 1 starting from 100, such that I get data looking like this:
COL_A | COL_B | COL_C
---------------------------------
AA | AB | 100
AA | AC | 101
AA | AD | 102
AA | AE | 103
I've tried using code with the IDENTITY(100,1) and AUTO_INCREMENT functions, but I don't think they work with SQL view and/or databricks. If anyone has any ideas I'd greatly appreciate it, thank you!

I think you need to learn about SEQUENCE, something like this:
CREATE SEQUENCE id_seq
INCREMENT BY 1
START WITH 100
MINVALUE 10
MAXVALUE 1000
CYCLE
CACHE 2;

Ideally you can just do this to the table in question:
ALTER TABLE ADD Col_C (UNIQUEIDENTIFIER or INT IDENTITY(1, 1) if supported) NOT NULL
If for some reason you can't do that and you can only uniquely identify something in the view query, I would recommend looking at window functions. These are supported in a variety of SQL flavors including SQL Server, SQLite, MySQL, and hopefully data bricks, though I'm unfamiliar with this technology.

Related

How to get only Integer values from a column

I'm using SQL in Databricks and I wish to create a view of a table. The data looks similar to this:
A | B | C
------------------------
AA | AB | 1
AA | AC | 1.5
AA | AD | 2
AA | AE | 3
And basically, what I want to do is read in the table such that only the rows when C has integer values are read in, so that I get:
A | B | C
------------------------
AA | AB | 1
AA | AD | 2
AA | AE | 3
I've tried using code similar to this:
WHERE df.C NOT LIKE '.%[0-9$]%'
But this doesn't work, and similarly I tried this too:
Where IsNumeric(df.C) = 0x1
But IsNumeric doesn't seem to work in Databricks. If anyone has any ideas I'd greatly appreciate it, thank you!

SQL - combining two rows of data into one with a common identifier

I am working on a project where I have to solve the following problem.
Goal:
If there are two rows that same the same identifier, but additional data that is different, how can I combine all of that data into one row with individual columns?
Example:
DateBase:
| ID | Rating | Rating Provider|
--------------------------------
| 5055 | A+ | Moodys |
---------------------------------
| 5055 | Bb+ | SNP |
Desired End Result:
| ID | Moodys | SNP |
--------------------
| 5005 | A+ | Bb+ |
I believe you simply need a Pivot -
SELECT *
FROM YOUR_TABLE
PIVOT(MAX(Rating)
FOR Rating_Provider IN (Moodys AS 'Moodys', SNP AS 'SNP'));
Quantnesto, i believe that what you are looking for it's the JOIN function. You have the information in different databases, right?
You SELECT all the fields that you want from the different tables
SELECT a.ID,a.Moodys,B.SNP
FROM DataBase a
JOIN Database b on a.ID = b.ID
And that's it.
There are different kinds of JOIN's, for further information let me know, i can explain each type.

Sql Server how to find values in different tables that have different suffix

I'm struggling to find a value that might be in different tables but using UNION is a pain as there are a lot of tables.
[Different table that contains the suffixes from the TestTable_]
| ID | Name|
| -------- | -----------|
| 1 | TestTable1 |
| 2 | TestTable2 |
| 3 | TestTable3 |
| 4 | TestTable4 |
TestTable1 content:
| id | Name | q1 | a1 |
| -------- | ---------------------------------------- |
| 1 | goose | withFeather? |featherID |
| 2 | rooster| withoutFeather?|shinyfeatherID |
| 3 | rooster| age | 20 |
TestTable2 content:
| id | Name | q1 | a1 |
| -------- | ---------------------------------------------------|
| 1 | brazilian_goose | withFeather? |featherID |
| 2 | annoying_rooster | withoutFeather?|shinyfeatherID |
| 3 | annoying_rooster | no_legs? |dead |
TestTable3 content:
| id | Name | q1 | a1 |
| -------- | ---------------------------------------- |
| 1 | goose | withFeather? |featherID |
| 2 | rooster| withoutFeather?|shinyfeatherID |
| 3 | rooster| age | 15 |
Common columns: q1 and a1
Is there a way to parse through all of them to lookup for a specific value without using UNION because some of them might have different columns?
Something like: check if "q1='age'" exists in all those tables (from 1 to 50)
Select q1,*
from (something)
where q1 exists in (TestTable_*)... or something like that.
If not possible, not a problem.
You could use dynamic SQL but something I do in situations like this where I have a list of tables that I want to quickly perform the same actions on is to either use a spreadsheet to paste the list of tables into and type a query into the cell with something like #table then use the substitute function to replace it.
Alternative I just paste the list into SSMS and use SHIFT+ALT+ArrowKey to select the column and start typing stuff out.
So here is my list of tables
Then I use that key combo. As you can see my cursor has now selected all those rows.
Now I can start typing and all rows selected will get the input.
Then I just go to the other side of the table names and repeat the action
It's not a perfect solution but it's quick a quick and dirty way of doing something repetitive quickly.
If you want to find all the tables with that column name you can use information schema.
Select table_name from INFORMATION_SCHEMA.COLUMNS where COLUMN_NAME = 'q1'
Given the type of solution you are after I can offer a method that I've had to use on legacy systems.
You can query sys.columns for the name of the column(s) you need to find in N tables and join using object_id to sys.tables where type='U'. This will give you a list of table names.
From this list you can then build a working query for each table, and depending on your requirements (is this ad-hoc?) either just manually execute it yourself of build a procedure that will do it for you using sp_executesql
Eg
select t.name, c.name
into #workingtable
from sys.columns c
join sys.tables t on t.object_id=c.object_id
where c.name in .....
psudocode:
begin loop while rows exist in #working table
select top 1 row from #workingtable
set #sql=your query specific to that table and column(s)
exec(#sql) / sp_executesql / try/catch as necessary
delete row from working table
end loop
Hopefully that give ideas at least for how you might implement your requirements.

Generate rows from input array

Let's assume I have a table with many records called comments, and each record includes only a text body:
CREATE TABLE comments(id INT NOT NULL, body TEXT NOT NULL, PRIMARY KEY(id));
INSERT INTO comments VALUES (generate_series(1,100), md5(random()::text));
Now, I have an input array with N substrings, with arbitrary length. For example:
abc
xyzw
123456
not_found
For each input value, I want to return all rows that match a certain condition.
For example, given that the table includes the following records:
| id | body |
| -- | ----------- |
| 11 | abcd1234567 |
| 22 | unkown12 |
| 33 | abxyzw |
| 44 | 12345abc |
| 55 | found |
I need a query that returns the following result:
| substring | comments.id | comments.body |
| --------- | ----------- | ------------- |
| abc | 11 | abcd1234567 |
| abc | 44 | 12345abc |
| xyzw | 33 | abxyzw |
| 123456 | 11 | abcd1234567 |
So far, I have this SQL query:
SELECT substrings, comments.id, comments.body
FROM unnest(ARRAY[
'abc',
'xyzw',
'123456',
'not_found'
]) AS substrings
JOIN comments ON comments.id IN (
SELECT id
FROM comments as inner_comments
WHERE inner_comments.body LIKE ('%' || substrings || '%')
);
But the database client gets stuck for more than 10 minutes. And I missing something about joins?
Please note that this is a simplified example of my problem. My current check on the comment is not a LIKE statement, but a complex switch-case statement of different functions (fuzzy matching).
The detour with the IN is unnecessary and unless the optimizer can rewrite this and it likely cannot, adds overhead. Try if it gets better without.
SELECT un.substring,
comments.id,
comments.body
FROM unnest(ARRAY['abc',
'xyzw',
'123456',
'not_found']) un (substring)
INNER JOIN comments
ON comments.body LIKE ('%' || un.substring || '%');
But still indexes cannot be used here because of the wildcard at the beginning. You might want to look at Full Text Search and see what options you have with it to improve the situation.
Basically you are performing FULLTEXT search in a column that most likely doesn't have a FULLTEXT index.
A first step you could try would be to have your column "body" FULLTEXT indexed. See details here and then perform the search using CONTAINS but, quite honestly, since you want to perform fuzzy matching you cannot rely on SQL server to perform the search - it would just not work properly. You will need an indexing service such as ElasticSearch, CloudSearch, Azure Search, etc

Rolling id based on foreign key in a hierarchical schema

As an example, consider this hierarchical schema.
Assume all id fields are auto incrementing primary keys and that foreign keys are named by [parent_table_name]_id convention.
The problem
As soon as there are multiple companies in the database, then companies will share all primary key sequences between them.
For example, if there are two company rows, the customer_group table could look like this
| id | company_id |
-------------------
| 1 | 1 |
| 2 | 1 |
| 3 | 2 |
| 4 | 2 |
| 5 | 1 |
-------------------
But it should look like this
| id | company_id |
-------------------
| 1 | 1 |
| 2 | 1 |
| 1 | 2 |
| 2 | 2 |
| 3 | 1 |
-------------------
This behavior should also be exhibited for customer and any other table in the tree that directly or indirectly references company.
Note that I will most likely make a second id column (named something like relative_id) for this purpose, keeping the unique id column intact, as this is really mostly for display purposes and how users will reference these data entities.
Now if this was just one level of hierarchy, it would be a relatively simple solution.
I could make a table (table_name, company_id, current_id) and a trigger procedure that fires before insert on any of the tables, incrementing the current id by 1 and setting the row's relative_id to that value.
It's trivial when the company_id is right there in the insert query.
But how about the tables that don't reference company directly?
Like the lowest level of the hierarchy in this example, workorder, which only references customer.
Is there a clean, reusable solution to climb the ladder all the way from 'customer_id' to ultimately retrieve the parenting company_id?
Going recursively up the hierarchy with SELECTs on each INSERT doesn't sound too appealing to me, performance wise.
I also do not like the idea of just adding a foreign key to company for each of these tables, the schema would get increasingly uglier with each additional table.
But these are the two solutions I can see, but I may not be looking in the right places.
The company shouldn't care what the primary key is if you're using generated keys. They're supposed to be meaningless; compared for equality and nothing else. I grumbled about this earlier, so I'm really glad to see you write:
Note that I will most likely make a second id column (named something
like relative_id) for this purpose, keeping the unique id column
intact, as this is really mostly for display purposes and how users
will reference these data entities.
You're doing it right.
Most of the time it doesn't matter what the ID is, so you can just give them whatever comes out of a sequence and not care about holes/gaps. If you're concerned about inter-company leakage (unlikely) you can obfuscate the IDs by using the sequence as an input to a pseudo-random generator. See the function Daniel Verité wrote in response to my question about this a few years ago, pseudo_encrypt.
There are often specific purposes for which you need perfectly sequential gapless IDs, like invoice numbers. For those you need to use a counter table and - yes - look up the company ID. Such ID generation is slow and has terrible concurrency anyway, so an additional SELECT with a JOIN or two on indexed keys won't hurt much. Don't go recursively up the schema with SELECTs though, just use a series of JOINs. For example, for an insert into workorder your key generation trigger on workorder would be something like the (untested):
CREATE OR REPLACE FUNCTION workorder_id_tgfn() RETURNS trigger AS $$
BEGIN
IF tg_op = 'INSERT' THEN
-- Get a new ID, locking the row so no other transaction can add a
-- workorder until this one commits or rolls back.
UPDATE workorder_ids
SET next_workorder_id = next_workorder_id + 1
WHERE company_id = (SELECT company_id
FROM customer
INNER JOIN customer_group ON (customer.customer_group_id = customer_group.id)
INNER JOIN company ON (customer_group.company_id = company.id)
WHERE customer.id = NEW.customer_id)
RETURNING next_workorder_id
INTO NEW.id;
END IF;
END;
$$ LANGUAGE 'plpgsql';
For the UPDATE ... RETURNING ... INTO syntax see Executing a Query with a Single-Row Result.
There can be gaps in normal sequences even if there's no multi-company problem. Observe:
CREATE TABLE demo (id serial primary key, blah text);
BEGIN;
INSERT INTO demo(blah) values ('aa');
COMMIT;
BEGIN;
INSERT INTO demo(blah) values ('bb');
ROLLBACK;
BEGIN;
INSERT INTO demo(blah) values ('aa');
COMMIT;
SELECT * FROM demo;
Result:
regress=# SELECT * FROM demo;
id | blah
----+------
1 | aa
3 | aa
"But it should look like this"
| id | company_id |
-------------------
| 1 | 1 |
| 2 | 1 |
| 1 | 2 |
| 2 | 2 |
| 3 | 1 |
-------------------
I think it should not and I think you want a many to many relationship. The customer_group table:
| id | name |
-------------
| 1 | n1 |
| 2 | n2 |
| 3 | n3 |
-------------
And then the customer_group_company table:
| group_id | company_id |
-------------------------
| 1 | 1 |
| 2 | 1 |
| 1 | 2 |
| 2 | 2 |
| 3 | 1 |
-------------------------