SQL: Taking one column from two tables and putting them into one predefined table - sql

Just a little bug off my shoulder, but for what I'm using this code for, it is not the end of the world if this one doesn't get answered. To preface, a few things: I know this is entirely improper, I know this should never be used -- let alone, done -- in a production environment, and I know that the root of this operation is totally unconventional, but I'm asking anyway:
If I have two tables with a set of values that I am looking to grab and put into one other, combined and predefined table, side by side, how might I do that?
Right now, I have two statements doing
INSERT INTO table ('leftCol') SELECT NAME FROM smolT1 ORDER BY num DESC LIMIT 3
INSERT INTO table ('rightCol') SELECT NAME FROM smolT2 ORDER BY num DESC LIMIT 3
but, as one would imagine, that query ends up with something like...
leftCol | rightCol
Jack |
James |
John |
| Jill
| Justina
| Jesebelle
and of course, it would be much more preferred if the left and right column lined up, though, for the sake of gathering just those six records, I suppose it is not too big of a concern.
To add on, yes, these two tables do have a NAME in common, but with how I am querying them, they are totally irrelevant one another and should not be associated with one another, just displayed side by side.
I am simply curious as to whether or not one query would get these two unrelated queries to work together and print neatly into a form or if I just have to live with this data looking like this.
Cheers!

The most recent versions of SQLite support window functions. This allows you to do:
select min(name1) as name1, min(name2) as name2
from ((select name as name1, null as num2 row_number() over (order by name) as seqnum
from smolt1
where name is not null
) union all
(select null, name, row_number() over (order by name) as seqnum
from smolt2
where name is not null
)
) lr
group by seqnum;

Related

How to join 2 tables without common fields?

There are 2 tables:
Table 1: first_names
id | first_name
1 | Joey
7 | Ross
17| Chandler
Table 2: last_names
id | first_name
2 | Tribbiani
7 | Geller
25| Bing
Desired result:
id | full_name
1 | Joey Tribbiani
2 | Ross Geller
3 | Chandler Bing
Task:
Write the solution using only the simplest SQL syntax. Using store procedures, declaring variables, ROW_NUMBER(), RANK() functions are forbidden.
I have solution using ROW_NUMBER() function, but no ideas about solving this task using only the simplest SQL syntax.
P.S. I'm only trainee and it's my first question on stackoverflow
Simple join will suffice here
select * from first_names fn
join last_names ln on fn.id = ln.id - 1
But your question is very unclear though. Because join here is based rather on knowledge about Friends series rather than concrete logic...
You must create an id to join the tables.
This can be the order number in the table based in ids:
select
f.counter id, concat(f.first_name, ' ', l.last_name) full_name
from (
select t.*, (select count(*) from first_names where id < t.id) + 1 counter
from first_names t
) f inner join (
select t.*, (select count(*) from last_names where id < t.id) + 1 counter
from last_names t
) l
on l.counter = f.counter
See the demo.
Results:
> id | full_name
> -: | :-------------
> 1 | Joey Tribbiani
> 2 | Ross Geller
> 3 | Chandler Bing
Honestly, this is a stupid solution; it's vastly inefficient to ROW_NUMBER, and I wouldn't be surprised if LEAD is "not allowed" as ROW_NUMBER isn't. The fact that you were told to "use the simpliest SQL" means that the SQL you want to use is a subquery/CTE and ROW_NUMBER; that is as simple as this can really go. Anything else add a layer on unneeded complexity and will likely just make the query suffer from performance degradation. This one, for example, means you need to scan both tables twice; where as with ROW_NUMBER it would be once.
CREATE TABLE FirstNames (id int, FirstName varchar(10));
CREATE TABLE LastNames (id int, LastName varchar(10));
INSERT INTO FirstNames
VALUES(1,'Joey'),
(7,'Ross'),
(17,'Chandler');
INSERT INTO LastNames
VALUES (2,'Tribbiani'),
(7,'Geller'),
(25,'Bing');
GO
WITH CTE AS(
SELECT FN.id,
FN.FirstName,
LN.LastName
FROM FirstNames FN
LEFT JOIN LastNames LN ON FN.id = LN.id
UNION ALL
SELECT LN.id,
FN.FirstName,
LN.LastName
FROM LastNames LN
LEFT JOIN FirstNames FN ON LN.id = FN.id
WHERE FN.id IS NULL),
FullNames AS(
SELECT C.id,
C.FirstName,
ISNULL(C.LastName, LEAD(C.LastName) OVER (ORDER BY id)) AS LastName
FROM CTE C)
SELECT *
FROM FullNames FN
WHERE FN.FirstName IS NOT NULL
ORDER BY FN.id;
GO
DROP TABLE FirstNames;
DROP TABLE LastNames;
To answer the "Task" given:
"Task: Write the solution using only the simplest SQL syntax. Using store procedures, declaring variables, ROW_NUMBER(), RANK() functions are forbidden."
My answer would be the below?
"Why is this a requirement? SQL Server has supported ROW_NUMBER for 14 years, since SQL Server 2005. If you can't use ROW_NUMBER this infers you're using SQL Server 2000. This is actually a big security problem for the company, as 2000 has been out of support for close to a decade. Legislation like GDPR require a company to keep the technology they use secure, and it is very unlikely that this is therefore being met.
If this is the case, the solution if not the find a way around using ROW_NUMBER but to get the company back up to do date. The latest version of SQL Server that you can upgrade to from SQL Server 2000 is 2008; which also runs out of support on July 16 of this year. We'll need to get an instance up and running and get the existing features into this new server ASAP and get QA testing done as soon as possible. This needs to be the highest priority thing. After that we need to repeat the cycle to another version of SQL Server. The latest is 2017, which does support migration from 2008.
Once we've done that, we can then actually make use of ROW_NUMBER in the query; providing the simplest solution and also bringing the company back into a secure environment."
Sometimes requirements need to be challenged. From experience management can make some "stupid" requirements, because they don't understand the technology. When you're in an IT role, sometimes you will need to question those requirements and explain why the requirement isn't actually a good idea. Then, instead, you can aid Management to find the correct solution for the problem. At the end of the day, what they might be trying to fix could be an XY problem; and part of your troubleshooting will be to find out what X really is.

Node / Postgres SQL Select distinct entries then put all other entries with the same reference into one column

this question was probably asked somewhere but I can't seem to phrase it correctly in the search to find an accurate answer.
I'm doing a query on a Postgres DB, it has quite a few joins, the results are something like this:
WON | name | item
1 Joe A
1 Joe B
2 Smith A
So one row for each entry, I need to somehow get the result back as such:
WON | name | item
1 Joe A, B
2 Smith A
This can be done in the query or with NodeJS, there are hundreds to thousands of results for the query, so getting a distinct row (WON 1) then searching the DB for all entries that match it then repeating for the rest isn't feasible, so this may be better done in Node / Javascript, but I'm somewhat new to that, what would be a (somewhat) efficient way to do this?
If there IS a way to do this in the query itself then that would be my preference though.
Thanks
A sql approach:
SELECT won, name
,STRING_AGG(item, ',' ORDER BY item) AS items
FROM myTable
GROUP BY won, name
ORDER BY won, name
You can use GROUP BY and string_agg to cancat rows, somelike this:
Create table:
CREATE TABLE test
(
won int,
name character varying(255),
item character varying(255)
);
insert into test (won, name, item) values (1,'Joe', 'A'),(1, 'Joe', 'B'),(2, 'Smith', 'A')
And do this in the query:
select won, name, string_agg(item, ',') from test group by won, name order by won
See this example in sqlFiddle

Best practice for setup and querying versioned records in T-SQL

I'm trying to optimize my SQL queries and I always come back to this one issue and I was hoping to get some insight into how I could best optimize this.
For brevity, lets say I have a simple employee table:
tbl_employees
Id HiredDateTime
------------------
1 ...
2 ...
That has versioned information in another another table for each employee:
tbl_emplyees_versioned
Id Version Name HourlyWage
-------------------------------
1 1 Bob 10
1 2 Bob 20
1 3 Bob 30
2 1 Dan 10
2 2 Dan 20
And this is how the latest version records are retrieved in a View:
Select tbl_employees.Id, employees_LatestVersion.Name, employees_LatestVersion.HourlyWage, employees_LatestVersion.Version
From tbl_employees
Inner Join tbl_employees_versioned
ON tbl_employees.Id = tbl_employees_versioned.Id
CROSS APPLY
(SELECT Id, Max(Version) AS Version
FROM tbl_employees_versioned AS employees_LatestVersion
WHERE Id = tbl_employees_versioned.Id
GROUP BY Id) AS employees_LatestVersion
To get a response like this:
Id Version Name HourlyWage
-------------------------------
1 3 Bob 30
2 2 Dan 20
When pulling a query that has over 500 employees records for which each have a couple few versions, this query starts choking up and takes a few seconds to run.
There are a couple strikes right off the bat, but I'm not sure how to overcome them.
Obviously the Cross Apply adds some performance loss. Is there a best practice when dealing with versioned information like this? Is there a better way to get just a record with the highest version?
The versioned table doesn't have a clustered index beause neither Id or Version are unique. Concatenated together they would be, but it doesn't work like that. Instead there is a non-clustered index for Id and another one for Version. Is there a better way to index this table to get any performance gain? Would an indexed view really help here?
I think the best way to structure the data is using start dates and end dates. So, the data structure for your original table would look like:
create table tbl_EmployeesHistory (
EmployeeHistoryId int,
EffDate date not null,
EndDate date,
-- Fields that describe the employee during this time
)
Then, you can see the current version using a view:
create view vw_Employees as
select *
from tbl_EmployeesHistory
where EndDate is NULL
In some cases, where future end dates are allowed, the where clause would be:
where coalesce(EndDate, getdate()) >= getdate()
Alternatively, in this case, you can default EndDate to some future date far, far away such as '01-o1-9999'. You would add this as the default in the create table statement, make the column not null, and then you can always use the statement:
where getdate() between EffDate and EndDate
As Martin points out in his comment, the coalesce() might impede the use of an index (it does in SQL Server), whereas this does not have that problem.
This is called a slowly changing dimension. Ralph Kimball discusses this concept in some length in his books on data warehousing.
Here's one way you can get a view of the most recent version for each employee:
Select Id, Name, HourlyWage, Version
FROM (
Select E.Id, V.Name, V.HourlyWage, V.Version,
row_number() OVER (PARTITION BY V.ID ORDER BY V.Version DESC) as nRow
From tbl_employees E
Inner Join tbl_employees_versioned V ON E.Id = V.Id
) A
WHERE A.nRow = 1
I suspect that this will perform better than your previous solution. One index across Id and Version in tbl_employees_versioned would most likely also help.
Also, note that you only need to join on tbl_employees if you're selecting fields that are not in tbl_employees_versioned.

Maintaining logical consistency with a soft delete, whilst retaining the original information

I have a very simple table students, structure as below, where the primary key is id. This table is a stand-in for about 20 multi-million row tables that get joined together a lot.
+----+----------+------------+
| id | name | dob |
+----+----------+------------+
| 1 | Alice | 01/12/1989 |
| 2 | Bob | 04/06/1990 |
| 3 | Cuthbert | 23/01/1988 |
+----+----------+------------+
If Bob wants to change his date of birth, then I have a few options:
Update students with the new date of birth.
Positives: 1 DML operation; the table can always be accessed by a single primary key lookup.
Negatives: I lose the fact that Bob ever thought he was born on 04/06/1990
Add a column, created date default sysdate, to the table and change the primary key to id, created. Every update becomes:
insert into students(id, name, dob) values (:id, :name, :new_dob)
Then, whenever I want the most recent information do the following (Oracle but the question stands for every RDBMS):
select id, name, dob
from ( select a.*, rank() over ( partition by id
order by created desc ) as "rank"
from students a )
where "rank" = 1
Positives: I never lose any information.
Negatives: All queries over the entire database take that little bit longer. If the table was the size indicated this doesn't matter but once you're on your 5th left outer join using range scans rather than unique scans begins to have an effect.
Add a different column, deleted date default to_date('2100/01/01','yyyy/mm/dd'), or whatever overly early, or futuristic, date takes my fancy. Change the primary key to id, deleted then every update becomes:
update students x
set deleted = sysdate
where id = :id
and deleted = ( select max(deleted) from students where id = x.id );
insert into students(id, name, dob) values ( :id, :name, :new_dob );
and the query to get out the current information becomes:
select id, name, dob
from ( select a.*, rank() over ( partition by id
order by deleted desc ) as "rank"
from students a )
where "rank" = 1
Positives: I never lose any information.
Negatives: Two DML operations; I still have to use ranked queries with the additional cost or a range scan rather than a unique index scan in every query.
Create a second table, say student_archive and change every update into:
insert into student_archive select * from students where id = :id;
update students set dob = :newdob where id = :id;
Positives: Never lose any information.
Negatives: 2 DML operations; if you ever want to get all the information ever you have to use union or an extra left outer join.
For completeness, have a horribly de-normalised data-structure: id, name1, dob, name2, dob2... etc.
If number 1 is not an option if I never want to lose any information and always do a soft delete. Number 5 can be safely discarded as causing more trouble than it's worth.
I'm left with options 2, 3 and 4 with their attendant negative aspects. I usually end up using option 2 and the horrific 150 line (nicely-spaced) multiple sub-select joins that go along with it.
tl;dr I realise I'm skating close to the line on a "not constructive" vote here but:
What is the optimal (singular!) method of maintaining logical consistency while never deleting any data?
Is there a more efficient way than those I have documented? In this context I'll define efficient as "less DML operations" and / or "being able to remove the sub-queries". If you can think of a better definition when (if) answering please feel free.
I'd stick to #4 with some modifications.No need to delete data from original table ; it's enough to copy old values to archive table before updating(or before deleting) original record. That's can be easily done with row level trigger. Retrieving all information in my opinion is not a frequent operation, and I don't see anything wrong with extra join /union. Also, you can define a view , so all queries will be straightforward from end user perspective.

Counting Distinct Values in large dataset (40M rows): SELECT count(*) as count, name FROM names GROUP BY name ORDER BY name;

CREATE TABLE `names` ( `name` varchar(20) );
Assume the names table contains all 40 million first names of everyone living in California (for example).
SELECT count(*) as count, name FROM names GROUP BY name ORDER BY name;
How can I optimize this query?
Expected Result:
count | name
9999 | joe
9995 | mike
9990 | kate
.... | ....
2 | kal-el
You have to create an index on the name column of your table. The query is as good as it can be.
Well, what makes you think it's not already optimised? This looks like the sort of query a good database engine should be able to handle relatively easily - particularly if you've got an appropriate index on your table.
Do you actually have a bottleneck here, or are you worrying about something that might happen in the future? If it's the latter, I suggest you try it with your RDBMS (by generating dummy data), and see what happens.