SQL: Combine tables and order all rows / Avoid storing lots of null data - sql

Typically in SQL JOINS requires you to join two tables ON a specific column, and then rows get merged. That isn't what I'm looking for.
Is it possible to join tables in SQL in a way that you can ORDER BY columns with the same name, such that x rows are returned, where x = the sum of rows in table 1 and table 2.
To hopefully clarify what I mean, here's an example query:
SELECT * FROM (combined Real and Placeholder items)
ORDER BY StartDate, OnDayIndex
and here's what results might look like:
ID OnDayIndex StartDate ItemType Name TemplatePointer
12308 2 1996-09-18 Real Actual Name Null
10309 11 1996-09-19 Placeholder Null 123
30310 5 1996-09-20 Real Actual Name Null
30410 6 1996-09-20 Placeholder Null 456
My use case is a calendar application with recurring events. To save space, it doesn't make sense to store every recurrence of an event. If it weren't for the particulars of my use case I'd just store a template with a rule and recurring events would be generated when viewed, except for one-off events. The problem is the calendar app I'm working on allows you to move items around in the day they're and saves way you order the items. I'm already using a ranked model gem (link here: https://github.com/mixonic/ranked-model) to cut down on the number of writes needed to update the "onDayIndex". The template approach on its own turns into a bit of a nightmare when "onDayIndex" is factored in... (I could say more...)
So I'd like to store slimmed down 'Placeholder' items that store the items' position and a pointer to template, perhaps in a separate table if possible.
If this isn't possible, an alternative approach I've considered for conserving space is moving most columns from the Items table to an ItemData table, and storing an ItemDataID on Items.
But I'd really like to know if it is possible, as I'm pretty junior in SQL, as well as any other vital information I may be missing.
I'm using Rails with a Postgres database.

Are you talking about using UNION / UNION ALL to stack result sets on top of each other, but where the sources have different columns?
If so, you need to fill in the 'missing' columns (you can only UNION two sets if their signatures match, or can be coerced to match).
For example...
SELECT col1, col2, NULL as col3
FROM tbl1
UNION ALL
SELECT col1, NULL AS col2, col3
FROM tbl2
Note: UNION expends additional resources to remove duplicates from the results. Use UNION ALL if such effort is wasted.

Related

MS Access - trying to find duplicates across 4 tables based on Column1 and Column2

MS Access - trying to find duplicates across 4 tables based on the info in Column1 and Column2. I would also like the resulting query to show me Column3, Column4 and Column5 for easy review. I've tried following a Youtube vid on a union query and was successful.. But that's as far as I can go. I tried to follow along some of the answers but I cant make it work. Just note that I have 0 programming language knowledge. Tyvm in advance!
Column1 = Unique reference
Column 2 = Loss date
Duplicates happen when a row has same unique ref and same DOL. This can be within the table or across tables. Like one entry is in Table2019 and another one is in Table2022. Or two entries in Table2019 with four more spread in other tables.
SELECT [t2019].ID, [t2019].[ClaimNo], [t2019].DOL, [t2019].[Amount], [t2019].[Cause], [t2019].[Ref], [t2019].[Regn], [t2019].Remarks
FROM [t2019]
UNION
SELECT [t2020].ID, [t2020].[ClaimNo], [t2020].DOL, [t2020].[Amount], [t2020].[Cause], [t2020].[Ref], [t2020].[Regn], [t2020].Remarks
FROM [t2020]
UNION
SELECT [t2021].ID, [t2021].[ClaimNo], [t2021].DOL, [t2021].[Amount], [t2021].[Cause], [t2021].[Ref], [t2021].[Regn], [t2021].Remarks
FROM [t2021]
UNION
SELECT [t2022].ID, [t2022].[ClaimNo], [t2022].DOL, [t2022].[Amount], [t2022].[Cause], [t2022].[Ref], [t2022].[Regn], [t2022].Remarks
FROM [t2022];
Access has a wizard to help write the relatively difficult SQL for finding duplicate records. So first gather up all the records that need to be searched for duplicates then use the wizard.
To gather the records open the query designer, go to the SQL Pane, SELECT union and adapt the following SQL:
Unfortunately, there is no graphical interface to help.
Get Typing and don't forget that semi-colon. UNION is used to combine SELECT statements. So were combining everything from all the tables. the ALL is important because by itself UNION ignores rows where every column is an exact match to a previous row. We are looking for duplicates so we add ALL to include those skipped rows.
When you have all the rows go to query wizard under the create tab and run the find duplicates wizard:
Here is the resulting SQL for my example data:
SELECT Query1.[ID], Query1.[DOL], Query1.[ClaimNo], Query1.[Amount], Query1.[Cause], Query1.[Ref], Query1.[Regn], Query1.[Remarks]
FROM Query1
WHERE (((Query1.[ID]) In (SELECT [ID] FROM [Query1] As Tmp GROUP BY [ID],[DOL] HAVING Count(*)>1 And [DOL] = [Query1].[DOL])))
ORDER BY Query1.[ID], Query1.[DOL]
Note:
In Access ID is a primary key and AutoNumber by default. It looks suspicious here. If the default settings are intact and you are entering data in Access then every table starts with ID 1 and you have duplicate ID's in every table. Instead, I would normally combine all these year tables using a year column. This also avoids the union query. I would only use year tables if I had millions of records and couldn't afford the space for a year column.

Selecting a large number of rows by index using SQL

I am trying to select a number of rows by the value of a column called ID. I know you can do this pretty easily by:
SELECT col1, col2, col3 FROM mytable WHERE id IN (1,2,3,4,5...)
However, what if there are a few million IDs I want to select and the IDs don't always have pattern (which means I can't use something like BETWEEN x AND y)? Does this select statement still work or is there better ways of doing so?
The actual application is this. Filters are specified by users, which is compared to some attributes of the records. From those filters, we create a subset of the data which is of interest to a particular user. There are about 30 million records each with roughly ~3000 attributes (which is stored in roughly 30 tables, but every table has ID as a primary key), so every time someone makes a query about their desired subset of records, we'd have to join many tables, apply those filters, and figure out what his subset looks like. In order to avoid joining many tables all the time, I thought maybe it's a better idea to join the tables once, figure out the id of the selected subset, and this way each time a new query is made, all we have to do is select the relevant columns of the rows that match the filtered ids.
This depends on the database and the interface you are using. For a few hundred or thousand values, no problem. But your question specifies millions. And that could start to get into limits on the length of the query -- either specified by the database, the tool you are using, or intermediate libraries.
If you have so many ids, I would strongly recommend that you load them into a table in the database with the id as the primary key. Then use join or exists to identify the rows in your table that match.
Often, such a list would be generated in the database anyway. In that case, you can use a subquery or CTE and just include that code in your final query.

How to get data based on two columns from same table in SQL

I wanted to retrieve some data from a table based on two columns see the below table structure
Update
i want the output data based on two condition
1. if the code value is having 'Web' or 'Offline'.
2. Memo column is having data same as Pre_memo column.
Output should be as shown below
So far i got the output by using same table two times but i wanted to get the output result by using the table only 1 time to avoid performance related issues as this table is having huge data.
select distinct OrderTable.Memo,
max(OrderTable.Memo_Date) as Date1,
max(ot.Pre_Memo_Date) as Date2
from OrderTable,
OrderTable ot
where OrderTable.code in ('Web')
and ot.code in ('Offline')
and OrderTable.Memo = ot.Pre_Memo
group by OrderTable.Memo
Can anyone help on this? With the use of OrderTable only once in the query and filter based on memo and pre_memo column as it's having same data?
You can use union all and do the conditional aggregation :
select Memo, max(case when code = 'Offline' then Date end) as Memo_date,
max(case when code = 'Web' then Date end) as Per_Memo_date
from (select Date, 'Web' as code, Pre_memo as Memo
from OrderTable o
where code = 'Web'
union all
select Date, 'Offline', Memo
from OrderTable o
where code = 'Offline'
) t
group by Memo;
"I wanted to retrieve some data from a table based on two columns see the below table structure"
Providing a sample is sufficient to illustrate the problem (and it is desirable to do so on SO) but it is not sufficient and thus not a replacement for defining the problem, which you have failed to do.
Absent such definition of the problem, we can only guess what you're trying to achieve. E.G.
from the subset of tuples that have 'Offline' for 'code' value, take the MAX() 'Date' value per appearing value of 'Memo'.
Match that (using some matching condition) to the subset of tuples that have 'Web' for 'code value and retain the 'Date' value from those as 'Memo_date' in the result set.
matching condition being that 'Memo' value of [a tuple in] the former is equal to 'Pre_memo' value in [the matching tuple in] the latter.
If all that is correct, then that explains why it is impossible to do this in SQL without having at least two references. You cannot avoid doing some kind of matching, and matching by definition takes two distinct things to match (even if the two distinct things are distinct subsets of one and the same thing). In fact it is almost certainly a fundamental design mistake for you to have those two distinct things in one single table, probably under the totally misguided belief that "having everything in one table makes things easier".
"So far i got the output by using same table two times but i wanted to get the output result by using the table only 1 time to avoid performance related issues as this table is having huge data"
From the way you have presented the question, I suspect that you were hoping for some means to exploit the fact that those 'Offline' tuples are "the next" after a 'Web' tuple, and that you could write the SQL in such a way that the engine could then derive a sort of "single pass" algorithm (which you probably assume would go faster).
It does not work like that. SQL tables have no inherent ordering and as a consequence there simply ain't no such thing as "the next" in a table.

postgreSQL query - finding presence of id in any field of a record

I have two tables which look like the following
tools:
id | part name
---------------
0 | hammer
1 | sickle
2 | axe
people:
personID | ownedTool1 | ownedTool2 | ownedTool3 ..... ownedTool20
------------------------------------------------------------------
0 | 2 | 1 | 3 ... ... 0
I'm trying to find out how many people own a particular tool. A person cannot own multiple copies of the same tool.
The only way I can think of doing this is something like
SELECT COUNT(*)
FROM tools JOIN people ON tools.id = people.ownedTool1.id OR tools.id = people.ownedTool2 ... and so on
WHERE tools.id = 0
to get the number of people who own hammers. I believe this will work, however, this involves having 20 OR statements in the query. Surely there is a more appropriate way to form such a query and I'm interested to learn how to do this.
You shouldn't have 20 columns each possibly containing an ID in the first place. You should properly establish a normalized schema. If a tool can belong to only one user - but a user can have multiple tools, you should establish a One to Many relationship. Each tool will have a user id in its row that maps back to the user it belongs to. If a tool can belong to one or more users you will need to establish a Many to Many relationship. This will require an intermediate table that contains rows of user_id to tool_id mappings. Having a schema set up appropriately like that will make the query you're looking to perform trivial.
In your particular case it seems like a user can have many tools and a tool can be "shared" by many users. For your many-to-many relation all you would have to do is count the number of rows in that intermediate table having your desired tool_id.
Something like this:
SELECT COUNT(ID) FROM UserTools Where ToolID = #desired_tool_id
Googling the terms I bolded should get you pointed in the correct direction. If you're stuck with that schema then the way you pointed out is the only way to do it.
If you cannot change the model (and I'm sure you will tell us that), then the only sensible way to work around this broken datamodel is to create a view that will give you a normalized view (pun intended) on the data:
create view normalized_people
as
select personid,
ownedTool1 as toolid
from people
union all
select personid,
ownedTool2 as toolid
from people
select personid,
ownedTool3 as toolid
from people
... you get the picture ...
Then your query is as simple as
select count(personid)
from normalized_people
where toolid = 0;
You received your (warranted) lectures about the database design.
As to your question, there is a simple way:
SELECT count(*) AS person_ct
FROM tbl t
WHERE translate((t)::text, '()', ',,')
~~ ('%,' || #desired_tool_id::text || ',%')
Or, if the first column is person_id and you want to exclude that one from the search:
SELECT count(*) AS person_ct
FROM tbl t
WHERE replace((t)::text, ')', ',')
~~ ('%,' || #desired_tool_id::text || ',%')
Explanation
Every table is accompanied by a matching composite type in PostgreSQL. So you can query any table this way:
SELECT (tbl) FROM tbl;
Yields one column per row, holding the whole row.
PostgreSQL can cast such a row type to text in one fell swoop: (tbl)::text
I replace both parens () with a comma , so every value of the row is delimited by commas ,.
My second query does not translate the opening parenthesis, so the first column (person_id) is excluded from the search.
Now I can search all columns with a simple LIKE (~~) expression using the desired number delimited by commas ~~ %,17,%
Voilá: all done with one simple command. This is reliable as long as you don't have columns like text or int[] in your table that could also hold ,17, within their values, or additional columns with numbers, which could lead to false positives.
It won't deliver performance wonders as it cannot use standard indexes. (You could create a GiST or GIN index on an expression using the tgrm module in pg 9.1, but that's another story.)
Anyway, if you want to optimize, you'd better start by normalizing your table layout as has been suggested.

How not to display columns which are NULL in a view

I've set up a view which combines all the data across several tables. Is there a way to write this so that only columns which contain non-null data are displayed, and those columns which contain all NULL values are not included?
ADDED:
Sorry, still studying and working on my first big project so every day seems to be a new experience at the minute. I haven't been very clear, and that's partly because I'm not sure I'm going about things the right way! The client is an academic library, and the database records details of specific collections. The view I mentioned is to display all the data held about an item, so it is bringing together tables on publication, copy, author, publisher, language and so on. A small number of items in the collection are papers, so have additional details over and above the standard bibliographic details. What I didn't want was a user to get all the empty fields relating to papers if what was returned only consisted of books, therefore the paper table fields were all null. So I thought perhaps there would be a way to not show these. Someone has commented that this is the job of the client application rather than the database itself, so I can leave this until I get to that phase of the project.
There is no way to do this in sql.
CREATE VIEW dbo.YourView
AS
SELECT (list of fields)
FROM dbo.Table1 t1
INNER JOIN dbo.Table2 t2 ON t1.ID = t2.FK_ID
WHERE t1.SomeColumn IS NOT NULL
AND t2.SomeOtherColumn IS NOT NULL
In your view definition, you can include WHERE conditions which can exclude rows that have certain columns that are NULL.
Update: you cannot really filter out columns - you define the list of columns that are part of your view in your view definition, and this list is fixed and cannot be dynamically changed......
What you might be able to do is us a ISNULL(column, '') construct to replace those NULLs with an empty string. Or then you need to handle excluding those columns in your display front end - not in the SQL view definition...
The only thing I see you could do is make sure to select only those columns from the view that you know aren't NULL:
SELECT (list of non-null fields) FROM dbo.YourView
WHERE (column1 IS NOT NULL)
and so forth - but there's no simple or magic way to select all columns that aren't NULL in one SELECT statement...
You cannot do this in a view, but you can do it fairly easily using dynamic SQL in a stored procedure.
Of course, having a schema which shifts is not necessarily good for clients who consume the data, but it can be efficient if you have very sparse data AND the consuming client understands the varying schema.
If you have to have a view, you can put a "header" row in your view which you can inspect client-side on the first row in your loop to see if you want to not bother with the column in your grid or whatever, you can do something like this:
SELECT * FROM (
-- This is the view code
SELECT 'data' as typ
,int_col
,varchar_col
FROM TABLE
UNION ALL
SELECT 'hdr' as typ
-- note that different types have to be handled differently
,CASE WHEN COUNT(int_col) = 0 THEN NULL ELSE 0 END
,CASE WHEN COUNT(varchar_col) = 0 THEN NULL ELSE '' END
FROM TABLE
) AS X
-- have to get header row first
ORDER BY typ DESC -- add other sort criteria here
If we're reading your question right, there won't be a way to do this in SQL. The output of a view must be a relation - in (over-)simplified terms, it must be rectangular. That is, each row must have the same number of columns.
If you can tell us more about your data and give us some idea of what you want to do with the output, we can perhaps offer more positive suggestions.
In general, add a WHERE clause to your query, e.g.
WHERE a IS NOT NULL AND b IS NOT NULL AND c IS NOT NULL
Here, a b c are your column names.
If you are joining tables together on potentially NULL columns, then use an INNER JOIN, and NULL values will not be included.
EDIT: I may have misunderstood - the above filters out rows, but you may be asking to filter out columns, e.g. you have several columns and you only want to display columns that contain at least one null value across all the rows you are returning. Using dynamic SQL offers a solution, since the set columns varies depending upon your data.
Here's a SQL query that builds another SQL query containing the appropriate columns. You could run this query, and then submit it's result as another query. It assumes 'pk' is some column that is always non-null, e.g. a primary key - this means we can prefix additional row names with a comma.
SELECT CONCAT("SELECT pk"
CASE (count(columnA)) WHEN 0 THEN '' ELSE ',columnA' END,
CASE (count(columnB)) WHEN 0 THEN '' ELSE ',columnB' END,
// etc..
' FROM (YourQuery) base')
FROM
(YourQuery) As base
The query works using Count(column) - the aggregate function ignores NULL values, and so returns 0 for a column consisting entirely of NULLs. The query builder assumes that YourQuery uses aliases to ensure there no duplicate column names.
While you cant put this into a view, you could wrap it up as a stored procedure that copies the data to another table - the result table. You may also set up a trigger so that the result table is updated whenever the base tables change.
I suspect what's going on is that an end user is running CrystalReports and complaining about all the empty columns that have to be removed manually.
It would actually be possible to create a stored procedure that would create a view on the fly, leaving out dataless columns. But then you would have to run this proc before using the view.
Is that acceptable?