How can I create a temporary numbers table with SQL? - sql

So I came upon a question where someone asked for a list of unused account numbers. The query I wrote for it works, but it is kind of hacky and relies on the existence of a table with more records than existing accounts:
WITH tmp
AS (SELECT Row_number()
OVER(
ORDER BY cusno) a
FROM custtable
fetch first 999999 rows only)
SELECT tmp.a
FROM tmp
WHERE a NOT IN (SELECT cusno
FROM custtable)
This works because customer numbers are reused and there are significantly more records than unique customer numbers. But, like I said, it feels hacky and I'd like to just generate a temporary table with 1 column and x records that are numbered 1 through x. I looked at some recursive solutions, but all of it looked way more involved than the solution I wound up using. Is there an easier way that doesn't rely on existing tables?

I think the simple answer is no. To be able to make a determination of absence, the platform needs to know the expected data set. You can either generate that as a temporary table or data set at runtime - using the method you've used (or a variation thereof) - or you can create a reference table once, and compare against it each time. I'd favour the latter - a table with a single column of integers won't put much of a dent in your disk space and it doesn't make sense to compute an identical result set over and over again.
Here's a really good article from Aaron Bertrand that deals with this very issue:
https://sqlperformance.com/2013/01/t-sql-queries/generate-a-set-1
(Edit: The queries in that article are TSQL specific, but they should be easily adaptable to DB2 - and the underlying analysis is relevant regardless of platform)

If you search all unused account number you can do it :
with MaxNumber as
(
select max(cusno) MaxID from custtable
),
RecurceNumber (id) as
(
values 1
union all
select id + 1 from RecurceNumber cross join MaxNumber
where id<=MaxID
)
select f1.* from RecurceNumber f1 exception join custtable f2 on f1.id=f2.cusno

Related

Any resources for this SQL filtering?

I have 100 tables each of size of order of few tenths of GB. The schema of each table is the following:
A: string | B: string | C: string
In each table I would like to retain only the rows for which the (B, C) appears at least 10 times in a concatenation of all 100 tables. Is there any efficient way to achieve this?
A very vague question, excluding your DBMS as well isn't helpful as SQL comes in different forms.
But first, you would have to join all of the tables together - there may be a faster way of doing this, but without knowing which flavor of SQL you are using it is hard to tell.
Something like this will work:
SELECT * FROM table_1
UNION
SELECT * FROM table_2
...
UNION
SELECT * FROM table_100
Once you have all of the data you do something like this:
WITH tables_with_counts as (SELECT
A,
B,
C,
COUNT(1) OVER(PARTITION BY(B, C)) AS bc_count
FROM
aggragated_tables)
SELECT
A,
B,
C
FROM
tables_with_counts
WHERE
bc_count >= 10
Here is my take:
Step 1 : Aggregate all tables into one. It would be bulky but if you are using Oracle database, I think it shouldn't be an issue.
Step 2: Create md5 checksum hash values for B,C columns like below :
SELECT APEX_ITEM.MD5_CHECKSUM(B,C) md5_cks,
A,B,C
FROM aggregated_tables
Step 3: take count based on checksum values and retain the rows where count > 10
Step 4: Get rid of duplicate data using rank() or dense rank() in delete statement.
The short answer, which I'm sure that you don't want to hear, is "no." In the context of relational databases there is no efficient query to merge 100 tables.
It is not all bad news though. If it were just one table (let's say it was named "combined" just to have concrete examples) you could use an elegant SQL using windowed functions
select A,B,C from (select A,B,C,count(1) over (partition by B,C) as counts from combined)counted where counts>=10
Option 1. So the question is how to get a "combined" table so that the snippet above works. If we stick with ANSI (standard) sql, you could use UNION ALL, which and collect it into a WITH clause to keep things neat.
Here is an example:
with
combined as (
select * from table_1
union all
select * from table_2),
counted as (
select
A,B,C,
count(1) over (partition by B,C) as counts
from
combined)
select A,B,C from counted where counts>=10;
I only included 2 tables, but the real query would extend that up to table_100. Thats a lot of typing and not very efficient with the programmer's time. Also unions and union all's are notoriously poor performing for databases, so this is not efficient in terms of system resources or time, either. Personally I would not do it this way, but it is an answer.
Option 2 There are other options which do not exactly match your question, but may be helpful to know. Any time you are tempted to create multiple tables with exactly the same schema, you will be better off creating a single table with multiple partitions. see MySQL, Postgres, Sql Server, Oracle, Hive. Every database platform has its own syntax for partitioning tables but they are all similar. For this table, each of the original tables becomes a single partition in the table, and the table name would be a really good candidate for the string value in the partition identifier (partition column)
If you are able to stuff all of your 100 tables into 100 partitions of one table then you can run the first query after all. The advantage is that the database can optimize that query because all modern databases are optimized to manage partitioned queries.
In addition, adding a partition to a table is really no more trouble than creating a new table instead, but supporting and maintaining one table is a lot less trouble than 100 tables.
A third option, since you tagged "big data" is to use a big data engine like Spark with SparkSQL. This would be objectively best because you can actually load a dataframe with 100 combined tables very efficiently with spark, and the SQL after that is not much different from the relational database sql we have been considering. That's kind of out of scope here, but worth considering. If you submit a more specific question and specifically for spark we could go into more details.

How do I generate a (dummy) column of arbitrary length with MonetDB?

I would like to run the equivalent of PostgreSQL's
SELECT * FROM GENERATE_SERIES(1, 10000000)
I've read this:
http://blog.jooq.org/2013/11/19/how-to-create-a-range-from-1-to-10-in-sql/
But most suggestions there don't really take an arbitrary length - the query depends on the length otherwise than by just replacing a number. Also, some suggestions do not apply in MonetDB. So, what's my best course of action (if any)?
Notes:
- I'm using a version from February 2013. Answers about more recent features are also welcome, but are exactly what I'm looking for.
- Assume the existing tables don't have enough lines; and do not assume that, say, a Cartesian product of the longest table with itself is sufficient (or alternatively, maybe that's too costly to perform).
Try with:
SELECT value
FROM sys.generate_series(initial_value, end_value, offset);
I have to report that the function is quite unstable on Jul2015 release as is causing the server process to crash. Hope you have better luck.
If you wants to generate an arbitrary numeric value you can use:
SELECT rand();
Forgive me; I've never worked with MonetDB before. But the documentation leads me to believe you can solve this with the ROW_NUMBER function and a pre-populated table like SYS.COLUMNS.
SELECT ROW_NUMBER() OVER () AS rownum
FROM SYS.COLUMNS;
This falls into jooq.org's category of just taking random records from a “large enough” table.
PostgreSQL's generate_series function is elegant, but non-standard. It's absent in other mainstream engines like SQL Server, Oracle, and MySQL. Your version of MonetDB doesn't have it either.
MonetDB does have the ROW_NUMBER function, a close equivalent in standard SQL. It assigns a sequential integer to rows in a result set. It will output the correct values, but it needs some rows in your database already. A chicken and egg problem!
SYS.COLUMNS is a system metadata table that contains one row for every column in your database. Most "empty" relational databases still have hundreds of system columns that appear in tables like these.
If the first query produces more rows than you need, you can push it into a subquery and filter the intermediate result.
SELECT rownum
FROM (
SELECT ROW_NUMBER() OVER () AS rownum
FROM SYS.COLUMNS
) AS tally
WHERE rownum >= 1 AND rownum <= 10;
But what if you need to generate more rows than you have in SYS.COLUMNS? Unfortunately, the shape of the query does depend on how many rows you want to generate.
A common workaround in the Microsoft SQL Server community would be to join SYS.COLUMNS to itself. This will produce an intermediate table containing the square of the number of rows in the table. In practice, it's probably more rows than you'll ever need.
With a self-join, the solution looks like this:
SELECT rownum
FROM (
SELECT ROW_NUMBER() OVER () AS rownum
FROM SYS.COLUMNS AS a, SYS.COLUMNS AS b
) AS tally
WHERE rownum >= 1 AND rownum <= 100000;
Hopefully these queries are also relevant in MonetDB world!

iSeries query changes selected RRN of subquery result rows

I'm trying to make an optimal SQL query for an iSeries database table that can contain millions of rows (perhaps up to 3 million per month). The only key I have for each row is its RRN (relative record number, which is the physical record number for the row).
My goal is to join the table with another small table to give me a textual description of one of the numeric columns. However, the number of rows involved can exceed 2 million, which typically causes the query to fail due to an out-of-memory condition. So I want to rewrite the query to avoid joining a large subset with any other table. So the idea is to select a single page (up to 30 rows) within a given month, and then join that subset to the second table.
However, I ran into a weird problem. I use the following query to retrieve the RRNs of the rows I want for the page:
select t.RRN2 -- Gives correct RRNs
from (
select row_number() over() as SEQ,
rrn(e2) as RRN2, e2.*
from TABLE1 as e2
where e2.UPDATED between '2013-05-01' and '2013-05-31'
order by e2.UPDATED, e2.ACCOUNT
) as t
where t.SEQ > 270 and t.SEQ <= 300 -- Paging
order by t.UPDATED, t.ACCOUNT
This query works just fine, returning the correct RRNs for the rows I need. However, when I attempted to join the result of the subquery with another table, the RRNs changed. So I simplified the query to a subquery within a simple outer query, without any join:
select rrn(e) as RRN, e.*
from TABLE1 as e
where rrn(e) in (
select t.RRN2 -- Gives correct RRNs
from (
select row_number() over() as SEQ,
rrn(e2) as RRN2, e2.*
from TABLE1 as e2
where e2.UPDATED between '2013-05-01' and '2013-05-31'
order by e2.UPDATED, e2.ACCOUNT
) as t
where t.SEQ > 270 and t.SEQ <= 300 -- Paging
order by t.UPDATED, t.ACCOUNT
)
order by e.UPDATED, e.ACCOUNT
The outer query simply grabs all of the columns of each row selected by the subquery, using the RRN as the row key. But this query does not work - it returns rows with completely different RRNs.
I need the actual RRN, because it will be used to retrieve more detailed information from the table in a subsequent query.
Any ideas about why the RRNs end up different?
Resolution
I decided to break the query into two calls, one to issue the simple subquery and return just the RRNs (rows-IDs), and the second to do the rest of the JOINs and so forth to retrieve the complete info for each row. (Since the table gets updated only once a day, and rows never get deleted, there are no potential timing problems to worry about.)
This approach appears to work quite well.
Addendum
As to the question of why an out-of-memory error occurs, this appears to be a limitation on only some of our test servers. Some can only handle up to around 2m rows, while others can handle much more than that. So I'm guessing that this is some sort of limit imposed by the admins on a server-by-server basis.
Trying to use RRN as a primary key is asking for trouble.
I find it hard to believe there isn't a key available.
Granted, there may be no explicit primary key defined in the table itself. But is there a unique key defined in the table?
It's possible there's no keys defined in the table itself ( a practice that is 20yrs out of date) but in that case there's usually a logical file with a unique key defined that is by the application as the de-facto primary key to the table.
Try looking for related objects via green screen (DSPDBR) or GUI (via "Show related"). Keyed logical files show in the GUI as views. So you'd need to look at the properties to determine if they are uniquely keyed DDS logicals instead of non-keyed SQL views.
A few times I've run into tables with no existing de-facto primary key. Usually, it was possible to figure out what could be defined as one from the existing columns.
When there truly is no PK, I simply add one. Usually a generated identity column. There's a technique you can use to easily add columns without having to recompile or test any heritage RPG/COBOL programs. (and note LVLCHK(*NO) is NOT it!)
The technique is laid out in Chapter 4 of the modernizing Redbook
http://www.redbooks.ibm.com/abstracts/sg246393.html
1) Move the data to a new PF (or SQL table)
2) create new LF using the name of the existing PF
3) repoint existing LF to new PF (or SQL table)
Done properly, the record format identifiers of the existing objects don't change and thus you don't have to recompile any RPG/COBOL programs.
I find it hard to believe that querying a table of mere 3 million rows, even when joined with something else, should cause an out-of-memory condition, so in my view you should address this issue first (or cause it to be addressed).
As for your question of why the RRNs end up different I'll take the liberty of quoting the manual:
If the argument identifies a view, common table expression, or nested table expression derived from more than one base table, the function returns the relative record number of the first table in the outer subselect of the view, common table expression, or nested table expression.
A construct of the type ...where something in (select somethingelse...) typically translates into a join, so there.
Unless you can specifically control it, e.g., via ALWCPYDTA(*NO) for STRSQL, SQL may make copies of result rows for any intermediate set of rows. The RRN() function always accesses physical record number, as contrasted with the ROW_NUMBER() function that returns a logical row number indicating the relative position in an ordered (or unordered) set of rows. If a copy is generated, there is no way to guarantee that RRN() will remain consistent.
Other considerations apply over time; but in this case it's as likely to be simple copying of intermediate result rows as anything.

TSQL: Best way to get data from temp/scratch table to normalized version?

I'm working in with relatively large data sets; ~200GB. The data is coming from text files that are being imported to SQL via a script. They are being bulkcopy'd into a temp table with the normalized tables waiting to recieve the data.
My question comes from the fact that I'm mostly a scripter so my logic would be to loop through each row and do individual checks per row to put the data where it needs to go but I read a different post on SO saying that's really wrong for SQL.
So my question is, if I have one temp table (31 columns) that is to be normalized between 5 others, what's the best way to go about this?
Table relationship is as follows:
System - Table that contains machine information (e.g. name, domain, etc.)
File - File information (e.g. name, size, directory, etc.)
SystemFile - The many-to-many system<->file relationship table.
Metadata - File metadata (language, etc.) - has foreign key relationship to file primary key
DigitalSignature - File digital signature status - has foreign key relationship to file primary key
Thanks
Dont have any links, don't have enough experience with things like ssis etc to give a balanced view. but when doing the task you are talking about my normal process would be (generic, simple version):
1.look at normalised data set and consider the least dependant components in the data being imported (e.g. order headers created before order items)
2.create queries the select out the data i will have.. these often have this form:
select
t.x,t.y,t.z
from
temp_table as t
left outer join normalise_table as n
on t.x=n.x
and t.y=n.y
and t.z=n.z
where
n.x is null
where temp_table may have lots of columns but these three represent whatever normalised nugget i want to add first, the left outer join and where null make sure i only get the new values - if merging is the same
verify that i am getting good information and that i am only getting the new rows i want. often you have to use group bys or distincts on the temp data to get accurate data for inserting.. something like:
select
t.x,t.y,t.z
from
(select
distinct x,y,z
from
temp_table ) as t
left outer join normalise_table as n
on t.x=n.x
and t.y=n.y
and t.z=n.z
where
n.x is null
3.wrap that select in an insert:
insert into
normalise_table (x,y,z)
select
t.x,t.y,t.z
from
(select
distinct x,y,z
from
temp_table ) as t
left outer join normalise_table as n
on t.x=n.x
and t.y=n.y
and t.z=n.z
where
n.x is null
in this way you are inserting sets of data.. the procedural part is doing this for each set to be inserted, but in general you are not iterating over rows.
BTW T-SQL has a merge command for when you may or may not have the data in the target table (and if you want to remove keys missing from the temp tables)
http://msdn.microsoft.com/en-us/library/bb510625.aspx
Some comments on foreign keys - these tend to be more specific to the situation:
Can you identify the relationship without the primary key? This is the easiest situation to deal with..
Imagine I have inserted my xyz object into a normalised table but it has 100 child rows (abc's) in another table (each child may have 100 children too.. this would mean 10000 rows in the de-normalised data for one xyz)
you would have to go through the validation before but your final query may look something like:
insert into
normalise_table_2 (parentID,a,b,c)
select
n.id,t.a,t.b,t.c
from
(select
distinct x,y,z,a,b,c
from
temp_table ) as t
inner join join normalise_table as n
on t.x=n.x
and t.y=n.y
and t.z=n.z
left outer join normalise_table_2 as n2
on n.id = n2.parentID
and t.a = n2.a
and t.b = n2.b
and t.c = n2.c
where
n2.a is null
or maybe a more readable way:
insert into normalise_table_2 (parentID,a,b,c)
select
*
from (
select distinct
n.id,t.a,t.b,t.c
from
normalise_table as n
inner join temp_table as t
on t.x = n.x
and t.y = n.y
and t.z = n.z
left outer join normalise_table_2 as n2
on t.a = n2.a
and t.b = n2.b
and t.c = n2.c
and n2.parentID = n.id
where
n2.id is null
) as x
If you are having trouble identifying the row without the id here are some points to consider
I often give a unique id to every row in the de-normalised/import data this makes it easier to track what has and has not been done. not to mention paying off in other ways (e.g. when source data has blanks if its they are to be the same as the row above)
I have created temp tables to track relationships like this as I go along.
sometimes (especially for less consistent data) these are not temp tables as they can be used after the fact for analysis what did and didn't import (and where it went), sometimes i have a comments column that the update queries populate with any details about exceptions relating to the import of that row.
sometimes you are lucky and there is some kind of source or oldId field in the target that can be used to link the de-normalised data and normalised version (this is particularly true of system migration type tasks as people often want to be able to look up items in the old system). sometimes this can be weird and wonderful - e.g. using the updated by or created by field looking for a special account that executes this particular process (though i would not particulary recommend that)
Sometimes it makes sense to update the source tables in some way.. e.g. replacing identifiers there
Sometimes you come up with ID ranges or similar that are used for import and you break normal rules about where ids are generated and your import process creates the ID.
this often means shutting down all other access to the target system while the import is executed. may sound mad but sometimes this is the best way for very complex uploads that require a lot of preparation
But often when you think about it there is a particular order you can add your data in and avoid this issue as you will always be able to identify the correct data. I have used the above techniques to make my life easier but I am not sure I have ever HAD to use them..
The only exception I can think of is generating IDs outside of the system which i have had to use, but this was so that IDs would be consistent across multiple trial loads and the final production load. Also data was coming from many sources with many people working on it, it made life easier that they could be in control of their own IDs - but it did bring other issues ;).
Generally I would try and leave the source data alone and ensure that if you re-run any of your scripts then they wont have any effect. this makes the whole system much more robust and gives everyone more confidence as you can re-import the same data or a file that has some of the same data and run everything again and nothing breaks.
note i have not tested any of these queries and just written them off the top of my head so sorry if they are not totally accurate.

SQL WHERE ID IN (id1, id2, ..., idn)

I need to write a query to retrieve a big list of ids.
We do support many backends (MySQL, Firebird, SQLServer, Oracle, PostgreSQL ...) so I need to write a standard SQL.
The size of the id set could be big, the query would be generated programmatically. So, what is the best approach?
1) Writing a query using IN
SELECT * FROM TABLE WHERE ID IN (id1, id2, ..., idn)
My question here is. What happens if n is very big? Also, what about performance?
2) Writing a query using OR
SELECT * FROM TABLE WHERE ID = id1 OR ID = id2 OR ... OR ID = idn
I think that this approach does not have n limit, but what about performance if n is very big?
3) Writing a programmatic solution:
foreach (var id in myIdList)
{
var item = GetItemByQuery("SELECT * FROM TABLE WHERE ID = " + id);
myObjectList.Add(item);
}
We experienced some problems with this approach when the database server is queried over the network. Normally is better to do one query that retrieve all results versus making a lot of small queries. Maybe I'm wrong.
What would be a correct solution for this problem?
Option 1 is the only good solution.
Why?
Option 2 does the same but you repeat the column name lots of times; additionally the SQL engine doesn't immediately know that you want to check if the value is one of the values in a fixed list. However, a good SQL engine could optimize it to have equal performance like with IN. There's still the readability issue though...
Option 3 is simply horrible performance-wise. It sends a query every loop and hammers the database with small queries. It also prevents it from using any optimizations for "value is one of those in a given list"
An alternative approach might be to use another table to contain id values. This other table can then be inner joined on your TABLE to constrain returned rows. This will have the major advantage that you won't need dynamic SQL (problematic at the best of times), and you won't have an infinitely long IN clause.
You would truncate this other table, insert your large number of rows, then perhaps create an index to aid the join performance. It would also let you detach the accumulation of these rows from the retrieval of data, perhaps giving you more options to tune performance.
Update: Although you could use a temporary table, I did not mean to imply that you must or even should. A permanent table used for temporary data is a common solution with merits beyond that described here.
What Ed Guiness suggested is really a performance booster , I had a query like this
select * from table where id in (id1,id2.........long list)
what i did :
DECLARE #temp table(
ID int
)
insert into #temp
select * from dbo.fnSplitter('#idlist#')
Then inner joined the temp with main table :
select * from table inner join temp on temp.id = table.id
And performance improved drastically.
First option is definitely the best option.
SELECT * FROM TABLE WHERE ID IN (id1, id2, ..., idn)
However considering that the list of ids is very huge, say millions, you should consider chunk sizes like below:
Divide you list of Ids into chunks of fixed number, say 100
Chunk size should be decided based upon the memory size of your server
Suppose you have 10000 Ids, you will have 10000/100 = 100 chunks
Process one chunk at a time resulting in 100 database calls for select
Why should you divide into chunks?
You will never get memory overflow exception which is very common in scenarios like yours.
You will have optimized number of database calls resulting in better performance.
It has always worked like charm for me. Hope it would work for my fellow developers as well :)
Doing the SELECT * FROM MyTable where id in () command on an Azure SQL table with 500 million records resulted in a wait time of > 7min!
Doing this instead returned results immediately:
select b.id, a.* from MyTable a
join (values (250000), (2500001), (2600000)) as b(id)
ON a.id = b.id
Use a join.
In most database systems, IN (val1, val2, …) and a series of OR are optimized to the same plan.
The third way would be importing the list of values into a temporary table and join it which is more efficient in most systems, if there are lots of values.
You may want to read this articles:
Passing parameters in MySQL: IN list vs. temporary table
I think you mean SqlServer but on Oracle you have a hard limit how many IN elements you can specify: 1000.
Sample 3 would be the worst performer out of them all because you are hitting up the database countless times for no apparent reason.
Loading the data into a temp table and then joining on that would be by far the fastest. After that the IN should work slightly faster than the group of ORs.
For 1st option
Add IDs into temp table and add inner join with main table.
CREATE TABLE #temp (column int)
INSERT INTO #temp (column)
SELECT t.column1 FROM (VALUES (1),(2),(3),...(10000)) AS t(column1)
Try this
SELECT Position_ID , Position_Name
FROM
position
WHERE Position_ID IN (6 ,7 ,8)
ORDER BY Position_Name