I would like to run the equivalent of PostgreSQL's
SELECT * FROM GENERATE_SERIES(1, 10000000)
I've read this:
http://blog.jooq.org/2013/11/19/how-to-create-a-range-from-1-to-10-in-sql/
But most suggestions there don't really take an arbitrary length - the query depends on the length otherwise than by just replacing a number. Also, some suggestions do not apply in MonetDB. So, what's my best course of action (if any)?
Notes:
- I'm using a version from February 2013. Answers about more recent features are also welcome, but are exactly what I'm looking for.
- Assume the existing tables don't have enough lines; and do not assume that, say, a Cartesian product of the longest table with itself is sufficient (or alternatively, maybe that's too costly to perform).
Try with:
SELECT value
FROM sys.generate_series(initial_value, end_value, offset);
I have to report that the function is quite unstable on Jul2015 release as is causing the server process to crash. Hope you have better luck.
If you wants to generate an arbitrary numeric value you can use:
SELECT rand();
Forgive me; I've never worked with MonetDB before. But the documentation leads me to believe you can solve this with the ROW_NUMBER function and a pre-populated table like SYS.COLUMNS.
SELECT ROW_NUMBER() OVER () AS rownum
FROM SYS.COLUMNS;
This falls into jooq.org's category of just taking random records from a “large enough” table.
PostgreSQL's generate_series function is elegant, but non-standard. It's absent in other mainstream engines like SQL Server, Oracle, and MySQL. Your version of MonetDB doesn't have it either.
MonetDB does have the ROW_NUMBER function, a close equivalent in standard SQL. It assigns a sequential integer to rows in a result set. It will output the correct values, but it needs some rows in your database already. A chicken and egg problem!
SYS.COLUMNS is a system metadata table that contains one row for every column in your database. Most "empty" relational databases still have hundreds of system columns that appear in tables like these.
If the first query produces more rows than you need, you can push it into a subquery and filter the intermediate result.
SELECT rownum
FROM (
SELECT ROW_NUMBER() OVER () AS rownum
FROM SYS.COLUMNS
) AS tally
WHERE rownum >= 1 AND rownum <= 10;
But what if you need to generate more rows than you have in SYS.COLUMNS? Unfortunately, the shape of the query does depend on how many rows you want to generate.
A common workaround in the Microsoft SQL Server community would be to join SYS.COLUMNS to itself. This will produce an intermediate table containing the square of the number of rows in the table. In practice, it's probably more rows than you'll ever need.
With a self-join, the solution looks like this:
SELECT rownum
FROM (
SELECT ROW_NUMBER() OVER () AS rownum
FROM SYS.COLUMNS AS a, SYS.COLUMNS AS b
) AS tally
WHERE rownum >= 1 AND rownum <= 100000;
Hopefully these queries are also relevant in MonetDB world!
Related
I have 100 tables each of size of order of few tenths of GB. The schema of each table is the following:
A: string | B: string | C: string
In each table I would like to retain only the rows for which the (B, C) appears at least 10 times in a concatenation of all 100 tables. Is there any efficient way to achieve this?
A very vague question, excluding your DBMS as well isn't helpful as SQL comes in different forms.
But first, you would have to join all of the tables together - there may be a faster way of doing this, but without knowing which flavor of SQL you are using it is hard to tell.
Something like this will work:
SELECT * FROM table_1
UNION
SELECT * FROM table_2
...
UNION
SELECT * FROM table_100
Once you have all of the data you do something like this:
WITH tables_with_counts as (SELECT
A,
B,
C,
COUNT(1) OVER(PARTITION BY(B, C)) AS bc_count
FROM
aggragated_tables)
SELECT
A,
B,
C
FROM
tables_with_counts
WHERE
bc_count >= 10
Here is my take:
Step 1 : Aggregate all tables into one. It would be bulky but if you are using Oracle database, I think it shouldn't be an issue.
Step 2: Create md5 checksum hash values for B,C columns like below :
SELECT APEX_ITEM.MD5_CHECKSUM(B,C) md5_cks,
A,B,C
FROM aggregated_tables
Step 3: take count based on checksum values and retain the rows where count > 10
Step 4: Get rid of duplicate data using rank() or dense rank() in delete statement.
The short answer, which I'm sure that you don't want to hear, is "no." In the context of relational databases there is no efficient query to merge 100 tables.
It is not all bad news though. If it were just one table (let's say it was named "combined" just to have concrete examples) you could use an elegant SQL using windowed functions
select A,B,C from (select A,B,C,count(1) over (partition by B,C) as counts from combined)counted where counts>=10
Option 1. So the question is how to get a "combined" table so that the snippet above works. If we stick with ANSI (standard) sql, you could use UNION ALL, which and collect it into a WITH clause to keep things neat.
Here is an example:
with
combined as (
select * from table_1
union all
select * from table_2),
counted as (
select
A,B,C,
count(1) over (partition by B,C) as counts
from
combined)
select A,B,C from counted where counts>=10;
I only included 2 tables, but the real query would extend that up to table_100. Thats a lot of typing and not very efficient with the programmer's time. Also unions and union all's are notoriously poor performing for databases, so this is not efficient in terms of system resources or time, either. Personally I would not do it this way, but it is an answer.
Option 2 There are other options which do not exactly match your question, but may be helpful to know. Any time you are tempted to create multiple tables with exactly the same schema, you will be better off creating a single table with multiple partitions. see MySQL, Postgres, Sql Server, Oracle, Hive. Every database platform has its own syntax for partitioning tables but they are all similar. For this table, each of the original tables becomes a single partition in the table, and the table name would be a really good candidate for the string value in the partition identifier (partition column)
If you are able to stuff all of your 100 tables into 100 partitions of one table then you can run the first query after all. The advantage is that the database can optimize that query because all modern databases are optimized to manage partitioned queries.
In addition, adding a partition to a table is really no more trouble than creating a new table instead, but supporting and maintaining one table is a lot less trouble than 100 tables.
A third option, since you tagged "big data" is to use a big data engine like Spark with SparkSQL. This would be objectively best because you can actually load a dataframe with 100 combined tables very efficiently with spark, and the SQL after that is not much different from the relational database sql we have been considering. That's kind of out of scope here, but worth considering. If you submit a more specific question and specifically for spark we could go into more details.
So I came upon a question where someone asked for a list of unused account numbers. The query I wrote for it works, but it is kind of hacky and relies on the existence of a table with more records than existing accounts:
WITH tmp
AS (SELECT Row_number()
OVER(
ORDER BY cusno) a
FROM custtable
fetch first 999999 rows only)
SELECT tmp.a
FROM tmp
WHERE a NOT IN (SELECT cusno
FROM custtable)
This works because customer numbers are reused and there are significantly more records than unique customer numbers. But, like I said, it feels hacky and I'd like to just generate a temporary table with 1 column and x records that are numbered 1 through x. I looked at some recursive solutions, but all of it looked way more involved than the solution I wound up using. Is there an easier way that doesn't rely on existing tables?
I think the simple answer is no. To be able to make a determination of absence, the platform needs to know the expected data set. You can either generate that as a temporary table or data set at runtime - using the method you've used (or a variation thereof) - or you can create a reference table once, and compare against it each time. I'd favour the latter - a table with a single column of integers won't put much of a dent in your disk space and it doesn't make sense to compute an identical result set over and over again.
Here's a really good article from Aaron Bertrand that deals with this very issue:
https://sqlperformance.com/2013/01/t-sql-queries/generate-a-set-1
(Edit: The queries in that article are TSQL specific, but they should be easily adaptable to DB2 - and the underlying analysis is relevant regardless of platform)
If you search all unused account number you can do it :
with MaxNumber as
(
select max(cusno) MaxID from custtable
),
RecurceNumber (id) as
(
values 1
union all
select id + 1 from RecurceNumber cross join MaxNumber
where id<=MaxID
)
select f1.* from RecurceNumber f1 exception join custtable f2 on f1.id=f2.cusno
I am looking at some code and came across something extremely unfamiliar to me ; Google did not prove fruitful for results so I was wondering if anyone could explain what the following code does? It does not refer to any of my tables or databases, so I assume it's general code and I needn't provide my database layout? Thanks very much.
The code :
SELECT ROW_NUMBER() OVER (ORDER BY Object_ID) AS weeks
FROM SYS.OBJECTS
It will select the numbers from 1 to N where N is the number of rows in sys.objects. It will not guarantee a sort-order.
Probably this code is intended to provide all week numbers (omg!) under the assumption that there are at least 52 rows in sys.objects.
This code, however, will return more than 52 rows and the result is not guaranteed to be ordered. I recommend you get rid of this nastiness.
Edit: As an alternative, I'd choose to create the following table: CREATE TABLE Weeks (WeekNumber TINYINT NOT NULL PrimaryKey) and fill it appropriately. This will be even faster than selecting from sys.objects because this custom table will be smaller and correctly sorted.
The developer is using records in the system table sysobjects to get a list of sequential numbers by using the ROW_NUMBER() window function and aliasing the column as "Weeks". The number of rows in the sys.objects view will change based on the objects defined to the database so why someone would do this is beyond me...
If a simple list of sequential numbers is needed there are more predictable ways to get them.
The typical way of selecting data is:
select * from my_table
But what if the table contains 10 million records and you only want records 300,010 to 300,020
Is there a way to create a SQL statement on Microsoft SQL that only gets 10 records at once?
E.g.
select * from my_table from records 300,010 to 300,020
This would be way more efficient than retrieving 10 million records across the network, storing them in the IIS server and then counting to the records you want.
SELECT * FROM my_table is just the tip of the iceberg. Assuming you're talking a table with an identity field for the primary key, you can just say:
SELECT * FROM my_table WHERE ID >= 300010 AND ID <= 300020
You should also know that selecting * is considered poor practice in many circles. They want you specify the exact column list.
Try looking at info about pagination. Here's a short summary of it for SQL Server.
Absolutely. On MySQL and PostgreSQL (the two databases I've used), the syntax would be
SELECT [columns] FROM table LIMIT 10 OFFSET 300010;
On MS SQL, it's something like SELECT TOP 10 ...; I don't know the syntax for offsetting the record list.
Note that you never want to use SELECT *; it's a maintenance nightmare if anything ever changes. This query, though, is going to be incredibly slow since your database will have to scan through and throw away the first 300,010 records to get to the 10 you want. It'll also be unpredictable, since you haven't told the database which order you want the records in.
This is the core of SQL: tell it which 10 records you want, identified by a key in a specific range, and the database will do its best to grab and return those records with minimal work. Look up any tutorial on SQL for more information on how it works.
When working with large tables, it is often a good idea to make use of Partitioning techniques available in SQL Server.
The rules of your partitition function typically dictate that only a range of data can reside within a given partition. You could split your partitions by date range or ID for example.
In order to select from a particular partition you would use a query similar to the following.
SELECT <Column Name1>…/*
FROM <Table Name>
WHERE $PARTITION.<Partition Function Name>(<Column Name>) = <Partition Number>
Take a look at the following white paper for more detailed infromation on partitioning in SQL Server 2005.
http://msdn.microsoft.com/en-us/library/ms345146.aspx
I hope this helps however please feel free to pose further questions.
Cheers, John
I use wrapper queries to select the core query and then just isolate the ROW numbers that i wish to take from the query - this allows the SQL server to do all the heavy lifting inside the CORE query and just pass out the small amount of the table that i have requested. All you need to do is pass the [start_row_variable] and the [end_row_variable] into the SQL query.
NOTE: The order clause is specified OUTSIDE the core query [sql_order_clause]
w1 and w2 are TEMPORARY table created by the SQL server as the wrapper tables.
SELECT
w1.*
FROM(
SELECT w2.*,
ROW_NUMBER() OVER ([sql_order_clause]) AS ROW
FROM (
<!--- CORE QUERY START --->
SELECT [columns]
FROM [table_name]
WHERE [sql_string]
<!--- CORE QUERY END --->
) AS w2
) AS w1
WHERE ROW BETWEEN [start_row_variable] AND [end_row_variable]
This method has hugely optimized my database systems. It works very well.
IMPORTANT: Be sure to always explicitly specify only the exact columns you wish to retrieve in the core query as fetching unnecessary data in these CORE queries can cost you serious overhead
Use TOP to select only a limited amont of rows like:
SELECT TOP 10 * FROM my_table WHERE ID >= 300010
Add an ORDER BY if you want the results in a particular order.
To be efficient there has to be an index on the ID column.
I know:
Firebird: FIRST and SKIP;
MySQL: LIMIT;
SQL Server: ROW_NUMBER();
Does someone knows a SQL ANSI way to perform result paging?
See Limit—with offset section on this page: http://troels.arvin.dk/db/rdbms/
BTW, Firebird also supports ROWS clause since version 2.0
No official way, no.*
Generally you'll want to have an abstracted-out function in your database access layer that will cope with it for you; give it a hint that you're on MySQL or PostgreSQL and it can add a 'LIMIT' clause to your query, or rownum over a subquery for Oracle and so on. If it doesn't know it can do any of those, fall back to fetching the lot and returning only a slice of the full list.
*: eta: there is now, in ANSI SQL:2003. But it's not globally supported, it often performs badly, and it's a bit of a pain because you have to move/copy your ORDER into a new place in the statement, which makes it harder to wrap automatically:
SELECT * FROM (
SELECT thiscol, thatcol, ROW_NUMBER() OVER (ORDER BY mtime DESC, id) AS rownumber
)
WHERE rownumber BETWEEN 10 AND 20 -- care, 1-based index
ORDER BY rownumber;
There is also the "FETCH FIRST n ROWS ONLY" suffix in SQL:2008 (and DB2, where it originated). But like the TOP prefix in SQL Server, and the similar syntax in Informix, you can't specify a start point, so you still have to fetch and throw away some rows.
In nowadays there is a standard, not necessarily a ANSI standard (people gave many anwsers, I think this is the less verbose one)
SELECT * FROM t1
WHERE ID > :lastId
ORDER BY ID
FETCH FIRST 3 ROWS ONLY
It's not supported by all databases though, bellow a list of all databases that have support
MariaDB: Supported since 5.1 (usually, limit/offset is used)
MySQL: Supported since 3.19.3 (usually, limit/offset is used)
PostgreSQL: Supported since PostgreSQL 8.4 (usually, limit/offset is used)
SQLite: Supported since version 2.1.0
Db2 LUW: Supported since version 7
Oracle: Supported since version 12c (uses subselects with the row_num function)
Microsoft SQL Server: Supported since 2012 (traditionally, top-N is used)
You can use the offset style of course, although you could have performance issues
SELECT * FROM t1
ORDER BY ID
OFFSET 0 ROWS
FETCH FIRST 3 ROWS ONLY
It has a different support
MariaDB: Supported since 5.1
MySQL: Supported since 4.0.6
PostgreSQL: Supported since PostgreSQL 6.5
SQLite: Supported since version 2.1.0
Db2 LUW: Supported since version 11.1
Oracle: Supported since version 12c
Microsoft SQL Server: Supported since 2012
Yes (SQL ANSI 2003), feature E121-10, combined with the F861 feature you have :
ORDER BY column OFFSET n1 ROWS FETCH NEXT n2 ROWS ONLY;
Like:
SELECT Name, Address FROM Employees ORDER BY Salary OFFSET 2 ROWS
FETCH NEXT 2 ROWS ONLY;
Examples:
postgres:
https://dbfiddle.uk/?rdbms=postgres_9.5&fiddle=e25bb5235ccce77c4f950574037ef379
oracle:
https://dbfiddle.uk/?rdbms=oracle_21&fiddle=07d54808407b9dbd2ad209f2d0fe7ed7
sqlserver:
https://dbfiddle.uk/?rdbms=sqlserver_2019l&fiddle=e25bb5235ccce77c4f950574037ef379
db2:
https://dbfiddle.uk/?rdbms=db2_11.1&fiddle=e25bb5235ccce77c4f950574037ef379
YugabyteDB:
https://dbfiddle.uk/?rdbms=yugabytedb_2.8&fiddle=e25bb5235ccce77c4f950574037ef379
Unfortunately, MySQL does not support this syntax, you need something like:
ORDER BY column LIMIT n1 OFFSET n2
But MariaDB does:
https://dbfiddle.uk/?rdbms=mariadb_10.6&fiddle=e25bb5235ccce77c4f950574037ef379
I know I'm very, very late to this question, but it's still one of the top results for this issue.
However one response missing for this question is that the I believe the "correct" ANSI SQL method for paging, at least if you want maximum portability, is to not to use LIMIT/OFFSET/FIRST etc. at all, but to instead do something like:
SELECT *
FROM MyTable
WHERE ColumnA > ?
ORDER BY ColumnA ASC
Where ? is a parameter using a library that supports them (such as PDO in PHP).
The idea here is simple, when fetching the first page we pass a parameter that will match every possible row, e.g- if ColumnA is text, we would pass an empty string (''). We then read in as many results as we want, and then release the rest. This may mean some extra rows are fetched behind the scenes, but our priority here is compatibility.
In order to fetch the next page, we take the value of ColumnA from the last row in our results, and pass it in as the parameter, this way we will only fetch values that appear after it. To run the same query in the other direction, just swap > for < and ASC for DESC.
There are some important caveats of this approach:
Since we're using a condition, your DBMS is free to use an index to optimise the request, which can actually be faster than some "proper" pagination methods, as you eliminate rows rather than advancing past them.
This form of paging is more tightly anchored than row number based methods. When using row number offsets, if you offset into the table, but new rows are added that sort earlier than the current page, then it will cause results to be shifted into later pages. For example, if your current page's last row is mango but since fetching it rows are added for apple and carrot, then mango may now appear on the next page as well, as it has been shifted in the sort order. By using a condition of ColumnA > 'mango' this can't happen. This can be very useful in cases where you are sorting by a DATETIME with frequent updates occurring.
This trick can be made to work in both directions, by reversing the sort order as mentioned when going backwards (flip > to < and ASC to DESC) and passing in the value of ColumnA from the first row of each page of results, rather than the last. Note that if values were added to your table, it may mean that your first page may be shorter, but this is a fairly minor issue.
To be sure you're on the last (or first) page, you should fetch N + 1 rows, where N is the number of rows you want per page, this way you can detect whether there are more rows to fetch.
This method works best if you have a single column with only unique values, but it is still possible to use in more complex cases, so long as you can expand your ORDER BY clause (and WHERE condition) to include enough columns that every row is unique.
So it's not without a few catches, but it's by far the most compatible method as every SQL database will support it.
Insert your results into a storage table, ordered how you'd like to display them, but with a new IDENTITY column.
Now SELECT from that table just the range of IDs you're interested in.
(Be sure to clean out the table when you're done)
Or do it on the client, as anything to do with presentation should not normally be done on the SQL Server (in my opinion)
ANSI Sql example:
offset=41, fetchsize=10
SELECT TOP(10) *
FROM table1
WHERE table1.ID NOT IN (SELECT TOP(40) table1.ID FROM table1)
For paging we need a RowNo column to filter over it -that it should be over a field like id- with two variables like #PageNo and #PageRows. So I use this query:
SELECT *
FROM (
SELECT *, (SELECT COUNT(1)
FROM aTable ti
WHERE ti.id < t.id) As RowNo
FROM aTable t) tr
WHERE
tr.RowNo >= (#PageNo - 1) * #PageRows + 1
AND
tr.RowNo <= #PageNo * #PageRows
BTW, Troels, PostgreSQL supports Limit/Offset