Get certain number of rows from a table - sql

I have a big table which has 700 millions of rows and dozens of columns. I need it to carry out a left join operation as the left table.
However, since it is so large, the time consumption is beyond affordable. So I'd like to split into few smaller tables and carry out the task in multiprocessing manner.
I know there is SSIS package is available for this, but I am constrained not to use it.
Also, I know it is a easy way to add a row id to each of the rows, but unfortunately, I cannot make change to the table.
So, may I know how to achieve my goal?
Many thanks.

Using CTE and row_number()
;WITH cte_tbl AS
(
SELECT *,
ROW_NUMBER() OVER (ORDER BY [First Name]) AS RowNumber
FROM your_table
)
SELECT * FROM cte_tbl
WHERE RowNumber < 100
You can fetch records for any range.

Use fetch-offset if you need certain number of rows using sql-script.
like this:
SELECT First Name + ' ' + Last Name FROM big_table ORDER BY First Name OFFSET 15 ROWS;
SELECT First Name + ' ' + Last Name FROM big_table ORDER BY First Name OFFSET 10 ROWS FETCH NEXT 5 ROWS ONLY;

Create a #temp table, and insert a subset of the records into it, then join to the #temp table. You can create an additional ID column in the #temp table as well for your needs.

Related

How to find duplicate rows in Hive?

I want to find duplicate rows from one of the Hive table for which I was given two approaches.
First approach is to use following two queries:
select count(*) from mytable; // this will give total row count
second query is as below which will give count of distinct rows
select count(distinct primary_key1, primary_key2) from mytable;
With this approach, for one of my table total row count derived using first query is 3500 and second query gives row count 2700. So it tells us that 3500 - 2700 = 800 rows are duplicate. But this query doesn't tell which rows are duplicated.
My second approach to find duplicate is:
select primary_key1, primary_key2, count(*)
from mytable
group by primary_key1, primary_key2
having count(*) > 1;
Above query should list of rows which are duplicated and how many times particular row is duplicated. but this query shows zero rows which means there are no duplicate rows in that table.
So I would like to know:
If my first approach is correct - if yes then how do I find which rows are duplicated
Why second approach is not providing list of rows which are duplicated?
Is there any other way to find the duplicates?
Hive does not validate primary and foreign key constraints.
Since these constraints are not validated, an upstream system needs to
ensure data integrity before it is loaded into Hive.
That means that Hive allows duplicates in Primary Keys.
To solve your issue, you should do something like this:
select [every column], count(*)
from mytable
group by [every column]
having count(*) > 1;
This way you will get list of duplicated rows.
analytic window function row_number() is quite useful and can provide the duplicates based upon the elements specified in the partition by clause. A simply in-line view and exists clause will then pinpoint what corresponding sets of records contain these duplicates from the original table. In some databases (like TD, you can forgo the inline view using a QUALIFY pragma option)
SQL1 & SQL2 can be combined. SQL2: If you want to deal with NULLs and not simply dismiss, then a coalesce and concatenation might be better in the
SELECT count(1) , count(distinct coalesce(keypart1 ,'') + coalesce(keypart2 ,'') )
FROM srcTable s
3) Finds all records, not just the > 1 records. This provides all context data as well as the keys so it can be useful when analyzing why you have dups and not just the keys.
select * from srcTable s
where exists
( select 1 from (
SELECT
keypart1,
keypart2,
row_number() over( partition by keypart1, keypart2 ) seq
FROM srcTable t
WHERE
-- (whatever additional filtering you want)
) t
where seq > 1
AND t.keypart1 = s.keypart1
AND t.keypart2 = s.keypart2
)
Suppose your want get duplicate rows based on a particular column ID here. Below query will give you all the IDs which are duplicate in table in hive.
SELECT "ID"
FROM TABLE
GROUP BY "ID"
HAVING count(ID) > 1

Select last N rows SQL Server 2012

I am currently working with Big Data. I am importing data into a table which is about 200 million records per import. I want to see how many records are loaded in for the current import. But currently my script is running through 1 billion records first to finally count the last imported data.
SELECT Datum, COUNT(Datum) AS recCount
FROM PF161DailyAggregates
GROUP BY Datum
That is my current code which shows the amount of rows per Date
I can make the code that it only shows the current import job but it will still go through all the other records.
Currently this query takes about an hour. How can I make this fast to only count the last N rows?
Thanks in advance
this will restrict result to 100 rows and you can get last rows by giving order by clause desc
SELECT Datum, COUNT(Datum) AS recCount
FROM PF161DailyAggregates
GROUP BY Datum
order by datum desc
OFFSET 1 ROWS
FETCH NEXT 100 ROWS ONLY;
Thats a hard one. I think as long as you want to find out the last records AFTER the import, you are required to use some ordering on the Datum column. You can try various tricks there but as long as this column does not have an index, you will be lost, since any ordering requires a full table scan. So my first suggestion is to make an index on that column, then you can use any technique that restricts your result to the last date like:
select top 1 Datum, count(Datum)
from PF161DailyAggregates
group by Datum
order by Datum desc
or
select count(*)
from PF161DailyAggregates
where Datum = (select top 1 Datum
from PF161DailyAggregates
order by Datum desc)
Another idea would be to break out of the box and make the import job write the number of records per Datum in a seperate table each time it runs. That would be much cheaper.
Fastest way to find count on single table,
SELECT T.name AS [TABLE NAME],
I.rows AS [ROWCOUNT]
FROM sys.tables AS T
INNER JOIN sys.sysindexes AS I
ON T.object_id = I.id
AND I.indid < 2
where T.name ='PF161DailyAggregates'
ORDER BY I.rows DESC
Alternatively,
you can create one identity column.
Before insert find max id== easy and fast
then after insert find SCOPE_IDENTITY() in variable.
then subtract these two .
If table already contain one rownumber type in sequence then
also you can use same technique using First_Value in sql server 2012

How to select bottom most rows?

I can do SELECT TOP (200) ... but why not BOTTOM (200)?
Well not to get into philosophy what I mean is, how can I do the equivalent of TOP (200) but in reverse (from the bottom, like you'd expect BOTTOM to do...)?
SELECT
columns
FROM
(
SELECT TOP 200
columns
FROM
My_Table
ORDER BY
a_column DESC
) SQ
ORDER BY
a_column ASC
It is unnecessary. You can use an ORDER BY and just change the sort to DESC to get the same effect.
Sorry, but I don't think I see any correct answers in my opinion.
The TOP x function shows the records in undefined order. From that definition follows that a BOTTOM function can not be defined.
Independent of any index or sort order. When you do an ORDER BY y DESC you get the rows with the highest y value first. If this is an autogenerated ID, it should show the records last added to the table, as suggested in the other answers. However:
This only works if there is an autogenerated id column
It has a significant performance impact if you compare that with the TOP function
The correct answer should be that there is not, and cannot be, an equivalent to TOP for getting the bottom rows.
Logically,
BOTTOM (x) is all the records except TOP (n - x), where n is the count; x <= n
E.g. Select Bottom 1000 from Employee:
In T-SQL,
DECLARE
#bottom int,
#count int
SET #bottom = 1000
SET #count = (select COUNT(*) from Employee)
select * from Employee emp where emp.EmployeeID not in
(
SELECT TOP (#count-#bottom) Employee.EmployeeID FROM Employee
)
It would seem that any of the answers which implement an ORDER BY clause in the solution is missing the point, or does not actually understand what TOP returns to you.
TOP returns an unordered query result set which limits the record set to the first N records returned. (From an Oracle perspective, it is akin to adding a where ROWNUM < (N+1).
Any solution which uses an order, may return rows which also are returned by the TOP clause (since that data set was unordered in the first place), depending on what criteria was used in the order by
The usefulness of TOP is that once the dataset reaches a certain size N, it stops fetching rows. You can get a feel for what the data looks like without having to fetch all of it.
To implement BOTTOM accurately, it would need to fetch the entire dataset unordered and then restrict the dataset to the final N records. That will not be particularly effective if you are dealing with huge tables. Nor will it necessarily give you what you think you are asking for. The end of the data set may not necessarily be "the last rows inserted" (and probably won't be for most DML intensive applications).
Similarly, the solutions which implement an ORDER BY are, unfortunately, potentially disastrous when dealing with large data sets. If I have, say, 10 Billion records and want the last 10, it is quite foolish to order 10 Billion records and select the last 10.
The problem here, is that BOTTOM does not have the meaning that we think of when comparing it to TOP.
When records are inserted, deleted, inserted, deleted over and over and over again, some gaps will appear in the storage and later, rows will be slotted in, if possible. But what we often see, when we select TOP, appears to be sorted data, because it may have been inserted early on in the table's existence. If the table does not experience many deletions, it may appear to be ordered. (e.g. creation dates may be as far back in time as the table creation itself). But the reality is, if this is a delete-heavy table, the TOP N rows may not look like that at all.
So -- the bottom line here(pun intended) is that someone who is asking for the BOTTOM N records doesn't actually know what they're asking for. Or, at least, what they're asking for and what BOTTOM actually means are not the same thing.
So -- the solution may meet the actual business need of the requestor...but does not meet the criteria for being the BOTTOM.
First, create an index in a subquery according to the table's original order using:
ROW_NUMBER () OVER (ORDER BY (SELECT NULL) ) AS RowIndex
Then order the table descending by the RowIndex column you've created in the main query:
ORDER BY RowIndex DESC
And finally use TOP with your wanted quantity of rows:
SELECT TOP 1 * --(or 2, or 5, or 34)
FROM (SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL) ) AS RowIndex, *
FROM MyTable) AS SubQuery
ORDER BY RowIndex DESC
All you need to do is reverse your ORDER BY. Add or remove DESC to it.
The problem with ordering the other way is that it often does not make good use of indices. It is also not very extendable if you ever need to select a number of rows that are not at the start or the end. An alternative way is as follows.
DECLARE #NumberOfRows int;
SET #NumberOfRows = (SELECT COUNT(*) FROM TheTable);
SELECT col1, col2,...
FROM (
SELECT col1, col2,..., ROW_NUMBER() OVER (ORDER BY col1) AS intRow
FROM TheTable
) AS T
WHERE intRow > #NumberOfRows - 20;
The currently accepted answer by "Justin Ethier" is not a correct answer as pointed out by "Protector one".
As far as I can see, as of now, no other answer or comment provides the equivalent of BOTTOM(x) the question author asked for.
First, let's consider a scenario where this functionality would be needed:
SELECT * FROM Split('apple,orange,banana,apple,lime',',')
This returns a table of one column and five records:
apple
orange
banana
apple
lime
As you can see: we don't have an ID column; we can't order by the returned column; and we can't select the bottom two records using standard SQL like we can do for the top two records.
Here is my attempt to provide a solution:
SELECT * INTO #mytemptable FROM Split('apple,orange,banana,apple,lime',',')
ALTER TABLE #mytemptable ADD tempID INT IDENTITY
SELECT TOP 2 * FROM #mytemptable ORDER BY tempID DESC
DROP TABLE #mytemptable
And here is a more complete solution:
SELECT * INTO #mytemptable FROM Split('apple,orange,banana,apple,lime',',')
ALTER TABLE #mytemptable ADD tempID INT IDENTITY
DELETE FROM #mytemptable WHERE tempID <= ((SELECT COUNT(*) FROM #mytemptable) - 2)
ALTER TABLE #mytemptable DROP COLUMN tempID
SELECT * FROM #mytemptable
DROP TABLE #mytemptable
I am by no means claiming that this is a good idea to use in all circumstances, but it provides the desired results.
You can use the OFFSET FETCH clause.
SELECT COUNT(1) FROM COHORT; --Number of results to expect
SELECT * FROM COHORT
ORDER BY ID
OFFSET 900 ROWS --Assuming you expect 1000 rows
FETCH NEXT 100 ROWS ONLY;
(This is for Microsoft SQL Server)
Official documentation:
https://www.sqlservertutorial.net/sql-server-basics/sql-server-offset-fetch/
"Tom H" answer above is correct and it works for me in getting Bottom 5 rows.
SELECT [KeyCol1], [KeyCol2], [Col3]
FROM
(SELECT TOP 5 [KeyCol1],
[KeyCol2],
[Col3]
FROM [dbo].[table_name]
ORDER BY [KeyCol1],[KeyCol2] DESC) SOME_ALAIS
ORDER BY [KeyCol1],[KeyCol2] ASC
Thanks.
try this.
declare #floor int --this is the offset from the bottom, the number of results to exclude
declare #resultLimit int --the number of results actually retrieved for use
declare #total int --just adds them up, the total number of results fetched initially
--following is for gathering top 60 results total, then getting rid of top 50. We only keep the last 10
set #floor = 50
set #resultLimit = 10
set #total = #floor + #resultLimit
declare #tmp0 table(
--table body
)
declare #tmp1 table(
--table body
)
--this line will drop the wanted results from whatever table we're selecting from
insert into #tmp0
select Top #total --what to select (the where, from, etc)
--using floor, insert the part we don't want into the second tmp table
insert into #tmp1
select top #floor * from #tmp0
--using select except, exclude top x results from the query
select * from #tmp0
except
select * from #tmp1
I've come up with a solution to this that doesn't require you to know the number of row returned.
For example, if you want to get all the locations logged in a table, except the latest 1 (or 2, or 5, or 34)
SELECT *
FROM
(SELECT ROW_NUMBER() OVER (ORDER BY CreatedDate) AS Row, *
FROM Locations
WHERE UserId = 12345) AS SubQuery
WHERE Row > 1 -- or 2, or 5, or 34
Querying a simple subquery sorted descending, followed by sorting on the same column ascending does the trick.
SELECT * FROM
(SELECT TOP 200 * FROM [table] t2 ORDER BY t2.[column] DESC) t1
ORDER BY t1.[column]
SELECT TOP 10*from TABLE1 ORDER BY ID DESC
Where ID is the primary key of the TABLE1.
SELECT columns FROM My_Table LIMIT 200 OFFSET (SELECT Count(*)-200 My_Table)

How to read the last row with SQL Server

What is the most efficient way to read the last row with SQL Server?
The table is indexed on a unique key -- the "bottom" key values represent the last row.
If you're using MS SQL, you can try:
SELECT TOP 1 * FROM table_Name ORDER BY unique_column DESC
select whatever,columns,you,want from mytable
where mykey=(select max(mykey) from mytable);
You'll need some sort of uniquely identifying column in your table, like an auto-filling primary key or a datetime column (preferably the primary key). Then you can do this:
SELECT * FROM table_name ORDER BY unique_column DESC LIMIT 1
The ORDER BY column tells it to rearange the results according to that column's data, and the DESC tells it to reverse the results (thus putting the last one first). After that, the LIMIT 1 tells it to only pass back one row.
If some of your id are in order, i am assuming there will be some order in your db
SELECT * FROM TABLE WHERE ID = (SELECT MAX(ID) FROM TABLE)
I think below query will work for SQL Server with maximum performance without any sortable column
SELECT * FROM table
WHERE ID not in (SELECT TOP (SELECT COUNT(1)-1
FROM table)
ID
FROM table)
Hope you have understood it... :)
I tried using last in sql query in SQl server 2008 but it gives this err:
" 'last' is not a recognized built-in function name."
So I ended up using :
select max(WorkflowStateStatusId) from WorkflowStateStatus
to get the Id of the last row.
One could also use
Declare #i int
set #i=1
select WorkflowStateStatusId from Workflow.WorkflowStateStatus
where WorkflowStateStatusId not in (select top (
(select count(*) from Workflow.WorkflowStateStatus) - #i ) WorkflowStateStatusId from .WorkflowStateStatus)
You can use last_value: SELECT LAST_VALUE(column) OVER (PARTITION BY column ORDER BY column)...
I test it at one of my databases and it worked as expected.
You can also check de documentation here: https://msdn.microsoft.com/en-us/library/hh231517.aspx
OFFSET and FETCH NEXT are a feature of SQL Server 2012 to achieve SQL paging while displaying results.
The OFFSET argument is used to decide the starting row to return rows from a result and FETCH argument is used to return a set of number of rows.
SELECT *
FROM table_name
ORDER BY unique_column desc
OFFSET 0 Row
FETCH NEXT 1 ROW ONLY
SELECT TOP 1 id from comission_fees ORDER BY id DESC
In order to retrieve the last row of a table for MS SQL database 2005, You can use the following query:
select top 1 column_name from table_name order by column_name desc;
Note: To get the first row of the table for MS SQL database 2005, You can use the following query:
select top 1 column_name from table_name;
If you don't have any ordered column, you can use the physical id of each lines:
SELECT top 1 sys.fn_PhysLocFormatter(%%physloc%%) AS [File:Page:Slot],
T.*
FROM MyTable As T
order by sys.fn_PhysLocFormatter(%%physloc%%) DESC
SELECT * from Employees where [Employee ID] = ALL (SELECT MAX([Employee ID]) from Employees)
This is how you get the last record and update a field in Access DB.
UPDATE compalints SET tkt = addzone &'-'& customer_code &'-'& sn where sn in (select max(sn) from compalints )
If you have a Replicated table, you can have an Identity=1000 in localDatabase and Identity=2000 in the clientDatabase, so if you catch the last ID you may find always the last from client, not the last from the current connected database.
So the best method which returns the last connected database is:
SELECT IDENT_CURRENT('tablename')
Well I'm not getting the "last value" in a table, I'm getting the Last value per financial instrument. It's not the same but I guess it is relevant for some that are looking to look up on "how it is done now". I also used RowNumber() and CTE's and before that to simply take 1 and order by [column] desc. however we nolonger need to...
I am using SQL server 2017, we are recording all ticks on all exchanges globally, we have ~12 billion ticks a day, we store each Bid, ask, and trade including the volumes and the attributes of a tick (bid, ask, trade) of any of the given exchanges.
We have 253 types of ticks data for any given contract (mostly statistics) in that table, the last traded price is tick type=4 so, when we need to get the "last" of Price we use :
select distinct T.contractId,
LAST_VALUE(t.Price)over(partition by t.ContractId order by created ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)
from [dbo].[Tick] as T
where T.TickType=4
You can see the execution plan on my dev system it executes quite efficient, executes in 4 sec while the exchange import ETL is pumping data into the table, there will be some locking slowing me down... that's just how live systems work.
It is very simple
select top 10 * from TableName order by 1 desc
SELECT * FROM TABLE WHERE ID = (SELECT MAX(ID) FROM TABLE)
I am pretty sure that it is:
SELECT last(column_name) FROM table
Becaause I use something similar:
SELECT last(id) FROM Status

Delete all but top n from database table in SQL

What's the best way to delete all rows from a table in sql but to keep n number of rows on the top?
DELETE FROM Table WHERE ID NOT IN (SELECT TOP 10 ID FROM Table)
Edit:
Chris brings up a good performance hit since the TOP 10 query would be run for each row. If this is a one time thing, then it may not be as big of a deal, but if it is a common thing, then I did look closer at it.
I would select ID column(s) the set of rows that you want to keep into a temp table or table variable. Then delete all the rows that do not exist in the temp table. The syntax mentioned by another user:
DELETE FROM Table WHERE ID NOT IN (SELECT TOP 10 ID FROM Table)
Has a potential problem. The "SELECT TOP 10" query will be executed for each row in the table, which could be a huge performance hit. You want to avoid making the same query over and over again.
This syntax should work, based what you listed as your original SQL statement:
create table #nuke(NukeID int)
insert into #nuke(Nuke) select top 1000 id from article
delete article where not exists (select 1 from nuke where Nukeid = id)
drop table #nuke
Future reference for those of use who don't use MS SQL.
In PostgreSQL use ORDER BY and LIMIT instead of TOP.
DELETE FROM table
WHERE id NOT IN (SELECT id FROM table ORDER BY id LIMIT n);
MySQL -- well...
Error -- This version of MySQL does not yet support 'LIMIT &
IN/ALL/ANY/SOME subquery'
Not yet I guess.
Here is how I did it. This method is faster and simpler:
Delete all but top n from database table in MS SQL using OFFSET command
WITH CTE AS
(
SELECT ID
FROM dbo.TableName
ORDER BY ID DESC
OFFSET 11 ROWS
)
DELETE CTE;
Replace ID with column by which you want to sort.
Replace number after OFFSET with number of rows which you want to keep.
Choose DESC or ASC - whatever suits your case.
I think using a virtual table would be much better than an IN-clause or temp table.
DELETE
Product
FROM
Product
LEFT OUTER JOIN
(
SELECT TOP 10
Product.id
FROM
Product
) TopProducts ON Product.id = TopProducts.id
WHERE
TopProducts.id IS NULL
This really is going to be language specific, but I would likely use something like the following for SQL server.
declare #n int
SET #n = SELECT Count(*) FROM dTABLE;
DELETE TOP (#n - 10 ) FROM dTable
if you don't care about the exact number of rows, there is always
DELETE TOP 90 PERCENT FROM dTABLE;
I don't know about other flavors but MySQL DELETE allows LIMIT.
If you could order things so that the n rows you want to keep are at the bottom, then you could do a DELETE FROM table LIMIT tablecount-n.
Edit
Oooo. I think I like Cory Foy's answer better, assuming it works in your case. My way feels a little clunky by comparison.
I would solve it using the technique below. The example expect an article table with an id on each row.
Delete article where id not in (select top 1000 id from article)
Edit: Too slow to answer my own question ...
Refactored?
Delete a From Table a Inner Join (
Select Top (Select Count(tableID) From Table) - 10)
From Table Order By tableID Desc
) b On b.tableID = A.tableID
edit: tried them both in the query analyzer, current answer is fasted (damn order by...)
Better way would be to insert the rows you DO want into another table, drop the original table and then rename the new table so it has the same name as the old table
I've got a trick to avoid executing the TOP expression for every row. We can combine TOP with MAX to get the MaxId we want to keep. Then we just delete everything greater than MaxId.
-- Declare Variable to hold the highest id we want to keep.
DECLARE #MaxId as int = (
SELECT MAX(temp.ID)
FROM (SELECT TOP 10 ID FROM table ORDER BY ID ASC) temp
)
-- Delete anything greater than MaxId. If MaxId is null, there is nothing to delete.
IF #MaxId IS NOT NULL
DELETE FROM table WHERE ID > #MaxId
Note: It is important to use ORDER BY when declaring MaxId to ensure proper results are queried.