How to implement deduplication in a billion rows table ssis - sql

which is the best option to implement distinct operation in ssis?
I have a table with more than 200 columns and contain more than 10 million rows.
I need to get the ditinct rows from this table.Is it wise to use a execute sql task (with select query to deduplicate the rows) or is there any other way to achieve this in ssis
I do understood that the ssis sort component deduplicate the rows..but this is a blocking component it is not at all a good idea to use ...Please let me know your views on this

I had done it in 3 steps this way:
Dump the MillionRow table into HashDump table, which has only 2 columns: Id int identity PK, and Hash varbinary(20). This table shall be indexed on its Hash column.
Dump the HashDump table into HashUni ordered by Hash column. In between would be a Script Component that check whether the current row's Hash column value is same as the previous row. If same, direct row to Duplicate output, else Unique. This way you can log the Duplicate even if what you need is just the Unique.
Dump the MillionRow table into MillionUni table. In between would be a Lookup Component that uses HashUni to tell which row is Unique.
This method allows me to log each duplicates with a message such as: "Row 1000 is a duplicate of row 100".
I have not found a better way than this. Earlier, I made a unique index on MillionUni, to dump directly the MillionRow into it, but I was not able to use "fast load", which was way too slow.
Here is one way to populate the Hash column:
public override void Input0_ProcessInputRow(Input0Buffer Row)
{
StringBuilder sb = new StringBuilder();
sb.Append(Row.Col1String_IsNull ? "" : Row.Col1String); sb.Append("|");
sb.Append(Row.Col2Num_IsNull ? "" : Row.Col2Num.ToString()); sb.Append("|");
sb.Append(Row.Col3Date_IsNull ? "" : Row.Col3Date.ToString("yyyy-MM-dd"));
var sha1Provider = HashAlgorithm.Create("SHA1");
Row.Hash = sha1Provider.ComputeHash(Encoding.UTF8.GetBytes(sb.ToString()));
}
If 200 columns prove to be a chore for you, part of this article shall inspire you. It is making a loop for the values of all column objects into a single string.
And to compare the Hash, use this method:
byte[] previousHash;
int previousRowNo;
public override void Input0_ProcessInputRow(Input0Buffer Row)
{
if (StructuralComparisons.StructuralEqualityComparer.Equals(Row.Hash, previousHash))
{
Row.DupRowNo = previousRowNo;
Row.DirectRowToDuplicate();
}
else
{
Row.DirectRowToUnique();
}
previousHash = Row.Hash;
previousRowNo = Row.RowNo;
}

I won't bother SSIS for it, a couple of queries will do; also you have a lot of data, so i suggest you check the execution plan before running the queries, and optimize your indexes
http://www.brijrajsingh.com/2011/03/delete-duplicate-record-but-keep.html
Check out a small article i wrote on the same topic

As far as I know, the Sort Component is the only transformation which allows you to distinct the duplcities. Or you could use SQL-like command.
If the sorting operation is problem, then you should use (assuming your source is DB) "SQL Command" in Data Access Mode specification. Select distinct your data and that's it .. you may also save a bit time as the ETL wont have to go through the Sort Component.

Related

Performance issue when using bind variable for a large list inside the IN clause

I'm using Sybase and had some code that looked like this:
String[] ids = ... an array containing 80-90k strings, which is retrieved from another table and varies.
for (String id : ids) {
// wrap every id with single-quotes
}
String idsAsString = String.join(",", ids);
String query = String.format("select * from someTable where idName in (%s)", idsAsString);
getNamedParameterJDBCTemplate().query(query, resultSetExtractor ->{
// do stuff with results
});
I've timed how long it took to get to the inner body of the resultSetExtractor and it never took longer than 4 seconds.
But to secure the code, I tried going the bind variable route. Thus, that code looked like the following:
String[] ids = ... an array containing 80-90k strings, which is retrieved from another table and varies.
String query = "select * from someTable where idName in (:ids)";
Map<String, Object> params = new HashMap<>();
params.put("ids", Arrays.asList(ids));
getNamedParameterJDBCTemplate().query(query, params, resultSetExtractor ->{
// do stuff with results
});
But doing it this way will take up to 4-5 minutes to finally spew out the following exception:
21-10-2019 14:04:01 DEBUG DefaultConnectionTester:126 - Testing a Connection in response to an Exception:
com.sybase.jdbc4.jdbc.SybSQLException: The token datastream length was not correct. This is an internal protocol error.
I also have other bits of code where I pass in arrays of sizes 1-10 as bind variables and noticed that those queries went from being instantaneous to taking up to 10 seconds.
I'm surprised doing it the bind variable way is at all different, let alone that drastically different. Can someone explain what is going on here? Is it that bind variable does something different underneath the hood as opposed to sending a formatted string through JDBC? Is there another way to secure my code without drastically slowing performance?
You should verify what's actually happening at the database end via a showplan/query plan, but using an 'in' clause will at best usually do one index search for every value in the 'in' clause, therefore 10 values does ten searches, 80k searches does 80k of them and thus massively slower. Oracle actually prohibits putting more than 1000 values in an 'in clause and whilst Sybase is not so restrictive that doesn't mean its a good idea. You risk stack and other issues in your database by putting massive amounts of values in this way I've seen such a query take out a production database instance with a stack failure.
It's much better to create a temporary table, load the 80k values into there and do an inner join between the temporary table and the main table using the column which previously you searched with the in clause.

How to chain SQL, Text and scan queries in Apache Ignite

We have a clustered Ignite cache in which we lan to store a huge amount of data (in excess of 100 million records). We are currently using SQL queries to search for records using indices. But we have a requirement for some free text based searches and we were planning to evaluate how Text Queries can work. The free text search will be in conjunction with some SQL constraints so that the result data set is not huge. I was hoping to find a way to use the Text Search and may be scan search on the result of a SQL search (which I think could give a lot more flexibility and power to the query framework of Ignite). Is there a way to achieve this. We use Native persistence and replicated cache in our system.
All query kinds - Scan, SQL and Text - are independent from each other. You can't use SQL on top of Text query result directly.
You can try to execute local Text queries on all nodes, and then filter the results manually (not using SQL, just Java code). E.g.
Collection<List<Cache.Entry<Object, Object>>> results = ignite.compute().broadcast(() -> {
IgniteCache<Object, Object> cache = Ignition.localIgnite().<Object, Object>cache("foo");
TextQuery<Object, Object> qry = new TextQuery<Object, Object>(Value.class, "str").setLocal(true);
try (QueryCursor<Cache.Entry<Object, Object>> cursor = cache.query(qry)) {
return StreamSupport.stream(cursor.spliterator(), false)
.filter(e -> needToReturnEntry(e))
.collect(Collectors.toList());
}
});
List<Cache.Entry<Object, Object>> combinedResults = results.stream()
.flatMap(Collection::stream)
.collect(Collectors.toList());
needToReturnEntry(e) here needs to be implemented to do the same filtering as SQL constraints would.
Another way is to retrieve a list of primary keys from the Text query, and then add that to the SQL query. This will work if the number of keys isn't too big.
select * from TABLE where pKey in (<keys from Text Query>) and <other constraints>

How to update and select records in the same sql query

As a typical scenario in any prod environment, we have multiple nodes which fetches and processes items from the database (oracle).
We want to make sure that each node fetches unique set of items from database and acts on it. To make this possible we are looking whether it is possible to update the records status (for e.g., Idle to In-Process), and the same update query returning the records which it updated. In this way every node will act on its own set of records and not interfere with each others' set.
We want to avoid pl/sql due to maintenance reasons. We tried with "select for update", but in few cases it was leading to database locks getting hold up for longer period of time.
Any suggestions on how to achieve this through simple sql or hibernate (since we have hibernate option available as well)?
A couple of thoughts on this. First up in Oracle you can use the RETURNING clause as part of an update statement to return select columns (such as the primary key) from the table being updated into a collection. But this method does require PL/SQL to work since you need to work with collections, although BULK transactions will mitigate some of the drawbacks of using PL/SQL.
Another option would be to add a column to your table where you can indicate which node is processing the record(s), similar to your idea of a status column indicating Idle, or Processing statuses. This one would be NULL for not being handled, or a value uniquely identifying the node or process working on the record.
A little extra research led to this post here on Stack about using Oracles RETURNING INTO statement with Java. It also leads right back to Oracle's own documentation on the subject of Oracles DML Returning feature as supported by Java
Finally we were able to find the solution for our problem. Our problem statement: Claim the top 100 items from the list order by time of their creation. So the logic that we wanted to apply was based on FIFO. So in this case each node will pick-up top 100 items from the database and start processing on it. In this way each node will work on its own set of items, without overlapping on each others path.
We achieved this by creating a TYPE in oracle database, and then used hibernate to claim the items and store the claimed items temporarily in TYPE. Here is the code:
create type TMP_TYPE as table of VARCHAR2(1000);
//Hibernate code
String query = "BEGIN UPDATE OUR_TABLE SET OUR_TABLE_STATUS = 'IP' WHERE OUR_TABLE_STATUS = 'ID' AND ID_OUR_TABLE IN (SELECT ID_OUR_TABLE FROM (SELECT ID_OUR_TABLE FROM OUR_TABLE ORDER BY AGEING_SINCE ASC ) ) AND ROWNUM < 101 RETURNING UUID BULK COLLECT INTO ?;END;";
Connection connection = getSession().connection();
CallableStatement cStmt = connection.prepareCall(query);
cStmt = connection.prepareCall(query);
cStmt.registerOutParameter(1, Types.ARRAY, " TMP_TYPE ");
cStmt.execute();
String[] updateBulkCollectArr = (String[]) (cStmt.getArray(1).getArray());
`
Got idea from here Oracle Type and Bulk Collect
Thanks #Sentinel

Query fast without search, slow with search, but with search fast in SSMS

I have this function that takes data from the database and also has search. The problem is that when I search with Entity framework it's slow, but if I use the same query I got from the log and use it in SSMS it's fast. I must also say that there are allot of movies, 388262. I also tried adding an index on title at movie, but didn't help.
Query I use in SSMS:
SELECT *
FROM Movie
WHERE title LIKE '%pirate%'
ORDER BY ##ROWCOUNT
OFFSET 0 ROWS FETCH NEXT 30 ROWS ONLY
Entity code (_movieRepository.GetAll() returns Queryable not all movies):
public IActionResult Index(MovieIndexViewModel vm) {
IQueryable<Movie> query = _movieRepository.GetAll().AsNoTracking();
if (!string.IsNullOrWhiteSpace(vm.Search)) {
query = query.Where(m => m.title.ToLower().Contains(vm.Search.ToLower()));
}
vm.TotalItemCount = query.Count();
vm.Movies = query.Skip(_pageSize * (vm.Page - 1)).Take(_pageSize);
vm.PageSize = _pageSize;
return View(vm);
}
Caveat: I don't have much experience with the Entity framework.
However, you might find useful debugging tips available in the Entity Framework Performance Article from Simple talk. Looking at what you've posted you might be able to improve your query performance by:
Choosing only the specific column you're interested in (it sounds like you're only interested in querying for the 'Title' column).
Pay special attention to your data-types. You might want to convert your NVARCHAR variables to VARCHAR(40) (or some appropriate character limit)
try removing all of the ToLower() stuff,
if (!string.IsNullOrWhiteSpace(vm.Search)) {
query = query.Where(m => m.title.Contains(vm.Search)));
}
sql server (unlike c#) is not case sensitive by default (though you can configure it to be that way). Your query is forcing sql server to lower case every record in the table and then do the comparison.

Improve update SQL query performance

I've an SQL database to contain stock bars downloaded from Yahoo!. I'm trying to create some indicators to analyze these stocks (i.e. Simple Moving Average). I am concerned with the performances of my query, which is simply UPDATE #stockname SET SMA = #value WHERE id = #n . To update 2000 rows it takes 2 minutes. I tried with a stored procedure but the result is almost the same.
for (int i = 0; i < closing_prices.Count - length; i++)
{
double signalValue signalValue = Selector.SignalProcessor(Signal,
closing_prices.GetRange(i, length), length);
//Write the value into the database
string location = Convert.ToString(i + length + 1);
this.UpdateWithSingleCondition("_" + Instrument, columnName,
signalValue.ToString(), "id", location, "=", sql_Connection);
}
This cycle calls the stored procedure to update the column SMA each time a new value is generated. Is there any possibility to put directly the entire column into the database? I think this can save time. Anyway updating 500 rows in 2 min sounds very slow.
Could you tell me how to improve the execution time of my query?
Instead of writing values out one at a time, perhaps you could use a stored proc with table valued parameters to ship the data from your app to the DB in a single op then MERGE the data into your table, saving on a lot of round-tripping.
Analyze your performance. You must have SOME bottleneck. Your update count is really low. You should easily be able to do 10-30 updates per second which would translate to a lot more in 2 minutes.... and that is on a stock computer, not even one worth a database (which would mean many fast discs).
Do a performance analysis on sql server and find out your bottlenecks. You have all indices needed?
I would create a stored proc that receives a string. This string is an XML or delimeted string.
Then use one of the many string to table functions floating around
(delimeted string) http://blogs.x2line.com/al/articles/150.aspx
(xml) http://kennyshu.blogspot.com/2007/12/convert-xml-file-to-table-in-sql-2005.html
and convert the string into a temp table.
Then perform a insert from the temp to the destination table.
This way you make one call to the DB server and avoid chatter. Its a LOT faster than multiple calls.
Avoid table parameters since you cant call em from code.
First disable external key constrains. then enable them again:
To disable "ALTER TABLE" "WITH NOCHECK CONSTRAINT ALL"
To anable them, use "ALTER TABLE" together with "WITH CHECK
CONSTRAINT ALL".