How to chain SQL, Text and scan queries in Apache Ignite - ignite

We have a clustered Ignite cache in which we lan to store a huge amount of data (in excess of 100 million records). We are currently using SQL queries to search for records using indices. But we have a requirement for some free text based searches and we were planning to evaluate how Text Queries can work. The free text search will be in conjunction with some SQL constraints so that the result data set is not huge. I was hoping to find a way to use the Text Search and may be scan search on the result of a SQL search (which I think could give a lot more flexibility and power to the query framework of Ignite). Is there a way to achieve this. We use Native persistence and replicated cache in our system.

All query kinds - Scan, SQL and Text - are independent from each other. You can't use SQL on top of Text query result directly.
You can try to execute local Text queries on all nodes, and then filter the results manually (not using SQL, just Java code). E.g.
Collection<List<Cache.Entry<Object, Object>>> results = ignite.compute().broadcast(() -> {
IgniteCache<Object, Object> cache = Ignition.localIgnite().<Object, Object>cache("foo");
TextQuery<Object, Object> qry = new TextQuery<Object, Object>(Value.class, "str").setLocal(true);
try (QueryCursor<Cache.Entry<Object, Object>> cursor = cache.query(qry)) {
return StreamSupport.stream(cursor.spliterator(), false)
.filter(e -> needToReturnEntry(e))
.collect(Collectors.toList());
}
});
List<Cache.Entry<Object, Object>> combinedResults = results.stream()
.flatMap(Collection::stream)
.collect(Collectors.toList());
needToReturnEntry(e) here needs to be implemented to do the same filtering as SQL constraints would.
Another way is to retrieve a list of primary keys from the Text query, and then add that to the SQL query. This will work if the number of keys isn't too big.
select * from TABLE where pKey in (<keys from Text Query>) and <other constraints>

Related

Performance issue when using bind variable for a large list inside the IN clause

I'm using Sybase and had some code that looked like this:
String[] ids = ... an array containing 80-90k strings, which is retrieved from another table and varies.
for (String id : ids) {
// wrap every id with single-quotes
}
String idsAsString = String.join(",", ids);
String query = String.format("select * from someTable where idName in (%s)", idsAsString);
getNamedParameterJDBCTemplate().query(query, resultSetExtractor ->{
// do stuff with results
});
I've timed how long it took to get to the inner body of the resultSetExtractor and it never took longer than 4 seconds.
But to secure the code, I tried going the bind variable route. Thus, that code looked like the following:
String[] ids = ... an array containing 80-90k strings, which is retrieved from another table and varies.
String query = "select * from someTable where idName in (:ids)";
Map<String, Object> params = new HashMap<>();
params.put("ids", Arrays.asList(ids));
getNamedParameterJDBCTemplate().query(query, params, resultSetExtractor ->{
// do stuff with results
});
But doing it this way will take up to 4-5 minutes to finally spew out the following exception:
21-10-2019 14:04:01 DEBUG DefaultConnectionTester:126 - Testing a Connection in response to an Exception:
com.sybase.jdbc4.jdbc.SybSQLException: The token datastream length was not correct. This is an internal protocol error.
I also have other bits of code where I pass in arrays of sizes 1-10 as bind variables and noticed that those queries went from being instantaneous to taking up to 10 seconds.
I'm surprised doing it the bind variable way is at all different, let alone that drastically different. Can someone explain what is going on here? Is it that bind variable does something different underneath the hood as opposed to sending a formatted string through JDBC? Is there another way to secure my code without drastically slowing performance?
You should verify what's actually happening at the database end via a showplan/query plan, but using an 'in' clause will at best usually do one index search for every value in the 'in' clause, therefore 10 values does ten searches, 80k searches does 80k of them and thus massively slower. Oracle actually prohibits putting more than 1000 values in an 'in clause and whilst Sybase is not so restrictive that doesn't mean its a good idea. You risk stack and other issues in your database by putting massive amounts of values in this way I've seen such a query take out a production database instance with a stack failure.
It's much better to create a temporary table, load the 80k values into there and do an inner join between the temporary table and the main table using the column which previously you searched with the in clause.

HBase database : Is it possible to filter on a column qualifier using Java API/REST API?

I'm completely new to HBase and I'm used to RDBMS database where I can use the WHERE clause to filter the records.
So, is there something similar to RDBMS using Java API or REST API exposed by HBase to filter the records using a column qualifier?
Yes, it's possible.
If you want to get only certain column qualifiers then you should use addColumn(byte[] family, byte[] qualifier) method of your Get or Scan instances. This is the most efficient way to query qualifiers you need because only HBase Stores representing the specific columns in request need to be accessed. Example of usage:
Get = new Get(Bytes.toBytes("rowKey"));
get.addColumn(Bytes.toBytes("columnFamily", Bytes.toBytes("Qual"));
Scan scan = new Scan();
scan.addColumn(Bytes.toBytes("columnFamily"), Bytes.toBytes("Qual1"));
scan.addColumn(Bytes.toBytes("columnFamily"), Bytes.toBytes("Qual2"));
If you need more complicated tool to filter your qualifiers then you can use QualifierFilter class from Java API. Example how you can query all columns with certain qualifiers:
Filter filter = new QualifierFilter(CompareFilter.CompareOp.EQUAL,
new BinaryComparator(Bytes.toBytes("columnQual")));
Get = new Get(Bytes.toBytes("rowKey"));
get.setFilter(filter);
Scan scan = new Scan();
scan.setFilter(filter);
Also you can read about another HBase filters and how combine them in official HBase documentation.

Query fast without search, slow with search, but with search fast in SSMS

I have this function that takes data from the database and also has search. The problem is that when I search with Entity framework it's slow, but if I use the same query I got from the log and use it in SSMS it's fast. I must also say that there are allot of movies, 388262. I also tried adding an index on title at movie, but didn't help.
Query I use in SSMS:
SELECT *
FROM Movie
WHERE title LIKE '%pirate%'
ORDER BY ##ROWCOUNT
OFFSET 0 ROWS FETCH NEXT 30 ROWS ONLY
Entity code (_movieRepository.GetAll() returns Queryable not all movies):
public IActionResult Index(MovieIndexViewModel vm) {
IQueryable<Movie> query = _movieRepository.GetAll().AsNoTracking();
if (!string.IsNullOrWhiteSpace(vm.Search)) {
query = query.Where(m => m.title.ToLower().Contains(vm.Search.ToLower()));
}
vm.TotalItemCount = query.Count();
vm.Movies = query.Skip(_pageSize * (vm.Page - 1)).Take(_pageSize);
vm.PageSize = _pageSize;
return View(vm);
}
Caveat: I don't have much experience with the Entity framework.
However, you might find useful debugging tips available in the Entity Framework Performance Article from Simple talk. Looking at what you've posted you might be able to improve your query performance by:
Choosing only the specific column you're interested in (it sounds like you're only interested in querying for the 'Title' column).
Pay special attention to your data-types. You might want to convert your NVARCHAR variables to VARCHAR(40) (or some appropriate character limit)
try removing all of the ToLower() stuff,
if (!string.IsNullOrWhiteSpace(vm.Search)) {
query = query.Where(m => m.title.Contains(vm.Search)));
}
sql server (unlike c#) is not case sensitive by default (though you can configure it to be that way). Your query is forcing sql server to lower case every record in the table and then do the comparison.

How to implement deduplication in a billion rows table ssis

which is the best option to implement distinct operation in ssis?
I have a table with more than 200 columns and contain more than 10 million rows.
I need to get the ditinct rows from this table.Is it wise to use a execute sql task (with select query to deduplicate the rows) or is there any other way to achieve this in ssis
I do understood that the ssis sort component deduplicate the rows..but this is a blocking component it is not at all a good idea to use ...Please let me know your views on this
I had done it in 3 steps this way:
Dump the MillionRow table into HashDump table, which has only 2 columns: Id int identity PK, and Hash varbinary(20). This table shall be indexed on its Hash column.
Dump the HashDump table into HashUni ordered by Hash column. In between would be a Script Component that check whether the current row's Hash column value is same as the previous row. If same, direct row to Duplicate output, else Unique. This way you can log the Duplicate even if what you need is just the Unique.
Dump the MillionRow table into MillionUni table. In between would be a Lookup Component that uses HashUni to tell which row is Unique.
This method allows me to log each duplicates with a message such as: "Row 1000 is a duplicate of row 100".
I have not found a better way than this. Earlier, I made a unique index on MillionUni, to dump directly the MillionRow into it, but I was not able to use "fast load", which was way too slow.
Here is one way to populate the Hash column:
public override void Input0_ProcessInputRow(Input0Buffer Row)
{
StringBuilder sb = new StringBuilder();
sb.Append(Row.Col1String_IsNull ? "" : Row.Col1String); sb.Append("|");
sb.Append(Row.Col2Num_IsNull ? "" : Row.Col2Num.ToString()); sb.Append("|");
sb.Append(Row.Col3Date_IsNull ? "" : Row.Col3Date.ToString("yyyy-MM-dd"));
var sha1Provider = HashAlgorithm.Create("SHA1");
Row.Hash = sha1Provider.ComputeHash(Encoding.UTF8.GetBytes(sb.ToString()));
}
If 200 columns prove to be a chore for you, part of this article shall inspire you. It is making a loop for the values of all column objects into a single string.
And to compare the Hash, use this method:
byte[] previousHash;
int previousRowNo;
public override void Input0_ProcessInputRow(Input0Buffer Row)
{
if (StructuralComparisons.StructuralEqualityComparer.Equals(Row.Hash, previousHash))
{
Row.DupRowNo = previousRowNo;
Row.DirectRowToDuplicate();
}
else
{
Row.DirectRowToUnique();
}
previousHash = Row.Hash;
previousRowNo = Row.RowNo;
}
I won't bother SSIS for it, a couple of queries will do; also you have a lot of data, so i suggest you check the execution plan before running the queries, and optimize your indexes
http://www.brijrajsingh.com/2011/03/delete-duplicate-record-but-keep.html
Check out a small article i wrote on the same topic
As far as I know, the Sort Component is the only transformation which allows you to distinct the duplcities. Or you could use SQL-like command.
If the sorting operation is problem, then you should use (assuming your source is DB) "SQL Command" in Data Access Mode specification. Select distinct your data and that's it .. you may also save a bit time as the ETL wont have to go through the Sort Component.

How to get elasticsearch to perform similar to SQL 'LIKE'

If using a SQL 'Like' statement to query data it will return data even if its only partially matched. For instance, if I'm searching for food and there's an item in my db called "raisins" when using SQL the query would return "raisins" even if my search only contained "rai". In elasticsearch, the query won't return a record unless the entire name (in this case "raisins") is specified. How can I get elasticsearch to behave similar to the SQL statement. I'm using Rails 3.1.1 and PostgreSQL. Thanks!
While creating index of model for elasticsearch use tokenizer on index which will fulfil your requirement. For. e.g.
tokenizer: {
:name_tokenizer => {type: "edgeNGram", max_gram: 100, min_gram: 3, side: "front"}
}
this will create tokens of size from 3 to 100 of your fields and as side is given as front it will check from the starting. You can get more details from here http://www.slideshare.net/clintongormley/terms-of-endearment-the-elasticsearch-query-dsl-explained