Determining whether SPARQL query inserted anything - sparql

Say I have an update form such as:
INSERT DATA {
# ... data
}
WHERE {
FILTER EXISTS {
# ... condition
}
}
which may or may not insert data depending on whether the FILTER condition holds. As far as I can tell, the SPARQL 1.1 update standard makes no recommendations about the response that a SPARQL engine must return after successfully running this query. In other words, there is no way to tell whether data was inserted or not.
Of course, one could subsequently run a SELECT query to whether rows have been inserted/changed, but this second query would not run as part of the same transaction as the INSERT, so false positives and negatives can be expected.
Am I missing something here? Is there some way, aside from vendor-specific solutions, to determine whether filter conditions matched or not? This seems like a pretty significant limitation.
The only hack I can think of is generating, with every insert, a triple marked with a unique UUID, which gets added to the graph provided that the FILTER condition holds. Then a subsequent SELECT for this UUID would determine conclusively whether the INSERT ran or not.

INSERT DATA { } WHERE { } isn't legal syntax.
There is INSERT DATA { } (for plain data no variables) or INSERT { } WHERE { } for a template and binding variables.
INSERT DATA :: https://www.w3.org/TR/sparql11-update/#insertData
INSERT {} WHERE {} :: https://www.w3.org/TR/sparql11-update/#insert

Related

Abstract view of how distinct queries are implemented in NoSQL

I am developing a system using Google Data-Store, where there's a Kind - Posts and which has 2 properties
1. message (string)
2. hashtags (list)
I wanted to query the distinct hashtags with the number. For example
say The posts are
{
{
"message":"msg1",
"tags":["abc","cde","efr"]
},
{
"message":"msg2",
"tags":["abc,"efgh","efk"]
},
{
"message":"msg3",
"tags":["abc,"efgh","efr"]
}
}
The output should be
{
"abc":3
"cde":1
"efk":1
"efgh":2
"efr":2
}
But in NoSQL implementation Data-store I can't directly query this. In order to query I have to load all the messages and find distinct queries. It will be a time-consuming event.
But I have seen a distinct function db.collection.distinct() which I think might have optimize this problem. If It has to be done on any NoSQL what may be the optimum solution for this?
Unfortunately, projection queries with 'distinct on' will only return a single result per distinct value (https://cloud.google.com/datastore/docs/concepts/queries#projection_queries). It will not provide a count of each distinct value. You'll need to do the count yourself, but you can use a projection query to save cost by only returning the tag values instead of the full entities.

Are rows processed even when one query parameter always evaluates to false

Here is an SQL statement for Sqlite and Android's Room, although this really applies to SQL in general:
SELECT rowid, firstName, lastName, photoUrl, occupation, summary
FROM Connections
WHERE (:wordCriteria != "") AND title MATCH :wordCriteria
In the WHERE clause, the wordCriteria is actually a parameter that gets set before the query is run. If it is set to "", the query will not return any results. But even if it's "", does this mean that all the rows are processed anyways or does the underlying code in the database recognize that the expression:
wordCriteria != ""
will be false and doesn't bother processing the rows?
If all the rows are going to be read anyways, is there a way to prevent them from being read if wordCriteria is ""? I don't want to run the query if wordCriteria is set to "".
It's not possible to tell.
SQL is a declarative language, not an imperative one. That means you specify what you want, not how to do it. The SQL planner and SQL optimizer are free to choose any option to evaluate and assess your predicates. You don't have control over it. And why would you care?
Good optimizers will probably detect the case you are talking about, but bad optimizers may not.
In general, the underlying database would still scan the entire Connections table, even if the WHERE clause is trying to match title against empty string. If you want to avoid the query completely from your Java code, then the easiest thing to do would be to just use an if statement which does just this:
if (!"".equals(wordCriteria)) {
// then run your query
}
else {
// the result set would logically default to being empty
}
You could also check if the word criteria be null or empty, using something like StringUtils#isNullOrEmpty from the Apache library.

What is the difference between a Result Set and Return value in a SQL procedure? what do they signify?

I know that writing :
SELECT * FROM <ANY TABLE>
in a stored procedure will output a result set... what why do we have a return value separately in a stored procedure? where do we use it ?
If any error comes then the result set will be null rite?
First of all you have two distinct ways to return something. You may return a result set (i.e. a table) as the result of the operation as well as return value indicating either some sort of error or status of the result set.
Also, a return value is limited to a single 32bit integer, whereas a result set can have as many rows and columns the RDBMS allows.
My personal opinion is to use a stored procedure to execute a task mainly, and not to create a result set. But that is a matter of taste. However, using this paradigm, an action should inform the caller about the success and -in case of a failure- about the reason. Some RDBMS allow using exceptions, but if there is nothing to throw, i.e. just returning a status (e.g. 0,1,2 for 'data was new and had to be inserted, data existed and was updated, data could not be updated etc.)
There is a third way to pass information back to the caller: By using output parameter. So you have three different possibilities of passing information back to the caller.
This is one more than with a 'normal' programming language. They usually have the choice of either returning a value (e.g. int Foo() or an output/ref parameter void Foo(ref int bar). But SQL introduces a new and very powerful way of returning data (i.e. tables).
In fact, you may return more than one table which makes this feature even more powerful.
Because if you use return values you can have a more fine grained control over the execution status and what the error (if any) were and you can return different error codes for malformed or invalid parameters etc and hence add error control/checking on the calling side to.
If you just check for an empty result set you really don't know why the set might be empty (maybe you called the procedure with an invalid parameter).
The main difference between a result set and a return value is that the result set stores the data returned (if any) and the return code holds some kind of status information about the execution itself.
You can use return value to return additional information from a stored procedure. This can be error codes, validation results or any other custom information you may want to return. It gives you additional flexibility when coding stored procedures.
why do we have a return value separately in a stored procedure?
A stored procedure may return 0 or more resultsets. Insert, update, and delete normally don't produce a resultset, and a stored procedure may call select many times. In all cases, the resultset is data.
I suggest the best way to think of the "return value" is as status information: it indicates how the stored procedure worked out. You could return ##rowcount for an update. Sometimes it can be something simple, like the number of rows meeting some criteria, saving you the work of binding a variable to a single row to get the same answer. Or you could return 0 for success and nonzero for error; it's often easier to check the return status inline than in an error handler.
There's an analogy on the lines of the Unix cat utility that might help: it produces data on standard output, and returns an exit status to let the caller know whether or not it succeeded.

How to implement deduplication in a billion rows table ssis

which is the best option to implement distinct operation in ssis?
I have a table with more than 200 columns and contain more than 10 million rows.
I need to get the ditinct rows from this table.Is it wise to use a execute sql task (with select query to deduplicate the rows) or is there any other way to achieve this in ssis
I do understood that the ssis sort component deduplicate the rows..but this is a blocking component it is not at all a good idea to use ...Please let me know your views on this
I had done it in 3 steps this way:
Dump the MillionRow table into HashDump table, which has only 2 columns: Id int identity PK, and Hash varbinary(20). This table shall be indexed on its Hash column.
Dump the HashDump table into HashUni ordered by Hash column. In between would be a Script Component that check whether the current row's Hash column value is same as the previous row. If same, direct row to Duplicate output, else Unique. This way you can log the Duplicate even if what you need is just the Unique.
Dump the MillionRow table into MillionUni table. In between would be a Lookup Component that uses HashUni to tell which row is Unique.
This method allows me to log each duplicates with a message such as: "Row 1000 is a duplicate of row 100".
I have not found a better way than this. Earlier, I made a unique index on MillionUni, to dump directly the MillionRow into it, but I was not able to use "fast load", which was way too slow.
Here is one way to populate the Hash column:
public override void Input0_ProcessInputRow(Input0Buffer Row)
{
StringBuilder sb = new StringBuilder();
sb.Append(Row.Col1String_IsNull ? "" : Row.Col1String); sb.Append("|");
sb.Append(Row.Col2Num_IsNull ? "" : Row.Col2Num.ToString()); sb.Append("|");
sb.Append(Row.Col3Date_IsNull ? "" : Row.Col3Date.ToString("yyyy-MM-dd"));
var sha1Provider = HashAlgorithm.Create("SHA1");
Row.Hash = sha1Provider.ComputeHash(Encoding.UTF8.GetBytes(sb.ToString()));
}
If 200 columns prove to be a chore for you, part of this article shall inspire you. It is making a loop for the values of all column objects into a single string.
And to compare the Hash, use this method:
byte[] previousHash;
int previousRowNo;
public override void Input0_ProcessInputRow(Input0Buffer Row)
{
if (StructuralComparisons.StructuralEqualityComparer.Equals(Row.Hash, previousHash))
{
Row.DupRowNo = previousRowNo;
Row.DirectRowToDuplicate();
}
else
{
Row.DirectRowToUnique();
}
previousHash = Row.Hash;
previousRowNo = Row.RowNo;
}
I won't bother SSIS for it, a couple of queries will do; also you have a lot of data, so i suggest you check the execution plan before running the queries, and optimize your indexes
http://www.brijrajsingh.com/2011/03/delete-duplicate-record-but-keep.html
Check out a small article i wrote on the same topic
As far as I know, the Sort Component is the only transformation which allows you to distinct the duplcities. Or you could use SQL-like command.
If the sorting operation is problem, then you should use (assuming your source is DB) "SQL Command" in Data Access Mode specification. Select distinct your data and that's it .. you may also save a bit time as the ETL wont have to go through the Sort Component.

CakePHP query additions in controller

I am migrating raw PHP code to CakePHP and have some problems. As I have big problems with query to ORM transformation I temporary use raw SQL. All is going nice, but I met the ugly code and don't really know how to make it beautiful. I made DealersController and added function advanced($condition = null) (it will be called from AJAX with parameters 1-15 and 69). function looks like:
switch ($condition) {
case '1':
$cond_query = ' AND ( (d.email = \'\' OR d.email IS NULL) )';
break;
case '2':
$cond_query = ' AND (d.id IN (SELECT dealer_id FROM dealer_logo)';
break;
// There are many cases, some long, some like these two
}
if($user_group == 'group_1') {
$query = 'LONG QUERY WITH 6+ TABLES JOINING' . $cond_query;
} elseif ($user_group == 'group_2'){
$query = 'A LITLE BIT DIFFERENT LONG QUERY WITH 6+ TABLES JOINING' . $cond_query;
} else {
$query = 'A LITLE MORE BIT DIFFERENT LONG QUERY WITH 10+ TABLES JOINING' . $cond_query;
}
// THERE IS $this->Dealer->query($query); and so on
So.. As you see code looks ugly. I have two variants:
1) get out query addition and make model methods for every condition, then these conditions seperate to functions. But this is not DRY, because main 3 big queries is almost the same and if I will need to change something in one - I will need to change 16+ queries.
2) Make small reusable model methods/queries whitch will get out of DB small pieces of data, then don't use raw SQL but play with methods. It would be good, but the performance will be low and I need it as high as possible.
Please give me advice. Thank you!
If you're concerned about how CakePHP makes a database query for every joined table, you might find that the Linkable behaviour can help you reduce the number of queries (where the joins are simple associations on the one table).
Otherwise, I find that creating simple database querying methods at the Model level to get your smaller pieces of information, and then combining them afterwards, is a good approach. It allows you to clearly outline what your code does (through inline documentation). If you can migrate to using CakePHP's find methods instead of raw queries, you will be using the conditions array syntax. So one way you could approach your problem is to have public functions on your Model classes which append their appropriate conditions to an inputted conditions array. For example:
class SomeModel extends AppModel {
...
public function addEmailCondition(&$conditions) {
$conditions['OR'] = array(
'alias.email_address' => null,
'alias.email_address =' => ''
);
}
}
You would call these functions to build up one large conditions array which you can then use to retrieve the data you want from your controller (or from the model if you want to contain it all at the model layer). Note that in the above example, the conditions array is being passed by reference, so it can be edited in place. Also note that any existing 'OR' conditions in the array will be overwritten by this function: your real solution would have to be smarter in terms of merging your new conditions with any existing ones.
Don't worry about 'hypothetical' performance issues - if you've tried to queries and they're too slow, then you can worry about how to increase performance. But for starters, try to write the code as cleanly as possible.
You also might want to consider splitting up that function advanced() call into multiple Controller Actions that are grouped by the similarity of their condition query.
Finally, in case you haven't already checked it out, here's the Book's entry on retrieving data from models. There might be some tricks you hadn't seen before: http://book.cakephp.org/view/1017/Retrieving-Your-Data
If the base part of the query is the same, you could have a function to generate that part of the query, and then use other small functions to append the different where conditions, etc.