Frequently Updated CSDL query in DataSift - csdl

Our DataSift CSDL query looks this way:
List<string> keywords=dbAccess.GetAllKeywords(); // there are 100K+ of them
string csKwList="\""+String.Join(",", keywords)+"\"";
string csdl = "facebook.message contains_any "+csKwList;
DataSiftManager.Resubscribe(csdl); //this involves deleting current subscritpion, recompiling a new csdl, and subscribing anew.
This works but each time a couple of new keywords are added to the list, I have to pull the entire list from the DB. This is unacceptable.
My question is, if there is a way to slightly modify a currently active subscription if I know exactly what keywords are being added and removed from the csdl query ?

At present, when you need to 'modify' your CSDL, you are required to recompile the definition. This will mean grabbing your full list of keywords, and adding them to your CSDL definition.
DataSift is working to improve this process by allowing smarter management of large lists of keywords, though this feature is still in development.

Related

Is using comma separated field good or not

I have a table named buildings
each building has zero - n images
I have two solutions
the first one (the classic solution) using two tables:
buildings(id, name, address)
building_images(id, building_id, image_url)
and the second solution using olny one table
buildings(id, name, address, image_urls_csv)
Given I won't need to search by image URL obviously,
I think the second solution (using image_urls_csv column) is easier to use, and no need to create another table just to keep the images, also I will avoid the hassle of multiple queries or joining.
the question is, if I don't really want to filter, search or group by the filed value, can I just make it CSV?
On the one hand, by simply having a column of image_urls_list avoids joins or multiple queries, yes. A single round-trip to the db is always a plus.
On the other hand, you then have a string of urls that you need to parse. What happens when a URL has a comma in it? Oh, I know, you quote it. But now you need a parser that is beyond a simple naive split on commas. And then, three months from now, someone will ask you which buildings share a given image, and you'll go through contortions to handle quotes, not-quotes, and entries that are at the beginning or end of the string (and thus don't have commas on either side). You'll start writing some SQL to handle all this and then say to heck with it all and push it up to your higher-level language to parse each entry and tell if a given image is in there, and find that this is slow, although you'll realise that you can at least look for %<url>% to limit it, ... and now you've spent more time trying to hack around your performance improvement of putting everything into a single entry than you saved by avoiding joins.
A year later, someone will give you a building with so many URLs that it overflows the text limit you put in for that field, breaking the whole thing. Or add some extra fields to each for extra metadata ("last updated", "expires", ...).
So, yes, you absolutely can put in a list of URLs here. And if this is postgres or any other db that has arrays as a first-class field type, that may be okay. But do yourself a favour, and keep them separate. It's a moderate amount of up-front pain, and the long-term gain is probably going to make you very happy you did.
Not
"Given I won't need to search by image URL obviously" is an assumption that you cannot make about a database. Even if you never do end up searching by url, you might add other attributes of building images, such as titles, alt tags, width, height, etc, so you would end up having to serialize all this data in that one column, and then you would not be able to index any of it. Plus, if you serialize it with one language, then you or whoever comes after you using a different language will either have to install some 3rd party library to deserialize your stuff or write their own deserialization function.
The only case that I can think of where you should keep serialized data in a database is when you inherit old software that you don't have time to fix yet.

Solr 5.3 implementation processes docs but doesn't return results

I have recently set up a local instance of Solr 5.3 in an effort to get it going for my company. As an initial test case I've set up a Data Import Handler (DIH) that returns PDFs stored within a file directory. When I execute the full import in the admin tool, the DIH processes all the files within the directory, and I'm able to run a general query (*:*) which returns all indexed fields for every record in the index.
When I switch to a specific query using a word definitely contained within the files, however, Solr returns no results. What connection am I not making here?
I can provide excerpts from the schema, solrconfig, and custom data config if needed, but I don't want to oversaturate this post.
The answer I came up with involved a simple newbie mistake combined with something I wasn't anticipating.
1) First, I didn't have my field set to indexed="true". I set that. Yeesh, it stinks being new to this!
2) I needed to make a change to solrconfig.xml for the core in question. Thanks to this article, I was able to determine that I needed to add a default field in the /select requestHandler. Uncommenting the relevant line in solrconfig and changing the field name did the trick-- I no longer need to supply the name in df to return results.
My carryover question for anyone coming across this question in the future is whether this latter point is the proper way to go about using default fields. I see in schema.xml that is deprecated (or heading that direction) in 5.3.0. So is it alright to define df in solrconfig instead?

how to update field names automatically after updating SQL

I am changing the command text for a data set inside the .rdl ffile:
I would like to know how can I update the resulting fields that are returned by the select statement:
I know that these fields must be automatically generated, so I was wondering if it's possible to update them right after editing the SQL code inline??
Usually when someone wants to have a look at the data in command text they are wanting it for reference to an end user(from what I have seen). You may want to amend it but ultimately with reporting your first goal should be: "What am I doing this for?" If your goal is dynamic creation at runtime then I would avoid this and offer a few other suggestions:
Procertize it. Making a stored procedure if you have the know how in SQL Server is a convenient and fast way to get what you want and you can optimize it if you know what you are doing with your SQL FU to get good results. The downside would be if you work with multiple environments you have to deploy your code for the TSQL as well as the RDL file.
Use an expression to build the dataset at runtime. In cases where I have been told that the query itself was not properly optimized by other developers they have mentioned doing this. I myself do not always see the advantage of doing this versus just having your predicate construction work well with good indexing on the source engine. Regardless you can build your dataset at runtime. It would be similar to hitting 'fx' next to the text and then putting in something like this(assuming you have a variable named #Start):
="Select thing
from table
Where >= " & Parameters!Start.Value
Again I have not really seen if this is really that much faster than:
Select thing
from table
Where >= #Start
But it is there if you just want to build it dynamically.
You can try to build your expression dynamically from parameters being PART of the select statement. SSRS is all about the 'expressions' and what you can do with them. Once you jump in and learn how they apply to everything you can go nuts so to speak on using them. A general rule though is the more of them you use and rely on the slower your reports will become.
I hope some of this may help, I would ask first is something dynamic due to a need to be event driven or is performance related.

ndb ComputedProperty filtering

I have a User ndb.Model which has a username StringProperty that allows upper en lower case letters, at some point I wanted to fetch users by username but have the case forced to lowercase for the filtering. Therefor I added a ComputedProperty to User: username_lower which returns the lowercase version of the username as follows:
#ndb.ComputedProperty
def username_lower(self):
return self.username.lower()
then I filter the query like so:
query = query.filter(User.username_lower==username_input.lower())
This works, however it only does for users created (put) after I added this to the model. Users created before don't get filtered by this query. I first thought the ComputedProperty wasn't working for the older users. However, tried this and calling .username_lower on an old user does work.
Finally, I found a solution to this is to fetch all users and just run a .put_multi(all_users)
So seems like a ComputedProperty added later to the model works when you invoke it straight but doesn't filter at first. Does it not get indexed automatically ? or could it be a caching thing.. ?
any insight to why it was behaving like this would be welcome
thanks
this is the expected behaviour. The value of a ComputedProperty (or any property for that matter I guess) is indexed when the object is "put". The datastore does not do automatic schema updates or anything like that. When you update your schema you need to either allow for different schema versions in your code or update your entities individually. In the case of changes to indexing you have no choice but to update your entities. The MapReduce API can be used for updating entities to avoid request limitations and the like.

Query string with keys without values

What drawbacks can you think of if I design my REST API with query strings without parameter values? Like so:
http://host/path/to/page?edit
http://host/path/to/page?delete
http://host/path/to/page/+commentId?reply
Instead of e.g.:
http://host/api/edit?page=path/to/page
http://host/api/delete?page=path/to/page
http://host/api/reply?page=path/to/page&comment=commentId
( Edit: Any page-X?edit and page-X?delete links would trigger GET requests but wouldn't actually edit or delete the page. Instead, they show a page with a <form>, in which page-X can be edited, or a <form> with a Really delete page-X? confiramtion dialog. The actual edit/delete requests would be POST or DELETE requests. In the same manner as host/api/edit?page=path/to/page shows a page with an edit <form>. /Edit. )
Pleace note that ?action is not how query strings are usually formatted. Instead, they are usually formated like so: ?key=value;key2=v2;key3=v3
Moreover, sometimes I'd use URLs like this one:
http://host/path/to/page?delete;user=spammer
That is, I'd include a query string parameter with no value (delete) and one parameter with a value (user=spammer) (in order to delete all comments posted by the spammer)
My Web framework copes fine with query strings like ?reply. So I suppose that what I'm mostly wondering about, is can you think of any client side issues? Or any problems, should I decide to use another Web framework? (Do you know if the frameworks you use provides information on query strings without parameter values?)
(My understanding from reading http://labs.apache.org/webarch/uri/rfc/rfc3986.html is that the query string format I use is just fine, but what does that matter to all clients and server frameworks everywhere.)
(I currently use the Lift-Web framework. I've tested Play Framework too and it was possible to get hold of the value-less query strings parameters, so both Play and Lift-Web seems okay from my point of view.)
Here is a related question about query strings with no values. However, it deals with ASP.NET functions returning null in some cases: Access Query string parameters with no values in ASP.NET
Kind regards, Kaj-Magnus
Query parameters without value are no problem, but putting actions into the URI, in particular destructive ones, is.
Are you seriously thinking about "restful" design, and having a GET be a destructive action?