I'm working in a large project using Kohana 3 framework, actually I need to improve it adding a cache system to reduce the number of MySQL connections.
I'm thinking in develop a basic (but general) module to generate a full query results caching but separately manage the table query results into different groups.
Pex:
cache groups: users, roles, roles_users, etc.
Each group contains all the query results from the correspondant table. So, if I want to get values from 'users', the cache system would automatically add the result to the cache system, but if I update the 'users' table all the keys in 'users' group would be deleted. I know, it's not so smart but it's fast and safe (the system also generate user lists, and the results may be correct).
Then, my question is: ¿Where an how can I make the "injection" of my code in the application tree?
I need, firstly (to generate a hash key) the full query (for a certain table -used as group-), and the result of that query to store. And, when another hash (in the that group) is the same as stored one, the value must be getted from memcached.
So, I need: the table name, the query and the result... I think it's possible extending the Database class, implementing the cache in the execute() method, but I can't find it!
I'm in the correct way? Where the execute() method is?
I built a Kohana 3 module that accomplishes this, but it must be used with the query builder. It also uses Memcache to cache the queries. It invalidates on inserts/updates/deletes.
Here's a link:
Kohana Memcache Query Caching
Related
Thanks for looking at my query. I have 20k+ unique identification id that I is provided by client, I want to look for all these id's in MongoDB using single query. I tried looking using $in but then it does not seems feasible to put all the 20K Id in $in and search. Is there a better version of achieving.
If the id field is indexed, an IN query should be very fast, but i don't think it is a good idea to perform a query with 20k ids in one time, as it may consume quite a lot of resources like memory, you can split the ids into multiple groups with a reasonable size and do the query separately and you still can perform the queries parallelly in application level.
Consider importing your 20k+ id into a collection(says using mongoimport etc). Then perform a $lookup from your root collection to the search collection. Depending on whether the $lookup result is empty array or not, you can proceed with your original operation that requires $in.
Here is Mongo playground for your reference.
Question
I have a complex query that joins three tables and returns a set of rows with each row having data from it's sibling tables. How is it possible to represent this in a RESTful way?
FWIW I know there is not necessarily a "right" way to do it, but I'm interested in learning about what might be the most extensible and durable solution for this situation.
Background
In the past I've represented single tables that more or less mirror the literal structure of the url. For example, the url GET /agents/1/policies would result in a query like select * from policies where agent_id = 1.
Assumption
It seems like the url doesn't necessarily have to be so tightly coupled to the structure of the database layer. For example, if the complex query was something like:
select
agent.name as agent_name,
policy.status as policy_status,
vehicle.year as vehicle_year
from
policies as policy
join agents as agent on policy.agent_id = agent.id
join vehicles as vehicle on vehicle.policy_id = policy.id
where 1=1
and policy.status = 'active';
# outputs something like:
{ "agent_name": "steve", "policy_status": "single", "vehicle_year": "1999" }
I could represent this QUERY as a url instead of TABLES as url. The url for this could be /vehicles, and if someone were to want to query it (with id or some other parameter like /vehicles?vehicle_color=red) I could just pass that value into a prepared statement.
Bonus Questions
Is this an antipattern?
Should my queries always be run against EXISTING tables instead of prepared statements?
Thanks for your help!
You want to step back from the database tables and queries and think about the basic resources. In your examples, clearly these are agent, customer vehicle and policy.
Resources vs Collections
One misstep I see in your examples is that you don't separate collections from resources using plurals which can be useful when you are dealing with Searching and logistically, for your controller routes. In your example you have:
GET /agents/1/policies
Suppose instead, that this was GET /agent/1/policies.
Now you have a clear differentiation between location of an Idempotent resource: /agent/1, and finding/searching for a collection of agents: /agents.
Following this train of thought, you start to disassociate enumerating relationships from each side of the relationship in your API, which is inherently redundant.
In your example, clearly, policies are not specifically owned by an agent. A policy should be a resource that stands on its own, identifiable via some Idempotent url using whatever ID uniquely identifies that policy for the purpose of finding that policy ie. /policy/{Id}
Searching Collections
What this now does for you is allow you to separate the finding of a policy through: /policies where returning only policies for a specific Agent is but one of a number of different ways you might access that collection.
So rather than having GET /agents/1/policies you would instead find the policies associated with an agent via: GET /policies?agent=1
The expected result of this would be a collection of resource identifiers for the matching policies:
{ "policies" : ["/policy/234234", "/policy/383282"] }
How do you then get the final result?
For a given policy, you would expect a complete return of associated information, as in your query, only without the limitations of the select clause. Since what you want is a filtered version, a way to handle that would be to include filter criteria.
GET /policy/234234?filter=agentName,policyStatus,vehicleYear
With that said, this approach has pitfalls, and I question it for a number of reasons. If you look at your original list of resources, each one can be considered an object. If you are building an object graph in the client, then the complete information for a policy would instead include resource locators for all the associated resources:
{ ... Policy data + "customer": "/customer/2834", "vehicle": "/vehicle/88328", "agent": "/agent/32" }
It is the job of the client to access the data for an agent, a vehicle and a customer, and not your job to regurgitate all that data redundantly anytime you need some view of that data.
This is better because it is both restful, and supports many of the aims of REST to support Idempotency, caching etc.
This also better allows the client to cache locally the data for an Agent, and to determine whether or not it needs to get that data or just access data it already cached. At worst case there are maybe 3 or 4 REST calls that need to be made.
Bonus questions
REST has some grey area. You have to interpret Fielding and for that reason, there are frequently different opinions in regards to how to do things. While the approach of providing an api like GET /agents/1/policies to provide the list of policies associated with an agent is frequently used, there is a point where that becomes limiting and redundant in my experience, as it requires the end users to become familiar with the way you model relationships to the underlying resources.
As for your question on queries, it makes no difference how you access the underlying data and organize it, so long as you are consistent. What often happens (for the purposes of performance) is that the api doesn't return resource identifiers and starts returning the data as I illustrated previously. This is a slippery slope where you are just turning your REST api into a frontend to a bunch of queries, and at that point your API might as well just be: GET \query?filter=agent.name, policy.status, vehicle.year&from=policies&join=agents,vehicles&where=...
I have the following scenario where the search returns a list of userid values (1,2,3,4,5,6... etc.) If the search were to be run again, the results are guaranteed to change given some time. However I need to stored the instance of the search results to be used in the future.
We have a current implementation (legacy), which creates a record for the search_id with the criteria and inserts every row returned into a different table with the associated search_id.
table search_results
search_id unsigned int FK, PK (clustered index)
user_id unsigned int FK
This is an unacceptable approach as this table has grown onto millions of records. I've considered partitioning the table, but either I will have numerous partitions (1000s).
I've optimized the existing tables that search results expired unless they're used elsewhere, so all the search results are referenced elsewhere.
In the current schema, I cannot store the results as serialized arrays or XML. I am looking to efficiently store the search result information, such that it can be efficiently accessed later without being burdened by the number of records.
EDIT: Thank you for the answers, I don't have any problems running the searches themselves, but the result set for the search gets used in this case for recipient lists, which will be used over and over again, the purpose of storing is exactly to have a snapshot of the data at the given time.
The answer is don't store query results. It's a terrible idea!
It introduces statefulness, which is very bad unless you really (really really) need it
It isn't scalable (as you're finding out)
The data is stale as soon as it's stored
The correct approach is to fix your query/database so it runs acceptable quickly.
If you can't make the queries faster using better SQL and/or indexes etc, I recommend using lucene (or any text-based search engine) and denormalizing your database into it. Lucene queries are incredibly fast.
I recently did exactly this on a large web site that was doing what you're doing: It was caching query results from the production relational database in the session object in an attempt top speed up queries, but it was a mess, and wasn't much faster anyway - before my time, a "senior" java developer (whose name started with Jam.. and ended with .illiams) who was actually a moron decided it was a good idea.
I put in Solr (a java-tailored lucene implementation) and kept Solr up to date with the relational database (using work queues) and the web queries are now just a few milliseconds.
Is there a reason why you need to store every search? Surely you would want the most up to date information available for the user ?
I'll admit first, this isn't a great solution.
Setup another database alongside your current one [SYS_Searches]
The save script could use SELECT INTO [SYS_Searches].Results_{Search_ID}
The script that retrieves can do a simple SELECT out of the matching table.
Benefits:
Every search is neatly packed into it's own table, [preferably in another DB]
The retrieval query is very simple
The retrieval time should be very quick, no massive table scans.
Drawbacks:
You will have a table for every x user * y searches a user can store.
This could get very silly very quickly unless there is management involved to expire results or the user can only have 1 cached search result set.
Not pretty, but I can't think of another way.
For our web-application (ASP.NET) we're using Fluent NHibernate (2.1.2) with 2nd-Level caching not only for entities, but also for queries (generating queries with the criteria API). We're using the Session-Per-Request pattern and one SessionFactory applicationwide, so the cache serves all Nhibernate-Sessions.
Problem:
We have to deal with different "Access-Rigths" per user on the data-objects in our legacy-database (Oracle) - that is, views constrain the returning data per user-rights.
So there's the situation, where for example the same view is queried by our criteria with the excact same query, but returns a different resultset, depending on the user-Rights.
Now, to gain performance, the mentioned query is cached. But this gives us the problem, that when the query is first fired from an action of user A, it caches the resulting ID's, which are the ID's to which user A has access rights. Shortly after, the same query is fired from an action of user B and Nhibernate then picks the cached ID's from the first call (from user A) and tries to get the corresponding entities, to which User B doesn't have access-rights (or maybe not for all of them). We're checking the rights with event-listeners, so our appliction throws an access-right-exception in the mentioned case.
Thoughts:
Not caching the queries could be an option against this. But performance is cleary an issue in our application, so it would be really desirable to have cached queries user-wise.
We even thought about a SessionFactory per user, to have a cache per user, sort of. But this has clearly an impact on ressources, is somewhat of an overkill and honestly isn't an option, because
there are entities, which have to be accessed, and are manipulated, by multiple users (think of a user-group), creating issues with stale data in the "individual caches" and so on. So that's a no-go.
What would be a valid solution for this? Is there something like "best practice" for such a situation?
Idea:
As I was stuck with this yesterday, seeing no way out, I slept over it, and today I came up with some sort of a "hack".
As NHibernate caches the query by query-text and parameters ("clauses"), I thought about a way, to "smuggle" something user-dependent in that signature of the queries, so it would
cache every query per user, but would not alter the query itself (concerning the result of the query).
So "creativity" guided me to this (example-code):
string userName = GetCurrentUser();
ICriteria criteria = session.CreateCriteria(typeof (EntityType))
.SetCacheable(true)
.SetCacheMode(CacheMode.Normal)
.Add(Expression.Eq("PropertyA", 1))
.Add(Expression.IsNotNull("PropertyB"))
.Add(Expression.Sql(string.Format("'{0}' = '{0}'", userName)));
return criteria.List();
This line:
.Add(Expression.Sql(string.Format("{0} = {0}", userName)))
results in a where-clause, which always evaluates to true, but "changes" the query from Nhibernate's viewpoint, so it caches per separate "userName".
I know, it's kind of ugly and I'm not really pleased with it.
Does anybody knows any alternative approach?
thanks in advance.
A "static" query is one that remains the same at all times. For example, the "Tags" button on Stackoverflow, or the "7 days" button on Digg. In short, they always map to a specific database query, so you can create them at design time.
But I am trying to figure out how to do "dynamic" queries where the user basically dictates how the database query will be created at runtime. For example, on Stackoverflow, you can combine tags and filter the posts in ways you choose. That's a dynamic query albeit a very simple one since what you can combine is within the world of tags. A more complicated example is if you could combine tags and users.
First of all, when you have a dynamic query, it sounds like you can no longer use the substitution api to avoid sql injection since the query elements will depend on what the user decided to include in the query. I can't see how else to build this query other than using string append.
Secondly, the query could potentially span multiple tables. For example, if SO allows users to filter based on Users and Tags, and these probably live in two different tables, building the query gets a bit more complicated than just appending columns and WHERE clauses.
How do I go about implementing something like this?
The first rule is that users are allowed to specify values in SQL expressions, but not SQL syntax. All query syntax should be literally specified by your code, not user input. The values that the user specifies can be provided to the SQL as query parameters. This is the most effective way to limit the risk of SQL injection.
Many applications need to "build" SQL queries through code, because as you point out, some expressions, table joins, order by criteria, and so on depend on the user's choices. When you build a SQL query piece by piece, it's sometimes difficult to ensure that the result is valid SQL syntax.
I worked on a PHP class called Zend_Db_Select that provides an API to help with this. If you like PHP, you could look at that code for ideas. It doesn't handle any query imaginable, but it does a lot.
Some other PHP database frameworks have similar solutions.
Though not a general solution, here are some steps that you can take to mitigate the dynamic yet safe query issue.
Criteria in which a column value belongs in a set of values whose cardinality is arbitrary does not need to be dynamic. Consider using either the instr function or the use of a special filtering table in which you join against. This approach can be easily extended to multiple columns as long as the number of columns is known. Filtering on users and tags could easily be handled with this approach.
When the number of columns in the filtering criteria is arbitrary yet small, consider using different static queries for each possibility.
Only when the number of columns in the filtering criteria is arbitrary and potentially large should you consider using dynamic queries. In which case...
To be safe from SQL injection, either build or obtain a library that defends against that attack. Though more difficult, this is not an impossible task. This is mostly about escaping SQL string delimiters in the values to filter for.
To be safe from expensive queries, consider using views that are specially crafted for this purpose and some up front logic to limit how those views will get invoked. This is the most challenging in terms of developer time and effort.
If you were using python to access your database, I would suggest you use the Django model system. There are many similar apis both for python and for other languages (notably in ruby on rails). I am saving so much time by avoiding the need to talk directly to the database with SQL.
From the example link:
#Model definition
class Blog(models.Model):
name = models.CharField(max_length=100)
tagline = models.TextField()
def __unicode__(self):
return self.name
Model usage (this is effectively an insert statement)
from mysite.blog.models import Blog
b = Blog(name='Beatles Blog', tagline='All the latest Beatles news.')
b.save()
The queries get much more complex - you pass around a query object and you can add filters / sort elements to it. When you finally are ready to use the query, Django creates an SQL statment that reflects all the ways you adjusted the query object. I think that it is very cute.
Other advantages of this abstraction
Your models can be created as database tables with foreign keys and constraints by Django
Many databases are supported (Postgresql, Mysql, sql lite, etc)
DJango analyses your templates and creates an automatic admin site out of them.
Well the options have to map to something.
A SQL query string CONCAT isn't a problem if you still use parameters for the options.