Multiple IDs in Conceptual Search - Is it and AND operation or an OR operation? - concept-insights

When I specify multiple IDs as query parameters in my conceptual search, would the results have only those documents which refer conceptually to all of the searched IDs? Or will it have documents that refer conceptually to any one of the IDs?
Thanks
Vipin

The intended behavior of passing multiple ids to the /conceptual_search query is a logical AND. So the search is trying to find documents that have relationship to all the ids listed in the query.
The OR behavior cannot be performed through a single query, but can be emulated by performing separate queries to the individual ids, followed by a merge (with proper sorting) of the results on the client side.

Related

Search large number of ID search in MongoDB

Thanks for looking at my query. I have 20k+ unique identification id that I is provided by client, I want to look for all these id's in MongoDB using single query. I tried looking using $in but then it does not seems feasible to put all the 20K Id in $in and search. Is there a better version of achieving.
If the id field is indexed, an IN query should be very fast, but i don't think it is a good idea to perform a query with 20k ids in one time, as it may consume quite a lot of resources like memory, you can split the ids into multiple groups with a reasonable size and do the query separately and you still can perform the queries parallelly in application level.
Consider importing your 20k+ id into a collection(says using mongoimport etc). Then perform a $lookup from your root collection to the search collection. Depending on whether the $lookup result is empty array or not, you can proceed with your original operation that requires $in.
Here is Mongo playground for your reference.

How to retrieve identical entries from two different queries which are based on the same table

I think I got a knot in my line of thought, surely you can untie it.
Basically I have two working queries which are based on the same table and result in an identical structure (same as source table). They are simply two different kinds of row filters. Now I would like to "stack" these filters, meaning that I want to retract all the entries which are in query a and query b.
Why do I want that?
Our club is structured in several local groups and I need to hand different kinds of lists (e.g. members with email-entry) to these groups. In this example I would have a query "groupA" and a query "newsletter". Another could be "groupB" and "activemember", but also "groupB" and "newsletter". Unfortunately each query is based on a set of conditions, which imho would be stored best in a single query instead of copying the conditions several times to different queries (in case something changes).
Judging from the Venn diagrams 1, I suppose I need to use INNER JOIN but could not get it to work. Neither with the LibreOffice Base query assistant nor an SQL-Code. I tried this:
SELECT groupA.*
FROM groupA
INNER JOIN newsletter
ON groupA.memberID = newsletter.memberID
The error code says: Cannot be in ORDER BY clause in statement
I suppose that the problem comes from the fact, that both queries are based on the same table.
May be there is an even easier way of nesting queries?
I am hoping for something like
SELECT * FROM groupA
WHERE groupA.memberID = newsletter.memberID
Thank you and sorry if this already has a duplicate, I just could not find the right search terms.

How (and where) should I combine one-to-many relationships?

I have a user table, and then a number of dependent tables with a one to many relationship
e.g. an email table, an address table and a groups table. (i.e. one user can have multiple email addresses, physical addresses and can be a member of many groups)
Is it better to:
Join all these tables, and process the heap of data in code,
Use something like GROUP_CONCAT and return one row, and split apart the fields in code,
Or query each table independently?
Thanks.
It really depends on how much data you have in the related tables and on how many users you're querying at a time.
Option 1 tends to be messy to deal with in code.
Option 2 tends to be messy to deal with as well in addition to the fact that grouping tends to be slow especially on large datasets.
Option 3 is easiest to deal with but generates more queries overall. If your data-set is small and you're not planning to scale much beyond your current needs its probably the best option. It's definitely the best option if you're only trying to display one record.
There is a fourth option however that is a middle of the road approach which I use in my job in which we deal with a very similar situation. Instead of getting the related records for each row 1 at a time, use IN() to get all of the related records for your results set. Then loop in your code to match them to the appropriate record for display. If you cache search queries you can cache that second query as well. Its only two queries and only one loop in the code (no parsing, use hashes to relate things by their key)
Personally, assuming my table indexes where up to scratch I'd going with a table join and get all the data out in one go and then process that to end up with a nested data structure. This way you're playing to each systems strengths.
Generally speaking, do the most efficient query for the situation you're in. So don't create a mega query that you use in all cases. Create case specific queries that return just the information you need.
In terms of processing the results, if you use GROUP_CONCAT you have to split all the resulting values during processing. If there are extra delimiter characters in your GROUP_CONCAT'd values, this can be problematic. My preferred method is to put the GROUPed BY field into a $holder during the output loop. Compare that field to the $holder each time through and change your output accordingly.

DB Design question: Tree (one table) vs. Two tables for tweets and retweets?

I've heard that on stackoverflow questions and answers are stored in the same DB table.
If you were to build a twitter like service that only would allow 1 level of commenting. ie 1 tweet and then comments/replies to that tweet but no re-comments or re-replies.
would you use two tables for tweets and retweets? or just one table where the field parent_tweet_id is optional?
I know this is an open question, but what are some advantages of either solutions?
Retweets are still normal tweets as well. So one table. You wouldn't want to have to load from two tables to include the retweets.
Advantages of one table:
You can search through all tweets and comments in a simple way.
You can use one identity column easily for all posts.
Every post has the same set of columns.
Advantages of two tables:
If it's more common to search or display only top-level tweets instead of tweets + comments, the table of tweets is that much smaller without comments.
Two tables can have different sets of columns, so if there are columns meaningful for one type of post but not the other, you can put these columns in the respective table without having to leave them null when not applicable.
Indexes can also be different on two tables, so if you have the need to search comments in different ways, you can make indexes specialized to that task.
In short, it depends on how you use the data, not only how it's structured. You haven't said much about the operations you need to do with the data.
Like all design questions, it depends.
I don't normally like to mix concepts in a single table. I find it can quickly damage the conceptual integrity of the database schema. For example, I would not put posts and replies in the same table because they are different entities.

How to create dynamic and safe queries

A "static" query is one that remains the same at all times. For example, the "Tags" button on Stackoverflow, or the "7 days" button on Digg. In short, they always map to a specific database query, so you can create them at design time.
But I am trying to figure out how to do "dynamic" queries where the user basically dictates how the database query will be created at runtime. For example, on Stackoverflow, you can combine tags and filter the posts in ways you choose. That's a dynamic query albeit a very simple one since what you can combine is within the world of tags. A more complicated example is if you could combine tags and users.
First of all, when you have a dynamic query, it sounds like you can no longer use the substitution api to avoid sql injection since the query elements will depend on what the user decided to include in the query. I can't see how else to build this query other than using string append.
Secondly, the query could potentially span multiple tables. For example, if SO allows users to filter based on Users and Tags, and these probably live in two different tables, building the query gets a bit more complicated than just appending columns and WHERE clauses.
How do I go about implementing something like this?
The first rule is that users are allowed to specify values in SQL expressions, but not SQL syntax. All query syntax should be literally specified by your code, not user input. The values that the user specifies can be provided to the SQL as query parameters. This is the most effective way to limit the risk of SQL injection.
Many applications need to "build" SQL queries through code, because as you point out, some expressions, table joins, order by criteria, and so on depend on the user's choices. When you build a SQL query piece by piece, it's sometimes difficult to ensure that the result is valid SQL syntax.
I worked on a PHP class called Zend_Db_Select that provides an API to help with this. If you like PHP, you could look at that code for ideas. It doesn't handle any query imaginable, but it does a lot.
Some other PHP database frameworks have similar solutions.
Though not a general solution, here are some steps that you can take to mitigate the dynamic yet safe query issue.
Criteria in which a column value belongs in a set of values whose cardinality is arbitrary does not need to be dynamic. Consider using either the instr function or the use of a special filtering table in which you join against. This approach can be easily extended to multiple columns as long as the number of columns is known. Filtering on users and tags could easily be handled with this approach.
When the number of columns in the filtering criteria is arbitrary yet small, consider using different static queries for each possibility.
Only when the number of columns in the filtering criteria is arbitrary and potentially large should you consider using dynamic queries. In which case...
To be safe from SQL injection, either build or obtain a library that defends against that attack. Though more difficult, this is not an impossible task. This is mostly about escaping SQL string delimiters in the values to filter for.
To be safe from expensive queries, consider using views that are specially crafted for this purpose and some up front logic to limit how those views will get invoked. This is the most challenging in terms of developer time and effort.
If you were using python to access your database, I would suggest you use the Django model system. There are many similar apis both for python and for other languages (notably in ruby on rails). I am saving so much time by avoiding the need to talk directly to the database with SQL.
From the example link:
#Model definition
class Blog(models.Model):
name = models.CharField(max_length=100)
tagline = models.TextField()
def __unicode__(self):
return self.name
Model usage (this is effectively an insert statement)
from mysite.blog.models import Blog
b = Blog(name='Beatles Blog', tagline='All the latest Beatles news.')
b.save()
The queries get much more complex - you pass around a query object and you can add filters / sort elements to it. When you finally are ready to use the query, Django creates an SQL statment that reflects all the ways you adjusted the query object. I think that it is very cute.
Other advantages of this abstraction
Your models can be created as database tables with foreign keys and constraints by Django
Many databases are supported (Postgresql, Mysql, sql lite, etc)
DJango analyses your templates and creates an automatic admin site out of them.
Well the options have to map to something.
A SQL query string CONCAT isn't a problem if you still use parameters for the options.