Lets say I have big Solr that holds ~150M documents.
I also have 100,000 users that each user have documents that he saved.
My questions:
What is the best way to store those documents IDs (the documents that each user saved)
If I decided to store the IDs in Mongo or MySql, what is the best way to allow the users to preform searches on their documents, meaning that I store only the Ids on Mongo/MySQL but the actual information is on Solr.
Thanks.
You can add a field username_s to each document that is being indexed. This field contains the username who may access the document. You can also use an array of users, if you would like to give more people access to this document.
Then in your backend you can add &fq=username_s:User. Even if there are 100 Million documents indexed, only those are shown who belong to the user.
/core/select?q=*.*&fq=username_s:<User>
You could store all documents for all users in the same core, leave the field "id" blank and an unqiue id is automatically generated for you by solr.
Related
I have to migrate a complex TYPO3 v7.6.30 website to Drupal 8.
So far I have investigated how TYPO3's administration part works.
I've also been digging into the TYPO3 database to find the correct mapping pattern, but I just don't seem to be getting anywhere.
My question is if there is a nice way to map/join all of the content with it's images/files/categories, so I can get row by row all page content like:
title
description
text fields
images
documents
tables
...
So in the end I will end up with a joined table with all of the data for each page on a single row, which then I can map in the migration.
I need a smooth way to map the pages with their fields.
I need the same for users (haven't researched this one yet).
The same is for the nesting of the pages in order to recreate the menus in the new CMS.
Any help on this will be highly appreciated.
You need a detailed plan of the configuration and then much understanding how TYPO3 works.
Here a basic introduction:
All content is organized in records and the main table is pages, the pagetree.
For nearly all records you have some common fields:
uid unique identifier
pid page ID (in which 'page' is the record 'stored', important for editing) (even pages are stored in pages to build a page tree)
title name of record
hidden, deleted,starttime,endtime, fe_group for visibility
there are fields for
versioning and workspaces
language support
sorting
some records (especially tt_content) have type fields, which decide how the record and which fields of it are used
there are relations to files (which are represented by sys_file records, and other records like file metadata or categories).
Aside from the default content elments where the data is stored in the tt_content record itself you can have plugins which display other records, (e.g. news, addresses, events, ...) or which get their data from another application or server.
You need to understand the complete configuration to save all.
What you might need is a special rendering of the pages.
That is doable with TYPO3: aside from the default HTML-rendering you can define other page types where you can get the content in any kind you define. e.g. xml, json, CSV, ...
This needs detailed knowledge of the individual TYPO3 configuration. So nobody can give you a full detailed picture of your installation.
And of course you need a good knowledge of your drupal target installation to answer the question 'what information should be stored where?'
I am building a small social networking website, I have a doubt regarding database schema:
How should I store the posts(text) by a user?
I'll have a separate POST table and will link USERS table with it, through USERS_POST table.
But every time to display all the posts on user's profile, system will have to search the entire USERS_POST table for USER id and then display?
What else should I do?
Similarly how should I store the multiple places the user has worked or studied?
I understand it's broad but I am new to Database. :)
First don't worry too much, start by making it work and see where you get performance problems. The database might be a lot quicker then you expect. Also it is often much easier to see what the best solution is when you have an actual query that is too slow.
Regarding your design, if a post is never linked to more then one user then forget the USERS_POST table and put the user id in the POST table. In either case an index on the user id would help (as in not having to read the whole table) when the database grows large.
Multiple places for a single user you would store in an additional table. For instance called USERS_PLACES, give it a column user_id to link it to USERS plus other columns for the data you wish to store per place.
BTW In postgresql you might want to keep all object (tables, columns, ...) names lowercase because unless you take care to always quote them like "USERS" postgresql will make them lowercase which can be confusing.
I know that, generally speaking, if a one-to-one relationship exists between two documents I should consider embedding one document within the other. I do, however, have a few scenarios where this doesn't feel right, primarily in scenarios where I need to query on properties of the embedded document. What I have done instead is to create a relationship between the ids (primary keys) of the two documents by using a convention.
For example, a User has a PasswordResetLog. The user and the log are represented by separate documents. If the user document id is 'users/123' then the corresponding password reset log document id is 'passwordresetlog/123'.
Since I almost always have access to the userId I can easily load documents associated with it in this manner.
My first question is: will this create fragmented indexes for the documents where I specifically set the ids by convention? The document ids are sequential, but I cannot alway guarantee that they'll be created in sequential order.
My second question is: Instead of using this convention, should I just add a property UserId on each document that is in a one-to-one relationship with User, and add an index for this property?
I don't think this is a very obscure Lucene problem, but somehow I just don't seem to be able to find a good solution to it. I will use an example.
Let's say I am building a news articles website. Registered users can bookmark articles that they are interested in. I want to allow users to search for only articles that he/she bookmarks. For the sake of example, let's also assume that a user can potentially bookmark thousands of articles, and we have hundreds of thousands of users in our database. How do I build a scalable solution for this problem?
Thanks a lot!
This is a very typical Lucene problem as it does not support joins. More specifically, there's no first class support and you have to find your ways around it. I can suggest a few:
You could have a database, which has users, articles and bookmarks tables (the latter would have foreign keys pointing to the first two). You would also have articles indexed in Lucene. When running a search against articles, you could write a Lucene Filter which would exclude all articles not bookmarked by the current user.
You could index all articles and bookmarks in Lucene - probably best if you do this using separate indices. Then you could run a query for bookmarks (to retrieve which articles current user has bookmarked) and then run another separate query for articles. Like in the previous example, you could use the results of the first query to exclude all other articles which are not bookmarked by the current user.
I personally prefer option #1 as this is classical relational structure and databases are designed for exactly this purpose. With the option #2 you would have to modify both user storage and Lucene index when user gets deleted.
Is there a way to store information about documents that are stored in Lucene such that I don't have to update the entire document to update certain attributes about the documents?
For instance, let's say I had a bunch of documents, and that I wanted to update a permissions list of who was allowed to see the documents on a daily, or more frequent, basis. Would it be possible to update all the permissions each day, without updating all the documents. I could do it by keeping a exactly which permissions were added and removed, but I would rather just be able to take the end list of permissions, and use that, rather than have to keep track of all the permission changes and post those entire documents to Lucene.
Updating individual fields is not implemented, see this related question. I agree with Karussell about not storing permissions in Solr, this seems more like a job for a RDBMS. Remember that Lucene indexes are really flat structures.
I think, you will have to update the whole document and not only properties/fields.
For your problem I wouldn't store the permission stuff into lucene/solr. I would use a database to check if a user is able to view a document. E.g. add the roles "admin", "default" and/or "anonymous" to every document in a multivalued field "role" and then e.g. if a user is logged in as admin then filter all queries by the "admin" role.