What is the best way to return most recent posts from Fauna DB for a blog, etc - faunadb

I am building a simple custom headless CMS with React to save data in Fauna via API Gateway and Lambda. To list my posts in the admin, I would like to get the data from my collection sorted by a date value.
When I create a new index to do this, I expected to get the same data/structure that is in the default index that is created. However, what I've found is that it returns only the data explicitly defined in the index without any keys to describe what values are present.
I asked this question without the context before and got a great response, but I would like to know more generally what the best and most performant practice would be to accomplish this in Fauna. I have not discovered a way to sort data outside of creating an index.
This default behavior is counter intuitive to me. It seems there would be a simpler way to return the data in reversed order. I would love to know why this is the default behavior. I'm sure there are good reasons for it rationalized by folks who are much smarter than I am. Thanks for any guidance.

there is indeed a very good reason for this. In contrast to many other databases, FaunaDB took the decisions not to allow you to do inefficient things in order to save you from unpleasant surprises. When you sort data in a database, it either uses an index typically one of two things happens:
There is an index defined because you knew you were going to do that, you care about performance and you thought about it. The index is used for the sort.
You forgot about the index, or your data is so small that you didn't care, the query engine is going to still do this yet is going to do this in a horribly inefficient way.
If you end up in the second case where you forgot and you do this on massive data, then you might have a performance problem, if that database is auto-scaling and pay-as-you-go than ok.. no problem.. the database should be be able to handle but since it's pay-as-you-go, it'll be expensive.
The same counts for sorting. Maybe a database has a clever way to reverse a sort order but it might just as well not use the index and do something super inefficient by running over that complete dataset until to the end and start reading in reverse order.
To avoid nasty pricing surprises, most things that you can do that requires an index to do efficiently will not be possible without defining that index in advance.
Is that the answer you were looking for?

Related

What do I need to know about databases in order to create a quality Django app?

I'm trying to optimize my site and found this nice little Django doc:
Database Access Optimization, which suggests profiling followed by indexing and the selection of proper fields as the starting point for database optimization.
Normally, the django docs explain things pretty well, even things that more experienced programmers might consider "obvious". Not so in this case. After no explanation of indexing, the doc goes on to say:
We will assume you have done the obvious things above.
Uhhh. Wait! What the heck is indexing?
Obviously I can figure out what indexing is via google, my question is: what is it that I need to know as far as database stuff goes in order to create a scalable website? What should I be aware of about the Django framework specifically? What other "obvious" things ought I know? Where can I learn them?
I'm looking to get pointed in a direction here. I don't need to learn anything and everything about SQL, I just want to be informed enough to build my app the right way.
Thanks in advance!
I encourage you to read all that the other answers suggest and whatever else you can find on the subject, because it's all good information to know and will make you a better programmer.
That said, one of the nice things about Django and other similar frameworks is that for the most part you don't have to know what's going on behind the scenes in the DB. Django adds indexes automatically for fields that need them. The encouragement to add more is based on the use cases of your app. If you continually query based on one particular field, you should ensure that that field is indexed. It might be already (if it's a foreign key, primary key, etc.), but other random fields typically aren't.
There's also various optimizations that are database client-specific. Django can't do much here because it's goal is to remain database independent. So, if you're using PostgreSQL, MySQL, whatever, read about optimizations and best practices concerning those particular clients.
Wikipedia database design, and database normalization http://en.wikipedia.org/wiki/Database_design, and http://en.wikipedia.org/wiki/Database_normalization are two very important concepts, in addition to indexing.
In addition to these, having a basic understanding of your database of choice is necessary. Being able to add users, set permissions, and create a database are key things that you should know.
Learning how to backup your data is also a crucial thing.
The list keeps getting longer, one should also be aware of the db relationships that django handles for you, OneToOne, ManyToMany, ManyToOne. https://docs.djangoproject.com/en/dev/topics/db/models/
The performance impact of JOINs shouldn't be ignored. Access model properties in django is so easy, but understanding that some of Foreign Key relationships could have huge performance impacts is something to consider too.
Once you have a basic understanding of these things you should be at a pretty good starting point for creating a non-trivial django app!
Wikipedia has a nice article about database indexes, they are similar(ish) to an index in a book i.e. lets you (the computer) find things faster because you just look at the index (probably a very bad example :-)
As for performance there are many things you can do and presumably as it is a very detailed subject in itself, and is something that is particular to each RDBMS then it would be distracting / irrelevant for them (django) to go into great detail. Best thing is really to google performance tips for your particular RDBMS. There are some general tips such as indexing, limiting queries to only return the required data etc.
I think one of the main things is a good design, sticking as much as possible to Normal Form and in general actually taking your database into consideration before programming your models etc (which clearly you seem to be doing). Naming conventions are also a big plus, remembering explicit is better then implicit :-)
To summarise:
Learn/understand the fundamentals such as the relational model
Decide on a naming convention
Design your database perhaps using an ERM tool
Prefer surrogate ID's
Use the correct data type of minimum possible size
Use indexes appropriately and don't over index
Avoid unecessary/over querying
Prioritise security and stability over raw performance
Once you have an up and running database 'tune' the database analysing/profiling settings, queries, design etc
Backup and archive regularly - cron
Hang out here :-)
If required advance into replication (master/slave - django supports this quite well too)
Consider upgrading your hardware
Don't get too hung up about it

Hibernate Search, Entities, and SQL VIEWs

I have a table that maintains rows of products that are for sale (tbl_products) using PostgreSQL 9.1. There are also several other tables that maintain ratings on the items, comments, etc. We're using JPA/Hibernate for ORM in a Seam application, and have the appropriate entities wired up properly. In an effort to provide better listings of these items, I've created a SQL VIEW (v_product_summary) that aggregates some of the basic product data (name, description, price, etc.) with data from the other tables (number of comments, average rating, etc.). This provides a nice concise view of the data, and I've created a corresponding JPA entity object that provides read-only access to the view data.
Everything is working fine with respect to running JPQL queries on either the Product object (tbl_products) or the ProductSummary (v_product_summary) objects. However, we'd like to provide a richer search experience using Hibernate Search and Lucene. The issue we're running into, though, is how do we query the ProductSummary objects using Hibernate Search? They're not indexed upon creation, because they're never really "created". They're obtained as read-only objects from the v_product_summary VIEW. An index entry is only created on Product when it's persisted to the database, and not for ProductSummary since it's never persisted.
Our thought is that we should be able to:
Persist our Product object to the database
Immediately query the corresponding ProductSummary object using the product's ID
Manually update the Hibernate Search index for the ProductSummary object
Is this possible? Is this even a good idea? I can see there will be a performance impact since we're executing a query for the ProductSummary object every time a new Product is persisted. However, products are not added to the database at a high volume, so I don't think this will be a huge issue.
We'd really like to find a better, more efficient way of accomplishing this. Can anyone provide any tips or recommendations? If we do go the route of updating the search index manually, is that even doable? Can anyone provide a resource explaining how we can add a single ProductSummary to the index?
Any help you can provide is GREATLY appreciated.
If I understand the question correctly, you're trying to do the normal thing of persisting an object and indexing it at that point, but you're dealing with 2 separate objects.
I find myself doing kludgey things in Hibernate all the time, it feels like it almost demands it of you. Yes, there'd be a performance impact, and as you say, it is probably not a big deal, so it might be worth profiling.
A part of me remembers there's a way you can refresh the object upon write, and wonders if there's a way you can wrap the Product and the ProductSummary and tweak the mapping so that you read part and write part of it (waves hands on syntax and mapping). Or create a Hibernate-facing object with readonly fields that can be split and merged into your two objects. I don't know if your design allows Hibernate-only objects, it's a common idiom in my system.
Either way could be useful if you had a lot of objects in this situation, if this is the only object you're searching in this way, your 3 steps look much clearer.
As for the syntax for adding an object manually, I think you're looking for something like this, after your fetch:
FullTextSession textSession = Search.getFullTextSession(session);
textSession.index(myProductSummary);
Was that all you wanted?
Since you are using postgresql, you could insert to the view and use a rule to redirect the insert to the appropriate table.
A postgresql rule is a way to change the query just before it gets executed. I used it in an application which needed a change in schema but required the old queries to still work for a little while.
You can check out the documentation about rules on insert queries on the postgresql site
Since you'll be inserting and updating to the view, hibernate search will work as usual.
EDIT
An easier strategy. You could insert and update ProductSummary when doing so on Product and tell PostgreSQL to ignore the inserts, updates and deletes on the view.
On the database side"
create RULE dontinsert AS ON insert to v_product_summary do instead nothing
create RULE dontupdate AS ON update to v_product_summary do instead nothing
create RULE dontdelete AS ON delete to v_product_summary do instead nothing
But I guess you will need to hack a little, since the jdbc call executeUpdate will return 0, and hibernate will probably freak.
Technically I think this would be possible, but I think your entire efficiency dilemma might be better solved using something like memcached, therefore making performance less of an issue, and perhaps increasing code maintainability depending on how you currently have it implemented at statement level. By updating the search index manually, do you mean the database index? That is not recommended, and I'm not sure if it's even doable. Why not index them on creation?

Storing multiple choice values in database

Say I offer user to check off languages she speaks and store it in a db. Important side note, I will not search db for any of those values, as I will have some separate search engine for search.
Now, the obvious way of storing these values is to create a table like
UserLanguages
(
UserID nvarchar(50),
LookupLanguageID int
)
but the site will be high load and we are trying to eliminate any overhead where possible, so in order to avoid joins with main member table when showing results on UI, I was thinking of storing languages for a user in the main table, having them comma separated, like "12,34,65"
Again, I don't search for them so I don't worry about having to do fulltext index on that column.
I don't really see any problems with this solution, but am I overlooking anything?
Thanks,
Andrey
Don't.
You don't search for them now
Data is useless to anything but this one situation
No data integrity (eg no FK)
You still have to change to "English,German" etc for display
"Give me all users who speak x" = FAIL
The list is actually a presentation issue
It's your system, though, and I look forward to answering the inevitable "help" questions later...
You might not be missing anything now, but when you're requirements change you might regret that decision. You should store it normalized like your first instinct suggested. That's the correct approach.
What you're suggesting is a classic premature optimization. You don't know yet whether that join will be a bottleneck, and so you don't know whether you're actually buying any performance improvement. Wait until you can profile the thing, and then you'll know whether that piece needs to be optimized.
If it does, I would consider a materialized view, or some other approach that pre-computes the answer using the normalized data to a cache that is not considered the book of record.
More generally, there are a lot of possible optimizations that could be done, if necessary, without compromising your design in the way you suggest.
This type of storage has almost ALWAYS come back to haunt me. For one, you are not even in first normal form. For another, some manager or the other will definitely come back and say.. "hey, now that we store this, can you write me a report on... "
I would suggest going with a normalized design. Put it in a separate table.
Problems:
You lose join capability (obviously).
You have to reparse the list on each page load / post back. Which results in more code client side.
You lose all pretenses of trying to keep database integrity. Just imagine if you decide to REMOVE a language later on... What's the sql going to be to fix all of your user profiles?
Assuming your various profile options are stored in a lookup table in the DB, you still have to run "30 queries" per profile page. If they aren't then you have to code deploy for each little change. bad, very bad.
Basing a design decision on something that "won't happen" is an absolute recipe for failure. Sure, the business people said they won't ever do that... Until they think of a reason they absolutely must do it. Today. Which will be promptly after you finish coding this.
As I stated in a comment, 30 queries for a low use page is nothing. Don't sweat it, and definitely don't optimize unless you know for darn sure it's necessary. Guess how many queries SO does for it's profile page?
I generally stay away at the solution you described, you asking for troubles when you store relational data in such fashion.
As alternative solution:
You could store as one bitmasked integer, for example:
0 - No selection
1 - English
2 - Spanish
4 - German
8 - French
16 - Russian
--and so on powers of 2
So if someone selected English and Russian the value would be 17, and you could easily query the values with Bitwise operators.
Premature optimization is the root of all evil.
EDIT: Apparently the context of my observation has been misconstrued by some - and hence the downvotes. So I will clarify.
Denormalizing your model to make things easier and/or 'more performant' - such as creating concatenated columns to represent business information (as in the OP case) - is what I refer to as a "premature optimization".
While there may be some extreme edge cases where there is no other way to get the necessary performance necessary for a particular problem domain - one should rarely assume this is the case. In general, such premature optimizations cause long-term grief because they are hard to undo - changing your data model once it is in production takes a lot more effort than when it initially deployed.
When designing a database, developers (and DBAs) should apply standard practices like normalization to ensure that their data model expresses the business information being collected and managed. I don't believe that proper use of data normalization is an "optimization" - it is a necessary practice. In my opinion, data modelers should always be on the lookout for models that could be restructured to (at least) third normal form (3NF).
If you're not querying against them, you don't lose anything by storing them in a form like your initial plan.
If you are, then storing them in the comma-delimited format will come back to haunt you, and I doubt that any speed savings would be significant, especially when you factor in the work required to translate them back.
You seem to be extremely worried about adding in a few extra lookup table joins. In my experience, the time it takes to actually transmit the HTML response and have the browser render it far exceed a few extra table joins. Especially if you are using indexes for your primary and foreign keys (as you should be). It's like you are planning a multi-day cross-country trip and you are worried about 1 extra 10 minute bathroom stop.
The lack of long-term flexibility and data integrity are not worth it for such a small optimization (which may not be necessary or even noticeable).
Nooooooooooooooooo!!!!!!!!
As stated very well in the above few posts.
If you want a contrary view to this debate, look at wordpress. Tables are chocked full of delimited data, and it's a great, simple platform.

BASIC Object-Relation Mapping question asked by a noob

I understand that, in the interest of efficiency, when you query the database you should only return the columns that are needed and no more.
But given that I like to use objects to store the query result, this leaves me in a dilemma:
If I only retrieve the column values that I need in a particular situation, I can only partially populate the object. This, I feel, leaves my object in a non-ideal state where only some of the properties and methods are available. Later, if a situation arises where I would like to the reuse the object but find that the new situation requires a different but overlapping set of columns to be returned, I am faced with a choice.
Should I reuse the existing SQL and add to the list of selected columns the additional fields that are required by the new situation so that the same query and object mapping logic can be reused for both? Or should I create another method that results in the execution of a slightly different SQL which results in the populating of only those object properties that were returned by the 2nd query?
I strongly suspect that there is no magic answer and that the answer really "depends" on the situation but I guess I am looking for general advice. In general, my approach has been to either return all columns from the queried table or to add to the query the additional columns as they are needed but to reuse the same SQL (and mapping code) that is, until performance becomes a concern. In general, I find that unless you are retrieving a large number of row - and I usually am not - that the cost of adding additional columns to the output does not have a noticable effect on performance and that the savings in development time and the simplified API that result are a good trade off.
But how do you deal with this situation when performance does become a factor? Do you create methods like
Employees.GetPersonalInfo
Employees.GetLittleMorePersonlInfoButMinusSalary
etc, etc etc
Or do you somehow end up creating an API where the user of your API has to specify which columns/properties he wants populated/returned, thereby adding complexity and making your API less friendly/easy to use?
Let's say you want to get Employee info. How many objects would typically be involved?
1) an Employee object
2) An Employees collection object containing one Employee object for each Employee row returned
3) An object, such as EmployeeQueries that returns contains methods such as "GetHiredThisWeek" which returns an Employees collection of 0 or more records.
I realize all of this is very subjective, but I am looking for suggestions on what you have found works best for you.
I would say make your application correct first, then worry about performance in this case.
You could be optimizing away your queries only to realize you won't use that query anyway. Create the most generalized queries that your entire app can use, and then as you are confident things are working properly, look for problem areas if needed.
It is likely that you won't have a great need for huge performance up front. Some people say the lazy programmers are the best programmers. Don't over-complicate things up front, make a single Employee object.
If you find a need to optimize, you'll create a method/class, or however your ORM library does it. This should be an exception to the rule; only do it if you have reason to do so.
...the cost of adding additional columns to the output does not have a noticable effect on performance...
Right. I don't quite understand what "new situation" could arise, but either way, it would be a much better idea (IMO) to get all the columns rather than run multiple queries. There isn't much of a performance penalty at all for getting more columns than you need (although the queries will take more RAM, but that shouldn't be a big issue; besides, hardware is cheap). Also, you'd save yourself quite a bit of development time.
As for the second part of your question, it's really up to you. As an example, Rails takes more of a "usability first, performance last" approach, but that may not be what you want. It just depends on your needs. If you're willing to sacrifice a little usability for performance, by all means, go for it. I would.
If you are using your Objects in a "row at a time" CRUD type application, then, by all means copy all the columns into your object, the extra overhead is minimal, and you object becomes truly re-usable for any program wanting row access to the table.
However if your SQL is doing a complex join or returning a large set of rows, then request precisely and only what you want. You get two performance penalties here, one handling each column each time will eat up cpu for no benefit, and, two most DBMS systems have a bag of tricks for optimising queries (such as index only access) which can only be used if you specify precisely which columns you want.
There is no reuse issue in most of these cases as scan/search processes tend to very specific to a particular use case.

Any SQL database: When is it better to fetch a whole table instead of querying for particular rows?

I have a table that contains maybe 10k to 100k rows and I need varying sets of up to 1 or 2 thousand rows, but often enough a lot less. I want these queries to be as fast as possible and I would like to know which approach is generally smarter:
Always query for exactly the rows I need with a WHERE clause that's different all the time.
Load the whole table into a cache in memory inside my app and search there, syncing the cache regularly
Always query the whole table (without WHERE clause), let the SQL server handle the cache (it's always the same query so it can cache the result) and filter the output as needed
I'd like to be agnostic of a specific DB engine for now.
with 10K to 100K rows, number 1 is the clear winner to me. If it was <1K I might say keep it cached in the application, but with this many rows, let the DB do what it was designed to do. With the proper indexes, number 1 would be the best bet.
If you were pulling the same set of data over and over each time then caching the results might be a better bet too, but when you are going to have a different where all the time, it would be best to let the DB take care of it.
Like I said though, just make sure you index well on all the appropriate fields.
Seems to me that a system that was designed for rapid searching, slicing, and dicing of information is going to be a lot faster at it than the average developers' code. On the other hand, some factors that you don't mention include the location or potential location of the database server in relation to the application - returning large data sets over slower networks would certainly tip the scales in favor of the "grab it all and search locally" option. I think that, in the 'general' case, I'd recommend querying for just what you want, but that in special circumstances, other options may be better.
I firmly believe option 1 should be preferred in an initial situation.
When you encounter performance problems, you can look on how you could optimize it using caching. (Pre optimization is the root of all evil, Dijkstra once said).
Also, remember that if you would choose option 3, you'll be sending the complete table-contents over the network as well. This also has an impact on performance .
In my experience it is best to query for what you want and let the database figure out the best way to do it. You can examine the query plan to see if you have any bottlenecks that could be helped by indexes as well.
First of all, let us dismiss #2. Searching tables is data servers reason for existence, and they will almost certainly do a better job of it than any ad hoc search you cook up.
For #3, you just say 'filter the output as needed" without saying where that filter is been done. If it's in the application code as in #2, than, as with #2, than you have the same problem as #2.
Databases were created specifically to handle this exact problem. They are very good at it. Let them do it.
The only reason to use anything other than option 1 is if the WHERE clause itself is huge (i.e. if your WHERE clause identifies each row individually, e.g. WHERE id = 3 or id = 4 or id = 32 or ...).
Is anything else changing your data? The point about letting the SQL engine optimally slice and dice is a good one. But it would be surprising if you were working with a database and do not have the possibility of "someone else" changing the data. If changes can be made elsewhere, you certainly want to re-query frequently.
Trust that the SQL server will do a better job of both caching and filtering than you can afford to do yourself (unless performance testing shows otherwise.)
Note that I said "afford to do" not just "do". You may very well be able to do it better but you are being paid (presumably) to provide functionality not caching.
Ask yourself this... Is spending time writing cache management code helping you fulfil your requirements document?
if you do this:
SELECT * FROM users;
mysql should perform two queries: one to know fields in the table and another to bring back the data you asked for.
doing
SELECT id, email, password FROM users;
mysql only reach the data since fields are explicit.
about limits: always ss best query the quantity of rows you will need, no more no less. more data means more time to drive it