Preserving data integrity in Drupal: - sql

I've happened to develop a module in Drupal and due to some seeming View limitations had to use custom SQL. This ran me into some problems with node revisions and I came to conclusion
that in Drupal it's best to use its native methods for working with any data. Otherwise, data integrity problems may arise.
And even with desire to optimize SQL queries in Drupal apparently this should be done in rare cases for real bottlenecks.
What are you experiences related to this dilemma - direct sql queries vs. Drupal modules/functions ?

When updating data you should always use the Drupal default, even if you need to do other queries afterwards for custom tables etc. It is not obvious (without digging into the code) what Drupal does on various actions and if you copy the code for an action and put it in your function you will have to watch for changes in the core from then on.
One trick with views which may help you, is if views has got you almost what you want you can see the query generated by views copy that and put it in your own code. This removes the rest of the overhead of views and can be a big performance boost.

Related

Best way to migrate data from Access to SQL Server

The problem
Ok, sorry that my question is somewhat abstract and subjective, but will try to make it as specific as possible. So, the situation I am in is simple - I am remaking a very old MS Access application on a new website using ASP.NET MVC. As currently the MVC site is using SQL Server 2008 (for many well known reasons) I need to find a way to migrate the tables AND the data, because the information in the old database will be used in the new application.
Alright, so far so good, however there are a few problems. The old application is written in a different language, meaning that I want to translate table names, field names, and all other names that are there to English. Furthermore, I will be making some changes on the models themselves (change the type of some fields, add additional fields to some tables, remove old unnecessary ones and more). So technically I'll be 'having my way' with everything.
Researched solutions
With those things in mind I researched for the ways to migrate data from Access database to a SQL Server. Of course, there is a lot of information on the matter, in Stack Overflow alone there are more than a few questions and solutions. So why am I struggling to find the answer ? Well I found a few solutions that will be sufficient to some extend (actually will definitely solve my problems) but I am writing to ask if someone experienced has a better perspective on it than I do. Alright, the solutions and why I am still looking for advice: /I'll be listing just a couple of the most common and popular ones that I found, many of the others share the same capabilities and/or results /
Upsize Wizzard (Access) - this is a tool devised specifically for migrating tables and data from Access. It is my most favourite one for the moment as I find it kind of straightforward to work with and it provides good overall results. I was able to migrate the tables to SQL Server (along with the data of course) which more or less is what I am intending to do. It is fast, it seems like it allows you to migrate indexes, primary keys and even to my knowledge foreign keys (table relationships). The downsides of this tool, however, include that it ignores your queries (which I don't really need honestly) and it doesn't provide a way to change the model, names or types of the properties of the table you migrate - which is the thing I kind of prefer, because I will have to make more than a few changes, adding, renaming, deleting, etc. And then continue with the development process (of the application) which will lead to a few additional minor changes. And finally I would need to apply all changes (migration + all changes) on the production server, which overall is prone to mistakes as I will be doing it by hand (and there are more than a few tables).
SQL Server Migration Assistant (SSMA) - ok, this is a separate tool (not included in Access) with again the same idea - to migrate data from Access to ... possibly everywhere, haven't researched that. Overall it offers more functionality and customizing from the Upsize Wizard, but of course it does it in a more complicated way. I haven't put enough effort to make a migration with this tool yet, as it involves a lot of installations and additional work, but according to my research it provides almost all (if not all) of the functionality I require. The downside however comes with the naming. As I mentioned it allows you to apply changes on the tables, schema, fields, indexes, keys and probably everything, but the articles advice that I change the names in Access first, as it will be easier and the migration process will run more smoothly. I am not allowed to make changes on the original Access database, as it will remain functional until the publish of the 'renewed' project, and the data inside it is being used, so a mere copy of the file is a solution I am not particularly fond of, because I might loose new records. Also I cant predict the changes I would want to make in the development process (as I said I believe I would want/need to apply some additional changes later on when I find 'weaknesses' in my data design in the development process) so I find it to be a little half baked solution.
Conclusion
The options presented, the way I see them, are two:
Use the Upsize Wizard to migrate the access tables, then write a script that applies the changes I want to make. Then in the development process add any additional changes to the script. When ready to publish on the production server, reapply the migration with the wizard, run the changes script and pray everything is fine.
Get more involved with the SSMA tool and try producing an updated version of the tables with the migration process. (See how efficient the renaming is and decide whether to use copied file to rename and then find a way to migrate only new records or do it all in the SSMA). Then again write a script for the changes that occur in the development process and re-do and apply it all on the production server when ready and then pray everything is fine.
Option I have not yet seen, apply it and then pray everything is fine.
I have researched the matter for a couple of days now, and found a few more solutions that I do not believe are better by the mentioned. However I include the possibility of missing the 'big red X on the map', a practical and easy solution which seems like it was designed specifically for me (though I doubt that a little). Anyway, reducing all the madness that I have written so far to a few simple questions will look like:
Is anyone aware if my conclusions are correct? I am leaning towards option one as it is easier to accomplish.
Has anyone experienced/found a better way to do that, or just found some 'logic-leaps' in my writings as I am overthinking the entire thing a little and may be doing some obvious miscalculation.
Very sorry for asking a trivial question and one that includes decision making that may involve deeper understanding of my project and situation, yet I am working with rather sensitive data and would appreciate feedback, even if only to improve my confidence into the chosen approach.
There is one other tool/method you might want to consider that seems to cater to your specific needs more. This would be to use the data import/export tool that ships with sqlserver to do a complete copy of all data into a temporary location within sql server and then write custom queries to reorganize the names and other changes you want to make. Is a bit more work but you could use the end product as a seed method for your migrations ;) (if you are doing code first anyway)

JSON vs classic schema design [duplicate]

The Project
I've been asked to work on an interesting project -- what amounts to a basic Web CMS -- that uses HTML/CSS/jQuery with PHP. However, one requirement is that there won't be a database to house the data (they want flat files for the documents/pages -- preferable in JSON format).
In a very basic sense, it'll be used to generate HTML pages via a very "non-techie" interface. Each installation would only have around 20 pages, but a few may get up to 100. It has to be fairly easy to drop onto a PHP capable server and run, with very little setup needed.
What's Out There
There are tons of CMS options and quite a few flat file versions. But an OSS or other existing CMS is not an option. They need a simple propriety system.
Initial Thoughts
So flat files it is... but I'd really like to get some feedback on the drawbacks, and if it is worth the effort to try and convince them to use something like MySQL (SQLite or CouchDB are out since none of the servers can be configured to run them at the present time).
Of course the document files are pretty straightforward, but we're also talking about login info for 1 or 2 admins per installation, a few lists, as well as configs/settings (which also can easily be stored in a file with protection).
The Dilemma
If there are benefits to using MySQL rather than JOSN formatted files and some arrays in a simple project like this -- beyond my own pre-conceived notions :) -- I'll be sure to argue them.
But honestly I can't see any that outweigh their need to not have a database system.
I'd appreciate you insight and opinions.
If you can't cite a specific need for relational table design, then you're good with flat files. Build as specified. The moment you can cite a specific need, let them know; upgrading isn't that hard, if you're perception is timely (that is, if you aren;t in the position of having to normalize data that should have been integrated earlier).
It's a shame you can't use CouchDB, this seems like the perfect application for it. Keep in mind that using flat-files severely constrains your architecture and, especially, scalability.
What's the best case scenario for your CMS app? It's successful and people want to use it more? If you're using flat-files it'll be harder to service and improve your system (e.g. make it more robust, and add new features for future versions) and performance will not scale well. So "success" in this case is at best short-lived, as success translates into more and more work for less and less gains in feature-set and performance.
Then again, if the CSM is designed right, then switching between a flat file to RDMS should be as simple as using a different data access file.
Will this be installed on any shared hosting sites. For this to work somewhat safely, a mechanism like suEXEC needs to be set up properly as the web server will need write permissions to various directories.
What would be cool with a simple site that was feed via JSON and jQuery is that the site wouldn't need to load on each click. Just the relevant data would change. You could then use hashes in the location bar to keep track of where you were (ex. http://localhost/#about)
The problem being if they are editing the raw JSON file they can mess it up pretty quick. I think your admin tools would have to generate the JSON files based on the input so that you can ensure nothing breaks. The admin tools would be more entailed then the site (though isn't that always the case with dynamic sites)
What is the predicted data sizes for the CMS?
A large reason for the use of a RDMS is quick,specific access to large amounts of data. The data format might not be large, but if there is a lot of the data, then it might be better in the long run for a RDMS.
Then again, if the CSM is designed right, then switching between a flat file to RDMS should be as simple as using a different data access file.
While an RDBMS may be necessary for a very large CMS, a small one could run off flat files very well. A lot of CMS products out there fall down in that regard, I think, by throwing an RDBMS into the mix when there's no real need.
However, if you are using flat files, there are security issues which others have highlighted. Another issue I've come across is hosting providers using the disable_functions directive in php.ini to disable file I/O functions like fopen() and friends. If you're hosting your CMS on a box you control, you won't have this problem but if you're using a third-party provider, check first.
As the original poster, I wasn't signed in, so I'm following up to the answers so far in an answer (sorry if this is bad form).
There may instances where this is on
a shared host.
Though the JSON files can technically
be edited, this won't be the case.
The admin interface will be robust
enough to do all of the creating/editing of pages
The size for each install will be
relatively small -- 1 - 2 admins,
10-100 pages. A few lists of common
items may run longer (snippets of
copy for example).
Security will be a big issue -- any
other options suggestions on this
specifically?
Well, isn't there a problem with they being distrustful to any database system? Isn't the problem more in their thinking than in technology? Maybe they are afraid of database because it sounds complex to them. In that case, if you just present them some very simple CMS (like CMS made simple, which I've heard is really simple and the learning process is very fast), if they see everything is easy then may be they just don't care what's behind, if it's a database or whatever!
They could hear to arguments like better maintenance, lower cost of maintenance, much better handover to another webmaster than proprietary solutions (they are not dependent on you) etc.

Strategies for Fixing Problems / Tweaking NHibernate Apps in Production

First off, I am not a DBA, but I do work in an environment where DBAs do tune/make changes in the production database from time to time in ways that do not cause the need for an application rebuild/redeployment. Usually these changes consist of reworking indexes, changing procs, and sometimes changing the table structure in minor ways (usually abstracted from the app via procs).
Obviously, a team should strive to catch performance problems with NHibernate before they get into production using things like NHProf, SQL Profiler, and load tests. That being said, are there certain strategies that can be used to allow some amount of tweaking once the code is built and out running in production? Using stored procedures 100% of the time seems like it would allow the most flexibility for the DBA's, but obviously that would really kill the efficiency of NHibernate. From what I've read, updatable views (in SQL Server) don't really work that well with NHibernate either (this may-or-may-not be true).
I've read quite a bit about NHibernate and experimented with it over the years, but I have never put it into practice in a production environment. I have yet to come across a set of "best practices" to allow for maximum tweaking once deployed.
As an NHibernate user, how are you and your team dealing with issues if they arise in production? My production environment is made up of ASP.NET apps and SQL server, but I don't think the answers need to be restricted to that platform.
I am in a similar position, and in order to keep our DBA happy, I did the following:
Wrote some of the queries in HQL, some others in SQL (especially those perf-sensitive)
Externalized those queries to files, one file per query.
When your app needs to execute of these queries, it just loads the appropriate file, optionally running it through a pre-processor, and runs it.
With this approach, the DBA could theoretically tweak the queries just by modifying those files. That's quite similar to having stored procedures.
In practice, it's up to you to decide if you'll really give the DBA access to those files (if you catch my drift...)
IMHO the DBA should just use the DBMS's profiling tools and report her findings back to the devs (as in "there's this query that is running 20 times/sec and does 10 joins. is that really necessary? can it be cached? do you really need all those joins? can we denormalize this?" etc.
I'm not in the deploy phase yet, but on my current project I've come up against this already and my solution presently has been to replace my queries with stored procs. As long as the shape of the data coming back from the DB remains the same it's not a big deal. Yeah you do lose some of that agility you enjoyed during development but I'm not sure it's as bad as it initially sounds. You'll have a code push when you first make the change of course, and then from that point it's just proc changes.
You can use a profiler like NHProf to see the sql queries executed, so you can show them to a DBA. This tool can also detect some problem like n+1 select.
Using a second level of cache can be useful : http://web.archive.org/web/20110514214657/http://blogs.hibernatingrhinos.com/nhibernate/archive/2008/11/09/first-and-second-level-caching-in-nhibernate.aspx

How do I 'refactor' SQL Queries?

I have several MS Access queries (in views and stored procedures) that I am converting to SQL Server 2000 (T-SQL). Due to Access's limitations regarding sub-queries, and or the limitations of the original developer, many views have been created that function only as sub-queries for other views.
I don't have a clear business requirements spec, except to 'do what the Access application does', and half a page of notes on reports/CSV extracts, but the Access application doesn't even do what I suspect is required properly.
I, therefore, have to take a bottom up approach, and 'copy' the Access DB to T-SQL, where I would normally have a better understanding of requirements and take a top down approach, creating new queries to satisfy well defined requirements.
Is there a method I can follow in doing this? Do I spread it all out and spend a few days 'grokking' it, or do I continue just copying the Access views and adopt an evolutionary approach to optimising the querying?
Work out what access does with the queries, and then use this knowledge to check that you've transferred it properly. Only once you've done this can you think about refactoring. I'd start with slow queries and then go from there: work out what indexes you need and then progressively rewrite. This way you can deliver as soon as you've proved that you moved everything successfully (even if it is potentially a bit slower). That's much better than not being able to deliver at all because problem X came along.
I'd probably start with the Access database, exercise the queries in situ and see what the resultset is. Often you can understand what the query accomplishes and then work back to your own design to accomplish it. (To be thorough, you'll need to understand the intent pretty completely anyway.) And that sounds like the best statement of requirements you're going to get - "Just like it's implemented now."
Other than that, You're approach is the best I can think of. Once they are in SQL Server, just start testing and grokking.
When you are dealing with a problem like this it's often helpful to keep things working as they are while you make incremental changes. This is better from a risk management perspective.
I'd concentrate on getting it working, then checking the database performance and optimizing performance problems. Then, as you add features and fix bugs, clean up the code that's hard to maintain. As you said, a sub-query is really very similar to a view. So if it's not broken you may not need to change it.
This depends on your timeline. If you have to get the project running absolutely as soon as possible (I know this is true for EVERY project, but if it's REALLY true for you), then yes, duplicate the functionality and infrastructure from Access then do your refactoring either later or as you go.
If you have SOME time you can dedicate to it, then refactoring it now will give you two things:
You'll be happier with the code, and it will (likely) perform better, since actual analysis was done rather than the transcoding equivalent of a copy-paste
You'll likely gain a greater understanding of what the true business rules are, since you'll almost certainly come across things that aren't in the spec (especially considering how you describe them)
I would recommend copying the views to SQL Server immediately, and then use its sophisticated tools to help you grok them.
For example, SQL Server can tell you what views, stored procedures, etc, rely on a particular view, so you can see from there whether the view is a one-of or if it's actually used in more than one place. It will help you determine which views are more important than which.

PostgreSQL performance monitoring tool

I'm setting up a web application with a FreeBSD PostgreSQL back-end. I'm looking for some database performance optimization tool/technique.
Database optimization is usually a combination of two things
Reduce the number of queries to the database
Reduce the amount of data that needs to be looked at to answer queries
Reducing the amount of queries is usually done by caching non-volatile/less important data (e.g. "Which users are online" or "What are the latest posts by this user?") inside the application (if possible) or in an external - more efficient - datastore (memcached, redis, etc.). If you've got information which is very write-heavy (e.g. hit-counters) and doesn't need ACID-semantics you can also think about moving it out of the Postgres database to more efficient data stores.
Optimizing the query runtime is more tricky - this can amount to creating special indexes (or indexes in the first place), changing (possibly denormalizing) the data model or changing the fundamental approach the application takes when it comes to working with the database. See for example the Pagination done the Postgres way talk by Markus Winand on how to rethink the concept of pagination to make it more database efficient
Measuring queries the slow way
But to understand which queries should be looked at first you need to know how often they are executed and how long they run on average.
One approach to this is logging all (or "slow") queries including their runtime and then parsing the query log. A good tool for this is pgfouine which has already been mentioned earlier in this discussion, it has since been replaced by pgbadger which is written in a more friendly language, is much faster and more actively maintained.
Both pgfouine and pgbadger suffer from the fact that they need query-logging enabled, which can cause a noticeable performance hit on the database or bring you into disk space troubles on top of the fact that parsing the log with the tool can take quite some time and won't give you up-to-date insights on what is going in the database.
Speeding it up with extensions
To address these shortcomings there are now two extensions which track query performance directly in the database - pg_stat_statements (which is only helpful in version 9.2 or newer) and pg_stat_plans. Both extensions offer the same basic functionality - tracking how often a given "normalized query" (Query string minus all expression literals) has been run and how long it took in total. Due to the fact that this is done while the query is actually run this is done in a very efficient manner, the measurable overhead was less than 5% in synthetic benchmarks.
Making sense of the data
The list of queries itself is very "dry" from an information perspective. There's been work on a third extension trying to address this fact and offer nicer representation of the data called pg_statsinfo (along with pg_stats_reporter), but it's a bit of an undertaking to get it up and running.
To offer a more convenient solution to this problem I started working on a commercial project which is focussed around pg_stat_statements and pg_stat_plans and augments the information collected by lots of other data pulled out of the database. It's called pganalyze and you can find it at https://pganalyze.com/.
To offer a concise overview of interesting tools and projects in the Postgres Monitoring area i also started compiling a list at the Postgres Wiki which is updated regularly.
pgfouine works fairly well for me. And it looks like there's a FreeBSD port for it.
I've used pgtop a little. It is quite crude, but at least I can see which query is running for each process ID.
I tried pgfouine, but if I remember, it's an offline tool.
I also tail the psql.log file and set the logging criteria down to a level where I can see the problem queries.
#log_min_duration_statement = -1 # -1 is disabled, 0 logs all statements
# and their durations, > 0 logs only
# statements running at least this time.
I also use EMS Postgres Manager to do general admin work. It doesn't do anything for you, but it does make most tasks easier and makes reviewing and setting up your schema more simple. I find that when using a GUI, it is much easier for me to spot inconsistencies (like a missing index, field criteria, etc.). It's only one of two programs I'm willing to use VMWare on my Mac to use.
Munin is quite simple yet effective to get trends of how the database is evolving and performing over time. In the standard kit of Munin you can among other thing monitor the size of the database, number of locks, number of connections, sequential scans, size of transaction log and long running queries.
Easy to setup and to get started with and if needed you can write your own plugin quite easily.
Check out the latest postgresql plugins that are shipped with Munin here:
http://munin-monitoring.org/browser/branches/1.4-stable/plugins/node.d/
Well, the first thing to do is try all your queries from psql using "explain" and see if there are sequential scans that can be converted to index scans by adding indexes or rewriting the query.
Other than that, I'm as interested in the answers to this question as you are.
Check out Lightning Admin, it has a GUI for capturing log statements, not perfect but works great for most needs. http://www.amsoftwaredesign.com
DBTuna http://www.dbtuna.com/postgresql_monitor.php has recently started supporting PostgreSQL monitoring. We use it extensively for MySQL monitoring, so if it provides the same for Postgres then it should be a good fit for you too.