Is This a “Large” GraphViz Chart, and How to Fix it? - optimization

I started using GraphViz yesterday in order to visualize the relationships between some things—a project I’ve been wanting to tackle for quite a while now.
So far, I’ve gotten it pretty well done, but there’s a few things I’m struggling with, specifically getting the chart to look good and how long it takes to process the DOT file—thank goodness I decided to use a computer ot do it instead of doing it by hand on paper!
It seems to me that the reason my graph looks so cluttered and why it takes so long is because there are so many entities and so many relationships in it (including many cyclical ones).
I’ve read of “large” graphs several times in the documentation and other resources on the Internet, but nothing that specifically indicates what is considered large, so I’m wondering if mine counts as large.
Simply put, I’ve got two groups of nodes (A and B), with relationships between items in A to items in B. Currently there are:
  123 entries in group A
  55 entries in group B
  278 relationships
  ? cyclical relationships
Plus, I’m also considering adding relationships between items in group A.
So does this count as “large”? I was under the impression that “large” was thousands of items/links. (Depending on some minor changes, the resulting image is about 8444x7463 pixels. I’ve included a—relatively huge—thumbnail.)
If it is large, are there any tips on making a messy, convoluted graph look good?
If it is not large, then why am I having trouble with it (I am currently using non-overlapped and neato)?
Thanks a lot.
alt text http://img340.imageshack.us/img340/5493/mygraph.png

I suggest you give Gephi a try, as I have found it easier to use than Graphviz.

Related

String interning with SQLAlchemy

I've been trying out various approaches to "string interning" in a database that's accessed primarily using SQLAlchemy ORM. I've tried a couple things, and so far I'm not really loving any of them. It seems like a common pattern, and I feel like I might be missing some obvious, elegant solution.
To elaborate: the situation is that my database (Postgres, if it matters) table is likely to to contain many of the same strings, but they are still arbitrary, and not bounded in a way that a native enum type would be the right solution. I want to collect these strings in another table with an auto-incrementing PK and then reference them in the main table by FK. The goals here include both space savings and string "hygiene" (i.e. I'd like to be able to easily assess and track the growth of this string table.)
I've tried the obvious naive solution of creating a separate entity, but this seems to foist the mechanics of the string interning onto every consumer of the entity. i.e. every consumer has to traverse the relationship to get the value, like this: obj.interned_property.value And absent joinedload hints, it causes another database hit for every new access. (In general, I try to keep loading strategies out of the model itself, since different use cases often benefit from different loading strategies.) Adding a python property to traverse the relationship is not a good approach because it can't participate in SQLAlchemy filtering/ordering operations.
I've tried using the AssociationProxy extension, but I've been generally disappointed with it. I discovered that AssociationProxy attributes don't follow the same metadata contract of other SA ORM attributes; They lack an info property, for instance. An info dictionary was relatively simple to graft on, but this was really just the first shoe to drop. After that, I discovered that you can't filter against them in a query (at least not with the LIKE operator.) I've gotten to the point where I'm kinda sick of discovering the next thing that AssociationProxy attributes can't do.
The next thought I had was to do all the interning inside the database using triggers and updatable views, but that inherently hampers portability w/r/t database engine, and splits the logic between Python and PL/SQL which makes it harder for future developers coming into this code to figure out what's going on. And, it's a bunch of effort, so if I'm going to do it, I would like to feel more confident that it's the right way to go.
Anyway, it seems like this is a pretty common pattern, and I feel like someone must have figured out an elegant solution by now. So, I'd love to hear from someone who's been down this road before: what's the best way to handle string interning with SQLAlchemy?

How do I structure my database so that two tables that constitute the same "element" link to another?

I read up on database structuring and normalization and decided to remodel the database behind my learning thingie to reduce redundancy.
I have different types of entries that can be learned. Gap texts/cloze tests (one text, many gaps) and simple known-unknown (one question, one answer) types.
Now I'm in a bit of a pickle:
gaps need exactly the same columns in the user table as question-answer types
but they need less columns than question-answer types (all that info is in the clozetests table)
I'm wishing for a "magic" foreign key that can point both to the gap and the terms table. Of course their ids would overlap though. I don't like having both a term_id and gap_id in the user_terms, that seems unelegant (but is the most elegant I can come up with after googling for a while, not knowing what name this pickle goes by).
I don't want a user_gaps analogue to user_terms, because then I'd be in the same pickle when it comes to the table user_terms_answers.
I put up this cardboard cutout collage of my schema. I didn't remove the stuff that isn't relevant for this question, but I can do that if anyone's confusion can be remedied like that. I think it looks super tidy already. Tidier than my mental concept of this at least.
Did I say any help would be greatly appreciated? Answerers might find themselves adulated for their wisdom.
Background story if you care, it's not really relevant to the question.
Before remodeling I had them all in one table (because I added the gap texts in a hurry), so that the gap texts were "normal" items without answers, while the gaps where items without questions. The application linked them together.
Edit
I added an answer after SO coughed up some helpful posts. I'm not yet 100% satisfied. I try to write views for common queries to this set up now and again I feel like I'll have to pull application logic for something that is database turf.
As mentioned in the comment, it is hard to answer without knowing the whole story. So, here is a story and a model to match. See if you can adapt this to you example.
School of (foreign) languages offers exams for several levels of language proficiency. The school maintains many pre-made tests for each level of each language (LangLevelTestNo).
Each test contains several (many) questions. Each question can be simple or of the close-text-type. Correct answers are stored for each simple question. Correct terms are stored for each gap of each close-text question.
Student can take an exam for a language level and is presented with one of the pre-made tests. For each student exam, the exam form is maintained which stores students answers for each question of the exam. Like a question, an answer may be of a simple of of a close-text-type.
After editing my question some Stackoverflow started relating the right questions to me.
I knew this was a common problem, but I really couldn't find it, just couldn't come up with the right search terms, I guess.
The following threads address similar problems and I'll try to apply that logic to my own design. They all propose adding a higher-level description for (in my case terms and gaps) like items. That makes sense and reflects the logic behind my application.
Relation Database Design
Foreign Key on multiple columns in one of several tables
Foreign Key refering to primary key across multiple tables
And this good person illustrates how to retrieve the data once it's broken up across tables. He also clues me to the keyword class table inheritance, so now I know what to google.
I'll post back with my edited schema once I've applied this. It does seem more elegant like this.
Edited schema

Using sqlite for very large merges and basic queries

I'm a new guy to databases, and I'm trying to figure out a good solution for dealing with large datasets. I mostly do statistical analyses using R, so I don't need a database as the backend of web pages or anything. By datasets are generally static - they are just big.
I was trying to do a simple left join of a ~10,000,000 record table on a ~1,400,000 table. The 1.4 m table had unique records. After churning for 3 hours, it quit on me. The query was specified correctly - I ran it limiting the retrievals to 1000 records and it returned exactly as I expected. Eventually, I found a way to split this up into 10 queries and it ran, but by this time, I was able to do that merge in R pretty quickly, without all the fancy calls to sqlite and indexing.
I've been looking to use databases because I thought they were faster/more effective for these basic data manipulations, but maybe I'm just overlooking something. In the above example, I had indexed in the appropriate columns, and I'm surprised that sqlite could not handle it whilst R could.
Sorry if this question is a little foggy (I'm a little foggy on databases), but if anyone has any advice on something obvious I'm doing wrong to not take advantage of the power of sqlite, that would be great. Or am I just expecting to much of of it, and a 100 m X 1.4 m record merge is just too big to execute without breaking it up?
I would think that a database could outperform R in this respect?
thanks!
EXL
I am going through the same process. If you look through the questions I've asked recently, you may get some good pointers, or at least avoid a lot of the time I've wasted :). In short, here's what's been most helpful to me.
-- the RSQLite package
-- the RSQLite.extfuns package
-- the SQLite FAQ
I'm still a newbie, but in general, you should be using SQLite for subsetting data that is too large to bring in to RAM. I would think that if the data are small enough to handle in RAM, then you're better off using the native R tools for joins/subsets. If you find that you become more comfortable with SQL queries, then there is the sqldf package. Also, JD Long has a great discussion on using sqldf with large datasets.
I have to admit that I'm surprised that this has been a problem for you. SQLite has always worked well for me, at least speed-wise. However -- SQLite is easy because it is so flexible. SQLite can be dangerous because it is so flexible. SQLite tends to be very forgiving with data types. Sometimes this is an absolute god-send, when I don't want to spend a bunch of time tweaking things to perfection, but with great flexibility comes great responsibility.
I have noticed that I need to be careful moving data into SQLite. Text is easy. However, sometimes numbers get stored as text rather than numbers. Doing a JOIN on a column of numbers is faster than the same JOIN on a column of text. If your number columns are stored as text and then coerced into numbers for the comparison, you would lose most of the advantage of using an index.
I don't know how you got your data into SQLite, so the first thing I would do is look at your table schemas and make sure they make sense. And while they may seem obvious, indexes can be tricky. Taking a look at the queries might also result in something useful.
Without being able to see the underlying structure and queries, answers to this question will be educated guesses.

How do you think while formulating Sql Queries. Is it an experience or a concept?

I have been working on sql server and front end coding and have usually faced problem formulating queries.
I do understand most of the concepts of sql that are needed in formulating queries but whenever some new functionality comes into the picture that can be dont using sql query, i do usually fails resolving them.
I am very comfortable with select queries using joins and all such things but when it comes to DML operation i usually fails
For every query that i never done before I usually finds uncomfortable with that while creating them. Whenever I goes for an interview I usually faces this problem.
Is it their some concept behind approaching on formulating sql queries.
Eg.
I need to create an sql query such that
A table contain single column having duplicate record. I need to remove duplicate records.
I know i can find the solution to this query very easily on Googling, but I want to know how everyone comes to the desired result.
Is it something like Practice Makes Man Perfect i.e. once you did it, next time you will be able to formulate or their is some logic or concept behind.
I could have get my answer of solving above problem simply by posting it on stackoverflow and i would have been with an answer within 5 to 10 minutes but I want to know the reason. How do you work on any new kind of query. Is it a major contribution of experience or some an implementation of concepts.
Whenever I learns some new thing in coding section I tries to utilize it wherever I can use it. But here scenario seems to be changed because might be i am lagging in some concepts.
EDIT
How could I test my knowledge and
concepts in Sql and related sql
queries ?
Typically, the first time you need to open a child proof bottle of pills, you have a hard time, but after that you are prepared for what it might/will entail.
So it is with programming (me thinks).
You find problems, research best practices, and beat your head against a couple of rocks, but in the process you will come to have a handy set of tools.
Also, reading what others tried/did, is a good way to avoid major obsticles.
All in all, with a lot of practice/coding, you will see patterns quicker, and learn to notice where to make use of what tool.
I have a somewhat methodical method of constructing queries in general, and it is something I use elsewhere with any problem solving I need to do.
The first step is ALWAYS listing out any bits of information I have in a request. Information is essentially anything that tells me something about something.
A table contain single column having
duplicate record. I need to remove
duplicate
I have a table (I'll call it table1)
I have a
column on table table1 (I'll call it col1)
I have
duplicates in col1 on table table1
I need to remove
duplicates.
The next step of my query construction is identifying the action I'll take from the information I have.
I'll look for certain keywords (e.g. remove, create, edit, show, etc...) along with the standard insert, update, delete to determine the action.
In the example this would be DELETE because of remove.
The next step is isolation.
Asnwer the question "the action determined above should only be valid for ______..?" This part is almost always the most difficult part of constructing any query because it's usually abstract.
In the above example you're listing "duplicate records" as a piece of information, but that's really an abstract concept of something (anything where a specific value is not unique in usage).
Isolation is also where I test my action using a SELECT statement.
Every new query I run gets thrown through a select first!
The next step is execution, or essentially the "how do I get this done" part of a request.
A lot of times you'll figure the how out during the isolation step, but in some instances (yours included) how you isolate something, and how you fix it is not the same thing.
Showing duplicated values is different than removing a specific duplicate.
The last step is implementation. This is just where I take everything and make the query...
Summing it all up... for me to construct a query I'll pick out all information that I have in the request. Using the information I'll figure out what I need to do (the action), and what I need to do it on (isolation). Once I know what I need to do with what I figure out the execution.
Every single time I'm starting a new "query" I'll run it through these general steps to get an idea for what I'm going to do at an abstract level.
For specific implementations of an actual request you'll have to have some knowledge (or access to google) to go further than this.
Kris
I think in the same way I cook dinner. I have some ingredients (tables, columns etc.), some cooking methods (SELECT, UPDATE, INSERT, GROUP BY etc.) then I put them together in the way I know how.
Sometimes I will do something weird and find it tastes horrible, or that it is amazing.
Occasionally I will pick up new recipes from the internet or friends, then use parts of these in my own.
I also save my recipes in handy repositories, broken down into reusable chunks.
On the "Delete a duplicate" example, I'd come to the result by googling it. This scenario is so rare if the DB is designed properly that I wouldn't bother keeping this information in my head. Why bother, when there is a good resource is available for me to look it up when I need it?
For other queries, it really is practice makes perfect.
Over time, you get to remember frequently used patterns just because they ARE frequently used. Rare cases should be kept in a reference material. I've simply got too much other stuff to remember.
Find a good documentation to your software. I am using Mysql a lot and Mysql has excellent documentation site with decent search function so you get many answers just by reading docs. If you do NOT get your answer at least you are learning something.
Than I set up an example database (or use the one I am working on) and gradually build my SQL. I tend to separate the problem into small pieces and solve it step by step - this is very successful if you are building queries including many JOINS - it is best to start with some particular case and "polute" your SQL with many conditions like WHEN id = "123" which you are taking out as you are working towards your solution.
The best and fastest way to learn good SQL is to work with someone else, preferably someone who knows more than you, but it is not necessarry condition. It can be replaced by studying mature code written by others.
Your example is a test of how well you understand the DISTINCT keyword and the GROUP BY clause, which are SQL's ways of dealing with duplicate data.
Examples and experience. You look at other peoples examples and you create your own code and once it groks, you don't need to think about it again.
I would have a look at the Mere Mortals book - I think it's the one by Hernandez. I remember that when I first started seriously with SQL Server 6.5, moving from manual ISAM databases and Access database systems using VB4, that it was difficult to understand the syntax, the joins and the declarative style. And the SQL queries, while powerful, were very intimidating to understand - because typically, I was looking at generated code in Microsoft Access.
However, once I had developed a relatively systematic approach to building queries in a consistent and straightforward fashion, my skills and confidence quickly moved forward.
From seeing your responses you have two options.
Have a copy of the specification for whatever your working on (SQL spec and the documentation for the SQL implementation (SQLite, SQL Server etc..)
Use Google, SO, Books, etc.. as a resource to find answers.
You can't formulate an answer to a problem without doing one of the above. The first option is to become well versed into the capabilities of whatever you are working on.
The second option allows you to find answers that you may not even fully know how to ask. You example is fairly simplistic, so if you read the spec/implementation documentaion you would know the answer right away. But there are times, where even if you read the spec/documentation you don't know the answer. You only know that it IS possible, just not how to do it.
Remember that as far as jobs and supervisors go, being able to resolve a problem is important, but the faster you can do it the better which can often be done with option 2.

How should I organize complex SQL views in Rails?

I manage a research database with Ruby on Rails. The data that is entered is primarily used by scientists who prefer to have all the relevant information for a study in one single massive table for use in their statistics software of choice. I'm currently presenting it as CSV, as it's very straightforward to do and compatible with the tools people want to use.
I've written many views (the SQL kind, not the Rails HTML/ERB kind) to make the output they expect a reality. Some of these views are quite large and have a fair amount of complexity behind them. I wrote them in SQL because there are many calculations and comparisons that are more easily done with SQL. They're currently loaded into the database straight from a file named views.sql. To get the requested data, I do a select * from my_view;.
The views.sql file is getting quite large. Part of the problem is that we're still figuring out what the data we collect means, so there's a lot of changes being made to the views all the time -- and a ton of them are being created. Many of them need to be repeatable.
I've recently run into issues organizing and testing these views. Rails works great for user interface stuff and business logic, but I'm not aware of much existing structure for handling the reporting we require.
Some options I've thought of:
Should I move them into the most relevant models somehow? Several of the views interact with each other, which makes this situation more complex than just doing a single find_by_sql, so I don't know if they should only be part of the model.
Perhaps they should be treated as a "view" in the MVC sense? (That is, they could be moved into app/views/ and live alongside the HTML, perhaps as files named something like my_view.csv.sql which return CSV.)
How would you deal with a complex reporting problem like this?
UPDATE for Mladen Jablanović
It started by having a couple of views for reporting purposes. My boss(es) decided they wanted more, so I started writing more. Some give couple hundred columns of data, based on the requirements I've been given.
I have a couple thousand lines of views all shoved in a single file now. I don't like that situation, so I want to reorganize/refactor the code. I'd also like an easy way of providing CSVs -- I'm currently running queries and emailing them by hand, which could easily be automated. Finally, I would like to be able to write some tests on the output of the views, since a couple of regressions have already popped up.
I haven't worked much with SQL and views directly, so I can't help you there, but you can certainly build an ActiveRecord model on top of a view, very easily in fact. The book Enterprise Rails has a whole chapter on it (here it is at Google Books).
We are using views in our DB extensively and some of them are exposed as Rails models. You work with them as you would with tables, except for you can't update them of course.
Also, some of the columns may be calculated using other columns (different ratios for example) so we don't do it in the view, but in the model instead (ok, not entirely true, we construct SQL snippet and pass it to :select => '' portion of find call).
Presentation logic (such as date and number formatting) goes to Rails views.
I'm afraid I can't help you with more concrete advice, as the scope of the question is pretty wide.
EDIT:
Hundreds of columns doesn't sound reasonable. Sounds like immense amount of data in one place. How do they use it at all? We have web application where they can drill down and filter the results, narrow timespan and time step etc, so they never have more then 10-20 columns in the reports.
We store our views one view per SQL file. Also, you can combine it with a numerical prefix in order to ensure proper creation order (in case some of them depend on others). No migrations there, whole DB layer is app-agnostic.
For CSV, you can create either a set of scripts you can invoke either manually, or using cron, or you can use FasterCSV from your Rails app and generate CSVs by HTTP request.