Large amount of data, I need help finding an approach - pandas

Have written an application in Python that handles a large amount of data used to generate graphics. I need help finding a technique to allow the user to scroll through the data, in both directions, while displaying a segment of the data in a graphic. Ideally only data which would be visible would be read from the database.
I am currently using a combination of Pandas, Sqlalchemy and bqplot, but if there are better packages for implementing the desired functionality I am willing to change.
Sqlalchemy seems like a good bet for handling the database but I have found the documentation hard to understand and need a nudge in the right direction. Any help or advice will be appreciated.
Thanks,
Steve

You can store your data in a grid whose each rectangle is the size of the user's display:
This way, you need to load at most 4 rectangles.
As to the database, sqlalchemy is great, but it does fancy ORM stuff you don't need. In fact, as a rule of thumb, if you're not going to do table joins (which, by your description you're not), you don't need an SQL database at all, but rather a large key-value store. Here is a pymongo tutorial, for example, but just choose any key-value store that works for you.

Related

Extract labels from serialized array using SQL

I do not have control of how this data is stored (I know as normalized data would be better for sql), because it is saved via the WordPress GravityForms plugin. The plugin uses a serialized array to define the question id (field_id), question label (label). My goal is to extract these three values in the following format:
field_id label
1 1. I know my organization’s mission (what it is trying to accomplish).
2 2. I know my organization’s vision (where it is trying to go in the future).
Here is the serialized array.
Can anyone please provide a specific example as to how to parse these values out with sql?
A specific example, no. This kind of stuff is complex. If your are working with straight json-formatted data, here are several options, none of which are simple.
You can build your own parser. Yuck.
You can upgrade everything you have to just-released SQL 2016, and hope that the built-in json tools do what you need (I've heard iffy things about them, but don't know what their final form is like. Too, updating all your database servers right now, oh sure.)
Phil Factor over on SimpleTalk built a json T-SQL parser (https://www.simple-talk.com/sql/t-sql-programming/consuming-json-strings-in-sql-server/). It looks horrible and may run poorly, but it would do the needful.
Buried in the comments of that article are links to a CLR tool that John Galt built (at https://github.com/jgcoding/J-SQL). I have used this successfully, though I haven't done anything too complex. (If you're json is relatively simple, this could do the trick.)
There are other json parsers for SQL out there, some free, some for sale. The key thing would be to not try and write your own, but rather find and use someone else's solution that addresses your requirements.

Sql Database structure for housing historical data and display changes

Good morning,
This is more of a concept question then anything.
I am looking to design a database and interface that will track changes to the entries (in this case people) and display those changes readily.
(user experience would look something like this)
for user A
Date Category Activity
8/8/14 change position position 1 -> position 2
8/9/14 change department department a -> department b
...
...
the visual experience seem like it would benefit from an E-A-V design, however i am designing the database to be easy to data mine and from my reading, i think that E-A-V is not the right way to go.
does it make sense to duplicate data just to display it?
if not, does anyone have a suggestion of how to query the history table and display? (currently using jquery and php to leverage the db...i suppose i could do something interesting from a coding perspective to get it done)
thank you for your help,
Travis
Creating an efficient operational database environment and a creating an 'easy-to-data mine' environment are two separate (and often opposing) goals.
Others might disagree with me but in my opinion it is best to create your database based on operational readiness (This means using your E-A-V design as mentioned above) and then worry about data transformation later. This may make it inconvenient later to transform the data to allow for easy mining but it will accomplish an incredibly important goal which is to eliminate the possibility for data error.
Once you have a good system in place where you can collect data appropriately, then you can create a warehouse or datamart environment to more conveniently extract that data.
This may sound like a lot of work but from a data integrity perspective, it is much safer than trying to create some system that is designed entirely for reporting. That's my personal opinion at least.
(sorry cannot comment yet)
You have to analyse the data you need to persist.
if you have only a couple of tables, with no relationship, you probably don't need the database.
In this case the database solution probably will be slower(connection/transmission/security overhead ...).
well if it's a few MBs of data, I would keep everything in one table.
You can easily load the whole data set in memory and do what you need to do.

What is the relational database equivalent of Factorial and the Fibonacci function?

When learning a new programming language there are always a couple of traditional problems that are good to get yourself moving. For example, Hello world and Fibonacci will show how to read input, print output and compute functions (the bread and butter that will solve basically everything) and while they are really simple they are nontrivial enough to be worth their time (and there is always some fun to be had by calculating the factorial of a ridiculously large number in a language with bignums)
So now I'm trying to get to grips with some SQL system and all the textbook examples I can think of involve mind-numbingly boring tables like "Student" or "Employee". What nice alternate datasets could I use instead? I am looking for something that (in order of importance) ...
The data can be generated by a straightforward algorithm.
I don't want to have to enter things by hand.
I want to be able to easily increase the size of my tables to stress efficiency, etc
Can be used to showcase as much stuff as possible. Selects, Joins, Indexing... You name it.
Can be used to get back some interesting results.
I can live with "boring" data manipulation if the data is real and has an use by itself but I'd rather have something more interesting if I am creating the dataset from scratch.
In the worst case, I at least presume there should be some sort of benchmark dataset out there that would at least fit the first two criteria and I would love to hear about that too.
The benchmark database in the Microsoft world is Northwind. One similar open source (EPL) one is Eclipse's Classic Models database.
You can't autogenerate either as far as I know.
However, Northwind "imports and exports specialty foods from around the world", while Classic Models sells "scale models of classic cars". Both are pretty interesting. :)
SQL is a query language, not a procedural language, so unless you will be playing with PL/SQL or something similar, your examples will be manipulating data.
So here is what was fun for me -- data mining! Go to:
http://usa.ipums.org/usa/
And download their micro-data (you will need to make an account, but its free).
You'll need to write a little script to inject the fixed width file into your db, which in itself should be fun. And you will need to write a little script to auto create the fields (since there are many) based on parsing their meta-file. That's fun, too.
Then, you can start asking questions. Suppose the questions are about house prices:
Say you want to look at the evolution of house price values by those with incomes in the top 10% of the population over the last 40 years. Then restrict to if they are living in california. See if there is a correlation between income and the proportion of mortgage payments as a percentage of income. Then group this by geographic area. Then see if there is a correlation between those areas with the highest mortgage burden and the percentage of units occupied by renters. Your db will have some built-in statistical functions, but you can always program your own as well -- so correl might be the equivalent of fibonnacci. Then write a little script to do the same thing in R, importing data from your db, manipulating it, and storing the result.
The best way to learn about DBs is to use them for some other purpose.
Once you are done playing with iPUMS, take a look at GEO data, with (depending on your database) something like PostGis -- the only difference is that iPUMS gives you resolution in terms of tracts, whereas GIS data has latitude/longitude coordinates. Then you can plot a heat map of mortgage burdens for the U.S., and evolve this heat map over different time scales.
Perhaps you can do something with chemistry. Input the 118 elements, or extract them for an online source. Use basic rules to combine them into molecules, which you can store in the database. Combine molecules into bigger molecules and perform more complex queries upon them.
You will have a hard time finding database agnostic tutorials. The main reason for that is that the SQL-92 standard on which most examples are based on is plain old boring. There are updated standards, but most database agnostic tutorials will dumb-it-down to the lowest common denomiator: SQL-92.
If you want to learn about databases as a software engineer, I would definitely recommend starting with Microsoft SQL Server. There are many reasons for that, some are facts, some are opinions. The primary reason though is that it's a lot easier to get a lot further with SQL Server.
As for sample data, Northwind has been replaced by AdventureWorks. You can get the latest versions from codeplex. This is a much more realistic database and allows demonstrating way more than basic joins, filtering and roll-ups. The great thing too, is that it is actually maintained for each release of SQL Server and updated to showcase some of the new features of the database.
Now, for your goal #1, well, I would consider the scaling out an exercise. After you go through the basic and boring stuff, you should gradually be able to perform efficient large-scale data manipulation and while not really generating data, at least copy/paste/modify your SQL data to take it to the size you think.
Keep in mind though that benchmarking databases is not trivial. The performance and efficiency of a database depends on many aspect of your application. How it is used is just as important as how it is setup.
Good luck and do let us know if you find a viable solution outside this forum.
Implement your genealogical tree within a single table and print it. In itself is not a very general problem, but the approach certainly is, and it should prove reasonably challenging.
Geographic data can showcase a lot of SQL capabilities while being somewhat complicated (but not too complicated). It's also readily available from many sources online - international organizations, etc.
You could create a database with countries, cities, zip codes, etc. Mark capitals of countries (remember that some countries have more than one capital city...). Include GIS data if you want to get really fancy. Also, consider how you might model different address information. Now what if the address information had to support international addresses? You can do the same with phone numbers as well. Once you get the hang of things you could even integrate with Google Maps or something similar.
You'd likely have to do the database design and import work yourself, but really that's a pretty huge part of working with databases.
Eclipse's Classic Model database is the best open source database equivalent of Factorial and the Fibonacci function .And Microsoft's Northwind is the another powerful alternative that you can use .

Serialization or SQlite?

I'm making a patient database program using Visual C#. It will have forms and will consist of 3 tabs with information about the patient. It will also have add, save, previous, next buttons and a search function. The most important thing is each record will have like 60 items/columns/attributes per record and the records could reach to 50k-100k or more.
Now my question is, which is better for my program? Should I use SQlite or Serialization/Deserialization?
Thanks
The "database" word in the question strongly suggests that just serialization/deserialization isn't enough. Of course if you can fit all of your data into memory and you're happy to perform all the querying yourself, it could work - but you'll need to consider the cost of potentially reading everything into memory on startup, and possibly writing everything out whenever you change anything.
A database does sound like a better fit to me, to be honest. Whether SQLite is the most appropriate database for you or not is a different question though.
Having said all of this, for the C# in Depth website I keep all the information about comments / errata in a simple XML file, which is loaded lazily and saved every time I make a change. It works well, it's easy to manage, and the file is human readable in source control when I want it. However, I have vastly fewer records than you, and they're much simpler too. I don't have any search requirements - I just need to list everything and fetch by ID. My guess is that your needs are rather more complex, hence my recommendation to use a database.

Objective-C best choice for saving data

I'm currently looking for the best way to save data in my iPhone application; data that will persist between opening and closing of the application. I've looked into archiving using a NSKeyedArchiver and I have been successful in making it work. However, I've noticed that if I try to save multiple objects, they keep getting overwritten every time I save. (Essentially, the user will be able to create a list of things he/she wants, save the list, create a few more lists, save them all, then be able to go back and select any of those lists to load at a future date.)
I've heard about SQLite, Core Data, or using .plists to store multiple arrays of data that will persist over time. Could someone point me in the best direction to save my data? Thanks!
Core Data is very powerful and easy to use once you get over the initial learning curve. here's a good tutorial to get you started - clicky
As an easy and powerful alternative to CoreData, look into ActiveRecord for Objective-C. https://github.com/aptiva/activerecord
I'd go with NSKeyedArchiver. Sounds like the problem is you're not organizing your graph properly.
You technically have a list of lists, but you're only saving the inner-nested list.
You should be added the list to a "super" list, and then archiving the super-list.
CoreData / SQL seems a bit much from what you described.
Also you can try this framework. It's very simple and easy to use.
It's based on ActiveRecord pattern and allow to use migrations, relationships, validations, and more.
It use sqlite3 only, without CoreData, but you don't need to use raw sql or create tables manually.
Just describe your iActiveRecord and enjoy.
You want to check out this tutorial by Ray Wenderlich on Getting started with CoreData. Its short and goes over the basics of CoreData.
Essentially you only want to look at plists if you have a small amount of data to store. A simple list of settings or preferences. Anything larger than that and it breaks down specifically around performance. There is a great video on iTunesU where the developers at LinkedIn describe their performance metrics between plists and CoreData.
Archiving works, but is going to be a lot of work to store and retrieve your data, as well as put the performance challenge on your back. So I wouldn't go there. I would use CoreData. Its extremely simple to get started with and if you understand the objects in this stack overflow question then you know everything you need to get going.