Storing wide-form dataframes in datajoint table [closed] - pandas

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 months ago.
Improve this question
Say I have some analysis that spits out a wide-form pandas dataframe with a multiindex on the index and columns. Depending on the analysis parameters, the number of columns may change. What is the best design pattern to use to store the outputs in a datajoint table? The following come to mind, each with pros and cons
Reshape to long-form and store single entries with index x column levels as primary keys
Pros: Preserves the ability to query/constrain based on both index and columns
Cons: Each analysis would insert millions of rows to the table, and I may have to do hundreds of such analyses. Even adding this many rows seems to take several minutes per dataframe, and queries become slow
Keep as wide-form and store single rows as longblob with just index levels as primary keys
Pros: Retain ability to query based on index levels, results in tables with a more reasonable number of rows
Cons: Loses the ability to query based on column levels, the columns would then also have to be stored somewhere to be able to reconstruct the original dataframes. Since dataframes with different numbers of columns need to be stored in the same table, it is not feasible to explicitly encode all the columns in the table definition
Store the dataframe itself as e.g. an h5 and store it in the database simply as a filepath or as an attachment
Pros: Does not result in large databases, simple to implement
Cons: Does not really feel in the "spirit" of datajoint, lose the ability to perform constraints or queries
Are there any designs or pros/cons I haven't thought of?

Before providing a more specific answer, let's establish a few basics (also known as normal forms).
DataJoint implements the relational data model. Under the relational model, complex dataframes of the type you described require normalization into multiple related tables related to each other through their primary keys and foreign keys.
Each table will represent a single entity class: Units and Trials will be represented in separate tables.
All entities in a given table will have the same attributes (columns). They will be uniquely identified by the same attribute(s) comprising the primary key.
In addition to the primary key, tables may have additional secondary indexes to accelerate queries.
If you already knew about normalization, we can talk how about to normalize your design. If not, we can refer you to a quick tutorial.

Related

How to normalize a database/dataset in Access or any other database? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I'm trying to build an OLAP database with this data set about the Olympics, the problem is that datasets are in csv format and they are usually in one single table, I've imported the data in access as I was told that Access has a tool to split the data in different tables but I have not found anything related to that. This is my current table:
Id1 is the one created in access so it could include the duplicated data, ID is the original one in the data set.
I want to normalize the data into the following schema:
I've tried to split data manually, but since there are lots of data, It's risky and prone to a lot of mistakes and errors.
Any idea on how to do this on Access or is there a better method to do it?
Since you have already imported your data into access and that data still needs to be normalized you can use an access wizard under database tools - analyze table. This wizard will help you normalize a table by splitting the original table into multiple tables. Here is one link to get you started with the table analyzer:
https://support.office.com/en-us/article/normalize-your-data-using-the-table-analyzer-8edbb763-5bab-4fbc-b62d-c17b1a40bbe2
The table analyzer will create new tables and copy the data from the original table into the new tables resulting in a structure like the following:
The table analyzer will even save the query it uses so you can reuse it later. However if you just choose defaults the wizard will not give you appropriate names for keys and tables. Also you might want to adjust the relationship structure access chooses. You can do all these things in the wizard once you are familiar with it. In this case I just renamed all the tables and keys but left seasons as the top of the relationships pile.
Alternately you can Import the data one table at a time but you will have to clean it first (particularly adding primary keys) or you will have problems. The data import wizard in access has the option to skip variables under one of the advanced tabs.
You can skip the table analyzer wizard and create the tables and write the queries to transfer the data yourself but The wizard is faster :)
Data Cleaning Commentary: Under the heading a picture is worth a thousand words it helps if you post your data and what you want. I found the dataset online and I have a couple comments that may be helpful. ID has a one to many relationship with Country so it cannot be used as a primary key. So let access provide primary keys. Age has missing data so a decision will need to be made on how to handle that, I just put the problem off by converting age to a text variable.

DB Schema: Why not create new table for each 'entity'? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
Sorry about the vague title.
An example: I'm guessing SO has one large table that lists all answers, in a schema like:
[ Ques No, Ans No, Text , Points ]
[ 22, 0 , "Win", 3 ],
[ 22, 1 , "Tin", 4 ],
[ 23, 0 , "Pin", 2 ]
My question is would it be better if there were two tables: Table_Ques22 and Table_Ques23? Can someone please list the pros and cons?
What comes to my mind:
Cons of multiple tables: Overhead of meta storage.
Pros of multiple tables: Quickly answer queries like, find all answers to Ques 22. (I know there are indices, but they take time to build and space to maintain).
Databases are designed to handle large tables. Having multiple tables with the same structure introduces a lot of problems. These come to mind:
Queries that span multiple rows ("questions" in your example) become much more complicated and performance suffers.
Maintaining similar entities is cumbersome. Adding an index or partitioning a single table is one thing. Doing it to hundreds of tables is much harder.
Maintaining triggers is cumbersome.
When a new row appears (new question), you have to incur the overhead of creating a table rather than just adding to an existing table.
Altering a table, say to add a new column or rename an existing one, is very cumbersome.
Although putting all questions in one table does use a small additional amount of storage, you have to balance that against the overhead of having very small tables. A table with data has to occupy at least one data page, regardless of whether the data is 10 bytes or 10 Gbytes. If a data page is 16 kbytes, that is a lot of wasted space to support multiple tables for a singe entity.
As for database limits. I'm not even sure a database could support a separate table for each question on Stack Overflow.
There is one case where having parallel table structures is useful. That is when security requirements require that the data be separated, perhaps for client confidentiality reasons. However, this is often an argument for separate databases, not just separate tables.
What about: SQL Servers are not made for people ignoring the basics of the relational theoream.
You ahve a ton of problems with cross question queries in your part, which will totally kill all the gains. Typical beginner mistake - I suggest a good book about SQL basics.

SQL large table VS. multiple smaller tables [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 11 years ago.
Improve this question
I have the option to use a single table that will expand upwards of 1,000,000 records per year.
With that said, I could use a foreign key to break up this table into muitiple smaller tables, which will reduce this expansion to each smaller table of 100,000 records per year.
Lets say 50% of the time, users will query all of the records where the other 50% of the time users will query the segmented smaller table data set. ( think based on all geographic areas vs. specific geographic areas)
Using a database managed by a shared hosting account ( think site5, godaddy, etc... ), is it faster to use a single larger table or to use several smaller segmented tables given this situation?
Where each dataset is accessed 10%/%90, 20%/%80, %30/%70... etc, at what point would using a single table vs muiltiple smaller tables be the most/least efficient?
In general do it so as to reduce the amount of duplicated information. If you are making smaller tables which have many redundant columns, then it seems like it'd be more efficient to have just one table. But otherwise, one table.
It also depends on what percent of the row is being used per query, and how your queries are structured. If you are adding lots of joins or subqueries, then it'll most likely be slower.

Which type of database structure design is better for performance? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
MSSQL database. I have issue to create database using old databases data. Old database structure is thousands tables conected with each other by ID. In this tables data duplicated many times. Old database tables have more than 50 000 rows (users). Structure like this table
Users (id, login, pass, register-date, update-date),
Users-detail (id, users_id, some data)
Users-some-data (id, users_is, some data)
and this kind of tables is hundreds.
And the question is, which design of db structure to choose, one table with all of this data, or hundreds of tables separated by some theme.
Which type of db structure would be with better performance?
Select id, login, pass from ONE_BIG_TABLE
or
Select * from SMALL_ONLY_LOGINS_TABLE.
Answer really depends on the use. No one can optimize your database for you if they don't know the usage statistics.
Correct DB design dictates that an entity is stored inside a single table, that is, the client with their details for example.
However this rule can change on the occasion you only access/write some of the entity data multiple times, and/or of there is optional info you store about a client (eg, some long texts, biography, history, extra addresses etc) in which cases it would be optimal to store them on a child-table.
If you find yourself a bunch of columns with all-null values, that means you should strongly consider a child table.
If you only need to try login credentials against the DB table, a stored procedure that returns a bool value depending on if the username/password are correct, will save you the round-trip of the data.
Without indexes the select on the smaller tables will be faster. But you can create the same covering index (id, login, pass) on both tables, so if you need only those 3 columns performance will probably be the same on both tables.
The general question which database structure is better can not be answered without knowing the usage of your database.

What is better: to have many similar databases or one database with similar tables or one database with one table? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I need to work with several data samples, to say, N. The samples represent similar data but from different origins. For example, history of order in different shops. So the structure of all the samples is the same. To operate with the data I have several possibilities:
Use N databases with identical
schema, one for each sample
Use one database, but N sets of tables. For example, User_1,..., User_N; Product_1, ..., Product_N, Order_1, ..., Order_N and so on.
Use one database with one set of tables User, Product, Order, but add to each table a helper column which represents a sample index. Clearly, this column should be an index.
The last variant seems to be the most convenient for use because all queries become simple. In the second case I need to send a table name to a query (stored procedure) as a parameter (is it possible?).
So which way would you advise? The performance is very important.
Step 1. Get a book on data warehousing -- since that's what you're doing.
Step 2. Partition your data into facts (measurable things like $'s, weights, etc.) and dimensions (non-measurable attributes like Product Name, Order Number, User Names, etc.)
Step 3. Build a fact table (e.g., order items) surrounded by dimensions of that fact. The order item's product, the order item's customer, the order item's order number, the order item's date, etc., etc. This will be one fact table and several dimension tables in a single database. Each "origin" or "source" is just a dimension of the basic fact.
Step 4. Use very simple "SELECT SUM() GROUP BY" queries to summarize and analyze your data.
This is the highest performance, most scalable way to do business. Buy Ralph Kimball's Data Warehouse Toolkit books for more details.
Do not build N databases with identical structure. Build one for TEST, and one for PRODUCTION, but don't build N.
Do not build N tables with identical structure. That's what keys are for.
Here is one example. Each row of the fact table in the example has one line item from the order. The OrderID field can be used to find all items from a specific order.
Well, if you separate the databases, you'll have smaller tables. That's usually more performant.
If you ever need to get to another database, that is possible with Microsoft SQL Server. If you need to get to a database on another server, that's possible too.
It will depend on how strongly correlated the data is.