We need to upload data from our logs to Google BigQuery and we have two subsets of the log data that will not overlap when queried.
Subset number one has a field "vendor_id" which will be used a lot in WHERE clauses.
Subset number two are the log entries that do not have "vendor_id"
We could make only one table with a nullable "vendor_id" field or make two different tables one for each subset. Is there any difference in the performance of these aproaches?
Regards
Leo
There will be little (if any) difference in query performance between the two options you mention. That said, the cost of queries is proportional to the amount of data read, so if you have two separate tables it will likely be less expensive, since each query will read a smaller amount of data.
Related
The data I want to store data that has this characteristics:
There are a finite number of fields (I don't expect to add new fields);
There are some columns that are common to all sets of data (a category field, for instance);
There are some columns that are specific to individual sets of data (each category needs it's own fields);
Here's how it would look like in a regular table:
I'm having trouble figuring out which would be the better way to store this data in a database for this situation.
Bellow are the ideas I already had:
Do exactly as the tabular table (I would have many NULL values);
Divide the categories into tables (I would use joins when needed);
Use JSON type for storing the values (no NULL values and having it all in same table).
So my questions are:
Is one of these solutions (or one that I have not thought about it) that is better for this case?
Are there other factors, other than the ones presented here, that I should consider to make this decision?
Unless you have very many columns (~ 100), it is usually better to use normal columns. NULL values don't take any storage space in PostgreSQL.
On the other hand, if you have queries that can use any of these columns in the WHERE condition, and you compare with =, a single GIN index on a jsonb might be better than having many B-tree indexes, because the index maintenance costs would be higher.
The definitive answer depends on the SQL statements that you plan to run on that table.
You have laid out the three options pretty well. Things to consider are:
Performance
Data size
Each of maintenance
Flexibility
Security
Note that you don't even allude to security considerations. But security at the table level is usually a tad simpler than at the column level and might be important for regulated data such as PII (personally identifiable information).
The primary strength of the JSON solution is flexibility. It is easy to add new columns. But you don't need that. JSON has a cost in data size and data type flexibility (notably JSON doesn't support date/times explicitly).
A multiple table solution requires duplicating the primary key but may result in much less storage overall if the columns really are sparse. The "may" may also depend on the data type. A NULL string for instance occupies less space than a NULL float in a table record.
The joins on multiple tables will be 1-1 on primary keys. These should be pretty fast.
What would I do? Unless the answer is obvious, I would dump the data into a single table with a bunch of columns. If that table starts to get unwieldy, then I would think about splitting it into separate tables -- but still have one table for the common columns. The details of one or multiple tables can be hidden behind a view.
Depends on how much data you want to store, but as long as it is finite it shouldn't make a big difference if it contains a lot of null's or not
I want to move multiple SQLite files to PostgreSQL.
Data contained in these files are monthly time-series (one month in a single *.sqlite file). Each has about 300,000 rows. There are more than 20 of these files.
My dilemma is how to organize the data in the new database:
a) Keep it in multiple tables
or
b) Merge it to one huge table with new column describing the time period (e.g. 04.2016, 05.2016, ...)
The database will be used only to pull data out of it (with the exception of adding data for new month).
My concern is that selecting data from multiple tables (join) would not perform very well and the queries can get quite complicated.
Which structure should I go for - one huge table or multiple smaller tables?
Think I would definitely go for one table - just make sure you use sensible indexes.
If you have the space and the resource 1 table, as other users have appropriately pointed out databases can handle millions of rows no problem.....Well depends on the data that is in them. The row size can make a big difference... Such as storing VARCHAR(MAX), VARBINARY(MAX) and several per row......
there is no doubt writing queries, ETL (extract transform load) is significantly easier on a single table! And maintenance of that is easier too from a archival perspective.
But if you never access the data and you need the performance in the primary table some sort of archive might make since.
There are some BI related reasons to maintain multiple tables but it doesn't sound like that is your issue here.
There is no perfect answer and will depend on your situation.
PostgreSQL is easily able to handle millions of rows in a table.
Go for option b) but..
with new column describing the time period (e.g. 04.2016, 05/2016, ...)
Please don't. Querying the different periods will become a pain, an unnecessary one. Just put the date in one column, put a index on the column and you can, probably, execute fast queries on it.
My concern is that selecting data from multiple tables (join) would not perform very well and the queries can get quite complicated.
Complicated for you to write or for the database to execute? An Example would be nice for us to get an image of your actual requirements.
I'm looking to essentially have a centralized table with a number of lookup tables that surround it. The central table is going to be used to store 'Users' and the lookup tables will be user attributes, like 'Religion'. The central table will store an Id, like ReligionId, and the lookup table would contain a list of religions.
Now, I've done a lot of digging into this and I've seen many people comment saying that a UserAttribute table might be the best way to go, essentially using an EAV pattern. I'm not looking to do this. I realize that my strategy will be join-heavy and that's why I ask this question here. I'm looking for a way to optimize those joins.
If the table has 100 lookup tables, how could it be optimized to be faster than just doing a massive 100 table inner join? Some ideas come to mind like using many smaller joins, sub-selects and views. I'm open to anything, including a combination of these strategies. Again, just to note, I'm not looking to do anything that's EAV-related. I need the lookup tables for other reasons and I like normalized data.
All suggestions considered!
Here's a visual look:
Edit: Is this insane?
Optimization techniques will likely depend on the size of the center table and intended query patterns. This is very similar to what you get in data warehousing star schemas, so approaches from that paradigm may help.
For one, ensuring the size of each row is absolutely as small as possible. Disk space may be cheap, but disk throughput, memory, and CPU resources are potential bottle necks. You want small rows so that it can read them quickly and cache as much as possible in memory.
A materialized/indexed view with the joins already performed allows the joins to essentially be precomputed. This may not work well if you are dealing with a center table that is being written to alot or is very large.
Anything you can do to optimize a single join should be done for all 100. Appropriate indexes based on the selectivity of the column, etc.
Depending on what kind of queries you are performing, then other techniques from data warehousing or OLAP may apply. If you are doing lots of group by's then this is likely an area to look in to. Data warehousing techniques can be applied within SQL Server with no additional tooling.
Ask yourself why so many attributes are being queried and how they are being presented? For most analysis it is not necessary to join with lookup tables until the final step where you materialize a report, at which time you may only have grouped by on a subset of columns and thus only need some of the lookup tables.
Group By's generally should be able to group on the lookup Id's without needing the text/description from the lookup table so a join is not necessary. If your lookups have other information relevant to the query at hand then consider denormalizing it into the central table to eliminate the join and/or make that discreet value its own lookup, essentially splitting the existing lookup ID into another ID.
You could implement a master code table that combines the code tables into a single table with a CodeType column. This is not the same as EAV because you'd still have a column in the center table for each code type and a join for each, where as EAV is usually used to normalize out an arbitrary number of attributes. (Note: I personally hate master code tables.)
Lastly, consider normalization the center table if you are not doing data warehousing.
Are there lots of null values in certain lookupId columns? Is the table sparse? This is an indication that you can pull some columns out into a 1 to 1/0 relationships to reduce the size of the center table. For example, a Person table that includes address information can have a PersonAddress table pulled out of it.
Partitioning the table may improve performance if there's a large number of rows and you can determine that certain rows, perhaps with a certain old datetime from couple years in the past, would rarely be queried.
Update: See "Ask yourself why so many attributes are being queried and how they are being presented?" above. Consider a user wants to know number of sales grouped by year, department, and product. You should have id's for each of these so you can just group by those IDs on the center table and in an outer query join lookups for only what columns remain. This ensures the aggregation doesn't need to pull in unnecessary information from lookups that aren't needed anyway.
If you aren't doing aggregations, then you probably aren't querying large numbers of records at a time, so join performance is less of a concern and should be taken care of with appropriate indexes.
If you're querying large numbers of records at a time pulling in all information, I'd look hard at the business case for this. No one sits down at their desk and opens a report with a million rows and 100 columns in it and does anything meaningful with all of that data, that couldn't be accomplished in a better way.
The only case for such a query be a dump of all data intended for export to another system, in which case performance shouldn't be as much as a concern as it can be scheduled overnight.
Since you are set on your way. you can consider duplicating data in order to join less times in a similar way to what is done in olap database.
http://en.wikipedia.org/wiki/OLAP_cube
With that said I don't think this is the best way to do it if you have 100 properties.
Have you tried to export it to Microsoft Excel Power Pivot with Power Query? you can make fast data analysis with pretty awsome ways to show it with Power view video sample
I have a table with "orders", and "order lines" that come as JSON, and it is simple to store it as JSON in BigQuery. I can run a process to flatten the file to rows, but it is a burden, and makes the BigQUery table bigger.
What would be a best performance structure for BigQuery? Assuming I have queries on sum or products, and sales in order lines.
And what is the best practice to number of "records" (or "order lines") in a record column? Can it contain thousands or is it aimed for a few? Assuming I would query it like in a MongoDB document based database.
This will help me plan the right architecture.
BigQuery's columnar architecture is designed to handle nested and repeated fields in a highly performant manner, and in general can return query results as fast as it would if those records were flattened. In fact, in some cases, (depending on your data and the types of queries you are running) using already nested records can actually allow you to avoid subqueries that tack on an extra step.
Short answer: Don't worry about flattening, keep your data in the nested structure, the query performance will generally be the same either way.
However, as to your second question: Your record limit will be determined by how much data you can store in a single record. Currently BigQuery's per row maximum is 100MB. You can have many, many repeated fields in a single record, but they need to fit into this limit.
This question relates to the performance of querying data in BigQuery.
Any particular table or column settings that would affect query performance, or are all columns in a table in effect treated equally by BigQuery, such that the order of columns or any definitions applying to a column would not impact data fetching in any distinguishable way?
Thanks!
All columns are treated equally by BigQuery. The order doesn't matter at all. There aren't any settings that affect performance, but if you import your table in a very large number of small pieces (e.g. more than a few thousand) it can cause some slowdown.