I need help to improve a table creation in teradata.
this table has a huge data and we retrieve the data from it heavily
To improve data retrieval you may use the below:
1. partition the table(range,case,date).
2. use compress in columns, so that values occurring again and again will be stored as part of table headers.
3. choose indexes(UPI/PI) wisely.
Related
For streaming inserts, I want to use a template table (with user id suffix) which is itself a Partitioned table. This way I can make my tables smaller than just using Partitioned Tables and hence make my queries more cost-effective. Also my query cost per user stays constant irrespective of the number of users in my system. As per the documentation at https://cloud.google.com/bigquery/streaming-data-into-bigquery:-
To create smaller sets of data by date, use time-partitioned tables. To create smaller tables that are not date-based, use template tables and BigQuery creates the tables for you.
It sounds as if it can either be a time-partitioned table OR a template table. Can it not be both? If not, is there another architecture that I should look into?
One more concern regarding my above proposed architecture is the 4000 limit that I saw on https://cloud.google.com/bigquery/docs/partitioned-tables . Does it mean that my partitioned table can't cover more than 4000 days? Will I have to delete old partitions in this case or will the last partition keep storing any subsequent streamed data?
You should look into Clustered Tables on partitioned tables.
With that you can have ONE table with all users in it, partitioned by time, and clustered by user_id as you would use in a template table.
Introduction to Clustered Tables
When you create a clustered table in BigQuery, the table data is automatically organized based on the contents of one or more columns in the table’s schema. The columns you specify are used to colocate related data. When you cluster a table using multiple columns, the order of columns you specify is important. The order of the specified columns determines the sort order of the data.
Clustering can improve the performance of certain types of queries such as queries that use filter clauses and queries that aggregate data. When data is written to a clustered table by a query job or a load job, BigQuery sorts the data using the values in the clustering columns. These values are used to organize the data into multiple blocks in BigQuery storage. When you submit a query containing a clause that filters data based on the clustering columns, BigQuery uses the sorted blocks to eliminate scans of unnecessary data.
Similarly, when you submit a query that aggregates data based on the values in the clustering columns, performance is improved because the sorted blocks colocate rows with similar values.
Clustered table pricing
When you create and use clustered tables in BigQuery, your charges are based on how much data is stored in the tables and on the queries you run against the data. Clustered tables help you to reduce query costs by pruning data so it is not processed by the query.
I have a situation in which I need to make a table (having up to 20 million rows) as a data source to Tableau.
Will Table partitioning work or any other method to try as I must have all the data in this particular table only?
Please assist.
Please optimize your extract based on performance checklist is followed and also optimize the extract as much as possible.
https://www.interworks.com/blog/bfair/2015/02/23/tableau-performance-checklist
Or
https://www.interworks.com/sites/default/files/InterWorks%27%20Tableau%20Performance%20Checklist.pdf
OR
http://onlinehelp.tableau.com/current/pro/desktop/en-us/extracting_optimize.html
OR
https://docs.powertoolsfortableau.com/best-practices-analyzer/#understanding-the-rules
Our use case for BigQuery is a little unique. I want to start using Date-Partitioned Tables but our data is very much eventual. It doesn't get inserted when it occurs, but eventually when it's provided to the server. At times this can be days or even months before any data is inserted. Thus, the _PARTITION_LOAD_TIME attribute is useless to us.
My question is there a way I can specify the column that would act like the _PARTITION_LOAD_TIME argument and still have the benefits of a Date-Partitioned table? If I could emulate this manually and have BigQuery update accordingly, then I can start using Date-Partitioned tables.
Anyone have a good solution here?
You don't need create your own column.
_PARTITIONTIME pseudo column still will work for you!
The only what you will need to do is insert/load respective data batch into respective partition by referencing not just table name but rather table with partition decorator - like yourtable$20160718
This way you can load data into partition that it belong to
I've got a table that can contain a variety of different types and fields, and I've got a table definitions table that tells me which field contains which data. I need to select things from that table, so currently I build up a dynamic select statement based on what's in that table definitions table and select it all into a temp table, then work from that.
The actual amount of data I'm selecting is quite big, over 5 million records. I'm wondering if a temp table is really the best way to go around doing this.
Are there other more efficient options of doing what I need to do?
If your data is static, reports like - cache most popular queries results, preferably on Application Server. Or do multidimensional modeling (cubes). That is the really "more efficient option to do that".
Temp tables, table variables, table data types... In any case you will use your tempdb, and if you want to optimize your queries, try to optimize tempdb storage (after checking IO statistics ). You can aslo create indexes for your temp tables.
You can use Table Variables to achieve the functionality.
If you are using the same structure in multiple queries, you can go for custom defined Table data types as well.
http://technet.microsoft.com/en-us/library/ms188927.aspx
http://technet.microsoft.com/en-us/library/bb522526(v=sql.105).aspx
I'm Working on My Program that Works With SQL Server.
for Store Data in Database Table, Which of the below approaches is correct?
Store Many Rows Just in One Table (10 Million Record)
Store Fewer Rows in Several Table (500000 Record) (exp: for each Year Create One Table)
It depends on how often you access data.If you are not using the old records, then you can archive those records. Splitting up of tables is not desirable as it may confuse you while fetching data.
I would say to store all the data in a single table, but implement a table partition on the older data. Partioning the data will increase query performance.
Here are some references:
http://www.mssqltips.com/sqlservertip/1914/sql-server-database-partitioning-myths-and-truths/
http://msdn.microsoft.com/en-us/library/ms188730.aspx
http://blog.sqlauthority.com/2008/01/25/sql-server-2005-database-table-partitioning-tutorial-how-to-horizontal-partition-database-table/
Please note that this table partioning functionality is only available in Enterprise Edition.
Well, it depends!
What are you going to do with the data? If you are querying this data a lot of times it could be a better solution to split the data in (for example) year tables. That way you would have a better performance since you have to query smaller tables.
But on the other side. With a bigger table and with good query's you might not even see a performance issue. If you only need to store this data it would be better to just use 1 table.
BTW For loading this data into the database you could use BCP (bulkcopy), which is a fast way of inserting a lot of rows.