Thanks for taking the time to go through this!
I have a redshift cluster having multiple tables within a schema and all tables have a date field that says when a row was inserted into the table.
The name of the date field is same for every table.
ex:
1. Schema = public
table name = packages
date field = timestamp
2. Schema = public
table name = binary ...
date field = timestamp
I want to be able to iterate over all tables in the aforementioned schema and get the maximum of the date fields.
Thanks!
First off, "iterate over all tables" implies that this is not done purely in SQL. So you need some layer to dynamically gather the list of all tables in the schema and loop on them. This loop can find the max date for each OR create the SQL that will UNION ALL the information and produce the single max value. I'd lean towards a loop that iterates each table as there are limits to how many tables can be addressed in a single SQL statement and the number of tables in a schema can be quite large.
Next decision is where to perform this loop. This can be done externally to Redshift OR internally with a stored procedure. I'd recommend that you do this externally as stored procedures are generally not the fastest way and come with restrictions that can limit what your code can do. AWS gives you many tools that have strengths different than Redshift and combining them opens up many new options for what you can do with Redshift. A Lambda function could be a good choice for performing this looping operation. This does imply moving beyond (or augmenting) the approach of "I have a single JDBC/ODBC connection to my data systems". If you can make this transition the rewards are generally worth the effort. If not you are looking at a stored procedure with the restrictions and speed it can provide.
Related
Due to the way our database is stored, we have tables for each significant event that occurs within a products life:
Acquired
Sold
Delivered
I need to go through and find the status of a product at any given time. In order to do so I'd need to query all of the tables within the schema and find the record with the most up to date record. I know this is possible by union-ing all tables and then finding the MAX timestamp but I wonder if there's a more elegant solution?
Is it possible to query all tables by just querying the root schema or database? Is there a way to loop through all tables within the schema and substitute that into the FROM clause?
Any help is appreciated.
Thanks
You could write a Stored Procedure but, IMO, that would only be worth the effort (and more elegant) if the list of tables changed regularly.
If the list of tables is relatively fixed then creating a UNION statement is probably the most elegant solution and relatively trivial to create - if you plan to use it regularly then just create it as a View.
The way I always approach this type of problem (creating the same SQL for multiple tables) is to dump the list of tables out into Excel, generate the SQL statement for the first table using functions, copy this function down for all the table names and then concatenate all these statements in a final function. You can then just paste this text back into your SQL editor
Pretty much as the title states: What is the difference between CREATE TABLE and CREATE COLUMN TABLE?
Both seemingly create a table, so what is the difference?
SAP HANA supports tables that store data in a column store or a row store.
These refer to different ways of how the database (HANA) manages the data stored in the tables.
They do not affect how the data can be used in a SQL statement whatsoever.
Technically, the syntax for CREATE TABLE in HANA has been extended to include a way to choose which of the two table types should be created:
CREATE [COLUMN|ROW] TABLE <table_name> ...
This means one can (and probably should) include the table type desired in any CREATE TABLE command, but can also choose to not do that (i.e. to keep compatibility which standard SQL).
The default setting
Now, which table type you get when not specifying the table type depends on a HANA parameter in the indexserver.ini configuration file.
If the parameter [sql] - [default_table_type] is set to row then not specifying the table type will get a table stored in the row store. This is also the default value for the parameter up until HANA 2 SPS 03 if I'm not mistaken.
With HANA 2 SPS 04 the default for the parameter was finally changed to column.
What you should use
This is important: with SAP HANA you want the table type to be COLUMN in nearly all cases.
Row store tables have very different performance and memory requirement characteristics and really only serve very specific data access and modification patterns.
Those patterns are for example:
always full row access by selecting the complete primary key.
high frequency of UPDATEs on records (think updating the same record thousands of times a second).
records with nearly distinct records in every/most column/s that absolutely need to be in memory at all times.
For the vast majority of all use cases and data types CREATE COLUMN TABLE is the right choice in SAP HANA.
Column store tables support compression, partitioning, memory displacement, and many other techniques that are not available for row store tables.
The difference it makes to your programs
And yet, both table types "look and feel" the same to any SQL command.
To give an analogy, other DBMS support different table types like "cluster" or "heap" that affect how data gets stored internally while the tables can be used regardless of the chosen type.
The HANA setting for column or row store is a similar choice about internal storage.
All that (and a lot more) is of course documented (e.g. here) and explained in many different places (e.g. my book SAP HANA Administration).
What column and row store is not about
Note that the choice between column or row store has nothing to do with whether the table is a temporary or permanent table. Both table types permanently persist the data as one would expect.
Of course, one can always use CREATE TEMPORARY TABLE which also comes with a whole range of option... CREATE TEMPORARY { ROW | COLUMN } TABLE | LOCAL TEMPORARY { ROW | COLUMN } TABLE but for this answer let's pretend we didn't see that to save our sanity.
Take away
It's quite important to understand that HANA has those two fundamentally different implementations of in-memory tables.
Make sure you don't accidentally (by using the default) create row store tables for your mass data analytics or really for most use cases.
Whenever you're unsure about the table type, start off with a column store table and see if that works for your use case. Should you actually have a use case for which row store is the better option, you can (nearly) always convert a table from one storage type to the other via an ALTER TABLE command.
I am new to columnar DB concepts and BigQuery in particular. I noticed that for the sake of performance and cost efficiency it is recommended to split data across tables not only logically - but also by time.
For example - while I need a table to store my logs (1 logical table that is called "logs"), it is actually considered a good practice to have a separate table for different periods, like "logs_2012", "logs_2013", etc... or even "logs_2013_01", "logs_2013_02", etc...
My questions:
1) Is it actually the best practice?
2) Where would be best to draw the line - an annual table? A monthly table? A daily table? You get the point...
3) In terms of retrieving the data via queries - what is the best approach? Should I construct my queries dynamically using the UNION option? If I had all my logs in one table - I would naturally use the where clause to get data for the desired time range, but having data distributed over multiple tables makes it weird. I come from the world of relational DB (if it wasn't obvious so far) and I'm trying to make the leap as smoothly as possible...
4) Using the distributed method (different tables for different periods) still raises the following question: before querying the data itself - I want to be able to determine for a specific log type - what is the available range for querying. For example - for a specific machine I would like to first present to my users the relevant scope of their available logs, and let them choose the specific period within that scope to get insights for. The question is - how do I construct such a query when my data is distributed over a number of tables (each for a period) where I don't know which tables are available? How can I construct a query when I don't know which tables exist? I might try to access the table "logs_2012_12" when this table doesn't actually exist, or event worst - I wouldn't know which tables are relevant and available for my query.
Hope my questions make sense...
Amit
Table naming
For daily tables, the suggested table name pattern is the specific name of your table + the date like in '20131225'. For example, "logs20131225" or "logs_20131225".
Ideal aggregation: Day, month, year?
The answer to this question will depend on your data and your queries.
Will you usually query one or two days of data? Then have daily tables, and your costs will be much lower, as you query only the data you need.
Will you usually query all your data? Then have all the data in one table. Having many tables in one query can get slower as the number of tables to query grow.
If in doubt, do both! You could have daily, monthly, yearly tables. For a small storage cost, you could save a lot when doing queries that target only the intended data.
Unions
Feel free to do unions.
Keep in mind that there is a limit of a 1000 tables per query. This means if you have daily tables, you won't be able to query 3 years of data (3*365 > 1000).
Remember that unions in BigQuery don't use the UNION keyword, but the "," that other databases use for joins. Joins in BigQuery can be done with the explicit SQL keyword JOIN (or JOIN EACH for very big joins).
Table discovery
API: tables.list will list all tables in a dataset, through the API.
SQL: To query the list of tables within SQL... keep tuned.
New 2016 answer: Partitions
Now you can have everything in one table, and BigQuery will analyze only the data contained in the desired dates - if you set up the new partitioned tables:
https://cloud.google.com/bigquery/docs/creating-partitioned-tables
I have an aggregate data set that spans multiple years. The data for each respective year is stored in a separate table named Data. The data is currently sitting in MS ACCESS tables, and I will be migrating it to SQL Server.
I would prefer that data for each year is kept in separate tables, to be merged and queried at runtime. I do not want to do this at the expense of efficiency, however, as each year is approx. 1.5M records of 40ish fields.
I am trying to avoid having to do an excessive number of UNIONS in the query. I would also like to avoid having to edit the query as each new year is added, leading to an ever-expanding number of UNIONs.
Is there an easy way to do these UNIONs at runtime without an extensive SQL query and high system utility? Or, if all the data should be managed in one large table, is there a quick and easy way to append all the tables together in a single query?
If you really want to store them in separate tables, then I would create a view that does that unioning for you.
create view AllData
as
(
select * from Data2001
union all
select * from Data2002
union all
select * from Data2003
)
But to be honest, if you use this, why not put all the data into 1 table. Then if you wanted you could create the views the other way.
create view Data2001
as
(
select * from AllData
where CreateDate >= '1/1/2001'
and CreateDate < '1/1/2002'
)
A single table is likely the best choice for this type of query. HOwever you have to balance that gainst the other work the db is doing.
One choice you did not mention is creating a view that contains the unions and then querying on theview. That way at least you only have to add the union statement to the view each year and all queries using the view will be correct. Personally if I did that I would write a createion query that creates the table and then adjusts the view to add the union for that table. Once it was tested and I knew it would run, I woudl schedule it as a job to run on the last day of the year.
One way to do this is by using horizontal partitioning.
You basically create a partitioning function that informs the DBMS to create separate tables for each period, each with a constraint informing the DBMS that there will only be data for a specific year in each.
At query execution time, the optimiser can decide whether it is possible to completely ignore one or more partitions to speed up execution time.
The setup overhead of such a schema is non-trivial, and it only really makes sense if you have a lot of data. Although 1.5 million rows per year might seem a lot, depending on your query plans, it shouldn't be any big deal (for a decently specced SQL server). Refer to documentation
I can't add comments due to low rep, but definitely agree with 1 table, and partitioning is helpful for large data sets, and is supported in SQL Server, where the data will be getting migrated to.
If the data is heavily used and frequently updated then monthly partitioning might be useful, but if not, given the size, partitioning probably isn't going to be very helpful.
I am currently writing an application that needs to be able to select a subset of IDs from Millions of users...
I am currently writing software to select a group of 100.000 IDs from a table that contains the whole list of Brazilian population 200.000.000 (200M), I need to be able to do this in a reasonable amount of time... ID on Table = ID on XML
I am thinking of parsing the xml file and starting a thread that performs a SELECT statement on a database, I would need a connection for each thread, still this way seems like a brute force approach, perhaps there is a more elegant way?
1) what is the best database to do this?
2) what is a reasonable limit to the amount of db connections?
Making 100.000 queries would take a long time, and splitting up the work on separate threads won't help you much as you are reading from the same table.
Don't get a single record at a time, rather divide the 100.000 items up in reasonably small batches, for example 1000 items each, which you can send to the database. Create a temporary table in the database with those id values, and make a join against the database table to get those records.
Using MS SQL Server for example, you can send a batch of items as an XML to a stored procedure, which can create the temporary table from that and query the database table.
Any modern DBMS that can handle an existing 200M row table, should have no problem comparing against a 100K row table (assuming your hardware is up to scratch).
Ideal solution: Import your XML (at least the IDs) into to a new table, ensure the columns you're comparing are indexed correctly. And then query.
What language? If your using .NET you could load your XML and SQL as datasources, and then i believe there are some enumerable functions that could be used to compare the data.
Do this:
Parse the XML and store the extracted IDs into a temporary table1.
From the main table, select only the rows whose ID is also present in the temporary table:
SELECT * FROM MAIN_TABLE WHERE ID IN (SELECT ID FROM TEMPORARY_TABLE)
A decent DBMS will typically do the job quicker than you can, even if you employed batching/chunking and parallelization on your end.
1 Temporary tables are typically created using CREATE [GLOBAL|LOCAL] TEMPORARY TABLE ... syntax and you'll probably want it private for the session (check your DBMS's interpretation of GLOBAL vs. LOCAL for this). If your DBMS of choice doesn't support temporary tables, you can use "normal" tables instead, but be careful not to let concurrent sessions mess with that table while you are still using it.