Union tables with different schema - google-bigquery

I am referring to Jordan's answer regarding "dealing with evolving schemas "Dealing with evolving schemas.
As I share similar problem I have tried to query tables with different schema and got the following results:
Select a,b,c
FROM (Select 1 as a, 2 as b), --Test_a
(Select 1 as a, 2 as b, 3 as c), --Test_b
runs fine...
I have put Test_a and Test_b into physical tables (all fields are nullable) and tried:
Select a,b,c
FROM (Select a,b, from BI_WORKSPACE.Test_a),
(Select a,b,c from BI_WORKSPACE.Test_b)
It also runs fine
but when i tried
Select a,b,c
FROM BI_WORKSPACE.Test_a,
BI_WORKSPACE.Test_b
It failed...
Is there a bug, something i do wrong?
the last sample is the one i am after as it allows me to "evolve" my schema over time. i would like to avoid altering schema of all existing tables whenever i add a column to support a new business need.
Many thanks for your help.
The reason for asking:
We hold our data in "Daily tables" so when querying we pay only for the period we are interested in.
As BQ doesn’t support "Dynamic SQL", we have created an offline process that takes a query template and generates a query for desired period. Something like:
Input:
Select a,b,c FROM [<Source>]
Output:
Select a,b,c FROM [MYDATASET.TABLE20140201], [MYDATASET.TABLE20140202], [MYDATASET.TABLE20140203], [MYDATASET.TABLE20140204] , [MYDATASET.TABLE20140205] , [MYDATASET.TABLE20140206] , [MYDATASET.TABLE20140207]
Our process is unaware of the query logic. Sometimes we add fields to support evolving business needs.
Using dynamic sub selects will complicate staff a lot, and altering the schema for all hundreds of existing tables is expensive and prone to mistakes.
Any suggestions?

I don't think the last query should work. You're asking for columns a,b, and c from two tables, but one of those tables doesn't have a column with that name. That looks like a query error to me, since you are explicitly asking for a column that doesn't exist on the table.
There is a workaround -- to use a subselect -- which you noticed, if you know that a field may be missing from one schema. The other workaround, of course, is to update the schema.
This seems like it is working as intended. If you don't agree, can you let me know why?

It's possible to select from union of tables with different schemas.
Simple trick is to use subquery with asterisk as Jordan proposed. There's no need to alter schema.
In your case this will work (legacy SQL dialect)
SELECT a,b,c
FROM ( SELECT * FROM BI_WORKSPACE.Test_a ),
( SELECT * FROM BI_WORKSPACE.Test_b )

Related

Any resources for this SQL filtering?

I have 100 tables each of size of order of few tenths of GB. The schema of each table is the following:
A: string | B: string | C: string
In each table I would like to retain only the rows for which the (B, C) appears at least 10 times in a concatenation of all 100 tables. Is there any efficient way to achieve this?
A very vague question, excluding your DBMS as well isn't helpful as SQL comes in different forms.
But first, you would have to join all of the tables together - there may be a faster way of doing this, but without knowing which flavor of SQL you are using it is hard to tell.
Something like this will work:
SELECT * FROM table_1
UNION
SELECT * FROM table_2
...
UNION
SELECT * FROM table_100
Once you have all of the data you do something like this:
WITH tables_with_counts as (SELECT
A,
B,
C,
COUNT(1) OVER(PARTITION BY(B, C)) AS bc_count
FROM
aggragated_tables)
SELECT
A,
B,
C
FROM
tables_with_counts
WHERE
bc_count >= 10
Here is my take:
Step 1 : Aggregate all tables into one. It would be bulky but if you are using Oracle database, I think it shouldn't be an issue.
Step 2: Create md5 checksum hash values for B,C columns like below :
SELECT APEX_ITEM.MD5_CHECKSUM(B,C) md5_cks,
A,B,C
FROM aggregated_tables
Step 3: take count based on checksum values and retain the rows where count > 10
Step 4: Get rid of duplicate data using rank() or dense rank() in delete statement.
The short answer, which I'm sure that you don't want to hear, is "no." In the context of relational databases there is no efficient query to merge 100 tables.
It is not all bad news though. If it were just one table (let's say it was named "combined" just to have concrete examples) you could use an elegant SQL using windowed functions
select A,B,C from (select A,B,C,count(1) over (partition by B,C) as counts from combined)counted where counts>=10
Option 1. So the question is how to get a "combined" table so that the snippet above works. If we stick with ANSI (standard) sql, you could use UNION ALL, which and collect it into a WITH clause to keep things neat.
Here is an example:
with
combined as (
select * from table_1
union all
select * from table_2),
counted as (
select
A,B,C,
count(1) over (partition by B,C) as counts
from
combined)
select A,B,C from counted where counts>=10;
I only included 2 tables, but the real query would extend that up to table_100. Thats a lot of typing and not very efficient with the programmer's time. Also unions and union all's are notoriously poor performing for databases, so this is not efficient in terms of system resources or time, either. Personally I would not do it this way, but it is an answer.
Option 2 There are other options which do not exactly match your question, but may be helpful to know. Any time you are tempted to create multiple tables with exactly the same schema, you will be better off creating a single table with multiple partitions. see MySQL, Postgres, Sql Server, Oracle, Hive. Every database platform has its own syntax for partitioning tables but they are all similar. For this table, each of the original tables becomes a single partition in the table, and the table name would be a really good candidate for the string value in the partition identifier (partition column)
If you are able to stuff all of your 100 tables into 100 partitions of one table then you can run the first query after all. The advantage is that the database can optimize that query because all modern databases are optimized to manage partitioned queries.
In addition, adding a partition to a table is really no more trouble than creating a new table instead, but supporting and maintaining one table is a lot less trouble than 100 tables.
A third option, since you tagged "big data" is to use a big data engine like Spark with SparkSQL. This would be objectively best because you can actually load a dataframe with 100 combined tables very efficiently with spark, and the SQL after that is not much different from the relational database sql we have been considering. That's kind of out of scope here, but worth considering. If you submit a more specific question and specifically for spark we could go into more details.

Trying to pull from multiple tables and multiple columns within tables

I have tried this query to pull up multiple tables and columns it works but comes back blank.
select * from onshore.contracting where code between 18789 and 18798;
select * from onshore.safety_incident where code between 18789 and 18798;
For your immediate problem, the following SQL will work if you really want all the data:
select * from onshore.contracting where code between 18789 and 18798;
select * from onshore.safety_incident where code between 18789 and 18798;
... and so on.
These tables probably have different columns so you need a separate select statement for each one.
If you are going to do more with SQL then it really would be worthwhile to learn it. There is a free resource here: https://www.w3schools.com/sql, my contribution is at http://www.thedatastudio.net and there are many others. It is a bit dangerous to use SQL without understanding it.

Difference between two tables, unknown fields

Is there a way in Access using SQL to get the difference between 2 tables?
I'm building an audit function and I want to return all records from table1 where a value (or values) doesn't match the corresponding record in table2. Primary keys will always match between the two tables. They will always contain the exact same number of fields, field names, and types, as each other. However, the number and name of those fields cannot be determined before the query is run.
Please also note, I am looking for an Access SQL solution. I know how to solve this with VBA.
Thanks,
There are several possibilities to compare fields with known names, but there is no way in SQL to access fields without knowing their name. Mostly becase SQL doesn't consider fields to have a specific order in a table.
So the only way to accomplish what you need in pure Access-SQL would be, if there was a SQL-Command for it (kind of like the * as placeholder for all fields). But there isn't. Microsoft Access SQL Reference.
What you COULD do is create an SQL-clause on the fly in VBA. (I know, you said you didn't want to do it in VBA - but this is doing it in SQL, but using VBA to create the SQL..).
Doing everything in VBA would probably take some time, but creating an SQL on the fly is very fast and you can optimize it to the specific table. Then executing the SQL is the fastest solution you can get.
Not sure without your table structure but you can probably get that done using NOT IN operator (OR) using WHERE NOT EXISTS like
select * from table1
where some_field not in (select some_other_field from table2);
(OR)
select * from table1 t1
where not exists (select 1 from table2 where some_other_field = t1.some_field);
SELECT A.*, B.* FROM A FULL JOIN B ON (A.C = B.C) WHERE A.C IS NULL OR B.C IS NULL;
IF you have tables A and B, both with colum C, here are the records, which are present in table A but not in B.To get all the differences with a single query, a full join must be used,like above

Select * from n tables

Is there a way to write a query like:
select * from <some number of tables>
...where the number of tables is unknown? I would like to avoid using dynamic SQL. I would like to select all rows from all the tables that (the tables) have a specific prefix:
select * from t1
select * from t2
select * from t3
...
I don't know how many t(n) might there be (might be 1, might be 20, etc.) The t table column structures are not the same. Some of them have 2 columns, some of them 3 or 4.
It would not be hard using dynamic SQL, but I wanted to know if there is a way to do this using something like sys.tables.
UPDATE
Basic database design explained
N companies will register/log in to my application
Each company will set up ONE table with x columns
(x depends on the type of business the company is, can be different, for example think of two companies: one is a Carpenter and the other is a Newspaper)
Each company will fill his own table using an API built by me
What I do with the data:
I have a "processor", that will be SQL or C# or whatever.
If there is at least one row for one company, I will generate a record in a COMMON table.
So the final results will be all in one table.
Anybody from any of those N companies will log in and will see the COMMON table filtered for his own company.
There would be no way to do that without Dynamic SQL. And having different table structures does not help that at all.
Update
There would be no easy way to return the desired output in one single result set (result set would have at least the same # of columns of the table with most columns and don't even get me started on data types compatibility).
However, you should check #KM.'s answer. That will bring multiple result sets.
to list ALL tables you could try :
EXEC sp_msforeachtable 'SELECT * FROM ?'
you can programmability include/exclude table by doing something like:
EXEC sp_msforeachtable 'IF LEFT(''?'',9)=''[dbo].[xy'' BEGIN SELECT * FROM ? END ELSE PRINT LEFT(''?'',9)'

MySQL - Selecting data from multiple tables all with same structure but different data

Ok, here is my dilemma I have a database set up with about 5 tables all with the exact same data structure. The data is separated in this manner for localization purposes and to split up a total of about 4.5 million records.
A majority of the time only one table is needed and all is well. However, sometimes data is needed from 2 or more of the tables and it needs to be sorted by a user defined column. This is where I am having problems.
data columns:
id, band_name, song_name, album_name, genre
MySQL statment:
SELECT * from us_music, de_music where `genre` = 'punk'
MySQL spits out this error:
#1052 - Column 'genre' in where clause is ambiguous
Obviously, I am doing this wrong. Anyone care to shed some light on this for me?
I think you're looking for the UNION clause, a la
(SELECT * from us_music where `genre` = 'punk')
UNION
(SELECT * from de_music where `genre` = 'punk')
It sounds like you'd be happer with a single table. The five having the same schema, and sometimes needing to be presented as if they came from one table point to putting it all in one table.
Add a new column which can be used to distinguish among the five languages (I'm assuming it's language that is different among the tables since you said it was for localization). Don't worry about having 4.5 million records. Any real database can handle that size no problem. Add the correct indexes, and you'll have no trouble dealing with them as a single table.
Any of the above answers are valid, or an alternative way is to expand the table name to include the database name as well - eg:
SELECT * from us_music, de_music where `us_music.genre` = 'punk' AND `de_music.genre` = 'punk'
The column is ambiguous because it appears in both tables you would need to specify the where (or sort) field fully such as us_music.genre or de_music.genre but you'd usually specify two tables if you were then going to join them together in some fashion. The structure your dealing with is occasionally referred to as a partitioned table although it's usually done to separate the dataset into distinct files as well rather than to just split the dataset arbitrarily. If you're in charge of the database structure and there's no good reason to partition the data then I'd build one big table with an extra "origin" field that contains a country code but you're probably doing it for legitimate performance reason.
Either use a union to join the tables you're interested in http://dev.mysql.com/doc/refman/5.0/en/union.html or by using the Merge database engine http://dev.mysql.com/doc/refman/5.1/en/merge-storage-engine.html.
Your original attempt to span both tables creates an implicit JOIN. This is frowned upon by most experienced SQL programmers because it separates the tables to be combined with the condition of how.
The UNION is a good solution for the tables as they are, but there should be no reason they can't be put into the one table with decent indexing. I've seen adding the correct index to a large table increase query speed by three orders of magnitude.
The union statement cause a deal time in huge data. It is good to perform the select in 2 steps:
select the id
then select the main table with it