BQ - INSERT without listing columns - google-bigquery

I have 2 BQ tables, very wide ones in terms of number of columns. Note all the table columns are made nullable for flexibility
Table A - 1000 cols - Superset of Bs cols
Table B - 500 cols - Subset of As cols - exactly named/typed as above cols
So rows in Bs table data should be insertable into A, where anything column not inserted just gets a null. i.e 500 cols get a value, remaining 500 get a default null as not present in the insert.
So as these tables are very wide, enumerating all the columns in an insert statement would take forever and be a maintence nightmare.
Is there a way in standard SQL to insert without listing the columns names in the the insert statement, whereby its automagically name matched?
So I want to be able to do this really and have the columns from B matched to A for each row inserted? If not is there any other way I am not seeing that could help with this?
thanks!
INSERT INTO
`p.d.A` (
SELECT
*
FROM
`p.d.B` )
I actually tried enumerating the columns to see if nesting worked and seems it doesnt?
INSERT INTO
`p.d.A` (x, y.z) (
SELECT
x, y.z
FROM
`p.d.B` )
I cant just say (x,y) as y structs from the dff tables arent exactly the same BQ complains structs dont match exact.....hence why I was trying y.z ?

Sure, easy!
Prepare dummy table p.d.b_ using below select
SELECT * FROM `p.d.a` WHERE FALSE
(Note, even though result will be empty table - above will scan whole table a - this is required just once - so should be Okey - if not you can script this once and just create this table from script)
Ok, so now instead of using
SELECT * FROM `p.d.b`
you will use
SELECT * FROM `p.d.b*`
and this will make a trick for you (it did for me :o)
P.S. Of course I assume you will make sure there is no other tables with names starting with b (or whatever real name is) in that dataset

Related

BigQuery insert values AS, assume nulls for missing columns

Imagine there is a table with 1000 columns.
I want to add a row with values for 20 columns and assume NULLs for the rest.
INSERT VALUES syntax can be used for that:
INSERT INTO `tbl` (
date,
p,
... # 18 more names
)
VALUES(
DATE('2020-02-01'),
'p3',
... # 18 more values
)
The problem with it is that it is hard to tell which value corresponds to which column. And if you need to change/comment out some value then you have to make edits in two places.
INSERT SELECT syntax also can be used:
INSERT INTO `tbl`
SELECT
DATE('2020-02-01') AS date,
'p3' AS p,
... # 18 more value AS column
... # 980 more NULL AS column
Then if I need to comment out some column just one line has to be commented out.
But obviously having to set 980 NULLs is an inconvenience.
What is the way to combine both approaches? To achieve something like:
INSERT INTO `tbl`
SELECT
DATE('2020-02-01') AS date,
'p3' AS p,
... # 18 more value AS column
The query above doesn't work, the error is Inserted row has wrong column count; Has 20, expected 1000.
Your first version is really the only one you should ever be using for SQL inserts. It ensures that every target column is explicitly mentioned, and is unambiguous with regard to where the literals in the VALUES clause should go. You can use the version which does not explicitly mention column names. At first, it might seem that you are saving yourself some code. But realize that there is a column list which will be used, and it is the list of all the table's columns, in whatever their positions from definition are. Your code might work, but appreciate that any addition/removal of a column, or changing of column order, can totally break your insert script. For this reason, most will strongly advocate for the first version.
You can try following solution, it is combination of above 2 process which you have highlighted in case study:-
INSERT INTO `tbl` (date, p, 18 other coll names)
SELECT
DATE('2020-02-01') AS date,
'p3' AS p,
... # 18 more value AS column
Couple of things you should consider here are :-
Other 980 Columns should ne Nullable, that means it should hold NULL values.
All 18 columns in Insert line and Select should be in same order so that data will be inserted in same correct order.
To Avoid any confusion, try to use Alease in Select Query same as Insert Table Column name. It will remove any ambiguity.
Hopefully it will work for you.
In BigQuery, the best way to do what you're describing is to first load to a staging table. I'll assume you can get the values you want to insert into JSON format with keys that correspond to the target column names.
values.json
{"date": "2020-01-01", "p": "p3", "column": "value", ... }
Then generate a schema file for the target table and save it locally
bq show --schema project:dataset.tbl > schema.json
Load the new data to the staging table using the target schema. This gives you "named" null values for each column present in the target schema but missing from your json, bypassing the need to write them out.
bq load --replace --source_format=NEWLINE_DELIMIITED_JSON \
project:dataset.stg_tbl values.json schema.json
Now the insert select statement works every time
insert into `project:dataset.tbl`
select * from `project:dataset.stg_tbl`
Not a pure SQL solution but I managed this by loading my staging table with data then running something like:
from google.cloud import bigquery
client = bigquery.Client()
table1 = client.get_table(f"{project_id}.{dataset_name}.table1")
table1_col_map = {field.name: field for field in table1.schema}
table2 = client.get_table(f"{project_id}.{dataset_name}.table2")
table2_col_map = {field.name: field for field in table2.schema}
combined_schema = {**table2_col_map, **table1_col_map}
table1.schema = list(combined_schema.values())
client.update_table(table1_cols, ["schema"])
Explanation:
This will retrieve the schemas of both, convert their schemas into a dictionary with key as column name and value as the actual field info from the sdk. Then both are combined with dictionary unpacking (the order of unpacking determines which table's columns have precedence when a column is common between them. Finally the combined schema is assigned back to table 1 and used to update the table, adding the missing columns with nulls.

UPDATE two columns with new value under large size table

We have table like :
mytable (pid, string_value, int_value)
This table has more than 20M rows in total. Now we have a feature try to mark all the rows from this tables as invalid. So we need update the table columns: string_Value = NULL and int_value = 0 which indicate this is invalid row ( we still want to keep the pid as it is important to us)
So what is the best way?
I use the following SQL:
UPDATE Mytable
SET string_value = NULL,
int_value = 0;
but this query takes more than 4 minutes in my test env. Is there any better way we can improve it?
Updating all the rows can be quite expensive. Often, it is faster to empty the table and reload it.
In generic SQL this looks like:
create table mytable_temp as
select pid
from mytable;
truncate table mytable; -- back it up first!
insert into mytable (pid, string_value, int_value)
select pid, null, 0
from mytable_temp;
The creation of the temporary table may use different syntax, depending on our database.
Updates can take time to complete. Another way of achieving this is to follow the following steps:
Add new columns with the values you need set as the default value
Drop the original columns
Rename the new columns with the names of the original columns.
You can then drop the default values on the new columns.
This needs to be tested as different DBMSs allow different levels of table alters (i.e. not all DMBSs allow a drop default or a drop column).

SQL Command to copy data from 1 column in table and 1 column in another table into a new table?

I had to make a new table to get the Include statement working in Entity Framework since EF was looking for a table called be_PostTagbe_Posts. I was using EF Code First from DB. But now the question is about SQL. I added one row of data and now the include works. But what I am looking for is a SQL command that can copy data from 1 column in 1 table and 1 column in another into the new be_PostTagbe_Posts table. In the be_Posts table I need the data in PostRowID to go into be_Posts_PostRowID and PostTagId to go into be_PostTag_PostTagID. Both be_PostTag_PostTagID and be_Posts_PostRowID are in the new be_PostTagbe_Posts table. I am not very good with SQL so not sure how to do this.
Edit: Thanks for the answers. I tried 2 separate queries but only data was inserted into the the be_PostTag_PostTagID while be_PostTag_PostRowID remained null.
And I tried this query which returned The multi-part identifier "be_PostTag.PostID" could not be bound.
INSERT INTO be_PostTagbe_Posts(be_PostTag_PostTagID, be_Posts_PostRowID)
SELECT be_PostTag.PostTagID, be_Posts.PostRowID
WHERE be_PostTag.PostID = be_Posts.PostID
EDIT:
This only inserted half the data - even 2 inserts leave one column null
INSERT INTO be_PostTagbe_Posts (be_Posts_PostRowID)
SELECT PostRowID FROM be_Posts;
INSERT INTO be_PostTagbe_Posts (be_PostTag_PostTagID)
SELECT PostTagID FROM be_PostTag;
And yet management studio tells me the query executed successfully but one column is still null. Weird.
Here are screenshots of the tables:
SELECT PostTagID AS be_PostTag_PostTagID, PostRowID AS be_Posts_PostRowID
INTO be_PostTagbe_Posts
FROM be_PostTag
Inner JOIN be_Posts
ON be_PostTag.PostID=be_Posts.PostID
That command created the new table with the 2 columns populated.
If i understand you ,you want to Copy Table Z's Column A to Table X And Table Z's Column B to Table Y.
If it is so, According to your question it is not clear about Table Structure of TableX and TableY
Assuming TableX And TableY to single ColumnTable [Apart from IdentityColumn] our query will be
INSERT INTO TableX
SELECT ColumnA FROM TableZ
INSERT INTO TableY
SELECT ColumnB FROM TableZ
Rest put your Entire Structure of Table To Get More Help Because These query are on Assumptions
There's not enough information in your question to give you a working example, but this would be the general syntax for INSERTing into a different table using a query SELECTing from two other tables.
INSERT INTO destination_table(wanted_value_1, wanted_value_2)
SELECT table_1.source_field_1, table_2.source_field_1
WHERE table_1.matching_field = table_2.matching_field
There has to be some sort of relationship between the two tables for the WHERE clause to work in that statement. I'm guessing based the little information you provided that there is a PostRowID field somewhere in the table that contains the tags such that your data would look similar to this in the tag table:
PostRowID PostTagID
--------- ---------
1 1
1 2
1 3
1 4
2 1
2 2
3 3
4 4
It sounds like you should use two sql statements:
Insert into `be_PostTagbe_Posts` (`be_PostTag_PostTagID`)
select `PostTagID` from POSTTAGIDTABLE
and
Insert into `be_PostTagbe_Posts` (`be_Posts_PostRowID`)
select `PostRowID` from POSTTAGIDTABLE
unless the items have some sort of relationship, then if you have a select statement that will select the merged data in two columns you can just do
Insert into `be_PostTagbe_Posts` (`be_PostTag_PostTagID`,`be_Posts_PostRowID`)
(select statement that selects the two items)

Character width exceeded

I create a master table with column A, column B, and column C. Whenever I try to insert row from another table using the command:
INSERT INTO MASTER
select * from Table B
I get the error message "Character Width exceeded". I am not sure why.
May be the one or more column size of Table B is larger than the respective column size of MASTER.
For e.g. - Table B's column1 might be VARCHAR(255) and MASTER's column A might be less than 255.
Check if the structure of table MASTER is exactly the same as the table B.
Probably is a problem in which MASTER.colx is a string(20) and B.colx is a string(25), or may be is a charset problem (unicode/latin-1/utf-8/iso-8859-x)
Consider creating the table MASTER as:
select *
into master
from tableB
where 1=0
This will guarantee that the column datatypes are the same between the two tables. Then you can try your insert again.

SQL query select from table and group on other column

I'm phrasing the question title poorly as I'm not sure what to call what I'm trying to do but it really should be simple.
I've a link / join table with two ID columns. I want to run a check before saving new rows to the table.
The user can save attributes through a webpage but I need to check that the same combination doesn't exist before saving it. With one record it's easy as obviously you just check if that attributeId is already in the table, if it is don't allow them to save it again.
However, if the user chooses a combination of that attribute and another one then they should be allowed to save it.
Here's an image of what I mean:
So if a user now tried to save an attribute with ID of 1 it will stop them, but I need it to also stop them if they tried ID's of 1, 10 so long as both 1 and 10 had the same productAttributeId.
I'm confusing this in my explanation but I'm hoping the image will clarify what I need to do.
This should be simple so I presume I'm missing something.
If I understand the question properly, you want to prevent the combination of AttributeId and ProductAttributeId from being reused. If that's the case, simply make them a combined primary key, which is by nature UNIQUE.
If that's not feasible, create a stored procedure that runs a query against the join for instances of the AttributeId. If the query returns 0 instances, insert the row.
Here's some light code to present the idea (may need to be modified to work with your database):
SELECT COUNT(1) FROM MyJoinTable WHERE AttributeId = #RequestedID
IF ##ROWCOUNT = 0
BEGIN
INSERT INTO MyJoinTable ...
END
You can control your inserts via a stored procedure. My understanding is that
users can select a combination of Attributes, such as
just 1
1 and 10 together
1,4,5,10 (4 attributes)
These need to enter the table as a single "batch" against a (new?) productAttributeId
So if (1,10) was chosen, this needs to be blocked because 1-2 and 10-2 already exist.
What I suggest
The stored procedure should take the attributes as a single list, e.g. '1,2,3' (comma separated, no spaces, just integers)
You can then use a string splitting UDF or an inline XML trick (as shown below) to break it into rows of a derived table.
Test table
create table attrib (attributeid int, productattributeid int)
insert attrib select 1,1
insert attrib select 1,2
insert attrib select 10,2
Here I use a variable, but you can incorporate as a SP input param
declare #t nvarchar(max) set #t = '1,2,10'
select top(1)
t.productattributeid,
count(t.productattributeid) count_attrib,
count(*) over () count_input
from (select convert(xml,'<a>' + replace(#t,',','</a><a>') + '</a>') x) x
cross apply x.x.nodes('a') n(c)
cross apply (select n.c.value('.','int')) a(attributeid)
left join attrib t on t.attributeid = a.attributeid
group by t.productattributeid
order by countrows desc
Output
productattributeid count_attrib count_input
2 2 3
The 1st column gives you the productattributeid that has the most matches
The 2nd column gives you how many attributes were matched using the same productattributeid
The 3rd column is how many attributes exist in the input
If you compare the last 2 columns and the counts
match - you can use the productattributeid to attach to the product which has all these attributes
don't match - then you need to do an insert to create a new combination