A possible way to remove BigQuery column - google-bigquery

I'm looking around for an approach to update an existing BigQuery table.
With the CLI I'm able to copy the table to a new one. And now, i'm looking for an effective to remove/rename a column.
It's said that is not possible to remove a column . So is it possible when copying table1 to table2 to exclude some columns ?
Thanks,

You can do this by running a query that copies the old table to the new one. You should specify allowLargeResults:true and flattenSchema:false. The former allows you to have query results larger than 128MB, the latter prevents repeated fields from being flattened in the result.
You can write the results to the same table as the source table, but use the writeDisposition:WRITE_TRUNCATE. This will atomically overwrite the table with the results. However, if you'd like to test out the query first, you always could write the results to a temporary table first, then copy the temporary table over the old table when you're happy with it (using WRITE_TRUNCATE to atomically replace the table).
(Note, the flags I'm describing here are their names in the underling API, but they have analogues in both the query options in the Web UI and the bq CLI).
For example, if you have a table t1 with schema {a, b, c, d} and you want to drop field c, and rename b to b2 you can run
SELECT a, b as b2, d FROM t1

Related

create a table with a Boolean column generated based on other tables columns values?

I have tables A, B, C with millions of rows each. Tables B and C reference table A. The tables are mainly used for one query with multiple filters but only one of those filters vary between queries. since the constant parameters are adding significant time to the query execution time, I was wondering if there is a way to precompute these params into a new table. I was looking at materialized views but the issue is that the computed type I want will be different from the original column type. To explain I will give an example.
lets say these tables represent a book store database. Table A contains general information and table B contain multiple codes for each book to indicate what categories they fall under such as 406, 678, 252.. . I'm building a query to search for books that only fall under 3 of those categories. The variable here is the keyword search in the discreption of the book. I will always need books under those 3 categories (codes) so these are constants.
What I want to do is create a table where it will have a column that tells me whether a given serial falls under those 3 codes or not. this can be done with a boolean type. I don't want to have to join these table and filter for these 3 codes (and more in the real scenario) for every query.. As I understand materialized views can't have generated fields?
What do you think is a good solution here?
You have multiple options.
Partial Index
PostgreSQL allows you to create an index with a where clause like so:
create index tableb_category on tableb (category)
where category in (406, 678, 252);
Create a view for those categories:
create view v_books_of_interest
as
select tablea.*, tableb.*
from tablea
inner join table b
on tableb.bookid = tablea.bookid
and tableb.category in (406, 678, 252);
Now, your queries can use this book_of_interest rather than books. Frankly, I would start with this first. Query optimization with the right indexes goes a long way. Millions of rows in multiple table are manageable.
Materialized view
create materialized view mv_books_of_interest
as
select tablea.*, tableb.*
from tablea
inner join table b
on tableb.bookid = tablea.bookid
and tableb.category in (406, 678, 252);
with no data;
Periodically, run a cron job (or the like) to refresh it:
refresh materialized view mv_books_of_interest;
Partitioning data
https://www.postgresql.org/docs/9.3/ddl-partitioning.html will get you started. If your team is on-board with table inheritance, great. Give it a shot and see how that works for your use case.
Trigger
Create a field is_interesting in tableA (or tableB, depending on how you want to access data). Create a trigger that checks for a certain criteria when data is inserted in dependencies and then turns the book's flag true/false. That will allow your queries to run faster but could slow down your inserts and updates.

Is it possible to update one table from another table without a join in Vertica?

I have two tables A(i,j,k) and B(m,n).
I want to update the 'm' column of B table by taking sum(j) from table A. Is it possible to do it in Vertica?
Following code can be used for Teradata, but does Vertica have this kind of flexibility?
Update B from (select sum(j) as m from A)a1 set m=a1.m;
The Teradata SQL syntax won't work with Vertica, but the following query should do the same thing :
update B set m = (select sum(j) from A)
Depending on the size of your tables, this may not be an efficient way to update data. Vertical is a WORM (write once read many times) store, and is not optimized for updates or deletes.
An alternate way would be to first temporarily move the data in the target table to another intermediate (but not temporary) table. After that write a join query using the other table to produce the desired result, and finally use export table with that join query. Finally drop the intermediate table. Of course, this is assuming you have partitioned your table in a way suitable for your update logic.

Renaming two columns or swapping the values? Which one is better?

I have a table with more than 1.5 million records, in which I have two columns, A and B. Mistakenly the column values of A got inserted into the column B and column B's values got inserted to A.
Recently only we found the issue. What will be the best option to correct this issue? Rename the column names interchangingly (I don't know how it can be possible, since if we nename A to B, when B already exists), or swapping the values contained in the two columns?
Hi, You can have the below query to swap the columns,
UPDATE table_name SET A = B, B = A;
But you have huge amount of date in this case renaming will be good. But renaming of column name because of data issue is not a right solution. So you can have above update query to update your data.
Before updating take a backup of table which you are updating using the query,
CREATE TABLE table_name_bkp AS SELECT * FROM table_name;
Always have a backup while playing with original data which will not mess up
15 lakh rows aren't a big deal for SQL server. Switching column names have many cons in relational DB such as index, foreign Key and also you may have to do lots of impacts. So, I would like to suggest to go for traditional path. Simply do the update.

Hive to Hive ETL

I have two large Hive tables, say TableA and TableB (which get loaded from different sources).
These two tables have almost identical table structure / columns with same partition column, a date stored as string.
I need to filter records from each table based on certain (identical) filter criteria.
These tables have some columns containing "codes", which need to be looked up to get its corresponding "values".
There are eight to ten such lookup tables, say, LookupA, LookupB, LookupC, etc.,
Now, I need to:
do a union of those filtered records from TableA and TableB.
do a lookup into the lookup tables and replace those "codes" from the filtered records with their respective "values". If a "code" or "value" is unavailable in the filtered records or lookup table respectively, I need to substitute it with zero or an empty string
transform the dates in the filtered records from one format to another
I am a beginner in Hive. Please let know how I can do it. Thanks.
Note: I can manage till union of the tables. Need some guidance on lookup and transformation.
To basically do a lookup Please follow these steps below,
You have to create a custom User Defined function(UDF) which basically does the look up work,meaning you have to create a Java Program internally for looking up, jar it and add it to Hive something like below:
ADD JAR /home/ubuntu/lookup.jar
You then have to add lookup file containing keyvalue pair as follows:
ADD FILE /home/ubuntu/lookupA;
You then have to create a temporary lookup function such as
CREATE TEMPORARY FUNCTION getLookupValueA AS 'com.LookupA';
Finally you have to call this lookup function in the Select query which will basically populate lookup value for the given lookup key.
Same thing can be achieved using JOIN but that will take a hit on the performance.
Taking a join approach you can very well join by the lookupcode for source and lookup tables something like
select a.key,b.lookupvalue
table a join lookuptable b
where a.key=b.lookupKey
Now for Date Transformation, you can use Date functions in Hive.
For the above problem follow the following steps:
Use union schema to union two tables(schema must be same).
For the above scenario you can try pig script.
script would look like(jn table A and tableB with lookup table and generate the appropriate columns):
a = join TableA by codesA left outer, lookupA by codesA.
b = join a by codesB left outer, lookupB by codesB.
Similarly for Table B.
Suppose some value of codesA does not have a value in the lookup table, then:
z = foreach b generate codesA as codesA, valueA is null ? '0' as valuesA.
(will replace all null values from value with 0).
If you are using Pig 0.12 or later, you can use ToString(CurrentTime(),'yyyy-MM-dd')
I hope it will solve your problem. Let me know in case of any concern.

Oracle SQL merge tables without specifying columns

I have a table people with less than 100,000 records and I have taken a backup of this table using the following:
create table people_backup as select * from people
I add some new records to my people table over time, but eventually I want to merge the records from my backup table into people. Unfortunately I cannot simply DROP my table as my new records will be lost!
So I want to update the records in my people table using the records from people_backup, based on their primary key id and I have found 2 ways to do this:
MERGE the tables together
use some sort of fancy correlated update
Great! However, both of these methods use SET and make me specify what columns I want to update. Unfortunately I am lazy and the structure of people may change over time and while my CTAS statement doesn't need to be updated, my update/merge script will need changes, which feels like unnecessary work for me.
Is there a way merge entire rows without having to specify columns? I see here that not specifying columns during an INSERT will direct SQL to insert values by order, can the same methodology be applied here, is this safe?
NB: The structure of the table will not change between backups
Given that your table is small, you could simply
DELETE FROM table t
WHERE EXISTS( SELECT 1
FROM backup b
WHERE t.key = b.key );
INSERT INTO table
SELECT *
FROM backup;
That is slow and not particularly elegant (particularly if most of the data from the backup hasn't changed) but assuming the columns in the two tables match, it does allow you to not list out the columns. Personally, I'd much prefer writing out the column names (presumably those don't change all that often) so that I could do an update.