How do I continuously merge two datasets with the same schema together without duplicating rows? - sql

On Google Cloud Platform, I have multiple billing accounts. For each billing account, I created a scheduled export to BigQuery that executes multiple times a day.
However, I'd like to have an overview of all of my billing accounts. I want to create a master data table with all of my billing accounts combined.
All of the data tables have the exact same schema. Some sample fields:
cost:FLOAT
sku:STRING
service:STRING
I have already successfully joined my two data tables with a JOIN query:
SELECT * FROM `TABLE 1`
UNION ALL
SELECT * FROM `TABLE 2`
After I've made this query, I clicked "Save results" --> "BigQuery Table." However, I believe this is just a one-time export.
I'd like to update this on a regular basis (say, once every 3 hours) without duplicating the entries.
How do I continuously combine these data tables while making sure I don't have duplicate rows? In other words, for new entries that come into both tables, how do I just append only those new entries to my new master table?

Use a view:
create view v_t as
select * from `TABLE 1`
union all
select * from `TABLE 2`;
This will always be up-to-date, because the tables are referenced when you query them.
Note: You can create the view using the BQ query interface by running the query and selecting "create view". Actually, you don't need to run the query, but I always do just to be sure.

Related

Google BigQuery: Why a table resulted from a union of different tables is not showing any values in preview mode? Only it does when running the query?

I have created the following table Called GDN All accounts which resulted from the following query:
SELECT * FROM `GDNA`
UNION ALL
SELECT * FROM `GDNB`
UNION ALL
SELECT * FROM `GDNC`
UNION ALL
SELECT * FROM `GDND
SELECT * FROM `GDNE`
However, once I have opened the table in preview mode it did not show any values only it did when I have re-run the query.
Moreover, my final aim is to connect this table to PowerBi, still once in PowerBi and connected to the data source no values are showing up only nulls.
Someone can help me with this?
Thanks
Connect and collect data separately from each table. Once this is done, first check all tables are having your expected data or not.
If you found all tables containing expected data, you can now create a new table using Append option in power query. This new table will contain all data together as per your expectation.
Remember, in preview mode not all data shown always if there are big amount of data in the source. You will get the complete list in table visual in the report.

How to delete customer information from hdfs

Suppose, I have several customers today so I am storing their information like customer_id, customer_name, customer_emailid etc. If my customer is leaving and he wants that his personal information should be removed from my hdfs.
So I have below two approaches to achieve the same.
Approach 1:
1.Create Internal Table on top of HDFS
2.Create external table from first table using filter logic
3.While Creating 2nd Table apply udfs on specific columns for more column filtering
Approach 2:
Spark=> Read, filter, write
Is there any other solution?
Approach 2 is possible in Hive - select, filter, write
Create a table on top of directory in hdfs (external or managed, does not matter in this context, better external if you are going to drop table later and keep the data as is). Insert overwrite table or partition from select with filter.
insert overwrite mytable
select *
from mytable --the same table
where customer_id not in (...) --filter rows

Why isn't "union all" doing what I expect?

I created 2 summary tables form the same source data for different date ranges.
Now that I have these multiple summary tables, I want to put those tables together
so that I will be able to run a summary on the combined table.
It's creating the summary table that is presenting the problem.
scratch.table_1 has 809,598 records.
scratch.table_2 has 1,228,176 records.
They both have the same set of fields from the source table,
plus a "record_number" field I created on each table using count(1).
The code I used to put these two tables together was:
create table scratch.table_1_and_2
select * from scratch.table_1
union all
select * from scratch.table_2
I assumed that there would be 809,598 + 1,228,176 records in the new table (2,037,774 records).
But there are only 1,960,769 records in the new table.
What am i doing wrong?
One way to troubleshoot would be to identify some of the missing records and see what might be different about the data in those that would cause them to be left out. A UNION ALL should include duplicate records so duplicates shouldn't be the issue. Maybe there is some data issue that's causing those records to be dropped. Also I'm assuming there isn't any funny business with Views going on in the underlying tables and that no data loads are affecting your record counts.

Update 200 tables in database

I have two databases with a couple hundred tables in them each, in SQL Server. The tables in the two databases are 90% the same, with about 20 different tables in each. I'm working on a stored procedure to update database2 with the data from the tables it shares in database1.
I‌'m thinking truncate the tables and then insert the records from the tables in the other database like:
t‌runcate table database2.dbo.table2
s‌elect *
into data‌‌‌base2.dbo.table2
from database1.dbo.table1
I‌s this the best way to do this, and is there a better way to do it than writing a couple hundred of these statements?‌
This may give an error because the table already exists in Database(As per your truncate command). Given query will create a new table.
"s‌elect *
into data‌‌‌base2.dbo.table2 ---Create new table
from database1.dbo.table1"
If you want the same table structure and Data then you should generate scripts for Schema and data and run that scripts on another database(DB2)
right click the database and select tasks --> generate scripts
Next --> select the requested table/tables (select required tables)
next --> click advanced --> types of data to script = schema and data
also, change check for Existence--True
next and finish

bigquery dataset design, multiple vs single tables for storing the same type of data

im planning to build a new ads system and we are considering to use google bigquery.
ill quickly describe my data flow :
Each User will be able to create multiple ADS. (1 user, N ads)
i would like to store the ADS impressions and i thought of 2 options.
1- create a table for impressions , for example table name is :Impressions fields : (userid,adsid,datetime,meta data fields...)
in this options of all my impressions will be stored in a single table.
main pros : ill be able to big data queries quite easily.
main cons: table will be hugh, and with multiple queries, ill end up paying too much (:
option 2 is to create table per ads
for example, ads id 1 will create
Impression_1 with fields (datetime,meta data fields)
pros: query are cheaper, data table is smaller
cons: todo big dataquery sometimes ill have to create a union and things will complex
i wonder what are your thoughts regarding this ?
In BigQuery it's easy to do this, because you can create tables per each day, and you have the possibility to query only those tables.
And you have Table wildcard functions, which are a cost-effective way to query data from a specific set of tables. When you use a table wildcard function, BigQuery only accesses and charges you for tables that match the wildcard. Table wildcard functions are specified in the query's FROM clause.
Assuming you have some tables like:
mydata.people20140325
mydata.people20140326
mydata.people20140327
You can query like:
SELECT
name
FROM
(TABLE_DATE_RANGE(mydata.people,
TIMESTAMP('2014-03-25'),
TIMESTAMP('2014-03-27')))
WHERE
age >= 35
Also there are Table Decorators:
Table decorators support relative and absolute <time> values. Relative values are indicated by a negative number, and absolute values are indicated by a positive number.
To get a snapshot of the table at one hour ago:
SELECT COUNT(*) FROM [data-sensing-lab:gartner.seattle#-3600000]
There is also TABLE_QUERY, which you can use for more complex queries.