Append one table to another - google-bigquery

Is there currently a way to append data from one table to another via. the API and PHP?
For instance:
I have two tables;
today
all_time
at the end of every day I want to append today into all_time and both tables use the same schema.

It's possible, you just need to pass in the async query configuration writeDisposition=WRITE_APPEND and setup the destination tables.
read about writeDisposition here: https://cloud.google.com/bigquery/docs/reference/v2/jobs#resource
Other than this, you can directly write the results of the query to a table in query mode, using the Destination Table option that is available under Show Options.

Related

Google BigQuery list tables

I need to list all tables in my BigQuery, but I don't know how to do it, I try search but I didn't find anything about it.
I need to know if the table exists, if it exists I search for record, if not I create table and insert record.
Depending where/how you want to do this, you can use CLI, API calls or client libraries. Here you have all the info about listing tables
As an example, if you want to list them using Command Line Interface, you can do it like:
bq ls <project>:<dataset>
If you want to use normal SQL queries, you can use the INFROMATION_SCHEMA Beta feature
SELECT table_name from `<project>.<dataset>.INFORMATION_SCHEMA.TABLES`
(project is optional)

Is there a way to merge ORC files in HDFS without using ALTER TABLE CONCATENATE command?

This is my first week with Hive and HDFS, so please bear with me.
Almost all the ways I saw so far to merge multiple ORC files suggest using ALTER TABLE with CONCATENATE command.
But I need to merge multiple ORC files of the same table without having to ALTER the table. Another option is to create a copy of the existing table and then use ALTER TABLE on that so that my original table remains unchanged. But I can't do that as well because space and data redundancy reasons.
The thing I'm trying to achieve (ideally) is: I need to transport these ORCs as one file per table into a cloud environment. So, is there a way that I can merge the ORCs on-the-go during the transfer process into cloud? Can this be achieved with/without Hive, maybe directly in HDFS?
Two possible methods other than ALTER TABLE CONCATENATE:
Try to configure merge task, see details here: https://stackoverflow.com/a/45266244/2700344
Alternatively you can force single reducer. This method is quite applicable for not too big files. You can overwrite the same table with ORDER BY, this will force single reducer on the last ORDER BY stage. This will work slow or even fail with big files because all the data will be passed through single reducer:
INSERT OVERWRITE TABLE
SELECT * FROM TABLE
ORDER BY some_col; --this will force single reducer
As a side effect you will get better packed ORC file with efficient index on columns listed in order by.

Create Partition table in Big Query

Can anyone please suggest how to create partition table in Big Query ?.
Example: Suppose I have one log data in google storage for the year of 2016. I stored all data in one bucket partitioned by year , month and date wise. Here I want create table with partitioned by date.
Thanks in Advance
Documentation for partitioned tables is here:
https://cloud.google.com/bigquery/docs/creating-partitioned-tables
In this case, you'd create a partitioned table and populate the partitions with the data. You can run a query job that reads from GCS (and filters data for the specific date) and writes to the corresponding partition of a table. For example, to load data for May 1st, 2016 -- you'd specify the destination_table as table$20160501.
Currently, you'll have to run several query jobs to achieve this process. Please note that you'll be charged for each query job based on bytes processed.
Please see this post for some more details:
Migrating from non-partitioned to Partitioned tables
There are two options:
Option 1
You can load each daily file into separate respective table with name as YourLogs_YYYYMMDD
See details on how to Load Data from Cloud Storage
After tables created, you can access them either using Table wildcard functions (Legacy SQL) or using Wildcard Table (Standar SQL). See also Querying Multiple Tables Using a Wildcard Table for more examples
Option 2
You can create Date-Partitioned Table (just one table - YourLogs) - but you still will need to load each daily file into respective partition - see Creating and Updating Date-Partitioned Tables
After table is loaded you can easily Query Date-Partitioned Tables
Having partitions for an External Table is not allowed as for now. There is a Feature Request for it:
https://issuetracker.google.com/issues/62993684
(please vote for it if you're interested in it!)
Google says that they are considering it.

Importing Excel data using SSIS using unique ID's

I have one excel file that I want to import into two different tables, tblUni and tblUser.
I have a third table which contains the id's from the other two tables:
tblUni_Students
Id
UniId
StudentId
What I need is when I import the excel data into the first two tables, for each record, the newly created ids to be inserted into the Uni_Students table also.
Using SSIS, I have managed to import the data into two sql destinations but cannot seem to then take the new ids from these destinations to then insert into the lookup table.
Can anyone advise please. Thanks.
It's a bit difficult to answer without knowing the target database or the structure of the data but speaking generally this would be much better done by adding the data into a "load" table. i.e. one who's sole reason is to temporarily hold data while you process it, you would then update the tblStudent, tblUni and tblUni_Student tables from the load area using SQL statements either via Procedure or via an Execute SQL Task component.
You'd it as an oledbcommand component, where the command is to insert values into the table. Then in the same component you'd output the generated identity. Assign the generated identity to a new column in the output, and now you have all your data plus the generated identity in the dataflow.
This will be processed one row at a time, so it will be slow. Personally I'd put it in a staging table and do it as CiarĂ¡n described.

using ssis to move tabels and parts of tables from one database to another

I am currently using a database that is poorly designed and a slow pipeline so i decided to copy a small portion of the database (15 tables) and only bring over some of those tables for example i want to bring only the rows that have a certain id.
But this is not a one time move i need to add all the stuff that is added to the old database added to the new one on an hourly basis. My research has led me to SSIS and that it may have a way of accomplishing this but i have found no clear examples on how it is done if in fact it is possible. Thanks in advance.
Yes it is possible . You can schedule your ssis package through sql agent to run on hourly basis .
For a table ,you can drag a Data Flow Task on to the control flow .Inside DFT ,you need to place an oledb Source component ,Lookup ,Data conversion (if the types are different in source and target table) and Oledb destination .
oledb Source component : Create a variable of type string and in the expression write your sql query to fetch the data based on ID.Now use this variable in source component.
Lookup: You need to select your source table and combine the primary key column from the source and destination table.It acts similar to inner join query .After combining the primary key from the both the tables ,select the columns which you need from the source .
Oledb destination : Simply select your target table and map the columns from Lookup no matched output .If you need to update the values from the source then use Lookup matched output and connect it to an execute SQL task and write the update query .
Please go thru the link and SO
Scheduling of SSIS package