Adding csv with lesser column than schema to BigQuery - sql

I have a table in BigQuery with 100 columns. Now I want to append more rows to it via Transfer but the new CSV has only 99 columns. How should I proceed with this?
I tried creating a schema and adding that column as NULLABLE but it didn't work

I am presuming your CSV file is stored in GCS Bucket and trying to use BQ Data Transfer service to load data periodically by scheduling it.
You can not directly Load/Append data into BQ Table due to schema mismatch.
But as an alternative, create a Staging table named staging_table_csv with 99 columns and Schedule a Data transfer service to load CSV to this table on Overwrite mode.
Now write a query to Append the contents of this staging table staging_table_csv to the target BQ table.
Query might look like this:
#standardSQL
INSERT INTO `project.dataset.target_table`
SELECT
*,
<DEFAULT_VALUE> AS COL100
FROM
`project.dataset.staging_table_csv`
Now schedule this query to run after the staging table is loaded
Make sure to keep a buffer between the Staging table load and the Target Table load. You can perform trials to find a suitable buffer.
For eg: If Transfer is scheduled at 12:00, Schedule Target Table load Query t 12:05 or 12:10
Note: Creating an extra Staging table would incur storage costs but
since it is overwritten for each load, historical data cost is not
incurred

Related

Azure Data Factory Incremental Load data by using Copy Activity

I would like to load incremental data from data lake into on premise SQL, so that i created data flow do the necessary data transformation and cleaning the data.
after that i copied all the final data sink to staging data lake to stored CSV format.
I am facing two kind of issues here.
when ever i am trigger / debug to loading my dataset(data flow full activity ), the first time data loaded in CSV, if I load second time similar pipeline, the target data lake, the CSV file loaded empty data, which means, the column header loaded but i could not see the any value inside file.
coming to copy activity, which is connected to on premise SQL server, i am trying to load the data but if we trigger this pipeline again and again, the duplicate data loaded, i want to load only incremental or if updated data comes from data lake CSV file. how do we handle this.
Kindly suggest.
When we want to incrementally load our data to a database table, we need to use the Upsert option in copy data tool.
Upsert helps you to incrementally load the source data based on a key column (or columns). If the key column is already present in target table, it will update the rest of the column values, else it will insert the new key column with other values.
Look at following demonstration to understand how upsert works. I used azure SQL database as an example.
My initial table data:
create table player(id int, gname varchar(20), team varchar(10))
My source csv data (data I want to incrementally load):
I have taken an id which already exists in target table (id=1) and another which is new (id=4).
My copy data sink configuration:
Create/select dataset for the target table. Check the Upsert option as your write behavior and select a key column based on which upsert should happen.
Table after upsert using Copy data:
Now, after the upsert using copy data, the id=1 row should be updated and id=4 row should be inserted. The following is the final output achieved which is inline with expected output.
You can use the primary key in your target table (which is also present in your source csv) as the key column in Copy data sink configuration. Any other configuration (like source filter by last modified configuration) should not effect the process.

Best way to merge JSON blob files to SQL table using Azure Data Factory

I have a bunch of JSON files coming into Azure data lake gen 2, the JSON files contains new data as well as updates.
The data needs to be merged into a SQL table so I can start to do some reporting. The way I solved the problem has been to create a Azure Data factory that looks like this
Create and copy to temp table:
First I use the copy data to take the JSON and create a table from the schema and dump the content into the table.
Create delivery table:
Creates a table with the right schema if it doesn't already exsist
Merge temp with delivery:
Here I use a merge clause to cast and merge the data from the table that was created at step 1 with the table from step 2.
Delete temp data:
Deletes the table from step 1
This data factory gets triggered each time there's a new file in the data lake.
The pipeline solves my problem but I feel like there's a lot of unnecessary overhead by creating and dropping a new table each time I process a file.
Is there a way to optimize this flow, maybe by merging the JSON directly to the "Delivery" table?
Thanks in advance

How hive create a table from a file present in HDFS?

I am new to HDFS and HIVE. I got some introduction of both after reading some books and documentation. I have a question regarding creation of a table in HIVE for which file is present in HDFS.
I have this file with 300 fields in HDFS. I want to create a table accessing this file in HDFS. But I want to make use of say 30 fields from this file.
My questions are
1. Does hive create a separate file directory?
2. Do I have to create hive table first and import data from HDFS?
3. Since I want to create a table with 30 columns out of 300 columns, Does hive create a file with only those 30 columns?
4. Do I have to create a separate file with 30 columns and import into HDFS and then create hive table pointing to HDFS directory?
My questions are
Does hive create a separate file directory?
YES if you create a hive table (managed/external) and load the data using load command.
NO if you create an external table and point to the existing file.
Do I have to create hive table first and import data from HDFS?
Not Necessarily you can create a hive external table and point to this existing file.
Since I want to create a table with 30 columns out of 300 columns, Does hive create a file with only those 30 columns?
You can do it easily using hiveQL. follow the below steps (note: this is not the only approach):
create a external table with 300 column and point to the existing
file.
create another hive table with desired 30 columns and insert data to this new table from 300 column table using "insert into
table30col select ... from table300col". Note: hive will create the
file with 30 columns during this insert operation.
Do I have to create a separate file with 30 columns and import into HDFS and then create hive table pointing to HDFS directory?
Yes this can be an alternative.
I personally like solution mentioned in question 3 as I don't have to recreate the file and I can do all of that in hadoop without depending on some other system.
You have several options. One is to have Hive simply point to the existing file, i.e. create an external HIVE table:
CREATE EXTERNAL TABLE ... LOCATION '<your existing hdfs file>';
This table in Hive will, obviously, match exactly your existing table. You must declare all 300 columns. There will be no data duplication, there is only one one file, Hive simply references the already existing file.
A second option would be to either IMPORT or LOAD the data into a Hive table. This would copy the data into a Hive table and let Hive control the location. But is important to understand that neither IMPORT nor LOAD do not transform the data, so the result table will have exactly the same structure layout and storage as your original table.
Another option, which I would recommend, is to create a specific Hive table and then import the data into it, using a tool like Sqoop or going through an intermediate staging table created by one of the methods above (preferably external reference to avoid an extra copy). Create the desired table, create the external reference staging table, insert the data into the target using INSERT ... SELECT, then drop the staging table. I recommend this because it lets you control not only the table structure/schema (ie. have only the needed 30 columns) but also, importantly, the storage. Hive has a highly columnar performant storage format, namely ORC, and you should thrive to use this storage format because will give you tremendous query performance boost.

Hive insert vs Hive Load: What are the trade offs?

I'm learning Hadoop/Big data technologies. I would like to ingest data in bulk into hive. I started working with a simple CSV file and when I tried to use INSERT command to load each record by record, one record insertion itself took around 1 minute. When I put the file into HDFS and then used the LOAD command, it was instantaneous since it just copies the file into hive's warehouse. I just want to know what are the trade offs that one have to face when they opt in towards LOAD instead of INSERT.
Load- Hive does not do any transformation while loading data into tables. Load operations are currently pure copy/move operations that move datafiles into locations corresponding to Hive tables.
Insert-Query Results can be inserted into tables by using the insert clause and which in turn runs the map reduce jobs.So it takes some time to execute.
In case if you want to optimize/tune the insert statements.Below are some techniques:
1.Setting the execution Engine in hive-site.xml to Tez(if its already installed)
set hive.execution.engine=tez;
2.USE ORCFILE
CREATE TABLE A_ORC (
customerID int, name string, age int, address string
) STORED AS ORC tblproperties (“orc.compress" = “SNAPPY”);
INSERT INTO TABLE A_ORC SELECT * FROM A;
3. Concurrent job runs in hive can save the overall job running time .To achieve that hive-default.xml,below config needs to be changed:
set hive.exec.parallel=true;
set hive.exec.parallel.thread.number=<your value>;
For more info,you can visit http://hortonworks.com/blog/5-ways-make-hive-queries-run-faster/
Hope this helps.

loading data to hive dynamic partitioned tables

I have created a hive table with dynamic partitioning on a column. Is there a way to directly load the data from files using "LOAD DATA" statement? Or do we have to only depend on creating a non-partitioned intermediate table and load file data to it and then inserting data from this intermediate table to partitioned table as mentioned in Hive loading in partitioned table?
No, the LOAD DATA command ONLY copies the files to the destination directory. It doesn't read the records of the input file, so it CANNOT do partitioning based on record values.
If your input data is already split into multiple files based on partitions, you could directly copy the files to table location in HDFS under their partition directory manually created by you (OR just point to their current location in case of EXTERNAL table) and use the following ALTER command to ADD the partition. This way you could skip the LOAD DATA statement altogether.
ALTER TABLE <table-name>
ADD PARTITION (<...>)
No other go, if we need to insert directly, we'll need to specify partitions manually.
For dynamic partitioning, we need staging table and then insert from there.