When using BigQuery transfer service with a CSV is it possible to only transfer certain columns? Not, all columns? - google-bigquery

I am setting up a BigQuery transfer service to transfer a CSV stored in a GCS bucket into BigQuery.
However, I don't need all the columns in the CSV file. Is there a way of limiting the columns I transfer without having to manually remove the columns before the transfer?
Or, if I limited the columns in my BQ table to the ones I need, will BQ just ignore the other columns in the CSV file?
I have read the relevant page in the documentation but there is no mention of limiting columns.

You can accomplish what you want if you manually specify the target table schema with the columns that you need. Then when you use the transfer service you need to set the option ignore_unknown_values to true.
Let's say I have a CSV on Google Cloud Storage with the following data:
"First"|"Second"|"Ignored"
"Third"|"Fourth"|"Ignored"
Then I have the table with the name test and schema like:
first_col STRING NULLABLE
second_col STRING NULLABLE
After configuring the transfer service with web UI and checking the checkbox "Ignore unknown values" I get the following data in the table:
first_col
second_col
First
Second
Third
Fourth
Read more about it in this section.

Related

Adding a dynamic column in copy activity of azure data factory

I am using Data flow in my Azure Data factory pipeline in order to copy data from one cosmos db collection to another cosmos db collection. I am using cosmos SQL Api as the source and sink datasets.
Problem is when copying the documents from one collection to other,I would like to add an additional column whose value will be same as one of the existing key in json. I am trying with Additional column thing in Source settings but i am not able to figure out how can I add an existing column value in there. anyone with any help on this_
In case of copy activity, you can assign the existing column value to new column under Additional column by specifying value as $$COLUMN and add the column name to be assigned.
If you are adding new column in data flow, you can achieve this using derived column

Trouble uploading single column csv to bigquery with split columns

I'm trying to upload a dataset to bigquery so that i can query the data. The dataset is currently in a csv, with all the data for each row in one column, split by commas. I want to have the data split into columns using the comma as a delimiter.
When trying to upload using autodetect schema, the 10 columns have been detected, but are called 'string_0, string_1, string_2 etc' and the rows still have all the data in the first column.
When trying to upload by manually inputting the schema, i get these errors:
CSV table encountered too many errors, giving up. Rows: 1; errors: 1.
CSV table references column position 9, but line starting at position:117 contains only 1 columns.
On both occasions I set header rows to skip = 1
Here's an image of the dataset.
Any help would be really appreciated!
I see here a three potential reasons for the error you're hitting:
Source data CSV file structural problem - the file does not correspond to the RFC 4180 specification prerequisites, i.e. used untypical line-breaks(line delimiters);
Bigquery sink table schema mismatch - i.e. missing a
dedicated column for a particular input data;
Bigquery schema type mismatch - parsing a table column that owns a
type that differs from input one.
Please find also more particularities for Bigquery auto-detect schema method, loading CSV format data, that can help you to solve above mentioned issue.

Create table schema and load data in bigquery table using source google drive

I am creating table using google drive as a source and google sheet as a format.
I have selected "Drive" as a value for create table from. For file Format, I selected Google Sheet.
Also I selected the Auto Detect Schema and input parameters.
Its creating the table but the first row of the sheet is also loaded as a data instead of table fields.
Kindly tell me what I need to do to get the first row of the sheet as a table column name not as a data.
It would have been helpful if you could include a screenshot of the top few rows of the file you're trying to upload at least to see the data types you have in there. BigQuery, at least as of when this response was composed, cannot differentiate between column names and data rows if both have similar datatypes while schema auto detection is used. For instance, if your data looks like this:
headerA, headerB
row1a, row1b
row2a, row2b
row3a, row3b
BigQuery would not be able to detect the column names (at least automatically using the UI options alone) since all the headers and row data are Strings. The "Header rows to skip" option would not help with this.
Schema auto detection should be able to detect and differentiate column names from data rows when you have different data types for different columns though.
You have an option to skip header row in Advanced options. Simply put 1 as the number of rows to skip (your first row is where your header is). It will skip the first row and use it as the values for your header.

Google BigQuery How to Load custom columns from csv?

I have a doubleclick csv file with 20 columns (Timestamp,AdvertiserId,ActionName,Brower,OSID ...) without any header. I would like to ingest only first 3 columns into a BQ table. Is there any way to achieve that without mapping each and every column one-by-one manually into BQ's UI (create new_table ->"Schema" section)?
Fields in CSV is comma separated and newlines are defined as semi-colon';'.
There are two possible ways to do that: BigQuery: Load from CSV, skip columns
In your case I would probably suggest the second approach. Set the ignoreUnknownValues flag and pass in a schema with just the first three columns. For example:
bq load --ignore_unknown_values dataset.new_table gs://path/to/file.csv ~/path/to/schema.json

How to Load data in CSV file with a query in SSIS

I have a CSV file which contain millions records/rows. The header row is like:
<NIC,Name,Address,Telephone,CardType,Payment>
In my scenario I want to load data "CardType" is equal to "VIP". How can I preform this operation without loading whole records in the file into a staging table?
I am not loading these records into a data warehouse. I only need to separate these data in CSV file.
The question isn't super-clear, but it sounds like you want to do some processing of the rows before outputting them back into another CSV file. If that's the case, then you'll want to make use of the various transforms available, notably Conditional Split. In there, you can look for rows where CardType == VIP and send those down one output (call it "Valid Rows"), and send the others into the default output. Connect up your valid rows output to your CSV destination and that should be it.