Google BigQuery How to Load custom columns from csv? - google-bigquery

I have a doubleclick csv file with 20 columns (Timestamp,AdvertiserId,ActionName,Brower,OSID ...) without any header. I would like to ingest only first 3 columns into a BQ table. Is there any way to achieve that without mapping each and every column one-by-one manually into BQ's UI (create new_table ->"Schema" section)?
Fields in CSV is comma separated and newlines are defined as semi-colon';'.

There are two possible ways to do that: BigQuery: Load from CSV, skip columns
In your case I would probably suggest the second approach. Set the ignoreUnknownValues flag and pass in a schema with just the first three columns. For example:
bq load --ignore_unknown_values dataset.new_table gs://path/to/file.csv ~/path/to/schema.json

Related

Processing CSV files with Metadata in S3 bucket in Pentaho

I have a CSV file that goes something like this:
Report Name: Stackoverflow parse data
Date of Report: 31 October, 2022
Col1, Col2, Col3,...
Data, Data, Data, ...
The values before Headers, essentially data that states what the CSV is for and when it was created (can contain multiple values, hence has dynamic number of rows), need to be removed from the CSV so I can parse it in Pentaho. Now, the CSV files are on an S3 bucket and I am fetching them using S3 CSV Input but I am not sure how to proceed with filtering the non-required data so I can successfully parse the CSV files.
You can read the complete file as a CSV with only one column, adding the rownumber to the output. Then you apply a filter to get rid of the first n rows, and then you use the Split fields step to separate the rows into columns.
You'll need more steps to transform numbers and dates into the correct format (using the Split fields you'll get strings), and maybe more operations to preformat some other columns.
Or you could create a temporal copy of with your S3 CSV file without the first n rows, and read that file instead of the original one.
Step1: In the Csv input, just adding rownumber
Step2:Use filter
Step3:Add a output component like csv or database.

When using BigQuery transfer service with a CSV is it possible to only transfer certain columns? Not, all columns?

I am setting up a BigQuery transfer service to transfer a CSV stored in a GCS bucket into BigQuery.
However, I don't need all the columns in the CSV file. Is there a way of limiting the columns I transfer without having to manually remove the columns before the transfer?
Or, if I limited the columns in my BQ table to the ones I need, will BQ just ignore the other columns in the CSV file?
I have read the relevant page in the documentation but there is no mention of limiting columns.
You can accomplish what you want if you manually specify the target table schema with the columns that you need. Then when you use the transfer service you need to set the option ignore_unknown_values to true.
Let's say I have a CSV on Google Cloud Storage with the following data:
"First"|"Second"|"Ignored"
"Third"|"Fourth"|"Ignored"
Then I have the table with the name test and schema like:
first_col STRING NULLABLE
second_col STRING NULLABLE
After configuring the transfer service with web UI and checking the checkbox "Ignore unknown values" I get the following data in the table:
first_col
second_col
First
Second
Third
Fourth
Read more about it in this section.

How do I load entire file content as a text into a column AzureSQLDW table?

I have a some file in an azure data lake 2 and I want to load them as a column value nvarchar(max) in AzureSQLDW. The table in AzureSQLDW is heap. I couldn't find any way to do it? All I see is column delimited when load them into multiple rows instead of one row in single column. How I achieve this?
I don't guarantee this will work, but try using COPY INTO and define non-present values for row and column delimiters. Make your target a single column table.
I would create a Source Dataset with a single column. You do this by specifying "No delimiter":
Next, go to the "Schema" tab and Import the schema, which should create a single column called "Prop_0":
Now the data should come through as a single string instead of delimited columns.

Trouble uploading single column csv to bigquery with split columns

I'm trying to upload a dataset to bigquery so that i can query the data. The dataset is currently in a csv, with all the data for each row in one column, split by commas. I want to have the data split into columns using the comma as a delimiter.
When trying to upload using autodetect schema, the 10 columns have been detected, but are called 'string_0, string_1, string_2 etc' and the rows still have all the data in the first column.
When trying to upload by manually inputting the schema, i get these errors:
CSV table encountered too many errors, giving up. Rows: 1; errors: 1.
CSV table references column position 9, but line starting at position:117 contains only 1 columns.
On both occasions I set header rows to skip = 1
Here's an image of the dataset.
Any help would be really appreciated!
I see here a three potential reasons for the error you're hitting:
Source data CSV file structural problem - the file does not correspond to the RFC 4180 specification prerequisites, i.e. used untypical line-breaks(line delimiters);
Bigquery sink table schema mismatch - i.e. missing a
dedicated column for a particular input data;
Bigquery schema type mismatch - parsing a table column that owns a
type that differs from input one.
Please find also more particularities for Bigquery auto-detect schema method, loading CSV format data, that can help you to solve above mentioned issue.

How to Load data in CSV file with a query in SSIS

I have a CSV file which contain millions records/rows. The header row is like:
<NIC,Name,Address,Telephone,CardType,Payment>
In my scenario I want to load data "CardType" is equal to "VIP". How can I preform this operation without loading whole records in the file into a staging table?
I am not loading these records into a data warehouse. I only need to separate these data in CSV file.
The question isn't super-clear, but it sounds like you want to do some processing of the rows before outputting them back into another CSV file. If that's the case, then you'll want to make use of the various transforms available, notably Conditional Split. In there, you can look for rows where CardType == VIP and send those down one output (call it "Valid Rows"), and send the others into the default output. Connect up your valid rows output to your CSV destination and that should be it.