Trouble uploading single column csv to bigquery with split columns - sql

I'm trying to upload a dataset to bigquery so that i can query the data. The dataset is currently in a csv, with all the data for each row in one column, split by commas. I want to have the data split into columns using the comma as a delimiter.
When trying to upload using autodetect schema, the 10 columns have been detected, but are called 'string_0, string_1, string_2 etc' and the rows still have all the data in the first column.
When trying to upload by manually inputting the schema, i get these errors:
CSV table encountered too many errors, giving up. Rows: 1; errors: 1.
CSV table references column position 9, but line starting at position:117 contains only 1 columns.
On both occasions I set header rows to skip = 1
Here's an image of the dataset.
Any help would be really appreciated!

I see here a three potential reasons for the error you're hitting:
Source data CSV file structural problem - the file does not correspond to the RFC 4180 specification prerequisites, i.e. used untypical line-breaks(line delimiters);
Bigquery sink table schema mismatch - i.e. missing a
dedicated column for a particular input data;
Bigquery schema type mismatch - parsing a table column that owns a
type that differs from input one.
Please find also more particularities for Bigquery auto-detect schema method, loading CSV format data, that can help you to solve above mentioned issue.

Related

Processing CSV files with Metadata in S3 bucket in Pentaho

I have a CSV file that goes something like this:
Report Name: Stackoverflow parse data
Date of Report: 31 October, 2022
Col1, Col2, Col3,...
Data, Data, Data, ...
The values before Headers, essentially data that states what the CSV is for and when it was created (can contain multiple values, hence has dynamic number of rows), need to be removed from the CSV so I can parse it in Pentaho. Now, the CSV files are on an S3 bucket and I am fetching them using S3 CSV Input but I am not sure how to proceed with filtering the non-required data so I can successfully parse the CSV files.
You can read the complete file as a CSV with only one column, adding the rownumber to the output. Then you apply a filter to get rid of the first n rows, and then you use the Split fields step to separate the rows into columns.
You'll need more steps to transform numbers and dates into the correct format (using the Split fields you'll get strings), and maybe more operations to preformat some other columns.
Or you could create a temporal copy of with your S3 CSV file without the first n rows, and read that file instead of the original one.
Step1: In the Csv input, just adding rownumber
Step2:Use filter
Step3:Add a output component like csv or database.

Create table schema and load data in bigquery table using source google drive

I am creating table using google drive as a source and google sheet as a format.
I have selected "Drive" as a value for create table from. For file Format, I selected Google Sheet.
Also I selected the Auto Detect Schema and input parameters.
Its creating the table but the first row of the sheet is also loaded as a data instead of table fields.
Kindly tell me what I need to do to get the first row of the sheet as a table column name not as a data.
It would have been helpful if you could include a screenshot of the top few rows of the file you're trying to upload at least to see the data types you have in there. BigQuery, at least as of when this response was composed, cannot differentiate between column names and data rows if both have similar datatypes while schema auto detection is used. For instance, if your data looks like this:
headerA, headerB
row1a, row1b
row2a, row2b
row3a, row3b
BigQuery would not be able to detect the column names (at least automatically using the UI options alone) since all the headers and row data are Strings. The "Header rows to skip" option would not help with this.
Schema auto detection should be able to detect and differentiate column names from data rows when you have different data types for different columns though.
You have an option to skip header row in Advanced options. Simply put 1 as the number of rows to skip (your first row is where your header is). It will skip the first row and use it as the values for your header.

How to combine two rows into one row with respective column values

I've a csv file which contains data with new lines for a single row i.e. one row data comes in two lines and I want to insert the new lines data into respective columns. I've loaded the data into sql but now I want to replace the second row data into 1st row with respective column values.
output details:
I wouldn't recommend fixing this in SQL because this is an issue with the CSV file. The issue is that file contains new lines, which causes rows split.
I strongly encourage fixing CSV files, if possible. It's going to be difficult to fix that in SQL given there are going to be more cases like that.
If you're doing the import with SSIS (or if you have the option of doing it with SSIS if you are not currently), the package can be configured to manage embedded carriage returns.
Define your file import connection manager with the columns you're expecting.
In the connection manager's Properties window, set the AlwaysCheckForRowDelimiters property to False. The default value is True.
By setting the property to False, SSIS will ignore mid-row carriage return/line feeds and will parse your data into the required number of columns.
Credit to Martin Smith for helping me out when I had a very similar problem some time ago.

BigQuery: schema autodetection of JSON couldn't recognize a field appearing later in the JSON input file

I found that BigQuery's schema autodetection doesn't recognize a field if that doesn't appear in the beginning of an input JSON file.
I have this field named "details" which is a record type. In the first 2K rows of the JSON input file, this field doesn't have any sub-fields. But then in 2,698 rows of the input file, this field has "report" sub-field for the first time. If I move the line to the top of the JSON file, it works fine.
How can I solve this issue? Explicitly specifying the schema is one way but I am wondering if there is a way to make the auto detection scan more rows or something like that.

Google BigQuery How to Load custom columns from csv?

I have a doubleclick csv file with 20 columns (Timestamp,AdvertiserId,ActionName,Brower,OSID ...) without any header. I would like to ingest only first 3 columns into a BQ table. Is there any way to achieve that without mapping each and every column one-by-one manually into BQ's UI (create new_table ->"Schema" section)?
Fields in CSV is comma separated and newlines are defined as semi-colon';'.
There are two possible ways to do that: BigQuery: Load from CSV, skip columns
In your case I would probably suggest the second approach. Set the ignoreUnknownValues flag and pass in a schema with just the first three columns. For example:
bq load --ignore_unknown_values dataset.new_table gs://path/to/file.csv ~/path/to/schema.json