Problem to replace the blank columns in big query with null - google-bigquery

I was expecting Null_marker to replace the blank STRING with null but, it did not work. Any suggestions, please?
tried using the --null_marker="null"
$gcloud_dir/bq load $svc_ac --max_bad_records=10 --replace --source_format=CSV --null_marker="null" --field_delimiter=',' table source
the empty stings did not get replaced with NULL

Google Cloud Support here!
After reading through the documentation, the description for the --null_marker flag states:
Specifies a string that represents a null value in a CSV file. For example, if you specify "\N", BigQuery interprets "\N" as a null value when loading a CSV file. The default value is the empty string.
Therefore setting null_marker=null will not replace empty strings with NULL, it will only treat 'null' as a null value. At this point you should either:
Replace empty strings before uploading the CSV file.
Once you have uploaded the CSV file make a query using the replace function.

Related

How to remove new line characters from data rows in Presto/AWS Athena?

I'm querying some tables on Athena (Presto SAS) and then downloading the generated CSV file to use locally. Opening the file, I realised the data contains new line characters that doesn't appear on AWS interface, only in the CSV and need to get rid of them. Tried using the function replace(string, search, replace) → varchar to skip the newline char replacing \n for \\n without success:
SELECT
p.recvepoch, replace(p.description, '\n', '\\n') AS description
FROM
product p
LIMIT 1000
How can I achieve that?
The problem was that the underlying table data doesn't actually contains \n anywhere, instead, the actual newline character, which is represented by char(10). I was able to achieve the expected behaviour using the replace function passing it as parameter:
SELECT
p.recvepoch, replace(p.description, chr(10), '\n') AS description
FROM
product p
LIMIT 1000

skipLeadingRows=1 in external table definition

In the below example, how can I set the skip leading row option?
bq --location=US query --external_table_definition=sales::Region:STRING,Quarter:STRING,Total_sales:INTEGER#CSV=gs://mybucket/sales.csv 'SELECT Region,Total_sales FROM sales;'
Regards,
Sreekanth
Flags options can be found under the installation home folder (I marked in bold below the flag you are looking for)
/google-cloud-sdk/platform/bq/bq.py:
--[no]allow_jagged_rows: Whether to allow missing trailing optional columns in
CSV import data.
--[no]allow_quoted_newlines: Whether to allow quoted newlines in CSV import
data.
-E,--encoding: : The character encoding used by the input
file. Options include:
ISO-8859-1 (also known as Latin-1)
UTF-8
-F,--field_delimiter: The character that indicates the boundary between
columns in the input file. "\t" and "tab" are accepted names for tab.
--[no]ignore_unknown_values: Whether to allow and ignore extra, unrecognized
values in CSV or JSON import data.
--max_bad_records: Maximum number of bad records allowed before the entire job
fails.
(default: '0')
(an integer)
--quote: Quote character to use to enclose records. Default is ". To indicate
no quote character at all, use an empty string.
--[no]replace: If true erase existing contents before loading new data.
(default: 'false')
--schema: Either a filename or a comma-separated list of fields in the form
name[:type].
--skip_leading_rows: The number of rows at the beginning of the source file to
skip.
(an integer)
--source_format: : Format of
source data. Options include:
CSV
NEWLINE_DELIMITED_JSON
DATASTORE_BACKUP

CSV to BQ: empty fields instead of null values

I have a pipeline that is loading a CSV file from GCS into BQ. The details are here: Import CSV file from GCS to BigQuery.
I'm splitting the CSV in a ParDo into a TableRow where some of the fields are empty.
String inputLine = c.element();
String[] split = inputLine.split(',');
TableRow output = new TableRow();
output.set("Event_Time", split[0]);
output.set("Name", split[1]);
...
c.output(output);
My question is, how can I have the empty fields show up as a null in BigQuery? Currently they are coming through as empty fields.
It's turning up in BigQuery as an empty String because when you use split(), it will return an empty String for ,, and not null in the Array.
Two options:
Check for empty String in your result array and don't set the field in output.
Check for empty String in your result array and explicitly set null for the field in output.
Either way will result in null for BigQuery.
Note: be careful splitting Strings in Java like this this. split() will remove leading and trailing empties. Use split("," -1) instead. See here.
BTW: unless you're doing some complex/advanced transformations in Dataflow, you don't have to use a pipeline to load in your CSV files. You could just load it or read it directly from GCS.

psycopg2: export csv to database, dealing with e+ expression

I have a csv file containing
numbers like "1.456e+07"
and I am using function "copy_expert" to export the file to database
but I am getting error
psycopg2.DataError: invalid input syntax for integer: "1.5637e+07"
I notice that I can insert "100" as an integer, but when I do "1.5637e+07" with qoute, it doesn't work.
I am using pandas dataframe's to_csv to generate the csv files. not sure how to get rid of qoute for integer like "1.5637e+07" only (I have string column), or whether there is other solution.
I find out the solution
Normally, pandas doesn't put quotes around number. However, I set float_format parameter which causes this. I reset
quoting=csv.QUOTE_MINIMAL
in the function call and the quotes go away.

Why doesn't the PostgreSQL COPY command allow NULL values inside arrays?

I have the following table definition:
create table null_test (some_array character varying[]);
And the following SQL file containing data.
copy null_test from stdin;
{A,\N,B}
\.
When unnesting the data (with select unnest(some_array) from null_test), the second value is "N", when I am expecting NULL.
I have tried changing the data to look as follows (to use internal quotes on the array value):
copy null_test from stdin;
{"A",\N,"B"}
\.
The same non-null value "N" is inserted?
Why is this not working and is there a workaround for this?
EDIT
As per the accepted answer, the following worked. However, the two representation of NULL values within a COPY command depending on whether you're using single or array values is inconsistent.
copy null_test from stdin;
{"A",NULL,"B"}
\.
\N represents NULL as a whole value to COPY, not as part of another value and \N isn't anything special to PostgreSQL itself. Inside an array, the \N is just \N and COPY just passes the array literal to the database rather than trying to interpret it using COPY's rules.
You simply need to know how to build an array literal that contains a NULL and from the fine manual:
To set an element of an array constant to NULL, write NULL for the element value. (Any upper- or lower-case variant of NULL will do.) If you want an actual string value "NULL", you must put double quotes around it.
So you could use these:
{A,null,B}
{"A",NULL,"B"}
...
to get NULLs in your arrays.