csv file missing data when loading into view - google-bigquery

I have loaded two csv files into google storage and created a view to separate and cast columns into the correct data type.
The there are data columns that I need to cast as a float64 to generate a point (long and lat) using the st_geopoint function which I will then match up to a polygon from another schema.
The first file loads correctly and i am able to push it into a table, however the second file appears differently. All the data in the columns have " " around the data which the first file did not have, and it does not generate the a point as the long and lat are not being picked up.
Could someone give me a solution or some advice.
The first csv file is significantly smaller in size 449kb vs the second 4.7mb, which I dont think should be an issue as long as the columns are in the correct order and have the same name when I load them in.
Also I have checked the files in notepad++ to see if there are any differences, they appear to be similar.
This is my view:
Select
Filename,
SAFE_CAST(Date as Date) Date,
SAFE_CAST (Lat as FLOAT64) Lat,
SAFE_CAST (Long as FLOAT64) Long,
st_geogpoint(SAFE_CAST(Long as FLOAT64),SAFE_CAST(Lat as FLOAT64)) as Point,
From (
-- Here we are separating the columns by 'tab' (/t) and getting the date format from file name
SELECT
Filename,
parse_date("%Y%m%d", array_reverse(split(replace(Filename, '.', '_'), '_'))[safe_offset(1)]) as Date,
split(input,"," )[safe_offset(0)] as Lat,
split(input,"," )[safe_offset(1)] as Long,
from (
SELECT DISTINCT
_FILE_NAME AS Filename,
*
FROM
xxx.xxx
)
)

Related

How can I force BigQuery to autodetect schema with all strings?

Is there a way to use --autodect in BigQuery forcing all new fields to be treated as strings?
The issue is the following: I have a csv file separated by \t and where all fields are quoted like this '67.4'. Now, if I simply provide a schema, then the bq load breaks for reasons I cannot understand. If I do bq load --autodect it works fine, but the values are still quoted. Now, I tried to do
bq load --autodetect --quote="'" --max_bad_records=10000
--field_delimiter="\t" --source_format=CSV
repo:abc.2017 gs://abc/abc_2017-*.csv.gz
But it now breaks wih
- gs://abc/abc_2017-04-16.csv.gz: Error while reading data,
error message: Could not parse '67.4' as int for field
int64_field_35 (position 35) starting at location 2138722
Here's one row, fields again are separated by tabs:
'333933353332333633383339333033333337' '31373335434633' 'pre' 'E' '1' '333933383335333833393333333333383338' '2017-02-01
05:13:59' '29' '333733333330333033323339333933313335333333303333333433393336' '333333353331333933363338333033373333333833323338333733323330' '3333343234313434' 'R' 'LC' '100' '-70.2' '-31.34' 'HSFC310' 'WOMT24I' '146' '1' '05'
Ideas?
Auto detect schema samples up to the first 100 rows so if the column contains all integer for up to the first 100 rows then the data type will be integer. The purpose of --qoute flag is to enclose the column with the specified value.
Example:
Sample csv data:
col1, col2
1, "2"
If you don't specify the --quote then by default it will be ". The data type for col2 will be Integer and the value will be 2.
If you specify the --quote other than the default " then it will enclose the data with that value. Example: --quote="'", col2 will be String type and data value will be "2" (the double quotes itself will be part of the data value)
As of now you can't force auto-detect schema to make all your columns to be of certain datatype, otherwise, it wouldn't be auto-detect after all. You may want to file a feature request to add another flag for bq load (and even in the UI) to make certain columns to be of certain data type (e.g. I want to make column # 1, 2, 15, 100, xxx to be String or All columns should be String/Integer/Numeric, etc...).

function to convert unicode in bigquery

I trying out the NORMALIZE function with NFKC in bigquery from the documentation, I see that I can convert a string to a readable format. For example
WITH EquivalentNames AS (
SELECT name
FROM UNNEST([
'Jane\u2004Doe',
'\u0026 Hello'
]) AS name
)
SELECT
NORMALIZE(name, NFKC) AS normalized_str
FROM EquivalentNames
GROUP BY 1;
The ampersand character shows up correctly, but I have a table, with a column of STRING with unicode character in its values, but I'm not able to use NORMALIZE to convert it to a readable format.
I've also tried some of the other solutions presented
Decode Unicode's to Local language in Bigquery but nothing is working yet.
Attached is an example of the data:
You posted a question about NORMALIZE, but didn't make your goals clear.
Here I'll answer the question about NORMALIZE - to point out that it probably doesn't do what you are expecting it to do. But at least it's acting as expected.
There are many ways to encode the same string with Unicode. Normalize chooses one, while preserving the string.
See this query:
SELECT *, a=b ab, a=c ac, a=d ad, b=c bc, b=d bd, c=d cd
FROM (
SELECT NORMALIZE('hello ñá 😞', NFC) a
, NORMALIZE('hello ñá 😞', NFKC) b
, NORMALIZE('hello ñá 😞', NFD) c
, NORMALIZE('hello ñá 😞', NFKD) d
)
As you see - every time you get the same string, they just have different non-visible representations.
The \u2004 is so called thick space so that is why you thnk it is not showing correctly because you just see space -
But if you will try some other codes - like for example \2020 - you will see it is actually showing even without extra processing with NORMALIZE function
As in below
#standardSQL
WITH EquivalentNames AS (
SELECT name
FROM UNNEST([
'Jane\u2020Doe',
'\u0026 Hello'
]) AS name
)
SELECT
name, NORMALIZE(name, NFKC) AS normalized_str
FROM EquivalentNames
GROUP BY 1
with result
Row name normalized_str
1 Jane†Doe Jane†Doe
2 & Hello & Hello

Amazon Spectrum incremental load directly from string

I have take a field as 'filename Pro_180913_171842' from spectrum.
Tried the function in sql like
`select
fields
from spectrum.ex
where cast(SPLIT_PART('filename Pro_180913_171842','Pro_',2)as
timestamp)>cast('2018-09-12 15:13:54.0' as timestamp)`
but it returned empty rows only!
Your field has no date component, so unless we add date information, it makes no sense to compare to a full timestamp. If you intend to compare only times, then try this:
SELECT fields
FROM spectrum.ex
WHERE SPLIT_PART('filename Pro_180913_171842', '_', 2) > '151354';

Cast in Google BigQuery not appropriate?

I have a #StandardSQL query
SELECT
CAST(created_utc AS STRING),
author,
FROM
`table`
WHERE
something = "Something"
which gives me the following error,
Error: Cannot read field 'created_utc' of type STRING as INT64
An example of created_utc is 1517360483
If I understand that error, which I clearly don't. created_utc is stored a string, but the query is trying unsuccessfully to convert it to a INT64. I would have hoped the CAST function would enforce it to be kept as a string.
What have I done wrong?
The problem is that you don't actually have a single table. In your question, you wrote table, but I suspect that you are querying table*, which matches multiple tables where one of them happens to have a different type for that column. Instead of using table*, your options are to:
Use UNION ALL with the individual tables, preforming casts as appropriate in the SELECT lists.
If you know which table(s) have that column as an INT64 instead of a STRING, and you are okay with excluding them, you can use a filter on _TABLE_SUFFIX to skip reading from certain tables.
As Elliott has already pointed - some of your values are actually cannot be casted to INT64 because they are not represented integers and rather have some other characters than digits
Using below SELECT you can identify such values so it will help you to locate problematic entries and make then decision on next actions
#standardSQL
SELECT created_utc, author
FROM `table`
WHERE something = "Something"
AND NOT REGEXP_CONTAINS(created_utc , r'[0-9]')

Include numbers in the first column once and transpose them as column headers

I am working with data from the hospital, and when I add the .csv extension to my text files they output in the following way:
It would be much easier to manage if there were a way to only include the numbers in the first column once, and also transpose them as column headers. And go through the first ten in the second column, add and transpose them underneath, then do the next ten. The final product looking like this:
I have tried transposing them manually, but since there are millions of files, the CSV's are quite extensive. I have looked for a way in Excel to do it, but I have found nothing.
Could someone help me with a macro for this?
An excel formula could be used, if the numbers are repeated exactly.
If the data is in Columns A & B, the following formula could be placed in C2:
=INDEX($B:$B,(ROW(C1)-1)*10+COLUMN(A$1))
And then copied to the right and down as far as needed.
You didn't mention whether the sequence of row numbers (1,90,100,120...) is always the same for each "row". From your sample, I will assume that the numbers repeat the same way, ad infinitum.
First, import the CSV into Microsoft Access. Let's assume your first column is called RowID, and your second is called Description. RowID is an integer, and Description is a memo field.
Add a third column, an Integer, and call it "Ord" (no quotes).
In Access's VBA editor add a new module with this GroupIncrement function:
Function GroupIncrement(ByVal sGroup)
Static Num, LastGrp
If (StrComp(LastGrp, Nz(sGroup, "")) <> 0) Then
Num = 0
End If
Num = Num + 1
GroupIncrement = Num
LastGrp = sGroup
End Function
Create a new query, replacing MyTable with the name of your Access table containing the CSV data:
UPDATE (SELECT * FROM [MyTable]
ORDER BY [RowID]) SET [Ord]=GroupIncrement([RowID])
Create a third query:
TRANSFORM First([Description])
SELECT [Ord]
FROM [MyTable]
GROUP BY [Ord]
PIVOT [RowID]
This should put the data into the format you want (with an extra column on the left, Ord).
In Access, highlight that query and choose External Data, and in the Export section, choose Excel. Export the query to Excel.
Open the file in Excel and delete the Ord column.