Flexible schema with Google Bigquery - google-bigquery

I have around 1000 files that have seven columns. Some of these files have a few rows that have an eighth column (if there is data).
What is the best way to load this into BigQuery? Do I have to find and edit all these files to either
- add an empty eighth column in all files
- remove the eighth column from all files? I don't care about the value in this column.
Is there a way to specify eight columns in the schema and add a null value for the eighth column when there is no data available.
I am using BigQuery APIs to load data if that might help.

You can use the 'allowJaggedRows' argument, which will treat non-existent values at the end of a row as nulls. So your schema could have 8 columns, and all of the rows that don't have that value will be null.
This is documented here: https://developers.google.com/bigquery/docs/reference/v2/jobs#configuration.load.allowJaggedRows
I've filed a doc bug to make this easier to find.

If your logs are in JSON, you can define a nullable field, and if it does not appear in the record, it would remain null.
I am not sure how it works with CSV, but I think that you have to have all fields (even empty).

There is a possible solution here if you don't want to worry about having to change the CSV values (which would be my recommendation otherwise)
If the number of rows with an eight parameter is fairly small and you can afford to "sacrifice" those rows, then you can pass a maxBadRecords param with a reasonable number. In that case, all the "bad" rows (i.e. the ones not conforming to the schema) would be ignored and wouldn't be loaded.
If you are using bigquery for statistical information and you can afford to ignore those rows, it could solve your problem.

Found a workable "hack".
Ran a job for each file with the seven column schema and then ran another job on all files with eight columns schema. One of the job would complete successfully. Saving me time to edit each file individually and reupload 1000+ files.

Related

A date time field precise enough to differentiate between rows on bulk/mass insert

I am using SSIS to insert 500 to 3+ million rows into various tables. The data source is anything from a flat CSV file to another DB (Oracle, MySQL, SQL Server).
I am trying to create an "inserted_on" column that shows the date/time stamp of when the row was added and I need it to be precise enough to differentiate between the previous and next row. In other words, every row should have a different date time value, even if its really close.
I tried a datetime2(7) field with a default value of (gettime()) but that doesn't seem precise enough.
As described in this answer, you should use timestamp.
See documentation here or additional details available here.
Hope this help.

How to detect jagged rows in BigQuery?

We have a data issue where we'd like to take a backup of a particular Kind and determine which rows are "jagged", so effectively I'm trying to detect which rows are missing a certain column (meaning the field does not exist on that row, which I'm distinguishing from being null). Is there a way to do this in BigQuery?
From docs:
configuration.load.allowJaggedRows boolean [Optional] Accept rows
that are missing trailing optional columns. The missing values are
treated as nulls. Default is false which treats short rows as errors.
Only applicable to CSV, ignored for other formats.
https://cloud.google.com/bigquery/docs/reference/v2/jobs
This means missing values from jagged rows will be treated as null. You might need to try a different approach if preserving these values is important, like ingesting the whole row and parsing inside BigQuery - when possible.

Yahoo finance API field misalignment

I'm trying to use the Yahoo Finance API to create a custom csv but depending upon the stock there is field misalignment.
For instance, if I just want to download the "k3" field for yahoo which corresponds to last trade size, I would craft the url like so:
http://finance.yahoo.com/d/quotes.csv?s=yhoo&f=k3
However, if you download that csv there are two columns of data.
Similarly, if I decide to get Float Shares , I want the url:
http://finance.yahoo.com/d/quotes.csv?s=yhoo&f=f6
However that gives me 3 columns. Is there a way to get it in exactly one column? I want to automate this process but the different column orientations make it very difficult as different rows then have different column lengths and I am unable to easily match up the column name with the row.
Bonus: If someone can explain where the 3 float share numbers come from that would be great, I seem to only be able to match up the first to the site...
Thank you for your help!
In short, you are describing known bugs that Yahoo isn't going to fix as the feed is officially unsupported.
Specifically re. Float (f6): the number returned is the entire float. It is not 3 csv numbers. The commas are not delimiters; rather, they are 1,000s separators. (I suspect the same is the case with K3. As it is with a couple of other known numbers. (See link below.))
Two solutions:
(1) Write your own workaround using conditional statements (if or case) in your code.
(2) Download the buggy parameters separately from the clean ones.
See: Yahoo's official reply to your question.
The multiple columns is because that excel (or whatever csv viewer you are using) treats "thousand-seperator" as the the "comma-seperator". We used to have this problem in our school project, and found a hack which is good only if you are using this api for some hobby project and not concerning data usage.
The idea is instead of treating the results as a csv, pick a static column (column A) where you will know the value beforehand (e.g. column 's' stock symbol) or put this value as the first column. When constructing the query, use this column to surround those columns (float columns) with formatting problem. once you get the quotes.csv, manual seperate the results on the column A value.
for example using
http://download.finance.yahoo.com/d/quotes.csv?s=yhoo&f=sf6sa5sb6
will get you
"YHOO", 887,675,000,"YHOO",400,"YHOO",N/A
Then use ,"YHOO", to seperate the results (excluding first column).
Not an elegent way to solve the problem, but at least it gives you the correct result.

Testing a CSV - how far should I go?

I'm generating a CSV which contains several rows and columns.
However, when I'm testing said CSV I feel like I am simply repeating the code that builds the file in the test as I'm checking each and every field is correct.
Question is, is this more sensible than it seems to me, or is there a better way?
A far simpler test is to just import the CSV into a spreadsheet or database and verify the data output is aligned to the proper fields. No extra columns or extra rows, data selected from the imported recordset is a perfect INTERSECT with the recordset from which the CSV was generated, etc.
More importantly, I recommend making sure your test data includes common CSV fail scenarios such as:
Field contains a comma (or whatever your separator character)
Field contains multiple commas (You might think it's the same thing, but I've seen one fail where the other succeeded)
Field contains the new-row character(s)
Field contains characters not in the code page of the CSV file
...to make sure your code is handling them properly.

Full Text Searching for single characters

I have a table with a TEXT column where the contents is just strings of CSV numbers. Example ",1,76,77,115," Each string can have an arbitrary number of numbers.
I am trying to set up Full Text Indexing so that I can search this column rapidly. This works great. Instead of running queries with
where MY_COL LIKE '%,77,%' and MY_COL LIKE '%,115,%'
I can do
where CONTAINS(MY_COL,'77 and 115')
However, when I try to search for a single character it doesn't work.
where CONTAINS(MY_COL,'1')
But I know that there should be records returned! I quickly found that I need to edit the Noise file and rebuild the index. But even after doing that it still doesn't work.
Working with relational databases that way is going to hurt.
Use a proper schema. Either store the values in different rows or use an array datatype for the column.
That will make solving the problem trivial.
I fixed my own problem, although I'm not exactly sure what fixed it.
I dropped my table and populated a new one (my program does batch processing) and created a new Full Text Index. Maybe I wasn't being patient enough to allow the indexing to fully rebuild.
Agreed. How does 12,15,33 not return that record for a search for 1 with fulltext? Use an actual table schema to accomplish this.