How to detect jagged rows in BigQuery? - google-bigquery

We have a data issue where we'd like to take a backup of a particular Kind and determine which rows are "jagged", so effectively I'm trying to detect which rows are missing a certain column (meaning the field does not exist on that row, which I'm distinguishing from being null). Is there a way to do this in BigQuery?

From docs:
configuration.load.allowJaggedRows boolean [Optional] Accept rows
that are missing trailing optional columns. The missing values are
treated as nulls. Default is false which treats short rows as errors.
Only applicable to CSV, ignored for other formats.
https://cloud.google.com/bigquery/docs/reference/v2/jobs
This means missing values from jagged rows will be treated as null. You might need to try a different approach if preserving these values is important, like ingesting the whole row and parsing inside BigQuery - when possible.

Related

Split multiple points in text format and switch coordinates in postgres column

I have a PostgreSQL column of type text that contains data like shown below
(32.85563, -117.25624)(32.855470000000004, -117.25648000000001)(32.85567, -117.25710000000001)(32.85544, -117.2556)
(37.75363, -121.44142000000001)(37.75292, -121.4414)
I want to convert this into another column of type text like shown below
(-117.25624, 32.85563)(-117.25648000000001,32.855470000000004 )(-117.25710000000001,32.85567 )(-117.2556,32.85544 )
(-121.44142000000001,37.75363 )(-121.4414,37.75292 )
As you can see, the values inside the parentheses have switched around. Also note that I have shown two records here to indicate that not all fields have same number of parenthesized figures.
What I've tried
I tried extracting the column to Java and performing my operations there. But due to sheer amount of records I have, I will run out of memory. I also cannot do this method in batched due to time constraints.
What I want
A SQL query or a sequence of SQL queries that will achieve the result that I have mentioned above.
I am using PostgreSQL9.4 with PGAdmin III as the client
this is a type of problem that should not be solved by sql, but you are lucky to use Postgres.
I suggest the following steps in defining your algorithm.
First part will be turning your strings into a structured data, second will transform structured data back to string in a format that you require.
From string to data
First, you need to turn your bracketed values into an array, which can be done with string_to_array function.
Now you can turn this array into rows with unnest function, which will return a row per bracketed value.
Finally you need to slit values in each row into two fields.
From data to string
You need to group results of the first query with results wrapped in string_agg function that will combine all numbers in rows into string.
You will need to experiment with brackets to achieve exactly what you want.
PS. I am not providing query here. Once you have some code that you tried, let me know.
Assuming you also have a PK or some unique column, and possibly other columns, you can do as follows:
SELECT id, (...), string_agg(point(pt[1], pt[0])::text, '') AS col_reversed
FROM (
SELECT id, (...), unnest(string_to_array(replace(col, ')(', ');('), ';'))::point AS pt
FROM my_table) sub
GROUP BY id; -- assuming id is PK or no other columns
PostgreSQL has the point type which you can use here. First you need to make sure you can properly divide the long string into individual points (insert ';' between the parentheses), then turn that into an array of individual points in text format, unnest the array into individual rows, and finally cast those rows to the point data type:
unnest(string_to_array(replace(col, ')(', ');('), ';'))::point AS pt
You can then create a new point from the point you just created, but with the coordinates reversed, turn that into a string and aggregate into your desired output:
string_agg(point(pt[1], pt[0])::text, '') AS col_reversed
But you might also move away from the text format and make an array of point values as that will be easier and faster to work with:
array_agg(point(pt[1], pt[0])) AS pt_reversed
As I put in the question, I tried extracting the column to Java and performing my operations there. But due to sheer amount of records I have, I will run out of memory. I also cannot do this method in batched due to time constraints.
I ran out of memory here as I was putting everything in a Hashmap of
< my_primary_key,the_newly_formatted_text >. As the text was very long sometimes and due to the sheer number of records that I had, it wasnt surprising that I got an OOM.
Solution that I used:
As suggested my many folks here, this solution was better solved with a code. I wrote a small script that formatted the text as per my liking and wrote the primary key and the newly formatted text to a file in tsv format. Then I imported the tsv in a new table and updated the original table from the new one.

Flexible schema with Google Bigquery

I have around 1000 files that have seven columns. Some of these files have a few rows that have an eighth column (if there is data).
What is the best way to load this into BigQuery? Do I have to find and edit all these files to either
- add an empty eighth column in all files
- remove the eighth column from all files? I don't care about the value in this column.
Is there a way to specify eight columns in the schema and add a null value for the eighth column when there is no data available.
I am using BigQuery APIs to load data if that might help.
You can use the 'allowJaggedRows' argument, which will treat non-existent values at the end of a row as nulls. So your schema could have 8 columns, and all of the rows that don't have that value will be null.
This is documented here: https://developers.google.com/bigquery/docs/reference/v2/jobs#configuration.load.allowJaggedRows
I've filed a doc bug to make this easier to find.
If your logs are in JSON, you can define a nullable field, and if it does not appear in the record, it would remain null.
I am not sure how it works with CSV, but I think that you have to have all fields (even empty).
There is a possible solution here if you don't want to worry about having to change the CSV values (which would be my recommendation otherwise)
If the number of rows with an eight parameter is fairly small and you can afford to "sacrifice" those rows, then you can pass a maxBadRecords param with a reasonable number. In that case, all the "bad" rows (i.e. the ones not conforming to the schema) would be ignored and wouldn't be loaded.
If you are using bigquery for statistical information and you can afford to ignore those rows, it could solve your problem.
Found a workable "hack".
Ran a job for each file with the seven column schema and then ran another job on all files with eight columns schema. One of the job would complete successfully. Saving me time to edit each file individually and reupload 1000+ files.

MySQL command to search CSV (or similar array)

I'm trying to write an SQL query that would search within a CSV (or similar) array in a column. Here's an example:
insert into properties set
bedrooms = 1,2,3 (or 1-3)
title = nice property
price = 500
I'd like to then search where bedrooms = 2+. Is this even possible?
The correct way to handle this in SQL is to add another table for a multi-valued property. It's against the relational model to store multiple discrete values in a single column. Since it's intended to be a no-no, there's little support for it in the SQL language.
The only workaround for finding a given value in a comma-separated list is to use regular expressions, which are in general ugly and slow. You have to deal with edge cases like when a value may or may not be at the start or end of the string, as well as next to a comma.
SELECT * FROM properties WHERE bedrooms RLIKE '[[:<:]]2[[:>:]]';
There are other types of queries that are easy when you have a normalized table, but hard with the comma-separated list. The example you give, of searching for a value that is equal to or greater than the search criteria, is one such case. Also consider:
How do I delete one element from a comma-separated list?
How do I ensure the list is in sorted order?
What is the average number of rooms?
How do I ensure the values in the list are even valid entries? E.g. what's to prevent me from entering "1,2,banana"?
If you don't want to create a second table, then come up with a way to represent your data with a single value.
More accurately, I should say I recommend that you represent your data with a single value per column, and Mike Atlas' solution accomplishes that.
Generally, this isn't how you should be storing data in a relational database.
Perhaps you should have a MinBedroom and MaxBedroom column. Eg:
SELECT * FROM properties WHERE MinBedroom > 1 AND MaxBedroom < 3;

TSQL: No value instead of Null

Due to a weird request, I can't put null in a database if there is no value. I'm wondering what can I put in the store procedure for nothing instead of null.
For example:
insert into blah (blah1) values (null)
Is there something like nothing or empty for "blah1" instead using null?
I would push back on this bizarre request. That's exactly what NULL is for in SQL, to denote a missing or inapplicable value in a column.
Is the requester experiencing grief over SQL logic with NULL?
edit: Okay, I've read your reply with the extra detail about this job assignment (btw, generally you should edit your original question instead of posting more information in an answer).
You'll have to declare all columns as NOT NULL and designate a special value in the domain of that column's data type to signify "no value." The appropriate value to choose might be different on a case by case basis, i.e. zero may signify nothing in a person_age column, but it might have significance in an items_in_stock column.
You should document the no-value value for each column. But I suppose they don't believe in documentation either. :-(
Depends on the data type of the column. For numbers (integers, etc) it could be zero (0) but if varchar then it can be an empty string ("").
I agree with other responses that NULL is best suited for this because it transcends all data types denoting the absence of a value. Therefore, zero and empty string might serve as a workaround/hack but they are fundamentally still actual values themselves that might have business domain meaning other than "not a value".
(If only the SQL language supported a "Not Applicable" (N/A) value type that would serve as an alternative to NULL...)
Is null is a valid value for whatever you're storing?
Use a sentry value like INT32.MaxValue, empty string, or "XXXXXXXXXX" and assume it will never be a legitimate value
Add a bit column 'Exists' that you populate with true at the same time you insert.
Edit: But yeah, I'll agree with the other answers that trying to change the requirements might be better than trying to solve the problem.
If you're using a varchar or equivalent field, then use the empty string.
If you're using a numeric field such as int then you'll have to force the user to enter data, else come up with a value that means NULL.
I don't envy you your situation.
There's a difference between NULLs as assigned values (e.g. inserted into a column), and NULLs as a SQL artifact (as for a field in a missing record for an OUTER JOIN. Which might be a foreign concept to these users. Lots of people use Access, or any database, just to maintain single-table lists.) I wouldn't be surprised if naive users would prefer to use an alternative for assignments; and though repugnant, it should work ok. Just let them use whatever they want.
There is some validity to the requirement to not use NULL values. NULL values can cause a lot of headache when they are in a field that will be included in a JOIN or a WHERE clause or in a field that will be aggregated.
Some SQL implementations (such as MSSQL) disallow NULLable fields to be included in indexes.
MSSQL especially behaves in unexpected ways when NULL is evaluated for equality. Does a NULL value in a PaymentDue field mean the same as zero when we search for records that are up to date? What if we have names in a table and somebody has no middle name. It is conceivable that either an empty string or a NULL could be stored, but how do we then get a comprehensive list of people that have no middle name?
In general I prefer to avoid NULL values. If you cannot represent what you want to store using either a number (including zero) or a string (including the empty string as mentioned before) then you should probably look closer into what you are trying to store. Perhaps you are trying to communicate more than one piece of data in a single field.

Force numerical order on a SQL Server 2005 varchar column, containing letters and numbers?

I have a column containing the strings 'Operator (1)' and so on until 'Operator (600)' so far.
I want to get them numerically ordered and I've come up with
select colname from table order by
cast(replace(replace(colname,'Operator (',''),')','') as int)
which is very very ugly.
Better suggestions?
It's that, InStr()/SubString(), changing Operator(1) to Operator(001), storing the n in Operator(n) separately, or creating a computed column that hides the ugly string manipulation. What you have seems fine.
If you really have to leave the data in the format you have - and adding a numeric sort order column is the better solution - then consider wrapping the text manipulation up in a user defined function.
select colname from table order by dbo.udfSortOperator(colname)
It's less ugly and gives you some abstraction. There's an additional overhead of the function call but on a table containing low thousands of rows in a not-too-heavily hit database server it's not a major concern. Make notes in the function to optomise later as required.
My answer would be to change the problem. I would add an operatorNumber field to the table if that is possible. Change the update/insert routines to extract the number and store it. That way the string conversion hit is only once per record.
The ordering logic would require the string conversion every time the query is run.
Well, first define the meaning of that column. Is operator a name so you can justify using chars? Or is it a number?
If the field is a name then you will use chars, and then you would want to determine the fixed length. Pad all operator names with zeros on the left. Define naming rules for operators (I.E. No leters. Or the codes you would use in a series like "A001")
An index will sort the physical data in the server. And a properly define text naming will sort them on a query. You would want both.
If the operator is a number, then you got the data type for that column wrong and needs to be changed.
Indexed computed column
If you find yourself ordering on or otherwise querying operator column often, consider creating a computed column for its numeric value and adding an index for it. This will give you a computed/persistent column (which sounds like oxymoron, but isn't).