AVG is not taking null values into consideration - google-bigquery

I have loaded the following test data:
name, age,gender
"John", 33,m
"Sam", 33,m
"Julie",33,f
"Jimbo",, m
with schema: name:STRING,age:INTEGER,gender:STRING and I have confirmed that the Jimbo row shows a null for column "age" in the BigQuery Browser Tool > mydataset > Details > Preview section.
When I run this query :
SELECT AVG(age) FROM [peterprivatedata.testpeople]
I get 24.75 which is incorrect. I expected 33 because the documentation for AVG says "Rows with a NULL value are not included in the calculation."
Am I doing something wrong or is this a known bug? (I don't know if there's a public issues list to check). What's the simplest workaround to this?

This is a known bug where we coerce null numeric values to to 0 on import. We're currently working on a fix. These values do however, show up as not not defined (which for various reasons is different from null), so you can check for IS_EXPLICITLY_DEFINED. For example:
SELECT sum(if(is_explicitly_defined(numeric_field), numeric_field, 0)) /
sum(if(is_explicitly_defined(numeric_field), 1, 0))
AS my_avg FROM your_table
Alternately, you could use another column to represent is_null. Then the query would look like:
SELECT sum(if(numeric_field_is_null, 0, numeric_field)) /
sum(if(numeric_field_is_null, 0, 1))
AS my_avg FROM your_table

Related

SQL 2-table query results need to be concatenated

My current SQL:
SELECT B.MESSAGENO, B.LINENO, B.LINEDATA
FROM BILL.MESSAGE AS B, BILL.ACTIVITYAS A
WHERE A.MSGNO = D.MESSAGENO AND A.FUPTEAM = 'DBWB'
AND A.ACTIVITY = 'STOPPAY' AND A.STATUS = 'WAIT'
AND A.COMPANY = D.COMPANY
MESSAGENO LINENO LINEDATA
1234567 1 CHEQUE NO : 9999999 RUN NO : 55555
1234567 2 DATE ISSUED: 12/25/2020 AMOUNT : 710.51
1234567 3 PAYEE : LASTNAME, FIRSTNAME
1234567 4 ACCOUNT NO : 12345-67890
1234567 5 USER : USERNAME
there are 550 sets of 5 lines per MESSAGENO
What I am trying to figure out is how I can get something like where LINENO = 1, concatenate LINEDATA so I just get 9999999 as checkno, where LINENO = 2, concatenate LINEDATA so I get 710.51 as amount, where LINENO = 3, concatenate LINEDATA so I get LASTNAME, FIRSTNAME as payee, where LINENO = 4, concatenate LINEDATA so I get LASTNAME, FIRSTNAME as payee, and lastly, the same thing for USERNAME.
I just cannot seems to conceptualize this. Every time I try, my brain starts turning into macaroni. Any help is appreciated.
UPDATED ANSWER, extracts all fields from stored strings:
https://dbfiddle.uk/?rdbms=mysql_8.0&fiddle=e22d26866053ea6088aa78dc23c4d809
Check this fiddle.
It uses a SUBSTRING_INDEX in the internal query to split the fields at the : or by a combination of : and " ". I used the two spaces because I wasn't sure what your actual whitespace was, and when I copied the data out of your post it was just spaces.
Then MAX is used in the outer query to get everything on one line, grouping by the messageNo. Since some lines have two pieces of data to extract, a second string parser was added. Here is the code from the fiddle, for reference. In this case, the table test was created from the output of OP's initial query, since I didn't have enough info to build both tables and completely recreate his output.
SELECT MESSAGENO,
MAX(if(LINENO = 1, extractFirst, null)) as checkNo,
MAX(if(LINENO = 1, extractLast, null)) as runNo,
MAX(if(LINENO = 2, extractFirst, null)) as issued,
MAX(if(LINENO = 2, extractLast, null)) as amount,
MAX(if(LINENO = 3, extractFirst, null)) as payee,
MAX(if(LINENO = 4, extractFirst, null)) as accountNo,
MAX(if(LINENO = 5, extractFirst, null)) as username
FROM (
SELECT MESSAGENO, LINENO,
trim(substring_index(substring_index(LINEDATA, ": ", -2), " ", 1)) as extractFirst,
trim(substring_index(LINEDATA, ":", -1)) as extractLast
FROM test
) z
GROUP BY MESSAGENO
Again, you will be much better off to alter your tables so that you can use simpler queries, as shown in the last fiddle.
===============================================
ORIGINAL ANSWER (demo of string parsing, suggestion for data model change)
You can achieve this with some string parsing BUT ABSOLUTELY DO NOT unless you have no choice. The reason you are having trouble is because your data shouldn't be stored this way.
I've included a fiddle incorporating this case statement and substring_index to extract the data. I have assumed mySQL 8 because you didn't specify SQL version.
https://dbfiddle.uk/?rdbms=mysql_8.0&fiddle=068a49a2c819c08018691e54bcdf91e5
case when LINENO = 1 then trim(substring_index(substring_index(LINEDATA, "RUN NO", 1), ":", -1))
else trim(substring_index(LINEDATA, ":", -1)) end
as LDATA
See this fiddle for the full statement. I have just inserted the data from your join into a test table, instead of trying to recreate all your tables, since I don't have access to all the data I would need for that. In future, set up a fiddle like this one with some representative data and the SQL version, and it will be easier for people to help you.
=========================================
I think this is a better layout for you, with all data stored as the proper type and a field defined for each one and the extra text stripped out:
https://dbfiddle.uk/?rdbms=mysql_8.0&fiddle=11d52189b005740cdc53175466374635

Extracting Values from Array in Redshift SQL

I have some arrays stored in Redshift table "transactions" in the following format:
id, total, breakdown
1, 100, [50,50]
2, 200, [150,50]
3, 125, [15, 110]
...
n, 10000, [100,900]
Since this format is useless to me, I need to do some processing on this to get the values out. I've tried using regex to extract it.
SELECT regexp_substr(breakdown, '\[([0-9]+),([0-9]+)\]')
FROM transactions
but I get an error returned that says
Unmatched ( or \(
Detail:
-----------------------------------------------
error: Unmatched ( or \(
code: 8002
context: T_regexp_init
query: 8946413
location: funcs_expr.cpp:130
process: query3_40 [pid=17533]
--------------------------------------------
Ideally I would like to get x and y as their own columns so I can do the appropriate math. I know I can do this fairly easy in python or PHP or the like, but I'm interested in a pure SQL solution - partially because I'm using an online SQL editor (Mode Analytics) to plot it easily as a dashboard.
Thanks for your help!
If breakdown really is an array you can do this:
select id, total, breakdown[1] as x, breakdown[2] as y
from transactions;
If breakdown is not an array but e.g. a varchar column, you can cast it into an array if you replace the square brackets with curly braces:
select id, total,
(translate(breakdown, '[]', '{}')::integer[])[1] as x,
(translate(breakdown, '[]', '{}')::integer[])[2] as y
from transactions;
You can try this :
SELECT REPLACE(SPLIT_PART(breakdown,',',1),'[','') as x,REPLACE(SPLIT_PART(breakdown,',',2),']','') as y FROM transactions;
I tried this with redshift db and this worked for me.
Detailed Explanation:
SPLIT_PART(breakdown,',',1) will give you [50.
SPLIT_PART(breakdown,',',2) will give you 50].
REPLACE(SPLIT_PART(breakdown,',',1),'[','') will replace the [ and will give just 50.
REPLACE(SPLIT_PART(breakdown,',',2),']','') will replace the ] and will give just 50.
Know its an old post.But if someone needs a much easier way
select json_extract_array_element_text('[100,101,102]', 2);
output : 102

Handling null or missing attributes in JSON using PostgreSQL

I'm learning how to handle JSON in PostgreSQL.
I have a table with some columns. One column is a JSON field. The data in that column has at least these three variations:
Case 1: {"createDate": 1448067864151, "name": "world"}
Case 2: {"createDate": "", "name": "hello"}
Case 3: {"name": "sky"}
Later on, I want to select the createDate.
TO_TIMESTAMP((attributes->>'createDate')::bigint * 0.001)
That works fine for Case 1 when the data is present and it is convertible to a bigint. But what about when it isn't? How do I handle this?
I read this article. It explains that we can add check constraints to perform some rudimentary validation. Alternatively, I could do a schema validation before the data is inserts (on the client side). There are pros and cons with both ideas.
Using a Check Constraint
CONSTRAINT validate_createDate CHECK ((attributes->>'createDate')::bigint >= 1)
This forces a non-nullable field (Case 3 fails). But I want the attribute to be optional. Furthermore, if the attribute doesn't convert to a bigint because it is blank (Case 2), this errors out.
Using JSON schema validation on the client side before insert
This works, in part, because the schema validation makes sure that what data comes in conforms to the schema. In my case, I can control which clients access this table, so this is OK. But it doesn't matter for the SQL later on since my validator will let pass all three cases.
Basically, you need to check if createDate attribute is empty:
WITH data(attributes) AS ( VALUES
('{"createDate": 1448067864151, "name": "world"}'::JSON),
('{"createDate": "", "name": "hello"}'::JSON),
('{"name": "sky"}'::JSON)
)
SELECT to_timestamp((attributes->>'createDate')::bigint * 0.001) FROM data
WHERE
(attributes->>'createDate') IS NOT NULL
AND
(attributes->>'createDate') != '';
Output:
to_timestamp
----------------------------
2015-11-20 17:04:24.151-08
(1 row)
Building on Dmitry's answer, you can also check the json type with the json_typeof function. Note the json operator: -> to get json instead of the ->> operator which always casts the value to string.
By doing the check in the SELECT with a CASE conditional instead of in the WHERE clause, we also keep the rows not having a createdDate. Depending on your usecase, this might be better.
WITH data(attributes) AS ( VALUES
('{"createDate": 1448067864151, "name": "world"}'::JSON),
('{"createDate": "", "name": "hello"}'::JSON),
('{"name": "sky"}'::JSON)
)
SELECT
CASE WHEN (json_typeof(attributes->'createDate') = 'number')
THEN to_timestamp((attributes->>'createDate')::bigint * 0.001)
END AS created_date
FROM data
;
Output:
created_date
----------------------------
"2015-11-21 02:04:24.151+01"
""
""
(3 rows)

SQLite: decrement value if exists AND if not 0 else set default value

I'm a totally newbie in SQLite and also SQL is not my strength, so i could need help. I have a table like this:
--------------------------------------
try
--------------------------------------
id:INTEGER PRIMARY KEY AUTO_INCREMENT,
ip:VARCHAR(30),
number:INTEGER DEFAULT 10,
bdate:DATE
For what do i need this kind of table? I want to "ban" a person for 24 hours, if he gets 10 times false credential error. I just want to decrement the attribute number. So if the person types false logging information for first time, a new entry should be create with 10 as default number. This is what i have so far:
INSERT OR REPLACE INTO try(id, ip, number, bdate) VALUES(COALESCE((SELECT id FROM try WHERE ip="172.16.1.1"), NULL),"172.16.1.1" ,{DON'T KNOW WHAT TO FILL UP}, 06092013);
I'm almost done, except de decrement of the value. I thought about something like that:
COALESCE((SELECT number FROM try WHERE ip="172.16.1.1"), 10)
Any idea how to decrement the result from the selection only if its NOT 0(like the SET method from SQL)? If it's 0, a new statement should be done(like add new entry to table ban ip).
EDIT:
I got it with the decrement. Look my answer for the solution.
Now i want to use a if else statement. I googled and found the CASE WHEN statement, but i don't know how to use it. What do I do wrong?
SELECT number FROM try WHERE ip="172.16.1.1" AS number,
CASE WHEN number > 0
THEN
(INSERT OR REPLACE INTO try(id, ip, number, bdate) VALUES(COALESCE((SELECT id FROM try WHERE ip="172.16.1.1"), NULL),"172.16.1.1" ,COALESCE(number-1, 10), 06092013)
ELSE
{DO SOMETHING ELSE}
END;
I got it with the decrement on my own:
INSERT OR REPLACE INTO try(id, ip, number, bdate) VALUES(COALESCE((SELECT id FROM try WHERE ip="172.16.1.1"), NULL),"172.16.1.1" ,COALESCE((SELECT number FROM try WHERE ip="172.16.1.1")-1, 10), 06092013;
SQLite is an embedded database and is not designed to do program logic directly in SQL.
There is no IF/ELSE statement, or any other statement that behaves like that.
You should do the logic in whatever language you are using to access the database:
db.execute("BEGIN")
cursor = db.query("SELECT id, number FROM try WHERE ip = '172.16.1.1'")
if cursor.empty or cursor.number = 0:
db.execute("INSERT INTO try VALUES(NULL, '172.16.1.1', 10, '2013-09-06')")
else:
db.execute("UPDATE try SET number = ? WHERE id = ?",
[cursor.number - 1, cursor.id])
db.execute("COMMIT")
If you really want to do this in an incomprehensible SQL statement, try this:
INSERT OR REPLACE INTO try(id, ip, number, bdate)
VALUES ((SELECT id
FROM try
WHERE ip = '172.16.1.1'
AND number > 0),
'172.16.1.1',
COALESCE((SELECT number
FROM try
WHERE ip = '172.16.1.1'
AND number > 0
) - 1,
10)
'2013-09-06'
)

Logical operator sql query going weirdly wrong

I am trying to fetch results from my sqlite database by providing a date range.
I have been able to fetch results by providing 3 filters
1. Name (textfield1)
2. From (date)(textfield2)
3. To (date)(textfield3)
I am inserting these field values taken from form into a table temp using following code
Statement statement6 = db.createStatement("INSERT INTO Temp(date,amount_bill,narration) select date,amount,narration from Bills where name=\'"+TextField1.getText()+"\' AND substr(date,7)||substr(date,4,2)||substr(date,1,2) <= substr (\'"+TextField3.getText()+"\',7)||substr (\'"+TextField3.getText()+"\',4,2)||substr (\'"+TextField3.getText()+"\',1,2) AND substr(date,7)||substr(date,4,2)||substr(date,1,2) >= substr (\'"+TextField2.getText()+"\',7)||substr (\'"+TextField2.getText()+"\',4,2)||substr (\'"+TextField2.getText()+"\',1,2) ");
statement6.prepare();
statement6.execute();
statement6.close();
Now if i enter the following input in my form for the above filters
1.Ricky
2.01/02/2012
3.28/02/2012
It fetches date between these date ranges perfectly.
But now i want to insert values that are below and above these 2 date ranges provided.
I have tried using this code.But it doesnt show up any result.I simply cant figure where the error is
The below code is to find entries having date lesser than 01/02/2012 and greater than 28/02/2012.
Statement statementVII = db.createStatement("INSERT INTO Temp5(date,amount_rec,narration) select date,amount,narration from Bills where name=\'"+TextField1.getText()+"\' AND substr(date,7)||substr(date,4,2)||substr(date,1,2) < substr (\'"+TextField2.getText()+"\',7)||substr (\'"+TextField2.getText()+"\',4,2)||substr (\'"+TextField2.getText()+"\',1,2) AND substr(date,7)||substr(date,4,2)||substr(date,1,2) > substr (\'"+TextField3.getText()+"\',7)||substr (\'"+TextField3.getText()+"\',4,2)||substr (\'"+TextField3.getText()+"\',1,2)");
statementVII.prepare();
statementVII.execute();
statementVII.close();
Anyone sound on this,please guide.Thanks.
you need to use an Or clause together with brackets:
WHERE name='....' AND (yourDateField<yourLowerDate OR yourDateField>yourHigherDate)