pyspark count with condition with selectExpr - dataframe

I have a DataFrame with a column "age" and I want to count how many rows with age = 60, for example. I know how to solve this using select or df.count() but I want to use selectExpr.
I tried
customerDfwithAge.selectExpr("count(when(col(age) = 60))")
but it returns me
Undefined function: 'col'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.;
If I try to remove col, it returns me
Invalid arguments for function when; line 1 pos 6
What is wrong?

If you want to use selectExpr you need to provide a valid SQL expression.
when() and col() are pyspark.sql.functions not SQL expressions.
In your case, you should try:
customerDfwithAge.selectExpr("sum(case when age = 60 then 1 else 0 end)")
Bear in mind that I am using sum not count. count will count every row (0s and 1s) and it would simply return the total number of rows of your dataframe.

Related

count number of times a regex pattern occurs in hive

I have a string variable stored in hive as follows
stringvar
AA1,BB3,CD4
AA12,XJ5
I would like to count (and filter on) how many times the regex pattern \w\w\d occurs. In the example, in the first row there are obviously three such examples. How can I do that without resorting to lateral views and explosions of stringvar (too expensive)?
Thanks!
You can split string by pattern and calculate size of result array - 1.
Demo:
select size(split('AA1,BB3,CD4','\\w\\w\\d'))-1 --returns 3
select size(split('AA12,XJ5','\\w\\w\\d'))-1 --returns 2
select size(split('AAxx,XJx','\\w\\w\\d'))-1 --returns 0
select size(split('','\\w\\w\\d'))-1 --returns 0
If column is null-able than special care should be taken. For example like this (depends on what you need to be returned in case of NULL):
select case when col is null then 0
else size(split(col,'\\w\\w\\d'))-1
end
Or simply convert NULL to empty string using NVL function:
select size(split(NVL(col,''),'\\w\\w\\d'))-1
The solution above is the most flexible one, you can count the number of occurrences and use it for complex filtering/join/etc.
In case you just need to filter records with fixed number of pattern occurrences or at least fixed number and do not need to know exact count then simple RLIKE without splitting is the cheapest method.
For example check for at least 2 repeats:
select 'AA1,BB3,CD4' rlike('\\w\\w\\d+,\\w\\w\\d+') --returns true, can be used in WHERE

How to get malformed or string type data from a numeric column in hive?

I have a column id (data type integer) containing the following records:
1
2
NULL
x
y
As hive automatically converts x and y into NULL, I'm first casting the id column to a string. Now I want count(id) where id is not from [0-9] and also not NULL. In my case, the count should be 2, but it is not working with xand y. I am also getting count of NULL's, in my example 3.
I have tried using LIKE, RLIKE and also with regexp_extract(id,'\&q=([^\&]+).
Can some one suggest me how to achieve this?
I tried something similar and it is working for me. I created an external table with your data:
CREATE EXTERNAL TABLE temp_count (count STRING) ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t' LOCATION 'user/$username/data'
Now I am running a query like this:
(Edited)
select count(*) from (select (count - count) as value from temp_count where count != 'NULL')q1 where value is NULL;
and I am getting 2 as the output.
Let me know if I am missing something here

Avoid division by zero in PostgreSQL

I'd like to perform division in a SELECT clause. When I join some tables and use aggregate function I often have either null or zero values as the dividers. As for now I only come up with this method of avoiding the division by zero and null values.
(CASE(COALESCE(COUNT(column_name),1)) WHEN 0 THEN 1
ELSE (COALESCE(COUNT(column_name),1)) END)
I wonder if there is a better way of doing this?
You can use NULLIF function e.g.
something/NULLIF(column_name,0)
If the value of column_name is 0 - result of entire expression will be NULL
Since count() never returns NULL (unlike other aggregate functions), you only have to catch the 0 case (which is the only problematic case anyway). So, your query simplified:
CASE count(column_name)
WHEN 0 THEN 1
ELSE count(column_name)
END
Or simpler, yet, with NULLIF(), like Yuriy provided.
Quoting the manual about aggregate functions:
It should be noted that except for count, these functions return a
null value when no rows are selected.
I realize this is an old question, but another solution would be to make use of the greatest function:
greatest( count(column_name), 1 ) -- NULL and 0 are valid argument values
Note:
My preference would be to either return a NULL, as in Erwin and Yuriy's answer, or to solve this logically by detecting the value is 0 before the division operation, and returning 0. Otherwise, the data may be misrepresented by using 1.
Another solution avoiding division by zero, replacing to 1
select column + (column = 0)::integer;
If you want the divider to be 1 when the count is zero:
count(column_name) + 1 * (count(column_name) = 0)::integer
The cast from true to integer is 1.

Problems with Postgresql CASE syntax

Here is my SQL query:
SELECT (CASE (elapsed_time_from_first_login IS NULL)
WHEN true THEN 0
ELSE elapsed_time_from_first_login END)
FROM (
SELECT (now()::ABSTIME::INT4 - min(AcctStartTime)::ABSTIME::INT4)
FROM radacct
WHERE UserName = 'test156') AS elapsed_time_from_first_login;
When I execute the above query, I get this error:
ERROR: CASE types record and integer cannot be matched
From the error message I understand that PostgreSQL take the second select, respectively elapsed_time_from_first_login as a row, even if it will always be a single value (because of the min() function).
Question: do you have some suggestions on how to deal with this query?
I suppose, what you are actually trying to do should look like this:
SELECT COALESCE((SELECT now() - min(acct_start_time)
FROM radacct
WHERE user_name = 'test156')
, interval '0s')
While there is an aggregate function in the top SELECT list of the subselect, it cannot return "no row". The aggregate function min() converts "no row" to NULL, and the simple form below also does the trick.
db<>fiddle here
Oldsqlfiddle
Other problems with your query have already been pointed out. But this is the much simpler solution. It returns an interval rather than an integer.
Convert to integer
Simplified with input from artaxerxe.
Simple form does the job without check for "no row":
SELECT COALESCE(EXTRACT(epoch FROM now() - min(acct_start_time))::int, 0)
FROM radacct
WHERE user_name = 'test156';
Details about EXTRACT(epoch FROM INTERVAL) in the manual.
Aggregate functions and NULL
If you had used the aggregate function count() instead of sum() as you had initially, the outcome would be different. count() is a special case among standard aggregate functions in that it never returns NULL. If no value (or row) is found, it returns 0 instead.
The manual on aggregate functions:
It should be noted that except for count, these functions return a
null value when no rows are selected. In particular, sum of no rows
returns null, not zero as one might expect, and array_agg returns
null rather than an empty array when there are no input rows. The
coalesce function can be used to substitute zero or an empty array for
null when necessary.
Postgres is complaining that 0 and elapsed_time_from_first_login are not the same type.
Try this (also simplifying your select):
select
coalesce(elapsed_time_from_first_login::INT4, 0)
from ...
Here is how I formatted the SQL and now is working:
SELECT coalesce(result, 0)
FROM (SELECT (now()::ABSTIME::INT4 - min(AcctStartTime)::ABSTIME::INT4) as result
FROM radacct WHERE UserName = 'test156') as elapsed_time_from_first_login;
The second SELECT is returning a table, named elapsed_time_from_first_login with one column and one row. You have to alias that column and use it in the CASE clause. You can't put a whole table (even if it is one column, one row only) where a value is expected.
SELECT (CASE (elapsed_time IS NULL)
WHEN true THEN 0
ELSE elapsed_time end)
FROM (SELECT (now()::ABSTIME::INT4 - min(AcctStartTime)::ABSTIME::INT4)
AS elapsed_time -- column alias
FROM radacct
WHERE UserName = 'test156'
) as elapsed_time_from_first_login; -- table alias
and you can shorten the CASE by using the COALESCE() function (and optionally add an alias for that column to be shown in the results):
SELECT COALESCE(elapsed_time, 0)
AS elapsed_time
FROM (SELECT (now()::ABSTIME::INT4 - min(AcctStartTime)::ABSTIME::INT4)
AS elapsed_time
FROM radacct
WHERE UserName = 'test156'
) as elapsed_time_from_first_login; -- table alias

SQLite equivalent of PostgreSQL's GREATEST function

PostgreSQL has a useful function called GREATEST. It returns the largest value of those passed to it as documented here.
Is there any equivalent in SQLite?
As a note, I only need it to work with 2 arguments.
SELECT MAX(1,2,..)
ref: https://sqlite.org/lang_corefunc.html#maxoreunc
max(X,Y,...)
The multi-argument max() function returns the argument with the maximum value, or return NULL if any argument is NULL. The multi-argument max() function searches its arguments from left to right for an argument that defines a collating function and uses that collating function for all string comparisons. If none of the arguments to max() define a collating function, then the BINARY collating function is used. Note that max() is a simple function when it has 2 or more arguments but operates as an aggregate function if given only a single argument.
using a second value in MAX(value1, value2) would be the equivalent
Example:
UPDATE products SET Quantity = MAX(Quantity - #value, 0)...
if (Quantity - value) return a "Negative Number -0" then Max( , 0) will return 0 because 0 is bigger than -0 / -1 / -2 ... and so on
Max( , 1) will return 1 if the same condition (Quantity - value) return 0 or a Negative Number .. etc, you get the idea !
if we assume that both Quantity and #value could be NULL then use the combination: IFNULL(MAX(Quantity-#value,0),0)
IFNULL(..., 0) will return the second value of your choice IF the first one is NULL