Pyspark question making count result into a dataframe

Pyspark question making count result into a dataframe - pandas

I have a pyspark function that looks like this. \
spark.sql("select count(*) from student_table where student_id is NULL") \
spark.sql("select count(*) from student_table where student_scores is NULL") \
spark.sql("select count(*) from student_table where student_health is NULL")
I get a result that looks like \
+-------+|count(1)|\n-------+| 0|+-------+\n|count(1)|\n-------+| 100|+-------+\n|count(1)|\n-------+| 24145|
What I want to do is to make the result into a dataframe for each column by using pandas or pyspark function.
The result should have each null value result for each column.
For example,
Thanks in advance if someone can help me out.

You could use union between the 3 queries but actually you can get all null counts for each column using one query:
spark.sql("""
SELECT SUM(INT(student_id IS NULL)) AS student_id_nb_null,
SUM(INT(student_scores IS NULL)) AS student_scores_nb_null,
SUM(INT(student_health IS NULL)) AS student_health_nb_null,
FROM student_table
""").show()
#+------------------+----------------------+----------------------+
#|student_id_nb_null|student_scores_nb_null|student_health_nb_null|
#+------------------+----------------------+----------------------+
#| 0| 100| 24145|
#+------------------+----------------------+----------------------+
Or by using DataFrame API with:
import pyspark.sql.functions as F
df.agg(
F.sum(F.col("student_id").isNull().cast("int")).alias("student_id_nb_null"),
F.sum(F.col("student_scores").isNull().cast("int")).alias("student_scores_nb_null"),
F.sum(F.col("student_health").isNull().cast("int")).alias("student_health_nb_null")
)

Use union all and add all your queries in one spark.sql.
Example:
spark.sql("""select "student_id" `column_name`,count(*) `null_result` from tmp where student_id is null \
union all \
select "student_scores" `column_name`,count(*) `null_result` from tmp where student_scores is null \
union all \
select "student_health" `column_name`,count(*) `null_result` from tmp where student_health is null""").\
show()

Related

How to read a .txt file from S3 and use the resulting dataframe as a SQL query in pyspark

I have a txt file saved in a s3 folder. the content of that text file looks like:
select
fiscal year,
fiscal quarter,
refresh time,
...
from table 1
where condition1 and condition2.
I can use the following codes to read that file,
hist_sql=spark.read.text('s3://team-test/history/sql/his.txt)
but the file is read into a spark dataframe,
Instead, I want to convert that 'hist_sql' dataframe to a sql query like below in spark,
sql="""
select
fiscal year,
fiscal quarter,
refresh time,
...
from table 1
where condition1 and condition2
"""
so I can feed the above sql query to the following codes and query a redshift database
df = spark.read.format("com.databricks.spark.redshift")
.option("url", jdbcUrl)\
.option("query", sql )\
.option("forward_spark_s3_credentials", True)\
.load()
can someone show the codes to do it? thanks

try this hope its work for you
sql="""
select
fiscal year,
fiscal quarter,
refresh time,
...
from table 1
where condition1 and condition2
"""
df = spark.read.format("com.databricks.spark.redshift").option("url", "jdbc:redshift://<red shift path>:<portno>/<dbname?).option("dbtable","<db_table>").option("user","<username>").option("password","<password>").option("query", sql).option("tempdir", "s3a://<location>").load()
df.show()
or another way you can use by read by aws s3 data consider as df2
conf = SparkConf()
conf.set('spark.jars.packages', 'org.apache.hadoop:hadoop-aws:3.2.0')
conf.set('spark.hadoop.fs.s3a.aws.credentials.provider', 'org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider')
conf.set('spark.hadoop.fs.s3a.access.key', <access_key>)
conf.set('spark.hadoop.fs.s3a.secret.key', <secret_key>)
conf.set('spark.hadoop.fs.s3a.session.token', <token>)
spark = SparkSession.builder.config(conf=conf).getOrCreate()
df2 = spark.read.format("csv").load('s3a://<some_path_to_a_file>')
df2.show()
df2.createOrReplacementTempView("table1")
dfsql = spark.sql("select * from table1")
dfsql.show()

Join-Group PySpark - SQL to Pysaprk

I am trying to join 2 tables based on this SQL query using pyspark.
%sql
SELECT c.cust_id, avg(b.gender_score) AS pub_masc
FROM df c
LEFT JOIN pub_df b
ON c.pp = b.pp
GROUP BY c.cust_id
)
I tried following in pyspark but I am not sure if it's the right way as I was stuck to display my data. so I just choose .max
df.select('cust_id', 'pp') \
.join(pub_df, on = ['pp'], how = 'left')\
.avg(gender_score) as pub_masc
.groupBy('cust_id').max()
any help would be appreciated.
Thanks in advance

Your Python code contains an invalid line .avg(gender_score) as pub_masc. Also you should group by and then average, not the other way round.
import pyspark.sql.functions as F
df.select('cust_id', 'pp') \
.join(pub_df, on = ['pp'], how = 'left')\
.groupBy('cust_id')\
.agg(F.avg('gender_score').alias('pub_masc'))

Delete arguments from array

There is table w/ colum called Cars in this colum I have array [Audi, BMW, Toyota, ..., VW]
And I want update this table and set Cars without few elements from this array (Toyota,..., BMW)
How can I get it, I want put another array and delete elements that matched

You can unnest the array, filter, and reaggregate:
select t.*,
(select array_agg(car)
from unnest(t.cars) car
where car not in ( . . . )
) new_cars
from t;
If you want to keep the original ordering:
select t.*,
(select array_agg(u.car order by n)
from unnest(t.cars) with ordinality u(car, n)
where u.car not in ( . . . )
) new_cars
from t

You could call array_remove several times:
SELECT array_remove(
array_remove(
ARRAY['Audi', 'BMW', 'Toyota', 'Opel', 'VW'],
'Audi'
),
'BMW'
);
array_remove
------------------
{Toyota,Opel,VW}
(1 row)

Maybe I Can help using pandas in python. Assuming, you'd want to delete all the rows having the elements you'd like to delete. Lets say df is your dataframe, then,
import pandas as pd
vals_to_delete = df.loc[(df['cars']== 'Audi') | (df['cars']== 'VW')]
df = df.drop(vals_to_delete)
or you could also do
df1 = df.loc'[(df['cars']!= 'Audi') | (df['cars']!= 'VW')]
In sql, you could use
DELETE FROM table WHERE Cars in ('Audi','VW);

Discriminate data from functions aggregate on PosgreSQL

On PosgreSQL I have a database of pizza restaurant.
With this code:
SELECT command.id_command, array_agg(history_state_of_command.state)
FROM command JOIN history_state_of_command
ON command.id_command = history_state_of_command.id_command
GROUP BY command.id_command
I obtain these results, with the id of a command and the associated state of command:
command.id_command State of command
1
"{Pizza_Order,Pizza_in_preparation,Pizza_prepared,Pizza_ready_for_delivery,Pizza_delivering,Pizza_deliver}"
2
"{Pizza_Order,Pizza_in_preparation}"
3
"{Pizza_Order,Pizza_in_preparation,Pizza_prepared,Pizza_ready_for_delivery,Pizza_delivering,Pizza_deliver,"Command cancelled"}"
4
"{Pizza_Order,Pizza_in_preparation,Pizza_prepared,Pizza_ready_for_delivery,Pizza_delivering,Pizza_deliver}"
I would like to find an SQL code where I obtain only id of command where the pizza was never prepared:
command.id_command State of command
2 "{Pizza_Order,Pizza_in_preparation}"
Many thanks for your help !

You can use a correlated subquery to find this command:
select h.id_command
from history_state_of_command h
where h.state in ('Pizza_Order', 'Pizza_in_preparation')
and not exists (
select 1
from history_state_of_command i
where i.id_command = h.id_command and i.state = 'Pizza_prepared'
)

You can use aggregation as well:
select hsc.id_command
from history_state_of_command hsc
group by hsc.id_command
having count(*) filter (where hsc.state = 'Pizza_prepared') = 0;
Note: This assumes that commands have some row in the history. If not, then use not exists;
select c.*
from command c
where not exists (select 1
from history_state_of_command hsc
where hsc.id_command = c.id_command and hsc.state = 'Pizza_prepared'
);
This is probably the most efficient method, with appropriate indexes.

Perform Aggregations using JSON1 and SQLite3 in Json Objects

I just started using SQLite 3 with JSON1 support.
I already have created a database that consists of several attributes.
One of these attributes is a json object.
What I want to do is perform aggregations within this object.
Running the following query:
select json_extract(report, '$.some_attribute')
from table1
group by var1
having date == max(date);
returns the following:
[{"count":55,"label":"A"},{"count":99,"label":"B"}, {"count":1,"label":"C"}]
[{"count":29,"label":"A"},{"count":285,"label":"B"},{"count":461,"label":"C"}]
[{"count":6642,"label":"A"},{"count":24859,"label":"B"},{"count":3031,"label":"C"}]
[{"count":489,"label":"A"},{"count":250,"label":"B"},{"count":74,"label":"C"}]
Now, what I want to do is to group by the label key and for example, sum the count key.
The output should be something like this:
[{"label": A, 'count': 7215},
{"label": B, 'count': 25493},
{"label": C, 'count': 3567}]
OR this:
A, B, C
7215, 25493, 3567
I've tried to implement the latter one like this:
select sum(A) as A, sum(B) as B, sum(C) as C
from (
select json_extract(report,
'$.some_attribute[0].count') as A,
json_extract(report,
'$.some_attribute[1].count') as B,
json_extract(report,
'$.some_attribute[0].count') as C
from table1
group by var1
having date == max(date));
The thing is, how can you be sure that all the objects in the array will be sorted the same way. So this may cause problems.
Any solutions? thanks!

If you "un-nest" the json strings returned from the first json_extract,as with json_each, it becomes trivial. This worked in my repro:
WITH result as (select jsonString from jsonPlay)
select json_extract(value,'$.label') label,SUM(json_extract(value,'$.count'))
from result,json_each(jsonString)
group by label
Giving this result:
A| 7215
B| 25493
C| 3567
Basically, your select json_extract(report, '$.some_attribute') block replaces select jsonString from jsonPlay
You could use this to "columnize" it, as in your OR option.
WITH result as (select jsonString from jsonPlay)
select SUM(CASE WHEN json_extract(value,'$.label')='A' then json_extract(value,'$.count') END) 'A',
SUM(CASE WHEN json_extract(value,'$.label')='B' then json_extract(value,'$.count') END) 'B',
SUM(CASE WHEN json_extract(value,'$.label')='C' then json_extract(value,'$.count') END) 'C'
from result,json_each(jsonString)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Pyspark question making count result into a dataframe - pandas

Related

How to read a .txt file from S3 and use the resulting dataframe as a SQL query in pyspark

Join-Group PySpark - SQL to Pysaprk

Delete arguments from array

Discriminate data from functions aggregate on PosgreSQL

Perform Aggregations using JSON1 and SQLite3 in Json Objects

Categories

Resources