Please tell me how to add columns?
This DataFrame to pivot.
|date |country|type|qty|
|----------|-------|----|---|
|2021/03/01|jp |A |10 |
|2021/03/01|en |C |20 |
|2021/03/01|jp |C |15 |
|2021/03/02|jp |A |10 |
|2021/03/02|en |A |20 |
|2021/03/02|en |C |15 |
(to pivot)
| |2021/03/01|2021/03/02|
|-----|----------|----------|
| |jp |en |jp |en |
|-----|----------|----------|
| A |10 | 0 |50 |30 |
| C |15 | 15 |0 |75 |
I would like to add "rate column"
| |2021/03/01 |2021/03/02 |
|-----|---------------------|----------------------|
| | jp | en | jp | en |
|-----|---------------------|----------------------|
| |cnt | rate|cnt |rate|cnt | rate|cnt |rate |
|-----|----------|----------|----------|-----------|
| A |10 | 0.4 | 0 | 0 |50 | 1 | 30 | 0.26|
| C |15 | 0.6 | 15 | 1 |0 | 0 | 85 | 0.74|
You can use concat with keys parameter with divide values by sums, then add DataFrame.reorder_levels and sort MultiIndex:
#change to your function if necessary
df1 = df.pivot_table(index='type', columns=['date','country'], values='qty', fill_value=0)
print (df1)
date 2021/03/01 2021/03/02
country en jp en jp
type
A 0 10 20 10
C 20 15 15 0
df = (pd.concat([df1, df1.div(df1.sum())], axis=1, keys=('cnt','rate'))
.reorder_levels([1,2,0], axis=1)
.sort_index(axis=1))
print (df)
date 2021/03/01 2021/03/02
country en jp en jp
cnt rate cnt rate cnt rate cnt rate
type
A 0 0.0 10 0.4 20 0.571429 10 1.0
C 20 1.0 15 0.6 15 0.428571 0 0.0
Related
I want to know the number of goals scored away and at home for each team in each season
season |home_goal |away_goal |team_home |team_away|
-----------------------------------------------------
1 | 1 |0 |France |Spain |
1 | 1 |2 |Italie |Spain |
1 | 0 |1 |Spain |Italie |
1 | 1 |3 |France |Italie |
1 | 1 |4 |Spain |Portugal |
1 | 3 |4 |Portugal |Italie |
2 | 1 |2 |France |Portugal |
2 | 1 |0 |Spain |Italie |
2 | 0 |1 |Spain |Portugal |
2 | 3 |2 |Italie |Spain |
2 | 0 |1 |France |Portugal |
... | ... |... |... |... |
I want this output
season |hg |ag |team |
-------------------------------------------
1 | 2 |0 |France |
1 | 1 |8 |Italie |
1 | 1 |2 |Spain |
1 | 3 |4 |Portugal |
2 | 1 |0 |France |
... | ... |... |... |
I don't have the expected result, I only have the goals scored at home...
WITH all_match AS (SELECT
season,
match.team_home AS ht,
match.home_goal AS hg
FROM
match
UNION ALL
SELECT
season,
match.team_away AS ta,
match.away_goal AS ag
FROM
match)
SELECT
season,
ht,
SUM(hg)
FROM
all_match
group by 1,2
Use below approach
select * from (
select season, 'home' location, team_home as team, home_goal as goal
from your_table
union all
select season, 'away' location, team_away as team, away_goal as goal
from your_table
)
pivot (sum(goal) for location in ('home', 'away'))
if applied to sample data in your question - output is
How to calculate the counts of each distinct value in column for all the columns in a pyspark dataframe?
This is my input dataframe:
spark.table("table1").show()
+-------+---------+------- +--------+
|col1 | col2 | col3 | col4 |
+-------+---------+------- +--------+
|aa | ss | sss | jjj |
|bb | 123 | 1203 | uuu |
|null | 123 | null | zzz |
|null | 123 | 1203 | 6543 |
+-------+---------+--------+--------+
I need the final output data frame some thing like this:
+-----------+-------------+-------+-------+------------+
|table_name | Column_name | Value | count | percentage |
+-----------+-------------+-------+-------+------------+
|table1 | col1 | aa | 1 | |
|table1 | col1 | bb | 1 | |
|table1 | col1 | null | 2 | |
|table1 | col2 | ss | 1 | |
|table1 | col2 | 123 | 3 | |
|table1 | col3 | sss | 1 | |
|table1 | col3 | 1203 | 2 | |
|table1 | col3 | null | 1 | |
|table1 | col4 | jjj | 1 | |
|table1 | col4 | uuu | 1 | |
|table1 | col4 | zzz | 1 | |
|table1 | col4 | 6543 | 1 | |
+-----------+-------------+-------+-------+------------+
I have Python logic for calculating the percentage. Same needs to be implemented in Pyspark
percentages.append(excel['enum'][col].value_counts()[value]/excel['enum'][col].shape[0] * 100)
You can group by each column to count different values then union all the intermediary dataframes. Or use counting over window partitioned by each column then unpivot to get the desired output. Here's an example:
from pyspark.sql import functions as F, Window
df = spark.createDataFrame([
("aa", "ss", "sss", "jjj"), ("bb", "123", "1203", "uuu"),
(None, "123", None, "zzz"), (None, "123", "1203", "6543")
], ["col1", "col2", "col3", "col4"])
result = df.select(*[
F.struct(
F.col(c).alias("Value"),
F.count(F.coalesce(F.col(c), F.lit("null"))).over(Window.partitionBy(c)).alias("count")
).alias(c) for c in df.columns
]).agg(*[
F.collect_set(c).alias(c) for c in df.columns
]).selectExpr(
f"stack({len(df.columns)}," + ','.join(chain(*[(c, f"'{c}'") for c in df.columns])) + ")"
).selectExpr(
"col1 as Column_name", "inline(col0)"
).withColumn(
"percentage",
F.round(F.col("count") / df.count() * 100, 2)
)
result.show()
# +-----------+-----+-----+----------+
# |Column_name|Value|count|percentage|
# +-----------+-----+-----+----------+
# |col1 |null |2 |50.0 |
# |col1 |bb |1 |25.0 |
# |col1 |aa |1 |25.0 |
# |col2 |123 |3 |75.0 |
# |col2 |ss |1 |25.0 |
# |col3 |null |1 |25.0 |
# |col3 |1203 |2 |50.0 |
# |col3 |sss |1 |25.0 |
# |col4 |jjj |1 |25.0 |
# |col4 |uuu |1 |25.0 |
# |col4 |zzz |1 |25.0 |
# |col4 |6543 |1 |25.0 |
# +-----------+-----+-----+----------+
I'm trying to set up a set of data using this query :
WITH CTE AS (
SELECT *
FROM UNNEST([("a",NULL),("b",NULL),("c",1),("d",1),("e",2),("f",3),("g",4),("h",4),("i",4),("j",4),("k",5),("l",5),("m",6),("n",7),("o",7)])
)
SELECT *
FROM CTE
The yielded result is :
|Row|f0_|f1_ |
|---|---|----|
| 1 | a |null|
| 2 | b |null|
| 3 | c |1 |
| 4 | d |1 |
| 5 | e |2 |
| 6 | f |3 |
| 7 | g |4 |
| 8 | h |4 |
| 9 | i |4 |
| 10| j |4 |
| 11| k |5 |
| 12| l |5 |
| 13| m |6 |
| 14| n |7 |
| 15| o |7 |
What I want is :
|Row| x | y |
|---|---|----|
| 1 | a |null|
| 2 | b |null|
| 3 | c |1 |
| 4 | d |1 |
| 5 | e |2 |
| 6 | f |3 |
| 7 | g |4 |
| 8 | h |4 |
| 9 | i |4 |
| 10| j |4 |
| 11| k |5 |
| 12| l |5 |
| 13| m |6 |
| 14| n |7 |
| 15| o |7 |
Use STRUCT:
WITH CTE AS (
SELECT *
FROM UNNEST([STRUCT("a" as x,NULL as y),("b",NULL),("c",1),("d",1),("e",2),("f",3),("g",4),("h",4),("i",4),("j",4),("k",5),("l",5),("m",6),("n",7),("o",7)])
)
SELECT *
FROM CTE
You can use this trick with almost any SQL dialect:
WITH CTE AS (
SELECT NULL AS A, NULL AS B
FROM (SELECT 1) T
WHERE FALSE
UNION ALL
SELECT *
FROM UNNEST([
("a",NULL),("b",NULL),("c",1),("d",1),("e",2),("f",3),
("g",4),("h",4),("i",4),("j",4),("k",5),("l",5),("m",6),("n",7),("o",7)
])
)
SELECT *
FROM CTE;
I'm trying to coalesce multiple input columns into multiple output columns in either a pyspark dataframe or sql table.
Each output column would contain the "first available" input value, and then "consume" it so the input value is unavailable for following output columns.
+----+-----+-----+-----+-----+-----+---+------+------+------+
| ID | in1 | in2 | in3 | in4 | in5 | / | out1 | out2 | out3 |
+----+-----+-----+-----+-----+-----+---+------+------+------+
| 1 | | | C | | | / | C | | |
| 2 | A | | C | | E | / | A | C | E |
| 3 | A | B | C | | | / | A | B | C |
| 4 | A | B | C | D | E | / | A | B | C |
| 5 | | | | | | / | | | |
| 6 | | B | | | E | / | B | E | |
| 7 | | B | | D | E | / | B | D | E |
+----+-----+-----+-----+-----+-----+---+------+------+------+
What's the best way to do this?
edit: clarification - in1, in2, in3, etc.. can be any value
Here is the way.
import pyspark.sql.functions as f
df = spark.read.option("header","true").option("inferSchema","true").csv("test.csv")
cols = df.columns
cols.remove('ID')
df2 = df.withColumn('ins', f.array_except(f.array(*cols), f.array(f.lit(None))))
for i in range(0, 3):
df2 = df2.withColumn('out' + str(i+1), f.col('ins')[i])
df2.show(10, False)
+---+----+----+----+----+----+---------------+----+----+----+
|ID |in1 |in2 |in3 |in4 |in5 |ins |out1|out2|out3|
+---+----+----+----+----+----+---------------+----+----+----+
|1 |null|null|C |null|null|[C] |C |null|null|
|2 |A |null|C |null|E |[A, C, E] |A |C |E |
|3 |A |B |C |null|null|[A, B, C] |A |B |C |
|4 |A |B |C |D |E |[A, B, C, D, E]|A |B |C |
|5 |null|null|null|null|null|[] |null|null|null|
|6 |null|B |null|null|E |[B, E] |B |E |null|
|7 |null|B |null|D |E |[B, D, E] |B |D |E |
+---+----+----+----+----+----+---------------+----+----+----+
So I have a Request History table that I would like to flag its versions (version is based on end of cycle); I was able to mark the end of the cycle, but somehow I couldn't update the values of each associated with each cycle. Here is an example:
|history_id | Req_id | StatID | Time |EndCycleDate |
|-------------|---------|-------|---------- |-------------|
|1 | 1 |18 | 3/26/2017 | NULL |
|2 | 1 | 19 | 3/26/2017 | NULL |
|3 | 1 |20 | 3/30/2017 | NULL |
|4 |1 | 23 |3/30/2017 | NULL |
|5 | 1 |35 |3/30/2017 | 3/30/2017 |
|6 | 1 |33 |4/4/2017 | NULL |
|7 | 1 |34 |4/4/2017 | NULL |
|8 | 1 |39 |4/4/2017 | NULL |
|9 | 1 |35 |4/4/2017 | 4/4/2017 |
|10 | 1 |33 |4/5/2017 | NULL |
|11 | 1 |34 |4/6/2017 | NULL |
|12 | 1 |39 |4/6/2017 | NULL |
|13 | 1 |35 |4/7/2017 | 4/7/2017 |
|14 | 1 |33 |4/8/2017 | NULL |
|15 | 1 | 34 |4/8/2017 | NULL |
|16 | 2 |18 |3/28/2017 | NULL |
|17 | 2 |26 |3/28/2017 | NULL |
|18 | 2 |20 |3/30/2017 | NULL |
|19 | 2 |23 |3/30/2017 | NULL |
|20 | 2 |35 |3/30/2017 | 3/30/2017 |
|21 | 2 |33 |4/12/2017 | NULL |
|22 | 2 |34 |4/12/2017 | NULL |
|23 | 2 |38 |4/13/2017 | NULL |
Now what I would like to achieve is to derive a new column, namely VER, and update its value like the following:
|history_id | Req_id | StatID | Time |EndCycleDate | VER |
|-------------|---------|-------|---------- |-------------|------|
|1 | 1 |18 | 3/26/2017 | NULL | 1 |
|2 | 1 | 19 | 3/26/2017 | NULL | 1 |
|3 | 1 |20 | 3/30/2017 | NULL | 1 |
|4 |1 | 23 |3/30/2017 | NULL | 1 |
|5 | 1 |35 |3/30/2017 | 3/30/2017 | 1 |
|6 | 1 |33 |4/4/2017 | NULL | 2 |
|7 | 1 |34 |4/4/2017 | NULL | 2 |
|8 | 1 |39 |4/4/2017 | NULL | 2 |
|9 | 1 |35 |4/4/2017 | 4/4/2017 | 2 |
|10 | 1 |33 |4/5/2017 | NULL | 3 |
|11 | 1 |34 |4/6/2017 | NULL | 3 |
|12 | 1 |39 |4/6/2017 | NULL | 3 |
|13 | 1 |35 |4/7/2017 | 4/7/2017 | 3 |
|14 | 1 |33 |4/8/2017 | NULL | 4 |
|15 | 1 | 34 |4/8/2017 | NULL | 4 |
|16 | 2 |18 |3/28/2017 | NULL | 1 |
|17 | 2 |26 |3/28/2017 | NULL | 1 |
|18 | 2 |20 |3/30/2017 | NULL | 1 |
|19 | 2 |23 |3/30/2017 | NULL | 1 |
|20 | 2 |35 |3/30/2017 | 3/30/2017 | 1 |
|21 | 2 |33 |4/12/2017 | NULL | 2 |
|22 | 2 |34 |4/12/2017 | NULL | 2 |
|23 | 2 |38 |4/13/2017 | NULL | 2 |
One method that comes really close is a cumulative count:
select t.*,
count(endCycleDate) over (partition by req_id order by history_id) as ver
from t;
However, this doesn't get the value when the endCycle date is defined exactly right. And the value starts at 0. Most of these problems are fixed with a windowing clause:
select t.*,
(count(endCycleDate) over (partition by req_id
order by history_id
rows between unbounded preceding and 1 preceding) + 1
) as ver
from t;
But that misses the value on the first row first one. So, here is a method that actually works. It enumerates the values backward and then subtracts from the total to get the versions in ascending order:
select t.*,
(1 + count(*) over (partition by req_id) -
(count(endCycleDate) over (partition by req_id
order by history_id desc)
) as ver
from t;