My spark data looks like -
area product score
a aa .39
a bb .03
a cc 1.1
a dd .5
b ee .02
b aa 1.2
b mm .5
b bb 1.3
I want top 3 product area wise ranking based on score variable. My final output should be
area product score rank
a cc 1.1 1
a dd .5 2
a a .39 3
b bb 1.3 1
b aa 1.2 2
b mm .5 3
How to do it in PySpark?
I have done so far -
from pyspark.sql import Window
import pyspark.sql.functions as psf
wA = Window.orderBy(psf.desc("score"))
df = df.withColumn(
"rank",
psf.dense_rank().over(wA))
But not working for me.
Partition by area and filter rank<=3 will give the results
import pyspark.sql.functions as psf
from pyspark.sql import SparkSession
from pyspark.sql.window import Window
spark = SparkSession.builder.appName("Test").master("local[*]") \
.getOrCreate()
df = spark.createDataFrame([('a', 'aa', .39),
('a', 'bb', .03),
('a', 'cc', 1.1),
('a', 'dd', .5),
('b', 'ee', .02),
('b', 'aa', 1.2),
('b', 'mm', .5),
('b', 'bb', 1.3)],
['area', 'product', 'score'])
wA = Window.partitionBy("area").orderBy(psf.desc("score"))
df = df.withColumn("rank",
psf.dense_rank().over(wA))
df.filter("rank<=3").show()
Related
I have the following dataset:
id test date
1 A 2000-01-01
1 B 2000-01-01
1 C 2000-01-08
2 A 2000-01-01
2 A 2000-01-01
2 B 2000-01-08
3 A 2000-01-01
3 C 2000-01-01
3 B 2000-01-08
4 A 2000-01-01
4 B 2000-01-01
4 C 2000-01-01
5 A 2000-01-01
5 B 2000-01-01
5 C 2000-01-01
I would love to create a matrix figure with the count of how many individuals got a test taken on the same day.
For example:
Since we can see that 1 time (for one individual, id=1) test A and B were taken on the day; also for one individual (id = 3) test A and B were taken on the same day; and for two individuals (id=4 and 5) the three tests were taken on the same day.
So far I am doing the following:
df_tests = df.groupby(['id', 'date']).value_counts().reset_index(name='count')
df_tests_unique = df_tests[df_tests_re.duplicated(subset=['id','date'], keep=False)]
df_tests_unique = df_tests_unique[["id", "date", "test"]]
So the only thing left is to count the number of times the different tests ocur within the same date
Thanks for the fun exercise :) Given below is a possible solution. I created a numpy array and plotted it using seaborn. Note that it's quite hardcoded for the case where there is only A, B, C but I'm sure you will be able to generalize that. Also, the default color scheme of seaborn brings opposite colors than what you intended but that's easily fixable as well. Hope I helped!
This is the resulting plot from the script:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.DataFrame({
'id': [1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5],
'test': ['A', 'B', 'C', 'A', 'A', 'B', 'A', 'C', 'B', 'A', 'B', 'C', 'A', 'B', 'C'],
'date': ['2000-01-01', '2000-01-01', '2000-01-08', '2000-01-01', '2000-01-01', '2000-01-08', '2000-01-01', '2000-01-01', '2000-01-08', '2000-01-01', '2000-01-01', '2000-01-01', '2000-01-01', '2000-01-01', '2000-01-01']
})
df_tests = df.groupby(['id', 'date']).value_counts().reset_index(name='count')
df_test_with_patterns = (df_tests[df_tests.duplicated(subset=['id', 'date'], keep=False)]
.groupby(['id', 'date'])
.agg({'test': 'sum'})
.reset_index().groupby('test').count().reset_index()
.assign(pattern=lambda df: df.test.apply(lambda tst: [1 if x in tst else 0 for x in ['A', 'B', 'C']]))
)
pattern_mat = np.vstack(df_test_with_patterns.pattern.values.tolist())
ax = sns.heatmap(pattern_mat, xticklabels=['A', 'B', 'C'], yticklabels=df_test_with_patterns.id.values)
ax.set(xlabel='Test Type', ylabel='# of individuals that took in a single day')
plt.show()
print
Building on Erap answer, this works too, maybe slightly faster:
out = pd.get_dummies(df.set_index(['date', 'id'], drop=True).sort_index()).groupby(level=[0,1]).sum()
and then iterate through the different dates to get the different charts
for i in out.index.levels[0]:
d = out.loc[i]
plt.figure()
plt.title(f'test for date {i}')
sns.heatmap(d.gt(0))
It must not be that hard but I can't cope with this problem.
Imagine I have a long format dataframe with some data and want to calculate a weighted average of score per person and weighted by a manager and keep it as a separate variable - 'w_mean_m'.
df['w_mean_m'] = df.groupby('person')['score'].transform(lambda x: np.average(x['score'], weights=x['manager_weight']))
throws an error and I have no idea how to fix it.
Because GroupBy.transform working with each column separately is not possible select multiple columns, so is used GroupBy.apply with Series.map for new column:
s = (df.groupby('contact')
.apply(lambda x: np.average(x['score'], weights=x['manager_weight'])))
df['w_mean_m'] = df['contact'].map(s)
One hack is possible with selected values by unique index for weights:
df = df.reset_index(drop=True)
f = lambda x: np.average(x, weights=df.loc[x.index, "manager_weight"])
df['w_mean_m1'] = df.groupby('contact')['score'].transform(f)
print (df)
manager_weight score contact w_mean_m1
0 1.0 1 a 1.282609
1 1.1 1 a 1.282609
2 1.2 1 a 1.282609
3 1.3 2 a 1.282609
4 1.4 2 b 2.355556
5 1.5 2 b 2.355556
6 1.6 3 b 2.355556
7 1.7 3 c 3.770270
8 1.8 4 c 3.770270
9 1.9 4 c 3.770270
10 2.0 4 c 3.770270
Setup:
df = pd.DataFrame(
{
"manager_weight": [1.0,1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9,2.0],
"score": [1,1,1,2,2,2,3,3,4,4,4],
"contact": ['a', 'a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c', 'c']
})
I am searching for a sub-string in a Pandas dateframe.
tmp = Metadata_sheet_0.apply(lambda row: row.astype(str).str.contains('sRNA spacer'), axis=1)
It returns a dataframe of the same size, with every element True or False. I would like the indexes of all Trues, not another dataframe of Trues/Falses.
How to do this Pandas' way without resorting to for loops?
Thank you!
Assuming such example:
df = pd.DataFrame([[1,2,3],[4,1,2],[1,5,1]], columns=list('ABC'))
A B C
0 1 2 3
1 4 1 2
2 1 5 1
you can use a boolean mask and stack:
df.where(df.eq(1)).stack()
output:
0 A 1.0
1 B 1.0
2 A 1.0
C 1.0
dtype: float64
to only get the coordinates:
df.where(df.eq(1)).stack().index.to_list()
output:
[(0, 'A'), (1, 'B'), (2, 'A'), (2, 'C')]
I have a dataframe with grouping column (gr), date (c) (d1 - means current day, d6 - six days ago) and value column (v). For every group, I want to find the most recent date when value was lower (or higher) than current value in expanding way.
Here is toy-example with solution:
import pandas as pd
import operator
from functools import partial
df0 = pd.DataFrame({
'gr': ['a', 'a', 'a', 'a', 'b', 'b'],
'c': ['d1', 'd2', 'd3', 'd4', 'd5', 'd6'],
'v': [30, 10, 20, 5, 35, 5]
})
def last_time_op(df, col, t, op):
#col - column with values
#t - column with date
#op eg. operator.gt for lower, lt for higher
value = df[col]
series = [op(value.loc[x], value.loc[x+1:]) for x in value.index]
seriesIndex = [x.where(x==True).first_valid_index() for x in series]
r = df[t].reindex(seriesIndex)
return r
df0['dateLower'] = df0.groupby('gr').apply(partial(last_time_op, col='v', t='c', op=operator.gt)).reset_index(drop=True)
df0['dateHigher'] = df0.groupby('gr').apply(partial(last_time_op, col='v', t='c', op=operator.lt)).reset_index(drop=True)
The result is:
gr c v dateLower dateHigher
0 a d1 30 d2 NaN
1 a d2 10 d4 d3
2 a d3 20 d4 NaN
3 a d4 5 NaN NaN
4 b d5 35 d6 NaN
5 b d6 5 NaN NaN
For example: 10 (row 1, c: d2) < 20 (row 2: c: d3), so dateHigher for row 1 is d3.
For higher you need to give operator.lt instead of operator.gt. Function last_time_op works fine also when there is no group-by, but when there is no real-grouping e.g.
df1 = pd.DataFrame({
'gr': ['a', 'a', 'a', 'a', 'a', 'a'], #pseudo-grouping
'c': ['d1', 'd2', 'd3', 'd4', 'd5', 'd6'], #d1 now #d6 - six days ago
'v': [30, 10, 20, 5, 35, 5]
})
Then you need to additionally unstack() to "coerce shapes":
df1['dateLower'] = df1.groupby('gr').apply(partial(last_time_op, col='v', t='c', op=operator.gt)).unstack().reset_index(drop=True)
Of course, I can find number of unique values in grouping and with if provide implementation that is ready for pseudo-grouping, but it looks ugly to me.
Also my function last_time_op is not so simple...
I wonder if cleaner, less verbose and more idiomatic approach exists either using pure pandas or some pandas extension?
Solution should be ready mulitple columns in grouping and date-time in c.
You can do cartesian product within each group, then filter out those rows where c values on the right are not higher than those on the left (c < c_: e.g. we only want to compare d3 to [d4, d5, d6])
What remains is to find the lowest c_ where the value on the left v is lower/higher than the value on the right v_.
Something like this:
z = df0.merge(df0, on='gr', suffixes=['', '_']).query('c < c_')
df0.set_index(['gr', 'c']).assign(
dateLower=z[z['v'].gt(z['v_'])].groupby(['gr', 'c'])['c_'].min(),
dateHigher=z[z['v'].lt(z['v_'])].groupby(['gr', 'c'])['c_'].min()
).reset_index()
Output:
gr c v dateLower dateHigher
0 a d1 30 d2 NaN
1 a d2 10 d4 d3
2 a d3 20 d4 NaN
3 a d4 5 NaN NaN
4 b d5 35 d6 NaN
5 b d6 5 NaN NaN
I don't know why I'm struggling so hard with this one. I'm trying to do the excel equivalent of an averageifs calc across a pandas dataframe.
I have the following:
df = pd.DataFrame(rng.rand(1000, 7), columns=['1/31/2019', '2/28/2019', '3/31/2019', '4/30/2019', '5/31/2019', '6/30/2019', '7/31/2019'])
I also have a column:
df['Doc_Number'] = ['A', 'B', 'C', 'B', 'C', 'B', 'A', 'A', 'D', 'G', 'G', 'D', 'G', 'B' ...]
I want to do the excel equivalent of averageifs on the Doc_Number on each column of the df while maintaining the structure of the dataframe. So in each column, I'd calc the mean if df['Doc_Number'] = ['A', 'B', 'C'...] but I'd still maintain the 1,000 rows and I'd apply the calc to each individual column ['1/31/2019', '2/28/2019', '3/31/2019' ...].
For a single column, I would do something like:
df['AverageIfs'] = df.groupby('Doc_Number')['1/31/2019'].transform('np.mean')
But how would you apply the calc to each column of the df? In reality, I have many more columns to apply the calc across.
I'm a complete amateur so thanks for putting up with my questions.
You can remove ['1/31/2019'] after groupby for process all columns to new DataFramme, change columns names with add_suffix and add to original by join:
#simplify df for easy check output
np.random.seed(123)
df = pd.DataFrame(np.random.rand(14, 2), columns=['1/31/2019', '2/28/2019'])
df['Doc_Number'] = ['A', 'B', 'C', 'B', 'C', 'B', 'A', 'A', 'D', 'G', 'G', 'D', 'G', 'B']
print (df)
1/31/2019 2/28/2019 Doc_Number
0 0.696469 0.286139 A
1 0.226851 0.551315 B
2 0.719469 0.423106 C
3 0.980764 0.684830 B
4 0.480932 0.392118 C
5 0.343178 0.729050 B
6 0.438572 0.059678 A
7 0.398044 0.737995 A
8 0.182492 0.175452 D
9 0.531551 0.531828 G
10 0.634401 0.849432 G
11 0.724455 0.611024 D
12 0.722443 0.322959 G
13 0.361789 0.228263 B
df = df.join(df.groupby('Doc_Number').transform('mean').add_suffix('_mean'))
print (df)
1/31/2019 2/28/2019 Doc_Number 1/31/2019_mean 2/28/2019_mean
0 0.696469 0.286139 A 0.511029 0.361271
1 0.226851 0.551315 B 0.478146 0.548364
2 0.719469 0.423106 C 0.600200 0.407612
3 0.980764 0.684830 B 0.478146 0.548364
4 0.480932 0.392118 C 0.600200 0.407612
5 0.343178 0.729050 B 0.478146 0.548364
6 0.438572 0.059678 A 0.511029 0.361271
7 0.398044 0.737995 A 0.511029 0.361271
8 0.182492 0.175452 D 0.453474 0.393238
9 0.531551 0.531828 G 0.629465 0.568073
10 0.634401 0.849432 G 0.629465 0.568073
11 0.724455 0.611024 D 0.453474 0.393238
12 0.722443 0.322959 G 0.629465 0.568073
13 0.361789 0.228263 B 0.478146 0.548364