Pyspark conditional function evaluation based on another column - sql

I have a sample data set like below
sample_data = [('A', 'Chetna', 5, 'date_add(date_format(current_date(), \'yyyy-MM-dd\'), 7)'),
('B', 'Tanmay', 6, '`date_add(date_format(current_date(), \'yyyy-MM-dd\'), 1)`'),
('C', 'CC', 2, '`date_add(date_format(current_date(), \'yyyy-MM-dd\'), 3)`'),
('D', 'TC', 9, '`date_add(date_format(current_date(), \'yyyy-MM-dd\'), 5)`')]
df = spark.createDataFrame(sample_data, ['id', 'name', 'days', 'applyMe'])
from pyspark.sql.functions import lit
df = df.withColumn("salary", lit('days * 60'))
I am trying to evaluate the function provided in applyMe column and salary.
So far have tried doing it with expr and eval but no luck.
Could someone please point me in right direction to achieve the desired output.

Related

How to join two data frames using regexp_replace

I want to join two dataframes by removing the matching records using column cust_id. Some of the cust_id have leading zeros. So I need to match by removing zeros in the 'ON' clause. Tried the below query, but it's giving error in Databricks notebook.
PS: I don't want to create another DF1 with zeros removed.
Query:
df1 = df1.join(df2,[regexp_replace("cust_id", r'^[0]*','')], "leftanti")
Py4j.Py4JException: Method and Class java.lang.string does not exist
The following works, but the output that you provided will not be reached using "leftanti" join: S15 matches S15 from another table, so it is removed too. In the example that you provided, "leftanti" join does not return any row.
from pyspark.sql import functions as F
df1 = spark.createDataFrame(
[(1, 'S15', 'AAA'),
(2, '00767', 'BBB'),
(3, '03246', 'CCC')],
['ID', 'cust_id', 'Name'])
df2 = spark.createDataFrame(
[(1, 'S15', 'AAA'),
(2, '767', 'BBB'),
(3, '3246', 'CCC')],
['ID', 'cust_id', 'Name'])
df = df1.join(df2, df2.cust_id == F.regexp_replace(df1.cust_id, r'^0*', ''), "leftanti")
df.show()
# +---+-------+----+
# | ID|cust_id|Name|
# +---+-------+----+
# +---+-------+----+
No Need of square brackets [ ]
df1.join(df2, regexp_replace(df2("cust_id"), r'^[0]*', lit("")))
see documentatin here regexp_replace

How to group by multiple columns on a pandas series

The pandas.Series groupby method makes it possible to group by another series, for example:
data = {'gender': ['Male', 'Male', 'Female', 'Male'], 'age': [20, 21, 20, 20]}
df = pd.DataFrame(data)
grade = pd.Series([5, 6, 7, 4])
grade.groupby(df['age']).mean()
However, this approach does not work for a groupby using two columns:
grade.groupby(df[['age','gender']])
ValueError: Grouper for class pandas.core.frame.DataFrame not 1-dimensional.
In the example, it is easy to add the column to the dataframe and get the desired result as follows:
df['grade'] = grade
y = df.groupby(['gender','age']).mean()
y.to_dict()
{'grade': {('Female', 20): 7.0, ('Male', 20): 4.5, ('Male', 21): 6.0}}
But that can get quite ugly in real life situations. Is there any way to do this groupby on multiple columns directly on the series?
Since I don't know of any direct way to solve the problem, I've made a function that creates a temporary table and performs the groupby on it.
def pd_groupby(series,group_obj):
df = pd.DataFrame(group_obj).copy()
groupby_columns = list(df.columns)
df[series.name] = series
return df.groupby(groupby_columns)[series.name]
Here, group_obj can be a pandas Series or a Pandas DataFrame. Starting from the sample code, the desired result can be achieved by:
y = pd_groupby(grade,df[['gender','age']]).mean()
y.to_dict()
{('Female', 20): 7.0, ('Male', 20): 4.5, ('Male', 21): 6.0}

Plotting a multi-index dataframe with Altair

I have a dataframe which looks like:
data = {'ColA': {('A', 'A-1'): 0,
('A', 'A-2'): 1,
('A', 'A-3'): 1,
('B', 'B-1'): 2,
('B', 'B-2'): 2,
('B', 'B-3'): 0,
('C', 'C-1'): 1,
('C', 'C-2'): 2,
('C', 'C-3'): 2,
('C', 'C-4'): 3},
'ColB': {('A', 'A-1'): 3,
('A', 'A-2'): 1,
('A', 'A-3'): 1,
('B', 'B-1'): 0,
('B', 'B-2'): 2,
('B', 'B-3'): 2,
('C', 'C-1'): 2,
('C', 'C-2'): 0,
('C', 'C-3'): 3,
('C', 'C-4'): 1}}
df = pd.DataFrame( data )
The values for every column are either 0, 1, 2, or 3. These values could just as easily be 'U', 'Q', 'R', or 'Z' ... i.e. there is nothing inherently numeric about them.
I would like to use Altair
** First Set of Charts
I would like to get one bar chart per column.
The labels for the X-axis should be based on the unique values in the columns. The Y-axis should be the count of the unique values in the column.
** Second Set of Charts
Similar to the first set, I would like to get one bar chart per row.
The labels for the X-axis should be based on the unique values in the row. The Y-axis should be the count of the unique values in the row.
This should be easy, but I am not sure how to do it.
All of Altair's APIs are column-based, and ignore indices unless you explicitly include them (see Including Index Data in Altair's documentation).
For the first set of charts (one bar chart per column) you can do this:
alt.Chart(df.reset_index()).mark_bar().encode(
alt.X(alt.repeat(), type='nominal'),
y='count()'
).repeat(['ColA', 'ColB'])
For the second set of charts (one bar chart per row) you can do something like this:
df_transposed = df.reset_index(0, drop=True).T
alt.Chart(df_transposed).mark_bar().encode(
alt.X(alt.repeat(), type='nominal'),
y='count()'
).repeat(list(df_transposed.columns), columns=5)
Though this is a bit of a strange visualization, so I suspect I'm misunderstanding what you're after... your data has ten rows, so one chart per row is ten charts.

Get row and column index of value in Pandas df

Currently I'm trying to automate scheduling.
I'll get requirement as a .csv file.
However, the number of day changes by month, and personnel also changes occasionally, which means the number of columns and rows is not fixed.
So, I want to put value '*' as a marker meaning end of a table. Unfortunately, I can't find a function or method that take a value as a parameter and return a(list of) index(name of column and row or index numbers).
Is there any way that I can find a(or a list of) index of a certain value?(like coordinate)
for example, when the data frame is like below,
|column_1 |column_2
------------------------
1 | 'a' | 'b'
------------------------
2 | 'c' | 'd'
how can I get 'column_2' and '2' by the value, 'd'? It's something similar to the opposite of .loc or .iloc.
Interesting question. I also used a list comprehension, but with np.where. Still I'd be surprised if there isn't a less clunky way.
df = pd.DataFrame({'column_1':['a','c'], 'column_2':['b','d']}, index=[1,2])
[(i, np.where(df[i] == 'd')[0].tolist()) for i in list(df) if len(np.where(df[i] == 'd')[0]) > 0]
> [[('column_2', [1])]
Note that it returns the numeric (0-based) index, not the custom (1-based) index you have. If you have a fixed offset you could just add a +1 or whatever to the output.
If I understand what you are looking for: Find the (index value, column location) for a value in a dataframe. You can use list comprehension in a loop. Probably wont be the fastest if your dataframe is large.
# assume this dataframe
df = pd.DataFrame({'col':['abc', 'def','wert','abc'], 'col2':['asdf', 'abc', 'sdfg', 'def']})
# list comprehension
[(df[col][df[col].eq('abc')].index[i], df.columns.get_loc(col)) for col in df.columns for i in range(len(df[col][df[col].eq('abc')].index))]
# [(0, 0), (3, 0), (1, 1)]
change df.columns.get_loc to col if you want the column value rather than location:
[(df[col][df[col].eq('abc')].index[i], col) for col in df.columns for i in range(len(df[col][df[col].eq('abc')].index))]
# [(0, 'col'), (3, 'col'), (1, 'col2')]
I might be misunderstanding something, but np.where should get the job done.
df_tmp = pd.DataFrame({'column_1':['a','c'], 'column_2':['b','d']}, index=[1,2])
solution = np.where(df_tmp == 'd')
solution should contain row and column index.
Hope this helps!
To search single value:
df = pd.DataFrame({'column_1':['a','c'], 'column_2':['b','d']}, index=[1,2])
df[df == 'd'].stack().index.tolist()
[Out]:
[('column_2', 2)]
To search a list of values:
df = pd.DataFrame({'column_1':['a','c'], 'column_2':['b','d']}, index=[1,2])
df[df.isin(['a', 'd'])].stack().index.tolist()
[Out]:
[(1, 'column_1'), (2, 'column_2')]
Also works when value occurs at multiple places:
df = pd.DataFrame({'column_1':['test','test'], 'column_2':['test','test']}, index=[1,2])
df[df == 'test'].stack().index.tolist()
[Out]:
[(1, 'column_1'), (1, 'column_2'), (2, 'column_1'), (2, 'column_2')]
Explanation
Select cells where the condition matches:
df[df.isin(['a', 'b', 'd'])]
[Out]:
column_1 column_2
1 a b
2 NaN d
stack() reshapes the columns to index:
df[df.isin(['a', 'b', 'd'])].stack()
[Out]:
1 column_1 a
column_2 b
2 column_2 d
Now the dataframe is a multi-index:
df[df.isin(['a', 'b', 'd'])].stack().index
[Out]:
MultiIndex([(1, 'column_1'),
(1, 'column_2'),
(2, 'column_2')],
)
Convert this multi-index to list:
df[df.isin(['a', 'b', 'd'])].stack().index.tolist()
[Out]:
[(1, 'column_1'), (1, 'column_2'), (2, 'column_2')]
Note
If a list of values are searched, the returned result does not preserve the order of input values:
df[df.isin(['d', 'b', 'a'])].stack().index.tolist()
[Out]:
[(1, 'column_1'), (1, 'column_2'), (2, 'column_2')]
Had a similar need and this worked perfectly
# deals with case sensitivity concern
df = raw_df.applymap(lambda s: s.upper() if isinstance(s, str) else s)
# get the row index
value_row_location = df.isin(['VALUE']).any(axis=1).tolist().index(True)
# get the column index
value_column_location = df.isin(['VALUE']).any(axis=0).tolist().index(True)
# Do whatever you want to do e.g Replace the value above that cell
df.iloc[value_row_location - 1, value_column_location] = 'VALUE COLUMN'

Pandas: Issue with min() on Categorical columns

I have the following df where columns A,B,C are categorical variables with strict ordering:
df = DataFrame([[0, 1, 'PASS', 'PASS', 'PASS'],
[0, 2, 'CHAIN', 'FAIL', 'PASS'],
[0, 3, 'PASS', 'PASS', 'TATPG'],
[0, 4, 'FAIL', 'PASS', 'FAIL'],
[0, 5, 'FAIL', 'ATPG', 'FAIL']],
columns = ['X', 'Y', 'A', 'B', 'C'])
for c in ['A','B','C']:
df[c] = df[c].astype('category', categories=['CHAIN', 'ATPG', 'TATPG', 'PASS', 'FAIL'], ordered=True)`
I want to create a new column D which is defined by the min('A', 'B', 'C'). For example, row 1 says 'CHAIN'. That is the smallest value. Hence, D[1] = CHAIN and so on. The D column should result as follows:
D[0] = PASS, D[1] = CHAIN, D[2] = TPATG, D[3] = PASS, D[4] = ATPG
I tried:
df['D'] = df[['A','B','C']].apply(min, axis=1)
However, this does not work as apply() makes the A/B/C column become of type object and hence min() sorts the values lexicographically instead of the ordering that I provided.
I also tried:
df['D'] = df[['A', 'B', 'C']].transpose().min(axis=0)
tranpose() too results in the columns A/B/C getting changed to type object instead of category.
Any ideas on how to do this correctly? I'd rather not recast the columns as categorical a 2nd time if using apply(). In general, I'll be creating a bunch of indicator columns using this formula:
df[indicator] = df[[any subset of (A,B,C)]].min()
I have found a solution that applies sorted with keys:
d = {'CHAIN': 0,
'ATPG': 1,
'TATPG': 2,
'PASS': 3,
'FAIL':4}
def func(row):
return sorted(row, key=lambda x:d[x])[0]
df['D'] = df[['A','B','C']].apply(func, axis=1)
It gives you the result you're looking for:
0 PASS
1 CHAIN
2 TATPG
3 PASS
4 ATPG
However it does not make use of panda's native sorting of categorical variables.