Shift Column Values based on Index PYSKPARK - dataframe

I HAVE:
INDEX
A_0
B_0
A_1
B_1
A_2
B_2
A_3
B_3
0
00a
00b
01a
01b
02a
02b
03a
03b
1
11a
11b
12a
12b
13a
13b
2
21a
22b
23a
23b
3
33a
33a
I SHOULD GET THIS WITHOUT USING UDF:
INDEX
A_0
B_0
A_1
B_1
A_2
B_2
A_3
B_3
0
00a
00b
01a
01b
02a
02b
03a
03b
1
11a
11b
12a
12b
13a
13b
2
21a
22b
23a
23b
3
33a
33b

Here's my approach
from pyspark.sql import functions as psf
data = [
(0,'00a','00b','01a','01b','02a','02b','03a','03b')
, (1,None,None,'11a','11b','12a','12b','13a','13b')
, (2,None,None,None,None,'21a','22b','23a','23b')
, (3,None,None,None,None,None,None,'33a','33a')
]
df = spark.createDataFrame(data, ['INDEX','A_0','B_0','A_1','B_1','A_2','B_2','A_3','B_3'])
df = df.withColumn('to_array', psf.array_except(psf.array('A_0','B_0','A_1','B_1','A_2','B_2','A_3','B_3'), psf.array(psf.lit(None))))
df.select(
psf.col('index')
, psf.col('to_array')[0].alias('A_0')
, psf.col('to_array')[1].alias('B_0')
, psf.col('to_array')[2].alias('A_1')
, psf.col('to_array')[3].alias('B_1')
, psf.col('to_array')[4].alias('A_2')
, psf.col('to_array')[5].alias('B_2')
, psf.col('to_array')[6].alias('A_3')
, psf.col('to_array')[7].alias('B_3')
).show()
Convert cols into an array, drop NULLs to 'shift-left', extract back to cols

Related

How to convert rows into columns by group?

I'd like to do association analysis using apriori algorithm.
To do that, I have to make a dataset.
What I have data is like this.
data.frame("order_number"=c("100145", "100155", "100155", "100155",
"500002", "500002", "500002", "500007"),
"order_item"=c("27684535","15755576",
"1357954","124776249","12478324","15755576","13577","27684535"))
order_number order_item
1 100145 27684535
2 100155 15755576
3 100155 1357954
4 100155 124776249
5 500002 12478324
6 500002 15755576
7 500002 13577
8 500007 27684535
and I want to transfer the data like this
data.frame("order_number"=c("100145","100155","500002","500007"),
"col1"=c("27684535","15755576","12478324","27684535"),
"col2"=c(NA,"1357954","15755576",NA),
"col3"=c(NA,"124776249","13577",NA))
order_number col1 col2 col3
1 100145 27684535 <NA> <NA>
2 100155 15755576 1357954 124776249
3 500002 12478324 15755576 13577
4 500007 27684535 <NA> <NA>
Thank you for your help.
This would be a case of pivot_wider (or other functions for changing column layout). First step would be creating a row id variable to note whether each is 1, 2 or 3, then shaping this into the dataframe you want:
df <- data.frame("order_number"=c("100145", "100155", "100155", "100155",
"500002", "500002", "500002", "500007"),
"order_item"=c("27684535","15755576",
"1357954","124776249","12478324","15755576","13577","27684535"))
library(tidyr)
library(dplyr)
df |>
group_by(order_number) |>
mutate(rank = row_number()) |>
pivot_wider(names_from = rank, values_from = order_item,
names_prefix = "col")
#> # A tibble: 4 × 4
#> # Groups: order_number [4]
#> order_number col1 col2 col3
#> <chr> <chr> <chr> <chr>
#> 1 100145 27684535 <NA> <NA>
#> 2 100155 15755576 1357954 124776249
#> 3 500002 12478324 15755576 13577
#> 4 500007 27684535 <NA> <NA>

AI ML for Pattern Classification in Data Frame

I have a dataset on which I want to apply AI/ML algorithm to classify patterns as given below, please suggest to me the ways in which I can do it efficiently.
Example Dataset-
Name Subset Value
A X_1 55
A X_A 89
A X_B 45
B B_1 55
B B_A 89
C X_1 66
C X_A 656
C X_B 456
D D_1 545
D D_2 546
D D_3 895
D D_4 565
The output would be;
Pattern No. Instance Pattern
1 1 A X_1 55
A X_A 89
A X_B 45
2 C X_1 66
C X_A 656
C X_B 456
2 1 B B_1 55
B B_A 89
3 1 D D_1 545
D D_2 546
D D_3 895
D D_4 565
You need a clustering algorithm.
Define a distance function between instances
Try various clustering algorithms
I would recommend to start with Jaccard distance and Agglomerative clustering.
Code example
import re
import pandas as pd
from sklearn.cluster import AgglomerativeClustering
def read_data():
text = '''A X_1 55
A X_A 89
A X_B 45
B B_1 55
B B_A 89
C X_1 66
C X_A 656
C X_B 456
D D_1 545
D D_2 546
D D_3 895
D D_4 565'''
data = [re.split('\s+', line) for line in text.split('\n')]
return pd.DataFrame(data, columns=re.split('\s+', 'Name Subset Value'))
df = read_data()
instances = []
for name, name_df in df.groupby('Name'):
instances.append({'name': name, 'subsets': name_df['Subset'].tolist()})
def jaccard_distance(list_1, list_2):
set_1 = set(list_1)
set_2 = set(list_2)
return 1 - len(set_1.intersection(set_2)) / len(set_1.union(set_2))
clustering = AgglomerativeClustering(n_clusters=None,
distance_threshold=1,
affinity='precomputed',
linkage='average')
distance_matrix = [
[
jaccard_distance(instance_1['subsets'], instance_2['subsets'])
for instance_2 in instances
] for instance_1 in instances
]
clusters = clustering.fit_predict(distance_matrix)
for instance, cluster in zip(instances, clusters):
instance['cluster'] = cluster
print(instances)
Output
[{'name': 'A', 'subsets': ['X_1', 'X_A', 'X_B'], 'cluster': 0},
{'name': 'B', 'subsets': ['B_1', 'B_A'], 'cluster': 2},
{'name': 'C', 'subsets': ['X_1', 'X_A', 'X_B'], 'cluster': 0},
{'name': 'D', 'subsets': ['D_1', 'D_2', 'D_3', 'D_4'], 'cluster': 1}]

split pandas column to many columns

I have dataframe like below:
ColumnA ColumnB ColumnC
0 usr usr1,usr2 X1
1 xyz xyz1,xyz2,xyz3 X2
2 abc abc1,abc2,abc3 X3
What I want to do is:
split column B by ","
Problem is some cells of column B has 3 variable (xyz1,xyz2,xyz3), some of them 6 etc. It is not stable.
Expected output:
ColumnA ColumnB usercol1 usercol2 usercol3 ColumnC
0 usr usr1,usr2 usr1 usr2 - X1
1 xyz xyz1,xyz2,xyz3 xyz1 xyz2 xyz3 X2
2 abc abc1,abc2,abc3 abc1 abc2 abc3 X3
Create a new dataframe that uses expand=True with str.split()
Then concat the first two columns, the new expanded dataframe and the third original dataframecolumn. This is dynamic to uneven list lengths.
df1 = df['ColumnB'].str.split(',',expand=True).add_prefix('usercol')
df1 = pd.concat([df[['ColumnA', 'ColumnB']],df1, df[['ColumnC']]], axis=1).replace(np.nan, '-')
df1
Out[1]:
ColumnA ColumnB usercol0 usercol1 usercol2 ColumnC
0 usr usr1,usr2 usr1 usr2 - X1
1 xyz xyz1,xyz2,xyz3 xyz1 xyz2 xyz3 X2
2 abc abc1,abc2,abc3 abc1 abc2 abc3 X3
Technically, this could be done with one line as well:
df = pd.concat([df[['ColumnA', 'ColumnB']],
df['ColumnB'].str.split(',',expand=True).add_prefix('usercol'),
df[['ColumnC']]], axis=1).replace(np.nan, '-')
df
Out[1]:
ColumnA ColumnB usercol0 usercol1 usercol2 ColumnC
0 usr usr1,usr2 usr1 usr2 - X1
1 xyz xyz1,xyz2,xyz3 xyz1 xyz2 xyz3 X2
2 abc abc1,abc2,abc3 abc1 abc2 abc3 X3

How to groupby with the criteria that groups' set union is not the empty set?

I have the following dataframe
df_testing = pd.DataFrame({
'Q': ['Q_0', 'Q_1', 'Q_2', 'Q_3', 'Q_4', 'Q_5', 'Q_5', 'Q_6', 'Q_7', 'Q_7', 'Q_8'],
'A': ['A_0', 'A_1', 'A_1', 'A_1', 'A_2', 'A_3', 'A_4', 'A_5', 'A_5', 'A_6', 'A_7']
})
Q A
0 Q_0 A_0
1 Q_1 A_1
2 Q_2 A_1
3 Q_3 A_1
4 Q_4 A_2
5 Q_5 A_3
6 Q_5 A_4
7 Q_6 A_5
8 Q_7 A_5
9 Q_7 A_6
10 Q_8 A_7
and after grouping by Q's:
# As with same Qs
as_with_same_qs = df_testing.groupby('Q', as_index=False).agg({'A': tuple})
Q A
0 Q_0 (A_0,)
1 Q_1 (A_1,)
2 Q_2 (A_1,)
3 Q_3 (A_1,)
4 Q_4 (A_2,)
5 Q_5 (A_3, A_4)
6 Q_6 (A_5,)
7 Q_7 (A_5, A_6)
8 Q_8 (A_7,)
I need to group again, but this time by A. The problem is that by default, the groupby criteria is only that the values are the same. In this case, I would like to join into the same group the rows whose sets of A's have any element in common. For example, the rows:
6 Q_6 (A_5,)
7 Q_7 (A_5, A_6)
Have a common element, which is A_5, therefore set.union((A_5,), (A_5, A_6)) != set(). Because of that, I would like them to be grouped together and aggregate Q's as I liked afterwards. The problem is that I do not know how to define this user-custom groupby function.
Expected result:
A Q
(A_0,) (Q_0,)
(A_1,) (Q_1, Q_2, Q_3)
(A_2,) (Q_4,)
(A_3, A_4) (Q_5,)
(A_5, A_6) (Q_6, Q_7,)
(A_7,) (Q_8,)
import numpy as np
a = np.array([[set(e).issubset(f) for f in as_with_same_qs.A] for e in as_with_same_qs.A])
b = [np.array(as_with_same_qs.A)[e] for e in a]
c = [np.argmax([len(f) for f in np.array(as_with_same_qs.A)[e]]) for e in a]
as_with_same_qs['A'] = [b[i][v] for i,v in enumerate(c)]
as_with_same_qs.groupby('A', as_index=False).agg({'Q': tuple})
A Q
0 (A_0,) (Q_0,)
1 (A_1,) (Q_1, Q_2, Q_3)
2 (A_2,) (Q_4,)
3 (A_3, A_4) (Q_5,)
4 (A_5, A_6) (Q_6, Q_7)
5 (A_7,) (Q_8,)

How can I select the indexes where my dataframe has more than two entries?

I have a multilevel indexed frame. It looks like
actor title_year sum count
50 Cent 2005.0 30981850.0 1
A.J. Buckley 2015.0 123070338.0 1
Aaliyah 2002.0 30307804.0 1
Aasif Mandvi 2008.0 13214030.0 1
Abbie Cornish 2009.0 4440055.0 1
Here, actor and title_year form a multi index. How can I slice out the entries which have multi_index where actor spans more than n years?
I think you need filter:
print (df)
sum count
actor title_year
50 Cent 2005.0 30981850.0 1
2006.0 30981850.0 1
2007.0 30981850.0 1
A.J. Buckley 2015.0 123070338.0 1
Aaliyah 2002.0 30307804.0 1
2002.0 30307804.0 1
2004.0 30307804.0 1
Aasif Mandvi 2008.0 13214030.0 1
Abbie Cornish 2009.0 4440055.0 1
If need remove all actors with length more as 2:
df1 = df.groupby(level='actor').filter(lambda x: len(x) < 3)
print (df1)
sum count
actor title_year
A.J. Buckley 2015.0 123070338.0 1
Aasif Mandvi 2008.0 13214030.0 1
Abbie Cornish 2009.0 4440055.0 1
If need remove all actors with length of unique values in level title_year more as 2:
df2 = df.groupby(level='actor')
.filter(lambda x: x.index.get_level_values('title_year').nunique() < 3)
print (df2)
sum count
actor title_year
A.J. Buckley 2015.0 123070338.0 1
Aaliyah 2002.0 30307804.0 1
2002.0 30307804.0 1
2004.0 30307804.0 1
Aasif Mandvi 2008.0 13214030.0 1
Abbie Cornish 2009.0 4440055.0 1