Shift Column Values based on Index PYSKPARK

Shift Column Values based on Index PYSKPARK - dataframe

I HAVE:
INDEX
A_0
B_0
A_1
B_1
A_2
B_2
A_3
B_3
0
00a
00b
01a
01b
02a
02b
03a
03b
1
11a
11b
12a
12b
13a
13b
2
21a
22b
23a
23b
3
33a
33a
I SHOULD GET THIS WITHOUT USING UDF:
INDEX
A_0
B_0
A_1
B_1
A_2
B_2
A_3
B_3
0
00a
00b
01a
01b
02a
02b
03a
03b
1
11a
11b
12a
12b
13a
13b
2
21a
22b
23a
23b
3
33a
33b

Here's my approach
from pyspark.sql import functions as psf
data = [
(0,'00a','00b','01a','01b','02a','02b','03a','03b')
, (1,None,None,'11a','11b','12a','12b','13a','13b')
, (2,None,None,None,None,'21a','22b','23a','23b')
, (3,None,None,None,None,None,None,'33a','33a')
]
df = spark.createDataFrame(data, ['INDEX','A_0','B_0','A_1','B_1','A_2','B_2','A_3','B_3'])
df = df.withColumn('to_array', psf.array_except(psf.array('A_0','B_0','A_1','B_1','A_2','B_2','A_3','B_3'), psf.array(psf.lit(None))))
df.select(
psf.col('index')
, psf.col('to_array')[0].alias('A_0')
, psf.col('to_array')[1].alias('B_0')
, psf.col('to_array')[2].alias('A_1')
, psf.col('to_array')[3].alias('B_1')
, psf.col('to_array')[4].alias('A_2')
, psf.col('to_array')[5].alias('B_2')
, psf.col('to_array')[6].alias('A_3')
, psf.col('to_array')[7].alias('B_3')
).show()
Convert cols into an array, drop NULLs to 'shift-left', extract back to cols

Related

How to convert rows into columns by group?

I'd like to do association analysis using apriori algorithm.
To do that, I have to make a dataset.
What I have data is like this.
data.frame("order_number"=c("100145", "100155", "100155", "100155",
"500002", "500002", "500002", "500007"),
"order_item"=c("27684535","15755576",
"1357954","124776249","12478324","15755576","13577","27684535"))
order_number order_item
1 100145 27684535
2 100155 15755576
3 100155 1357954
4 100155 124776249
5 500002 12478324
6 500002 15755576
7 500002 13577
8 500007 27684535
and I want to transfer the data like this
data.frame("order_number"=c("100145","100155","500002","500007"),
"col1"=c("27684535","15755576","12478324","27684535"),
"col2"=c(NA,"1357954","15755576",NA),
"col3"=c(NA,"124776249","13577",NA))
order_number col1 col2 col3
1 100145 27684535 <NA> <NA>
2 100155 15755576 1357954 124776249
3 500002 12478324 15755576 13577
4 500007 27684535 <NA> <NA>
Thank you for your help.

This would be a case of pivot_wider (or other functions for changing column layout). First step would be creating a row id variable to note whether each is 1, 2 or 3, then shaping this into the dataframe you want:
df <- data.frame("order_number"=c("100145", "100155", "100155", "100155",
"500002", "500002", "500002", "500007"),
"order_item"=c("27684535","15755576",
"1357954","124776249","12478324","15755576","13577","27684535"))
library(tidyr)
library(dplyr)
df |>
group_by(order_number) |>
mutate(rank = row_number()) |>
pivot_wider(names_from = rank, values_from = order_item,
names_prefix = "col")
#> # A tibble: 4 × 4
#> # Groups: order_number [4]
#> order_number col1 col2 col3
#> <chr> <chr> <chr> <chr>
#> 1 100145 27684535 <NA> <NA>
#> 2 100155 15755576 1357954 124776249
#> 3 500002 12478324 15755576 13577
#> 4 500007 27684535 <NA> <NA>

AI ML for Pattern Classification in Data Frame

I have a dataset on which I want to apply AI/ML algorithm to classify patterns as given below, please suggest to me the ways in which I can do it efficiently.
Example Dataset-
Name Subset Value
A X_1 55
A X_A 89
A X_B 45
B B_1 55
B B_A 89
C X_1 66
C X_A 656
C X_B 456
D D_1 545
D D_2 546
D D_3 895
D D_4 565
The output would be;
Pattern No. Instance Pattern
1 1 A X_1 55
A X_A 89
A X_B 45
2 C X_1 66
C X_A 656
C X_B 456
2 1 B B_1 55
B B_A 89
3 1 D D_1 545
D D_2 546
D D_3 895
D D_4 565

You need a clustering algorithm.
Define a distance function between instances
Try various clustering algorithms
I would recommend to start with Jaccard distance and Agglomerative clustering.
Code example
import re
import pandas as pd
from sklearn.cluster import AgglomerativeClustering
def read_data():
text = '''A X_1 55
A X_A 89
A X_B 45
B B_1 55
B B_A 89
C X_1 66
C X_A 656
C X_B 456
D D_1 545
D D_2 546
D D_3 895
D D_4 565'''
data = [re.split('\s+', line) for line in text.split('\n')]
return pd.DataFrame(data, columns=re.split('\s+', 'Name Subset Value'))
df = read_data()
instances = []
for name, name_df in df.groupby('Name'):
instances.append({'name': name, 'subsets': name_df['Subset'].tolist()})
def jaccard_distance(list_1, list_2):
set_1 = set(list_1)
set_2 = set(list_2)
return 1 - len(set_1.intersection(set_2)) / len(set_1.union(set_2))
clustering = AgglomerativeClustering(n_clusters=None,
distance_threshold=1,
affinity='precomputed',
linkage='average')
distance_matrix = [
[
jaccard_distance(instance_1['subsets'], instance_2['subsets'])
for instance_2 in instances
] for instance_1 in instances
]
clusters = clustering.fit_predict(distance_matrix)
for instance, cluster in zip(instances, clusters):
instance['cluster'] = cluster
print(instances)
Output
[{'name': 'A', 'subsets': ['X_1', 'X_A', 'X_B'], 'cluster': 0},
{'name': 'B', 'subsets': ['B_1', 'B_A'], 'cluster': 2},
{'name': 'C', 'subsets': ['X_1', 'X_A', 'X_B'], 'cluster': 0},
{'name': 'D', 'subsets': ['D_1', 'D_2', 'D_3', 'D_4'], 'cluster': 1}]

split pandas column to many columns

I have dataframe like below:
ColumnA ColumnB ColumnC
0 usr usr1,usr2 X1
1 xyz xyz1,xyz2,xyz3 X2
2 abc abc1,abc2,abc3 X3
What I want to do is:
split column B by ","
Problem is some cells of column B has 3 variable (xyz1,xyz2,xyz3), some of them 6 etc. It is not stable.
Expected output:
ColumnA ColumnB usercol1 usercol2 usercol3 ColumnC
0 usr usr1,usr2 usr1 usr2 - X1
1 xyz xyz1,xyz2,xyz3 xyz1 xyz2 xyz3 X2
2 abc abc1,abc2,abc3 abc1 abc2 abc3 X3

Create a new dataframe that uses expand=True with str.split()
Then concat the first two columns, the new expanded dataframe and the third original dataframecolumn. This is dynamic to uneven list lengths.
df1 = df['ColumnB'].str.split(',',expand=True).add_prefix('usercol')
df1 = pd.concat([df[['ColumnA', 'ColumnB']],df1, df[['ColumnC']]], axis=1).replace(np.nan, '-')
df1
Out[1]:
ColumnA ColumnB usercol0 usercol1 usercol2 ColumnC
0 usr usr1,usr2 usr1 usr2 - X1
1 xyz xyz1,xyz2,xyz3 xyz1 xyz2 xyz3 X2
2 abc abc1,abc2,abc3 abc1 abc2 abc3 X3
Technically, this could be done with one line as well:
df = pd.concat([df[['ColumnA', 'ColumnB']],
df['ColumnB'].str.split(',',expand=True).add_prefix('usercol'),
df[['ColumnC']]], axis=1).replace(np.nan, '-')
df
Out[1]:
ColumnA ColumnB usercol0 usercol1 usercol2 ColumnC
0 usr usr1,usr2 usr1 usr2 - X1
1 xyz xyz1,xyz2,xyz3 xyz1 xyz2 xyz3 X2
2 abc abc1,abc2,abc3 abc1 abc2 abc3 X3

How to groupby with the criteria that groups' set union is not the empty set?

I have the following dataframe
df_testing = pd.DataFrame({
'Q': ['Q_0', 'Q_1', 'Q_2', 'Q_3', 'Q_4', 'Q_5', 'Q_5', 'Q_6', 'Q_7', 'Q_7', 'Q_8'],
'A': ['A_0', 'A_1', 'A_1', 'A_1', 'A_2', 'A_3', 'A_4', 'A_5', 'A_5', 'A_6', 'A_7']
})
Q A
0 Q_0 A_0
1 Q_1 A_1
2 Q_2 A_1
3 Q_3 A_1
4 Q_4 A_2
5 Q_5 A_3
6 Q_5 A_4
7 Q_6 A_5
8 Q_7 A_5
9 Q_7 A_6
10 Q_8 A_7
and after grouping by Q's:
# As with same Qs
as_with_same_qs = df_testing.groupby('Q', as_index=False).agg({'A': tuple})
Q A
0 Q_0 (A_0,)
1 Q_1 (A_1,)
2 Q_2 (A_1,)
3 Q_3 (A_1,)
4 Q_4 (A_2,)
5 Q_5 (A_3, A_4)
6 Q_6 (A_5,)
7 Q_7 (A_5, A_6)
8 Q_8 (A_7,)
I need to group again, but this time by A. The problem is that by default, the groupby criteria is only that the values are the same. In this case, I would like to join into the same group the rows whose sets of A's have any element in common. For example, the rows:
6 Q_6 (A_5,)
7 Q_7 (A_5, A_6)
Have a common element, which is A_5, therefore set.union((A_5,), (A_5, A_6)) != set(). Because of that, I would like them to be grouped together and aggregate Q's as I liked afterwards. The problem is that I do not know how to define this user-custom groupby function.
Expected result:
A Q
(A_0,) (Q_0,)
(A_1,) (Q_1, Q_2, Q_3)
(A_2,) (Q_4,)
(A_3, A_4) (Q_5,)
(A_5, A_6) (Q_6, Q_7,)
(A_7,) (Q_8,)

import numpy as np
a = np.array([[set(e).issubset(f) for f in as_with_same_qs.A] for e in as_with_same_qs.A])
b = [np.array(as_with_same_qs.A)[e] for e in a]
c = [np.argmax([len(f) for f in np.array(as_with_same_qs.A)[e]]) for e in a]
as_with_same_qs['A'] = [b[i][v] for i,v in enumerate(c)]
as_with_same_qs.groupby('A', as_index=False).agg({'Q': tuple})
A Q
0 (A_0,) (Q_0,)
1 (A_1,) (Q_1, Q_2, Q_3)
2 (A_2,) (Q_4,)
3 (A_3, A_4) (Q_5,)
4 (A_5, A_6) (Q_6, Q_7)
5 (A_7,) (Q_8,)

How can I select the indexes where my dataframe has more than two entries?

I have a multilevel indexed frame. It looks like
actor title_year sum count
50 Cent 2005.0 30981850.0 1
A.J. Buckley 2015.0 123070338.0 1
Aaliyah 2002.0 30307804.0 1
Aasif Mandvi 2008.0 13214030.0 1
Abbie Cornish 2009.0 4440055.0 1
Here, actor and title_year form a multi index. How can I slice out the entries which have multi_index where actor spans more than n years?

I think you need filter:
print (df)
sum count
actor title_year
50 Cent 2005.0 30981850.0 1
2006.0 30981850.0 1
2007.0 30981850.0 1
A.J. Buckley 2015.0 123070338.0 1
Aaliyah 2002.0 30307804.0 1
2002.0 30307804.0 1
2004.0 30307804.0 1
Aasif Mandvi 2008.0 13214030.0 1
Abbie Cornish 2009.0 4440055.0 1
If need remove all actors with length more as 2:
df1 = df.groupby(level='actor').filter(lambda x: len(x) < 3)
print (df1)
sum count
actor title_year
A.J. Buckley 2015.0 123070338.0 1
Aasif Mandvi 2008.0 13214030.0 1
Abbie Cornish 2009.0 4440055.0 1
If need remove all actors with length of unique values in level title_year more as 2:
df2 = df.groupby(level='actor')
.filter(lambda x: x.index.get_level_values('title_year').nunique() < 3)
print (df2)
sum count
actor title_year
A.J. Buckley 2015.0 123070338.0 1
Aaliyah 2002.0 30307804.0 1
2002.0 30307804.0 1
2004.0 30307804.0 1
Aasif Mandvi 2008.0 13214030.0 1
Abbie Cornish 2009.0 4440055.0 1

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Shift Column Values based on Index PYSKPARK - dataframe

I HAVE: INDEX A_0 B_0 A_1 B_1 A_2 B_2 A_3 B_3 0 00a 00b 01a 01b 02a 02b 03a 03b 1 11a 11b 12a 12b 13a 13b 2 21a 22b 23a 23b 3 33a 33a I SHOULD GET THIS WITHOUT USING UDF: INDEX A_0 B_0 A_1 B_1 A_2 B_2 A_3 B_3 0 00a 00b 01a 01b 02a 02b 03a 03b 1 11a 11b 12a 12b 13a 13b 2 21a 22b 23a 23b 3 33a 33b

Related

How to convert rows into columns by group?

AI ML for Pattern Classification in Data Frame

split pandas column to many columns

How to groupby with the criteria that groups' set union is not the empty set?

How can I select the indexes where my dataframe has more than two entries?

Categories

Resources