Create a dataframe from a series with a TimeSeriesIndex multiplied by another series - pandas

Let's say I have a series, ser1 with a TimeSeriesIndex length x. I also have another series, ser2 length y. How do I multiply these so that I get a dataframe shape (x,y) where the index is from ser1 and the columns are the indices from ser2. I want every element of ser2 to be multiplied by the values of each element in ser1.
import pandas as pd
ser1 = pd.Series([100, 105, 110, 114, 89],index=pd.date_range(start='2021-01-01', end='2021-01-05', freq='D'), name='test')
test_ser2 = pd.Series([1, 2, 3, 4, 5], index=['a', 'b', 'c', 'd', 'e'])
Perhaps this is more elegantly done with numpy.

Try this using np.outer with pandas DataFrame constructor:
pd.DataFrame(np.outer(ser1, test_ser2), index=ser1.index, columns=test_ser2.index)
Output:
a b c d e
2021-01-01 100 200 300 400 500
2021-01-02 105 210 315 420 525
2021-01-03 110 220 330 440 550
2021-01-04 114 228 342 456 570
2021-01-05 89 178 267 356 445

Related

Pandas equivalent of partition by [duplicate]

Trying to create a new column from the groupby calculation. In the code below, I get the correct calculated values for each date (see group below) but when I try to create a new column (df['Data4']) with it I get NaN. So I am trying to create a new column in the dataframe with the sum of Data3 for the all dates and apply that to each date row. For example, 2015-05-08 is in 2 rows (total is 50+5 = 55) and in this new column I would like to have 55 in both of the rows.
import pandas as pd
import numpy as np
from pandas import DataFrame
df = pd.DataFrame({
'Date' : ['2015-05-08', '2015-05-07', '2015-05-06', '2015-05-05', '2015-05-08', '2015-05-07', '2015-05-06', '2015-05-05'],
'Sym' : ['aapl', 'aapl', 'aapl', 'aapl', 'aaww', 'aaww', 'aaww', 'aaww'],
'Data2': [11, 8, 10, 15, 110, 60, 100, 40],
'Data3': [5, 8, 6, 1, 50, 100, 60, 120]
})
group = df['Data3'].groupby(df['Date']).sum()
df['Data4'] = group
You want to use transform this will return a Series with the index aligned to the df so you can then add it as a new column:
In [74]:
df = pd.DataFrame({'Date': ['2015-05-08', '2015-05-07', '2015-05-06', '2015-05-05', '2015-05-08', '2015-05-07', '2015-05-06', '2015-05-05'], 'Sym': ['aapl', 'aapl', 'aapl', 'aapl', 'aaww', 'aaww', 'aaww', 'aaww'], 'Data2': [11, 8, 10, 15, 110, 60, 100, 40],'Data3': [5, 8, 6, 1, 50, 100, 60, 120]})
​
df['Data4'] = df['Data3'].groupby(df['Date']).transform('sum')
df
Out[74]:
Data2 Data3 Date Sym Data4
0 11 5 2015-05-08 aapl 55
1 8 8 2015-05-07 aapl 108
2 10 6 2015-05-06 aapl 66
3 15 1 2015-05-05 aapl 121
4 110 50 2015-05-08 aaww 55
5 60 100 2015-05-07 aaww 108
6 100 60 2015-05-06 aaww 66
7 40 120 2015-05-05 aaww 121
How do I create a new column with Groupby().Sum()?
There are two ways - one straightforward and the other slightly more interesting.
Everybody's Favorite: GroupBy.transform() with 'sum'
#Ed Chum's answer can be simplified, a bit. Call DataFrame.groupby rather than Series.groupby. This results in simpler syntax.
# The setup.
df[['Date', 'Data3']]
Date Data3
0 2015-05-08 5
1 2015-05-07 8
2 2015-05-06 6
3 2015-05-05 1
4 2015-05-08 50
5 2015-05-07 100
6 2015-05-06 60
7 2015-05-05 120
df.groupby('Date')['Data3'].transform('sum')
0 55
1 108
2 66
3 121
4 55
5 108
6 66
7 121
Name: Data3, dtype: int64
It's a tad faster,
df2 = pd.concat([df] * 12345)
%timeit df2['Data3'].groupby(df['Date']).transform('sum')
%timeit df2.groupby('Date')['Data3'].transform('sum')
10.4 ms ± 367 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
8.58 ms ± 559 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Unconventional, but Worth your Consideration: GroupBy.sum() + Series.map()
I stumbled upon an interesting idiosyncrasy in the API. From what I tell, you can reproduce this on any major version over 0.20 (I tested this on 0.23 and 0.24). It seems like you consistently can shave off a few milliseconds of the time taken by transform if you instead use a direct function of GroupBy and broadcast it using map:
df.Date.map(df.groupby('Date')['Data3'].sum())
0 55
1 108
2 66
3 121
4 55
5 108
6 66
7 121
Name: Date, dtype: int64
Compare with
df.groupby('Date')['Data3'].transform('sum')
0 55
1 108
2 66
3 121
4 55
5 108
6 66
7 121
Name: Data3, dtype: int64
My tests show that map is a bit faster if you can afford to use the direct GroupBy function (such as mean, min, max, first, etc). It is more or less faster for most general situations upto around ~200 thousand records. After that, the performance really depends on the data.
(Left: v0.23, Right: v0.24)
Nice alternative to know, and better if you have smaller frames with smaller numbers of groups. . . but I would recommend transform as a first choice. Thought this was worth sharing anyway.
Benchmarking code, for reference:
import perfplot
perfplot.show(
setup=lambda n: pd.DataFrame({'A': np.random.choice(n//10, n), 'B': np.ones(n)}),
kernels=[
lambda df: df.groupby('A')['B'].transform('sum'),
lambda df: df.A.map(df.groupby('A')['B'].sum()),
],
labels=['GroupBy.transform', 'GroupBy.sum + map'],
n_range=[2**k for k in range(5, 20)],
xlabel='N',
logy=True,
logx=True
)
I suggest in general to use the more powerful apply, with which you can write your queries in single expressions even for more complicated uses, such as defining a new column whose values are defined are defined as operations on groups, and that can have also different values within the same group!
This is more general than the simple case of defining a column with the same value for every group (like sum in this question, which varies by group by is the same within the same group).
Simple case (new column with same value within a group, different across groups):
# I'm assuming the name of your dataframe is something long, like
# `my_data_frame`, to show the power of being able to write your
# data processing in a single expression without multiple statements and
# multiple references to your long name, which is the normal style
# that the pandas API naturally makes you adopt, but which make the
# code often verbose, sparse, and a pain to generalize or refactor
my_data_frame = pd.DataFrame({
'Date': ['2015-05-08', '2015-05-07', '2015-05-06', '2015-05-05', '2015-05-08', '2015-05-07', '2015-05-06', '2015-05-05'],
'Sym': ['aapl', 'aapl', 'aapl', 'aapl', 'aaww', 'aaww', 'aaww', 'aaww'],
'Data2': [11, 8, 10, 15, 110, 60, 100, 40],
'Data3': [5, 8, 6, 1, 50, 100, 60, 120]})
​
(my_data_frame
# create groups by 'Date'
.groupby(['Date'])
# for every small Group DataFrame `gdf` with the same 'Date', do:
# assign a new column 'Data4' to it, with the value being
# the sum of 'Data3' for the small dataframe `gdf`
.apply(lambda gdf: gdf.assign(Data4=lambda gdf: gdf['Data3'].sum()))
# after groupby operations, the variable(s) you grouped by on
# are set as indices. In this case, 'Date' was set as an additional
# level for the (multi)index. But it is still also present as a
# column. Thus, we drop it from the index:
.droplevel(0)
)
### OR
# We don't even need to define a variable for our dataframe.
# We can chain everything in one expression
(pd
.DataFrame({
'Date': ['2015-05-08', '2015-05-07', '2015-05-06', '2015-05-05', '2015-05-08', '2015-05-07', '2015-05-06', '2015-05-05'],
'Sym': ['aapl', 'aapl', 'aapl', 'aapl', 'aaww', 'aaww', 'aaww', 'aaww'],
'Data2': [11, 8, 10, 15, 110, 60, 100, 40],
'Data3': [5, 8, 6, 1, 50, 100, 60, 120]})
.groupby(['Date'])
.apply(lambda gdf: gdf.assign(Data4=lambda gdf: gdf['Data3'].sum()))
.droplevel(0)
)
Out:
Date
Sym
Data2
Data3
Data4
3
2015-05-05
aapl
15
1
121
7
2015-05-05
aaww
40
120
121
2
2015-05-06
aapl
10
6
66
6
2015-05-06
aaww
100
60
66
1
2015-05-07
aapl
8
8
108
5
2015-05-07
aaww
60
100
108
0
2015-05-08
aapl
11
5
55
4
2015-05-08
aaww
110
50
55
(Why are the python expression within parentheses? So that we don't need to sprinkle our code with backslashes all over the place, and we can put comments within our expression code to describe every step.)
What is powerful about this? It's that it is harnessing the full power of the "split-apply-combine paradigm". It is allowing you to think in terms of "splitting your dataframe into blocks" and "running arbitrary operations on those blocks" without reducing/aggregating, i.e., without reducing the number of rows. (And without writing explicit, verbose loops and resorting to expensive joins or concatenations to glue the results back.)
Let's consider a more complex example. One in which you have multiple time series of data in your dataframe. You have a column that represents a kind of product, a column that has timestamps, and a column that contains the number of items sold for that product at some time of the year. You would like to group by product and obtain a new column, that contains the cumulative total for the items that are sold for each category. We want a column that, within every "block" with the same product, is still a time series, and is monotonically increasing (only within a block).
How can we do this? With groupby + apply!
(pd
.DataFrame({
'Date': ['2021-03-11','2021-03-12','2021-03-13','2021-03-11','2021-03-12','2021-03-13'],
'Product': ['shirt','shirt','shirt','shoes','shoes','shoes'],
'ItemsSold': [300, 400, 234, 80, 10, 120],
})
.groupby(['Product'])
.apply(lambda gdf: (gdf
# sort by date within a group
.sort_values('Date')
# create new column
.assign(CumulativeItemsSold=lambda df: df['ItemsSold'].cumsum())))
.droplevel(0)
)
Out:
Date
Product
ItemsSold
CumulativeItemsSold
0
2021-03-11
shirt
300
300
1
2021-03-12
shirt
400
700
2
2021-03-13
shirt
234
934
3
2021-03-11
shoes
80
80
4
2021-03-12
shoes
10
90
5
2021-03-13
shoes
120
210
Another advantage of this method? It works even if we have to group by multiple fields! For example, if we had a 'Color' field for our products, and we wanted the cumulative series grouped by (Product, Color), we can:
(pd
.DataFrame({
'Date': ['2021-03-11','2021-03-12','2021-03-13','2021-03-11','2021-03-12','2021-03-13',
'2021-03-11','2021-03-12','2021-03-13','2021-03-11','2021-03-12','2021-03-13'],
'Product': ['shirt','shirt','shirt','shoes','shoes','shoes',
'shirt','shirt','shirt','shoes','shoes','shoes'],
'Color': ['yellow','yellow','yellow','yellow','yellow','yellow',
'blue','blue','blue','blue','blue','blue'], # new!
'ItemsSold': [300, 400, 234, 80, 10, 120,
123, 84, 923, 0, 220, 94],
})
.groupby(['Product', 'Color']) # We group by 2 fields now
.apply(lambda gdf: (gdf
.sort_values('Date')
.assign(CumulativeItemsSold=lambda df: df['ItemsSold'].cumsum())))
.droplevel([0,1]) # We drop 2 levels now
Out:
Date
Product
Color
ItemsSold
CumulativeItemsSold
6
2021-03-11
shirt
blue
123
123
7
2021-03-12
shirt
blue
84
207
8
2021-03-13
shirt
blue
923
1130
0
2021-03-11
shirt
yellow
300
300
1
2021-03-12
shirt
yellow
400
700
2
2021-03-13
shirt
yellow
234
934
9
2021-03-11
shoes
blue
0
0
10
2021-03-12
shoes
blue
220
220
11
2021-03-13
shoes
blue
94
314
3
2021-03-11
shoes
yellow
80
80
4
2021-03-12
shoes
yellow
10
90
5
2021-03-13
shoes
yellow
120
210
(This possibility of easily extending to grouping over multiple fields is the reason why I like to put the arguments of groupby always in a list, even if it's a single name, like 'Product' in the previous example.)
And you can do all of this synthetically in a single expression. (Sure, if python's lambdas were a bit nicer to look at, it would look even nicer.)
Why did I go over a general case? Because this is one of the first SO questions that pops up when googling for things like "pandas new column groupby".
Additional thoughts on the API for this kind of operation
Adding columns based on arbitrary computations made on groups is much like the nice idiom of defining new column using aggregations over Windows in SparkSQL.
For example, you can think of this (it's Scala code, but the equivalent in PySpark looks practically the same):
val byDepName = Window.partitionBy('depName)
empsalary.withColumn("avg", avg('salary) over byDepName)
as something like (using pandas in the way we have seen above):
empsalary = pd.DataFrame(...some dataframe...)
(empsalary
# our `Window.partitionBy('depName)`
.groupby(['depName'])
# our 'withColumn("avg", avg('salary) over byDepName)
.apply(lambda gdf: gdf.assign(avg=lambda df: df['salary'].mean()))
.droplevel(0)
)
(Notice how much synthetic and nicer the Spark example is. The pandas equivalent looks a bit clunky. The pandas API doesn't make writing these kinds of "fluent" operations easy).
This idiom in turns comes from SQL's Window Functions, which the PostgreSQL documentation gives a very nice definition of: (emphasis mine)
A window function performs a calculation across a set of table rows that are somehow related to the current row. This is comparable to the type of calculation that can be done with an aggregate function. But unlike regular aggregate functions, use of a window function does not cause rows to become grouped into a single output row — the rows retain their separate identities. Behind the scenes, the window function is able to access more than just the current row of the query result.
And gives a beautiful SQL one-liner example: (ranking within groups)
SELECT depname, empno, salary, rank() OVER (PARTITION BY depname ORDER BY salary DESC) FROM empsalary;
depname
empno
salary
rank
develop
8
6000
1
develop
10
5200
2
develop
11
5200
2
develop
9
4500
4
develop
7
4200
5
personnel
2
3900
1
personnel
5
3500
2
sales
1
5000
1
sales
4
4800
2
sales
3
4800
2
Last thing: you might also be interested in pandas' pipe, which is similar to apply but works a bit differently and gives the internal operations a bigger scope to work on. See here for more
df = pd.DataFrame({
'Date' : ['2015-05-08', '2015-05-07', '2015-05-06', '2015-05-05', '2015-05-08', '2015-05-07', '2015-05-06', '2015-05-05'],
'Sym' : ['aapl', 'aapl', 'aapl', 'aapl', 'aaww', 'aaww', 'aaww', 'aaww'],
'Data2': [11, 8, 10, 15, 110, 60, 100, 40],
'Data3': [5, 8, 6, 1, 50, 100, 60, 120]
})
print(pd.pivot_table(data=df,index='Date',columns='Sym', aggfunc={'Data2':'sum','Data3':'sum'}))
output
Data2 Data3
Sym aapl aaww aapl aaww
Date
2015-05-05 15 40 1 120
2015-05-06 10 100 6 60
2015-05-07 8 60 8 100
2015-05-08 11 110 5 50

AI ML for Pattern Classification in Data Frame

I have a dataset on which I want to apply AI/ML algorithm to classify patterns as given below, please suggest to me the ways in which I can do it efficiently.
Example Dataset-
Name Subset Value
A X_1 55
A X_A 89
A X_B 45
B B_1 55
B B_A 89
C X_1 66
C X_A 656
C X_B 456
D D_1 545
D D_2 546
D D_3 895
D D_4 565
The output would be;
Pattern No. Instance Pattern
1 1 A X_1 55
A X_A 89
A X_B 45
2 C X_1 66
C X_A 656
C X_B 456
2 1 B B_1 55
B B_A 89
3 1 D D_1 545
D D_2 546
D D_3 895
D D_4 565
You need a clustering algorithm.
Define a distance function between instances
Try various clustering algorithms
I would recommend to start with Jaccard distance and Agglomerative clustering.
Code example
import re
import pandas as pd
from sklearn.cluster import AgglomerativeClustering
def read_data():
text = '''A X_1 55
A X_A 89
A X_B 45
B B_1 55
B B_A 89
C X_1 66
C X_A 656
C X_B 456
D D_1 545
D D_2 546
D D_3 895
D D_4 565'''
data = [re.split('\s+', line) for line in text.split('\n')]
return pd.DataFrame(data, columns=re.split('\s+', 'Name Subset Value'))
df = read_data()
instances = []
for name, name_df in df.groupby('Name'):
instances.append({'name': name, 'subsets': name_df['Subset'].tolist()})
def jaccard_distance(list_1, list_2):
set_1 = set(list_1)
set_2 = set(list_2)
return 1 - len(set_1.intersection(set_2)) / len(set_1.union(set_2))
clustering = AgglomerativeClustering(n_clusters=None,
distance_threshold=1,
affinity='precomputed',
linkage='average')
distance_matrix = [
[
jaccard_distance(instance_1['subsets'], instance_2['subsets'])
for instance_2 in instances
] for instance_1 in instances
]
clusters = clustering.fit_predict(distance_matrix)
for instance, cluster in zip(instances, clusters):
instance['cluster'] = cluster
print(instances)
Output
[{'name': 'A', 'subsets': ['X_1', 'X_A', 'X_B'], 'cluster': 0},
{'name': 'B', 'subsets': ['B_1', 'B_A'], 'cluster': 2},
{'name': 'C', 'subsets': ['X_1', 'X_A', 'X_B'], 'cluster': 0},
{'name': 'D', 'subsets': ['D_1', 'D_2', 'D_3', 'D_4'], 'cluster': 1}]

calculating average depends of the number of variables of raw data in python

I am stuck on calculating the average depends on the variables in python.
I have 8 variables as below.
Case 1.
Time df1 df2 df3 df4 df5 df6 df7 df8
2020-01-01 220 250 235 215 221 221 220 253
In this case I can just calculate average like below
df['dfaverage']=(df['df1']+df['df2']+df['df3']+df['df4']+df['df5']+df['df6']+df['df7']+df['df8'])/8
The output will be 229.4
but if one of the value is zero then how would i ignore that values and calculate?
Case 2.
Time df1 df2 df3 df4 df5 df6 df7 df8
2020-01-01 220 250 235 215 221 221 220 0
the output should be 226 for case2 but when I run the code I will get 197.8.
How can I ignore 0 when calculating the average?
You can replace 0 to missing values and use mean:
df['dfaverage'] = df.replace(0, np.nan).mean(axis=1)
print (df)
Time df1 df2 df3 df4 df5 df6 df7 df8 dfaverage
0 2020-01-01 220 250 235 215 221 221 220 0 226.0
Or if need specify column names use list:
cols = [ 'df1', 'df2', 'df3', 'df4', 'df5', 'df6', 'df7', 'df8']
df['dfaverage'] = df[cols].replace(0, np.nan).mean(axis=1)
print (df)
Time df1 df2 df3 df4 df5 df6 df7 df8 dfaverage
0 2020-01-01 220 250 235 215 221 221 220 0 226.0

Getting variable no of pandas rows w.r.t. a dictionary lookup

In this sample dataframe df:
import pandas as pd
import numpy as np
import random, string
max_rows = {'A': 3, 'B': 2, 'D': 4} # max number of rows to be extracted
data_size = 1000
df = pd.DataFrame({'symbol': pd.Series(random.choice(string.ascii_uppercase) for _ in range(data_size)),
'qty': np.random.randn(data_size)}).sort_values('symbol')
How to get a dataframe with variable rows from a dictionary?
Tried using [df.groupby('symbol').head(i) for i in df.symbol.map(max_rows)]. It gives a RuntimeWarning and looks very incorrect.
You can use concat with list comprehension:
print (pd.concat([df.loc[df["symbol"].eq(k)].head(v) for k,v in max_rows.items()]))
symbol qty
640 A -0.725947
22 A -1.361063
190 A -0.596261
451 B -0.992223
489 B -2.014979
593 D 1.581863
600 D -2.162044
793 D -1.162758
738 D 0.345683
Adding another method using groupby+cumcount and df.query
df.assign(v=df.groupby("symbol").cumcount()+1,k=df['symbol'].map(max_rows)).query("v<=k")
Or same logic without assigning extra columns #thanks #jezrael
df[df.groupby("symbol").cumcount()+1 <= df['symbol'].map(max_rows)]
symbol qty
882 A -0.249236
27 A 0.625584
122 A -1.154539
229 B -1.269212
55 B 1.403455
457 D -2.592831
449 D -0.433731
634 D 0.099493
734 D -1.551012

Numpy delete repeated rows

I simply need to remove the rows that are repeated in an array but maintain one of them, I can't use unique because I need to maintain the order.
Example
1 a234 125
1 a123 265
1 a234 125
1 a145 167
1 a234 125
2 a189 547
2 a189 547
3 a678 567
3 a357 569
I need this output
1 a234 125
1 a123 265
1 a145 167
2 a189 547
3 a678 567
3 a357 569
I think this does what you want ,and uses np.unique with the return_index keyword argument:
import numpy as np
a = np.array([[1, 'a234', 125],
[2, 'b189', 547],
[1, 'a234', 125],
[3, 'c678', 567],
[1, 'a234', 125],
[2, 'b189', 547]])
b = a.ravel().view(np.dtype((np.void, a.dtype.itemsize*a.shape[1])))
_, unique_idx = np.unique(b, return_index=True)
new_a = a[np.sort(unique_idx)]
>>> new_a
array([['1', 'a234', '125'],
['2', 'b189', '547'],
['3', 'c678', '567']],
dtype='|S4')
The hackiest part is the view b, that turns each row into a single element of np.void dtype, so that full rows can be compared for equality by np.unique.