Kronecker product over the rows of a pandas dataframe - pandas

So I have these two dataframes and I would like to get a new dataframe which consists of the kronecker product of the rows of the two dataframes. What is the correct way to this?
As an example:
DataFrame1
c1 c2
0 10 100
1 11 110
2 12 120
and
DataFrame2
a1 a2
0 5 7
1 1 10
2 2 4
Then I would like to have the following matrix:
c1a1 c1a2 c2a1 c2a2
0 50 70 500 700
1 11 110 110 1100
2 24 48 240 480
I hope my question is clear.
PS. I saw this question was posted here kronecker product pandas dataframes. However, the answer given is not the correct answer (I believe to mine and the original question, but definitely not to mine). The answer there gives a Kronecker product of both dataframes, but I only want it over the rows.

Create MultiIndex by MultiIndex.from_product, convert both columns to MultiIndex by DataFrame.reindex and multiple Dataframe, last flatten MultiIndex:
c = pd.MultiIndex.from_product([df1, df2])
df = df1.reindex(c, axis=1, level=0).mul(df2.reindex(c, axis=1, level=1))
df.columns = df.columns.map(lambda x: f'{x[0]}{x[1]}')
print (df)
c1a1 c1a2 c2a1 c2a2
0 50 70 500 700
1 11 110 110 1100
2 24 48 240 480

Use numpy for efficiency:
import numpy as np
pd.DataFrame(np.einsum('nk,nl->nkl', df1, df2).reshape(df1.shape[0], -1),
columns=pd.MultiIndex.from_product([df1, df2]).map(''.join)
)
Output:
c1a1 c1a2 c2a1 c2a2
0 50 70 500 700
1 11 110 110 1100
2 24 48 240 480

Related

Adding extra n rows at the end of a dataframe of a certain value

I have a dataframe with currently 22 rows
index value
0 23
1 22
2 19
...
21 20
to this dataframe, i want to add 72 rows to make the dataframe exactly 100 rows. So i need to fill loc[22:99] but with a certain value, let's say 100.
I tried something like this
uncon_dstn_2021['balance'].loc[22:99] = 100
but did not work. Any idea?
You can do reindex
out = df.reindex(df.index.tolist() + list(range(22, 99+1)), fill_value = 100)
You can also use pd.concat:
df1 = pd.concat([df, pd.DataFrame({'balance': [100]*(100-len(df))})], ignore_index=True)
print(df1)
# Output
balance
0 1
1 14
2 11
3 11
4 10
.. ...
96 100
97 100
98 100
99 100
[100 rows x 1 columns]

Is there another way to solve about pandas set option?

I'm analyzing a data-frame and want to check more detailed lists
but even though I searched some solutions from google,
I don't understand why the result is not changed.
what is the problem??
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
# Import data
df = df = pd.read_csv(r"C:\Users\Administrator\Desktop\medical.txt")
pd.set_option("display.max_rows", 50)
pd.set_option('display.max_columns', 15)
print(df)
id age gender height weight ap_hi ap_lo cholesterol gluc
0 0 18393 2 168 62.0 110 80 1 1
1 1 20228 1 156 85.0 140 90 3 1
2 2 18857 1 165 64.0 130 70 3 1
3 3 17623 2 169 82.0 150 100 1 1
4 4 17474 1 156 56.0 100 60 1 1
... ... ... ... ... ... ... ... ...
69995 99993 19240 2 168 76.0 120 80 1 1
69996 99995 22601 1 158 126.0 140 90 2 2
69997 99996 19066 2 183 105.0 180 90 3 1
69998 99998 22431 1 163 72.0 135 80 1 2
69999 99999 20540 1 170 72.0 120 80 2 1
Look at https://pandas.pydata.org/pandas-docs/stable/user_guide/options.html at "Frequently used options" chapter.
You can see that if the "max_rows" is lower than the total number of rows in your dataframe then it is displayed like your results.
Below a copy past of the interesting part in the link that I gave you:
if there are a way to display enough columns
pd.set_option('display.width',1000)
or
pd.set_option('display.width',None)
but to rows may be you only use
df.head(50)
or
df.tail(50)
or follows to DisplayAll
pd.set_option("display.max_rows", None)
Why set that is useless:
The second parameter is not the maximum number of rows that can be viewed, but an internal template parameter
code as follows:
set_option = CallableDynamicDoc(_set_option, _set_option_tmpl)
CallableDynamicDoc:
class CallableDynamicDoc:
def __init__(self, func, doc_tmpl):
self.__doc_tmpl__ = doc_tmpl
self.__func__ = func
def __call__(self, *args, **kwds):
return self.__func__(*args, **kwds)
#property
def __doc__(self):
opts_desc = _describe_option("all", _print_desc=False)
opts_list = pp_options_list(list(_registered_options.keys()))
return self.__doc_tmpl__.format(opts_desc=opts_desc, opts_list=opts_list)

List of Pandas Dataframes: Merging Function Outputs

I've researched previous similar questions, but couldn't find any applicable leads:
I have a dataframe, called "df" which is roughly structured as follows:
Income Income_Quantile Score_1 Score_2 Score_3
0 100000 5 75 75 100
1 97500 5 80 76 94
2 80000 5 79 99 83
3 79000 5 88 78 91
4 70000 4 55 77 80
5 66348 4 65 63 57
6 67931 4 60 65 57
7 69232 4 65 59 62
8 67948 4 64 64 60
9 50000 3 66 50 60
10 49593 3 58 51 50
11 49588 3 58 54 50
12 48995 3 59 59 60
13 35000 2 61 50 53
14 30000 2 66 35 77
15 12000 1 22 60 30
16 10000 1 15 45 12
Using the "Income_Quantile" column and the following "for-loop", I divided the dataframe into a list of 5 subset dataframes (which each contain observations from the same income quantile):
dfs = []
for level in df.Income_Quantile.unique():
df_temp = df.loc[df.Income_Quantile == level]
dfs.append(df_temp)
Now, I would like to apply the following function for calculating the spearman correlation, p-value and t-statistic to the dataframe (fyi: scipy.stats functions are used in the main function):
def create_list_of_scores(df):
df_result = pd.DataFrame(columns=cols)
df_result.loc['t-statistic'] = [ttest_ind(df['Income'], df[x])[0] for x in cols]
df_result.loc['p-value'] = [ttest_ind(df['Income'], df[x])[1] for x in cols]
df_result.loc['correlation'] = [spearmanr(df['Income'], df[x])[1] for x in cols]
return df_result
The functions that "create_list_of_scores" uses, i.e. "ttest_ind" and "ttest_ind", can be accessed from scipy.stats as follows:
from scipy.stats import ttest_ind
from scipy.stats import spearmanr
I tested the function on one subset of the dataframe:
data = dfs[1]
result = create_list_of_scores(data)
It works as expected.
However, when it comes to applying the function to the entire list of dataframes, "dfs", a lot of issues arise. If I apply it to the list of dataframes as follows:
result = pd.concat([create_list_of_scores(d) for d in dfs], axis=1)
I get the output as the columns "Score_1, Score_2, and Score_3" x 5.
I would like to:
Have just three columns "Score_1, Score_2, and Score_3".
Index the output using the t-statistic, p-value and correlations as the first level index, and; the "Income_Quantile" as the second level index.
Here is what I have in mind:
Score_1 Score_2 Score_3
t-statistic 1
2
3
4
5
p-value 1
2
3
4
5
correlation 1
2
3
4
5
Any idea on how I can merge the output of my function as requested?
I think better is use GroupBy.apply:
cols = ['Score_1','Score_2','Score_3']
def create_list_of_scores(df):
df_result = pd.DataFrame(columns=cols)
df_result.loc['t-statistic'] = [ttest_ind(df['Income'], df[x])[0] for x in cols]
df_result.loc['p-value'] = [ttest_ind(df['Income'], df[x])[1] for x in cols]
df_result.loc['correlation'] = [spearmanr(df['Income'], df[x])[1] for x in cols]
return df_result
df = df.groupby('Income_Quantile').apply(create_list_of_scores).swaplevel(0,1).sort_index()
print (df)
Score_1 Score_2 Score_3
Income_Quantile
correlation 1 NaN NaN NaN
2 NaN NaN NaN
3 6.837722e-01 0.000000e+00 1.000000e+00
4 4.337662e-01 6.238377e-01 4.818230e-03
5 2.000000e-01 2.000000e-01 2.000000e-01
p-value 1 8.190692e-03 8.241377e-03 8.194933e-03
2 5.887943e-03 5.880440e-03 5.888611e-03
3 3.606128e-13 3.603267e-13 3.604996e-13
4 5.584822e-14 5.587619e-14 5.586583e-14
5 3.861801e-06 3.862192e-06 3.864736e-06
t-statistic 1 1.098143e+01 1.094719e+01 1.097856e+01
2 1.297459e+01 1.298294e+01 1.297385e+01
3 2.391611e+02 2.391927e+02 2.391736e+02
4 1.090548e+02 1.090479e+02 1.090505e+02
5 1.594605e+01 1.594577e+01 1.594399e+01

Panda: multiindex vs groupby [duplicate]

So I learned that I can use DataFrame.groupby without having a MultiIndex to do subsampling/cross-sections.
On the other hand, when I have a MultiIndex on a DataFrame, I still need to use DataFrame.groupby to do sub-sampling/cross-sections.
So what is a MultiIndex good for apart from the quite helpful and pretty display of the hierarchies when printing?
Hierarchical indexing (also referred to as “multi-level” indexing) was introduced in the pandas 0.4 release.
This opens the door to some quite sophisticated data analysis and manipulation, especially for working with higher dimensional data. In essence, it enables you to effectively store and manipulate arbitrarily high dimension data in a 2-dimensional tabular structure (DataFrame), for example.
Imagine constructing a dataframe using MultiIndex like this:-
import pandas as pd
import numpy as np
np.arrays = [['one','one','one','two','two','two'],[1,2,3,1,2,3]]
df = pd.DataFrame(np.random.randn(6,2),index=pd.MultiIndex.from_tuples(list(zip(*np.arrays))),columns=['A','B'])
df # This is the dataframe we have generated
A B
one 1 -0.732470 -0.313871
2 -0.031109 -2.068794
3 1.520652 0.471764
two 1 -0.101713 -1.204458
2 0.958008 -0.455419
3 -0.191702 -0.915983
This df is simply a data structure of two dimensions
df.ndim
2
But we can imagine it, looking at the output, as a 3 dimensional data structure.
one with 1 with data -0.732470 -0.313871.
one with 2 with data -0.031109 -2.068794.
one with 3 with data 1.520652 0.471764.
A.k.a.: "effectively store and manipulate arbitrarily high dimension data in a 2-dimensional tabular structure"
This is not just a "pretty display". It has the benefit of easy retrieval of data since we now have a hierarchal index.
For example.
In [44]: df.ix["one"]
Out[44]:
A B
1 -0.732470 -0.313871
2 -0.031109 -2.068794
3 1.520652 0.471764
will give us a new data frame only for the group of data belonging to "one".
And we can narrow down our data selection further by doing this:-
In [45]: df.ix["one"].ix[1]
Out[45]:
A -0.732470
B -0.313871
Name: 1
And of course, if we want a specific value, here's an example:-
In [46]: df.ix["one"].ix[1]["A"]
Out[46]: -0.73247029752040727
So if we have even more indexes (besides the 2 indexes shown in the example above), we can essentially drill down and select the data set we are really interested in without a need for groupby.
We can even grab a cross-section (either rows or columns) from our dataframe...
By rows:-
In [47]: df.xs('one')
Out[47]:
A B
1 -0.732470 -0.313871
2 -0.031109 -2.068794
3 1.520652 0.471764
By columns:-
In [48]: df.xs('B', axis=1)
Out[48]:
one 1 -0.313871
2 -2.068794
3 0.471764
two 1 -1.204458
2 -0.455419
3 -0.915983
Name: B
Great post by #Calvin Cheng, but thought I'd take a stab at this as well.
When to use a MultiIndex:
When a single column’s value isn’t enough to uniquely identify a row.
When data is logically hierarchical - meaning that it has multiple dimensions or “levels.”
Why (your core question) - at least these are the biggest benefits IMO:
Easy manipulation via stack() and unstack()
Easy math when there are multiple column levels
Syntactic sugar for slicing/filtering
Example:
Dollars Units
Date Store Category Subcategory UPC EAN
2018-07-10 Store 1 Alcohol Liqour 80480280024 154.77 7
Store 2 Alcohol Liqour 80480280024 82.08 4
Store 3 Alcohol Liqour 80480280024 259.38 9
Store 1 Alcohol Liquor 80432400630 477.68 14
674545000001 139.68 4
Store 2 Alcohol Liquor 80432400630 203.88 6
674545000001 377.13 13
Store 3 Alcohol Liquor 80432400630 239.19 7
674545000001 432.32 14
Store 1 Beer Ales 94922755711 65.17 7
702770082018 174.44 14
736920111112 50.70 5
Store 2 Beer Ales 94922755711 129.60 12
702770082018 107.40 10
736920111112 59.65 5
Store 3 Beer Ales 94922755711 154.00 14
702770082018 137.40 10
736920111112 107.88 12
Store 1 Beer Lagers 702770081011 156.24 12
Store 2 Beer Lagers 702770081011 137.06 11
Store 3 Beer Lagers 702770081011 119.52 8
1) If we want to easily compare sales across stores, we can use df.unstack('Store') to line everything up side-by-side:
Dollars Units
Store Store 1 Store 2 Store 3 Store 1 Store 2 Store 3
Date Category Subcategory UPC EAN
2018-07-10 Alcohol Liqour 80480280024 154.77 82.08 259.38 7 4 9
Liquor 80432400630 477.68 203.88 239.19 14 6 7
674545000001 139.68 377.13 432.32 4 13 14
Beer Ales 94922755711 65.17 129.60 154.00 7 12 14
702770082018 174.44 107.40 137.40 14 10 10
736920111112 50.70 59.65 107.88 5 5 12
Lagers 702770081011 156.24 137.06 119.52 12 11 8
2) We can also easily do math on multiple columns. For example, df['Dollars'] / df['Units'] will then divide each store's dollars by its units, for every store without multiple operations:
Store Store 1 Store 2 Store 3
Date Category Subcategory UPC EAN
2018-07-10 Alcohol Liqour 80480280024 22.11 20.52 28.82
Liquor 80432400630 34.12 33.98 34.17
674545000001 34.92 29.01 30.88
Beer Ales 94922755711 9.31 10.80 11.00
702770082018 12.46 10.74 13.74
736920111112 10.14 11.93 8.99
Lagers 702770081011 13.02 12.46 14.94
3) If we then want to filter to just specific rows, instead of using the
df[(df[col1] == val1) and (df[col2] == val2) and (df[col3] == val3)]
format, we can instead .xs or .query (yes these work for regular dfs, but it's not very useful). The syntax would instead be:
df.xs((val1, val2, val3), level=(col1, col2, col3))
More examples can be found in this tutorial notebook I put together.
The alternative to using a multiindex is to store your data using multiple columns of a dataframe. One would expect multiindex to provide a performance boost over naive column storage, but as of Pandas v 1.1.4, that appears not to be the case.
Timinigs
import numpy as np
import pandas as pd
np.random.seed(2020)
inv = pd.DataFrame({
'store_id': np.random.choice(10000, size=10**7),
'product_id': np.random.choice(1000, size=10**7),
'stock': np.random.choice(100, size=10**7),
})
# Create a DataFrame with a multiindex
inv_multi = inv.groupby(['store_id', 'product_id'])[['stock']].agg('sum')
print(inv_multi)
stock
store_id product_id
0 2 48
4 18
5 58
7 149
8 158
... ...
9999 992 132
995 121
996 105
998 99
999 16
[6321869 rows x 1 columns]
# Create a DataFrame without a multiindex
inv_cols = inv_multi.reset_index()
print(inv_cols)
store_id product_id stock
0 0 2 48
1 0 4 18
2 0 5 58
3 0 7 149
4 0 8 158
... ... ... ...
6321864 9999 992 132
6321865 9999 995 121
6321866 9999 996 105
6321867 9999 998 99
6321868 9999 999 16
[6321869 rows x 3 columns]
%%timeit
inv_multi.xs(key=100, level='store_id')
10 loops, best of 3: 20.2 ms per loop
%%timeit
inv_cols.loc[inv_cols.store_id == 100]
The slowest run took 8.79 times longer than the fastest. This could mean that an intermediate result is being cached.
100 loops, best of 3: 11.5 ms per loop
%%timeit
inv_multi.xs(key=100, level='product_id')
100 loops, best of 3: 9.08 ms per loop
%%timeit
inv_cols.loc[inv_cols.product_id == 100]
100 loops, best of 3: 12.2 ms per loop
%%timeit
inv_multi.xs(key=(100, 100), level=('store_id', 'product_id'))
10 loops, best of 3: 29.8 ms per loop
%%timeit
inv_cols.loc[(inv_cols.store_id == 100) & (inv_cols.product_id == 100)]
10 loops, best of 3: 28.8 ms per loop
Conclusion
The benefits from using a MultiIndex are about syntactic sugar, self-documenting data, and small conveniences from functions like unstack() as mentioned in #ZaxR's answer; Performance is not a benefit, which seems like a real missed opportunity.
Based on the comment on this
answer it seems the
experiment was flawed. Here is my attempt at a correct experiment.
Timings
import pandas as pd
import numpy as np
from timeit import timeit
random_data = np.random.randn(16, 4)
multiindex_lists = [["A", "B", "C", "D"], [1, 2, 3, 4]]
multiindex = pd.MultiIndex.from_product(multiindex_lists)
dfm = pd.DataFrame(random_data, multiindex)
df = dfm.reset_index()
print("dfm:\n", dfm, "\n")
print("df\n", df, "\n")
dfm_selection = dfm.loc[("B", 4), 3]
print("dfm_selection:", dfm_selection, type(dfm_selection))
df_selection = df[(df["level_0"] == "B") & (df["level_1"] == 4)][3].iat[0]
print("df_selection: ", df_selection, type(df_selection), "\n")
print("dfm_selection timeit:",
timeit(lambda: dfm.loc[("B", 4), 3], number=int(1e6)))
print("df_selection timeit: ",
timeit(
lambda: df[(df["level_0"] == "B") & (df["level_1"] == 4)][3].iat[0],
number=int(1e6)))
dfm:
0 1 2 3
A 1 -1.055128 -0.845019 -2.853027 0.521738
2 0.397804 0.385045 -0.121294 -0.696215
3 -0.551836 -0.666953 -0.956578 1.929732
4 -0.154780 1.778150 0.183104 -0.013989
B 1 -0.315476 0.564419 0.492496 -1.052432
2 -0.695300 0.085265 0.701724 -0.974168
3 -0.879915 -0.206499 1.597701 1.294885
4 0.653261 0.279641 -0.800613 1.050241
C 1 1.004199 -1.377520 -0.672913 1.491793
2 -0.453452 0.367264 -0.002362 0.411193
3 2.271958 0.240864 -0.923934 -0.572957
4 0.737893 -0.523488 0.485497 -2.371977
D 1 1.133661 -0.584973 -0.713320 -0.656315
2 -1.173231 -0.490667 0.634677 1.711015
3 -0.050371 -0.175644 0.124797 0.703672
4 1.349595 0.122202 -1.498178 0.013391
df
level_0 level_1 0 1 2 3
0 A 1 -1.055128 -0.845019 -2.853027 0.521738
1 A 2 0.397804 0.385045 -0.121294 -0.696215
2 A 3 -0.551836 -0.666953 -0.956578 1.929732
3 A 4 -0.154780 1.778150 0.183104 -0.013989
4 B 1 -0.315476 0.564419 0.492496 -1.052432
5 B 2 -0.695300 0.085265 0.701724 -0.974168
6 B 3 -0.879915 -0.206499 1.597701 1.294885
7 B 4 0.653261 0.279641 -0.800613 1.050241
8 C 1 1.004199 -1.377520 -0.672913 1.491793
9 C 2 -0.453452 0.367264 -0.002362 0.411193
10 C 3 2.271958 0.240864 -0.923934 -0.572957
11 C 4 0.737893 -0.523488 0.485497 -2.371977
12 D 1 1.133661 -0.584973 -0.713320 -0.656315
13 D 2 -1.173231 -0.490667 0.634677 1.711015
14 D 3 -0.050371 -0.175644 0.124797 0.703672
15 D 4 1.349595 0.122202 -1.498178 0.013391
dfm_selection: 1.0502406808918188 <class 'numpy.float64'>
df_selection: 1.0502406808918188 <class 'numpy.float64'>
dfm_selection timeit: 63.92458086000079
df_selection timeit: 450.4555013199997
Conclusion
MultiIndex single-value retrieval is over 7 times faster than conventional
dataframe single-value retrieval.
The syntax for MultiIndex retrieval is much cleaner.

Apply function with arguments across Multiindex levels

I would like to apply a custom function to each level within a multiindex.
For example, I have the dataframe
df = pd.DataFrame(np.arange(16).reshape((4,4)),
columns=pd.MultiIndex.from_product([['OP','PK'],['PRICE','QTY']]))
of which I want to add a column for each level 0 column, called "Value" which is the result of the following function;
def my_func(df, scale):
return df['QTY']*df['PRICE']*scale
where the user supplies the "scale" value.
Even in setting up this example, I am not sure how to show the result I want. But I know I want the final dataframe's multiindex column to be
pd.DataFrame(columns=pd.MultiIndex.from_product([['OP','PK'],['PRICE','QTY','Value']]))
Even if that wasn't had enough, I want to apply one "scale" value for the "OP" level 0 column and a different "scale" value to the "PK" column.
Use:
def my_func(df, scale):
#select second level of columns
df1 = df.xs('QTY', axis=1, level=1).values *df.xs('PRICE', axis=1, level=1) * scale
#create MultiIndex in columns
df1.columns = pd.MultiIndex.from_product([df1.columns, ['val']])
#join to original
return pd.concat([df, df1], axis=1).sort_index(axis=1)
print (my_func(df, 10))
OP PK
PRICE QTY val PRICE QTY val
0 0 1 0 2 3 60
1 4 5 200 6 7 420
2 8 9 720 10 11 1100
3 12 13 1560 14 15 2100
EDIT:
For multiple by scaled values different for each level is possible use list of values:
print (my_func(df, [10, 20]))
OP PK
PRICE QTY val PRICE QTY val
0 0 1 0 2 3 120
1 4 5 200 6 7 840
2 8 9 720 10 11 2200
3 12 13 1560 14 15 4200
Use groupby + agg, and then concatenate the pieces together with pd.concat.
scale = 10
v = df.groupby(level=0, axis=1).agg(lambda x: x.values.prod(1) * scale)
v.columns = pd.MultiIndex.from_product([v.columns, ['value']])
pd.concat([df, v], axis=1).sort_index(axis=1, level=0)
OP PK
PRICE QTY value PRICE QTY value
0 0 1 0 2 3 60
1 4 5 200 6 7 420
2 8 9 720 10 11 1100
3 12 13 1560 14 15 2100