I am having a tough time with this one - not sure why...maybe it's the late hour.
I have a dataframe in pandas as follows:
1 10
2 11
3 20
4 5
5 10
I would like to calculate for each row the multiplicand for each row above it. For example, at row 3, I would like to calculate 10*11*20, or 2,200.
How do I do this?
Use cumprod.
Example:
df = pd.DataFrame({'A': [10, 11, 20, 5, 10]}, index=range(1, 6))
df['cprod'] = df['A'].cumprod()
Note, since your example is just a single column, a cumulative product can be done succinctly with a Series:
import pandas as pd
s = pd.Series([10, 11, 20, 5, 10])
s
# Output
0 10
1 11
2 20
3 5
4 10
dtype: int64
s.cumprod()
# Output
0 10
1 110
2 2200
3 11000
4 110000
dtype: int64
Kudos to #bananafish for locating the inherent cumprod method.
Related
I have a DataFrame like this, where for column2 I need to add 0.004 throughout the column to get a 0 value in row 1 of column 2. Similarly, for column 3 I need to subtract 0.4637 from the entire column to get a 0 value at row 1 column 3. How do I efficiently execute this?
Here is my code -
df2 = pd.DataFrame(np.zeros((df.shape[0], len(df.columns)))).round(0).astype(int)
for (i,j) in zip(range(0, 5999),range(1,len(df.columns))):
if j==1:
df2.values[i,j] = df.values[i,j] + df.values[0,1]
elif j>1:
df2.iloc[i,j] = df.iloc[i,j] - df.iloc[0,j]
print(df2)
Any help would be greatly appreciated. Thank you.
df2 = df - df.iloc[0]
Explanation:
Let's work through an example.
df = pd.DataFrame(np.arange(20).reshape(4, 5))
0
1
2
3
4
0
0
1
2
3
4
1
5
6
7
8
9
2
10
11
12
13
14
3
15
16
17
18
19
df.iloc[0] selects the first row of the dataframe:
0 0
1 1
2 2
3 3
4 4
Name: 0, dtype: int64
This is a Series. The first column printed here is its index (column names of the dataframe), and the second one - the actual values of the first row of the dataframe.
We can convert it to a list to better see its values
df.iloc[0].tolist()
[0, 1, 2, 3, 4]
Then, using broadcasting, we are subtracting each value from the whole column where it has come from.
It was working great until it wasn't, and no idea what I'm doing wrong. I've reduced it to a very simple datsaset t:
1 2 3 4 5 6 7 8
0 3 16 3 2 17 2 3 2
1 3 16 3 2 19 4 3 2
2 3 16 3 2 9 2 3 2
3 3 16 3 2 19 1 3 2
4 3 16 3 2 17 2 3 1
5 3 16 3 2 17 1 17 1
6 3 16 3 2 19 1 17 2
7 3 16 3 2 19 4 3 1
8 3 16 3 2 19 1 3 2
9 3 16 3 2 7 2 17 1
corr = t.corr()
corr
returns "__"
and
sns.heatmap(corr)
throws the following error "zero-size array to reduction operation minimum which has no identity"
I have no idea what's wrong? I've tried it with more rows etc, and double checked that I don't have nay missing values...what's going on? I had such a pretty heatmap earlier, I've been trying to
As mentioned above, change type to float. Simply,
corr = t.astype('float64').corr()
The problem here is not the dataframe itself but the origin of it. I found same problem by using drop or iloc in a dataframe. The key is the global type the dataframe has.
Let's say we have the following dataframe:
list_ex = [[1.1,2.1,3.1,4,5,6,7,8],[1.2,2.2,3.3,4.1,5.5,6,7,8],
[1.3,2.3,3,4,5,6.2,7,8],[1.4,2.4,3,4,5,6.2,7.3,8.1]]
list_ex_new=pd.DataFrame(list_ex)
you can calculate the list_ex_new.corr() with no problem. If you check the arguments of the dataframe by vars(list_ex_new), you'll obtain:
{'_is_copy': None, '_data': BlockManager
Items: RangeIndex(start=0, stop=8, step=1)
Axis 1: RangeIndex(start=0, stop=4, step=1)
FloatBlock: slice(0, 8, 1), 8 x 4, dtype: float64, '_item_cache': {}}
where dtype is float64.
A new dataframe can be defined by list_new_new = list_ex_new.iloc[1:,:] and the correlations can be evaluated, successfully. A check of the dataframe's attributes shows:
{'_is_copy': ,
'_data': BlockManager
Items: RangeIndex(start=0, stop=8, step=1)
Axis 1: RangeIndex(start=1, stop=4, step=1)
FloatBlock: slice(0, 8, 1), 8 x 3, dtype: float64,
'_item_cache': {}}
where dtype is still float64.
A third dataframe can be defined:
list_ex_w = [['a','a','a','a','a','a','a','a'],[1.1,2.1,3.1,4,5,6,7,8],
[1.2,2.2,3.3,4.1,5.5,6,7,8],[1.3,2.3,3,4,5,6.2,7,8],
[1.4,2.4,3,4,5,6.2,7.3,8.1]]
list_ex_new_w=pd.DataFrame(list_ex_w)
A evaluation of dataframe's correlation will result in a empty dataframe, since list_ex_w attributes look like:
{'_is_copy': None, '_data': BlockManager
Items: RangeIndex(start=0, stop=8, step=1)
Axis 1: Index(['a', 1, 2, 3, 4], dtype='object')
ObjectBlock: slice(0, 8, 1), 8 x 5, dtype: object, '_item_cache': {}}
Where now dtype is 'object', since the dataframe is not consistent in its types. there are strings and floats together. Finally, a fourth dataframe can be generated:
list_new_new_w = list_ex_new_w.iloc[1:,:]
this will generate as a result same notebook but with no 'a's, apparently a perfectly correct dataframe to calculate the correlations. However this will return again an empty dataframe. A final check of the dataframe attributes shows:
vars(list_new_new_w)
{'_is_copy': None, '_data': BlockManager
Items: Index([1, 2, 3, 4], dtype='object')
Axis 1: RangeIndex(start=0, stop=8, step=1)
ObjectBlock: slice(0, 4, 1), 4 x 8, dtype: object, '_item_cache': {}}
where dtype is still object, thus the method corr returns an empty dataframe.
This problem can be solved by using astype(float)
list_new_new_w.astype(float).corr()
In summary, it seems pandas at the time corr or cov among others methods are called generate a new dataframe with same attibutes ignoring the case the new dataframe has a consistent global type. I've been checking out the pandas source code and I understand this is the correct interpretation of pandas' implementation.
I have this SQL code and I want to write in in Pandas. Every example I saw uses groupby and order by outside of the window function and that is not what I want. I don't want my data to look grouped, instead I just need a cumulative sum of my new column (reg_sum) ordered by hour for each article_id.
SELECT
*,
SUM(registrations) OVER(PARTITION BY article_id ORDER BY time) AS
cumulative_regs
FROM table
Data example of what I need to get (reg_sum column):
article_id time registrations reg_sum
A 7 6 6
A 9 5 11
B 10 1 1
C 10 2 2
C 11 4 6
If anyone can say what is the equivalent of this in Pandas, that would be great. Thanks!
Using groupby and cumsum, this should work:
import pandas as pd
import numpy as np
# generate data
df = pd.DataFrame({'article_id': np.array(['A', 'A', 'B', 'C', 'C']),
'time': np.array([7, 9, 10, 10, 11]),
'registrations': np.array([6, 5, 1, 2, 4])})
# compute cumulative sum of registrations sorted by time and grouped by article_id
df['reg_sum'] = df.sort_values('time').groupby('article_id').registrations.cumsum()
Output:
article_id time registrations reg_sum
0 A 7 6 6
1 A 9 5 11
2 B 10 1 1
3 C 10 2 2
4 C 11 4 6
I have a pandas dataframe.
One of its columns contains a list of 60 elements, constant across its rows.
How do I convert each of these lists into a row of a new dataframe?
Just to be clearer: say A is the original dataframe with n rows. One of its columns contains a list of 60 elements.
I need to create a new dataframe nx60.
My tentative:
def expand(x):
return(pd.DataFrame(np.array(x)).reshape(-1,len(x)))
df["col"].apply(lambda x: expand(x))]
it gives funny results....
The weird thing is that if i call the function "expand" on a single raw, it does exactly what I expect from it
expand(df["col"][0])
To ChootsMagoots: Thjis is the result when i try to apply your suggestion. It does not work.
Sample data
df = pd.DataFrame()
df['col'] = np.arange(4*5).reshape(4,5).tolist()
df
Output:
col
0 [0, 1, 2, 3, 4]
1 [5, 6, 7, 8, 9]
2 [10, 11, 12, 13, 14]
3 [15, 16, 17, 18, 19]
now exctract DataFrame from col
df.col.apply(pd.Series)
Output:
0 1 2 3 4
0 0 1 2 3 4
1 5 6 7 8 9
2 10 11 12 13 14
3 15 16 17 18 19
Try this:
new_df = pd.DataFrame(df["col"].tolist())
This is a little frankensteinish, but you could also try:
import numpy as np
np.savetxt('outfile.csv', np.array(df['col'].tolist()), delimiter=',')
new_df = pd.read_csv('outfile.csv')
You can try this as well:
newCol = pd.Series(yourList)
df['colD'] = newCol.values
The above code:
1. Creates a pandas series.
2. Maps the series value to columns in original dataframe.
I have a group by table as follows, I want to sort by index within the keys ['CPUCore', Offline_RetetionAge'] (need to keep the structure of ['CPUCore', Offline_RetetionAge']) how should I do?
I think there is problem dtype of your second level is object, what is obviously string, so if use sort_index it sorts alphanumeric:
df = pd.DataFrame({'CPUCore':[2,2,2,3,3],
'Offline_RetetionAge':['100','1','12','120','15'],
'index':[11,16,5,4,3]}).set_index(['CPUCore','Offline_RetetionAge'])
print (df)
index
CPUCore Offline_RetetionAge
2 100 11
1 16
12 5
3 120 4
15 3
print (df.index.get_level_values('Offline_RetetionAge').dtype)
object
print (df.sort_index())
index
CPUCore Offline_RetetionAge
2 1 16
100 11
12 5
3 120 4
15 3
#change multiindex - cast level Offline_RetetionAge to int
new_index = list(zip(df.index.get_level_values('CPUCore'),
df.index.get_level_values('Offline_RetetionAge').astype(int)))
df.index = pd.MultiIndex.from_tuples(new_index, names = df.index.names)
print (df.sort_index())
index
CPUCore Offline_RetetionAge
2 1 16
12 5
100 11
3 15 3
120 4
EDIT by comment:
print (df.reset_index()
.sort_values(['CPUCore','index'])
.set_index(['CPUCore','Offline_RetetionAge']))
index
CPUCore Offline_RetetionAge
2 12 5
100 11
1 16
3 15 3
120 4
I think what you mean is this:
import pandas as pd
from pandas import Series, DataFrame
# create what I believe you tried to ask
df = DataFrame( \
[[11,'reproducible'], [16, 'example'], [5, 'a'], [4, 'create'], [9,'!']])
df.columns = ['index', 'bla']
df.index = pd.MultiIndex.from_arrays([[2]*4+[3],[10,100,1000,11,512]], \
names=['CPUCore', 'Offline_RetentionAge'])
# sort by values and afterwards by index where sort_remaining=False preserves
# the order of index
df = df.sort_values('index').sort_index(level=0, sort_remaining=False)
print df
The statement sort_values sorts the values by index and the sort_index restores the grouping by multiindex without changing the order of index for rows with the same CPUCore.
I don't know what a "group by table" is supposed to be. If you have a pd.GroupBy object, you won't be able to use sort_values() like that.
You might have to rethink what you group by or use functools.partial and DataFrame.apply
Output:
index bla
CPUCore Offline_RetentionAge
2 11 4 create
1000 5 a
10 11 reproducible
100 16 example
3 512 9 !