Random sampling from a dataframe - pandas

I want to generate 2x6 dataframe which represents a Rack.Half of this dataframe are filled with storage items and the other half is with retrieval items.
I want to do is random chosing half of these 12 items and say that they are storage and others are retrieval.
How can I randomly choose?
I tried random.sample but this chooses random columns.Actually I want to choose random items individually.

Assuming this input:
0 1 2 3 4 5
0 0 1 2 3 4 5
1 6 7 8 9 10 11
You can craft a random numpy array to select/mask half of the values:
a = np.repeat([True,False], df.size//2)
np.random.shuffle(a)
a = a.reshape(df.shape)
Then select your two groups:
df.mask(a)
0 1 2 3 4 5
0 NaN NaN NaN 3.0 4 NaN
1 6.0 NaN 8.0 NaN 10 11.0
df.where(a)
0 1 2 3 4 5
0 0.0 1 2.0 NaN NaN 5.0
1 NaN 7 NaN 9.0 NaN NaN
If you simply want 6 random elements, use nummy.random.choice:
np.random.choice(df.to_numpy(). ravel(), 6, replace=False)
Example:
array([ 4, 5, 11, 7, 8, 3])

Related

pandas dataframe auto fill values if have same value on specific column [duplicate]

I have the data as below, the new pandas version doesn't preserve the grouped columns after the operation of fillna/ffill/bfill. Is there a way to have the grouped data?
data = """one;two;three
1;1;10
1;1;nan
1;1;nan
1;2;nan
1;2;20
1;2;nan
1;3;nan
1;3;nan"""
df = pd.read_csv(io.StringIO(data), sep=";")
print(df)
one two three
0 1 1 10.0
1 1 1 NaN
2 1 1 NaN
3 1 2 NaN
4 1 2 20.0
5 1 2 NaN
6 1 3 NaN
7 1 3 NaN
print(df.groupby(['one','two']).ffill())
three
0 10.0
1 10.0
2 10.0
3 NaN
4 20.0
5 20.0
6 NaN
7 NaN
With the most recent pandas if we would like keep the groupby columns , we need to adding apply here
out = df.groupby(['one','two']).apply(lambda x : x.ffill())
Out[219]:
one two three
0 1 1 10.0
1 1 1 10.0
2 1 1 10.0
3 1 2 NaN
4 1 2 20.0
5 1 2 20.0
6 1 3 NaN
7 1 3 NaN
Does it what you expect?
df['three']= df.groupby(['one','two'])['three'].ffill()
print(df)
# Output:
one two three
0 1 1 10.0
1 1 1 10.0
2 1 1 10.0
3 1 2 NaN
4 1 2 20.0
5 1 2 20.0
6 1 3 NaN
7 1 3 NaN
Yes please set the index and then try grouping it so that it will preserve the columns as shown here:
df = pd.read_csv(io.StringIO(data), sep=";")
df.set_index(['one','two'], inplace=True)
df.groupby(['one','two']).ffill()

Pandas groupby calculation using values from different rows based on other column

I have the following dataframe, observations are grouped in pairs. NaN here represents different products traded in pair wrt A. I want to groupby transaction and compute
A/NaN so that the value for all NaNs can be expressed in unit A.
transaction name value ...many other columns
1 A 3
1 NaN 5
2 NaN 7
2 A 6
3 A 4
3 NaN 3
4 A 10
4 NaN 9
5 C 8
5 A 6
..
Thus the desired df would be
transaction name value new_column ...many other columns
1 A 3 NaN
1 NaN 6 0.5
2 NaN 7 0.8571
2 A 6 NaN
3 A 4 1.333
3 NaN 3 NaN
4 A 10 1.111
4 NaN 9 NaN
5 C 8 0.75
5 A 6 NaN
...
First filter rows with A and convert transaction to index for possible divide rows with missing value by mapped transaction by Series.map:
m = df['name'].ne('A')
s = df[~m].set_index('transaction')['value']
df.loc[m, 'new_column'] = df.loc[m, 'transaction'].map(s) / df.loc[m, 'value']
print (df)
transaction name value new_column
0 1 A 3 NaN
1 1 NaN 5 0.600000
2 2 NaN 7 0.857143
3 2 A 6 NaN
4 3 A 4 NaN
5 3 NaN 3 1.333333
6 4 A 10 NaN
7 4 NaN 9 1.111111
8 5 NaN 8 0.750000
9 5 A 6 NaN
EDIT: There is multiple A values per groups, not only one, possible solution is removed duplicates:
print (df)
transaction name value
0 1 A 3
1 1 A 4
2 1 NaN 5
3 2 NaN 7
4 2 A 6
5 3 A 4
6 3 NaN 3
7 4 A 10
8 4 NaN 9
9 5 C 8
10 5 A 6
# s = df[~m].set_index('transaction')['value']
# df.loc[m, 'new_column'] = df.loc[m, 'transaction'].map(s) / df.loc[m, 'value']
# print (df)
#InvalidIndexError: Reindexing only valid with uniquely valued Index objects
m = df['name'].ne('A')
print (df[~m].drop_duplicates(['transaction','name']))
transaction name value
0 1 A 3
4 2 A 6
5 3 A 4
7 4 A 10
10 5 A 6
s = df[~m].drop_duplicates(['transaction','name']).set_index('transaction')['value']
df.loc[m, 'new_column'] = df.loc[m, 'transaction'].map(s) / df.loc[m, 'value']
print (df)
transaction name value new_column
0 1 A 3 NaN <- 2 times a per 1 group
1 1 A 4 NaN <- 2 times a per 1 group
2 1 NaN 5 0.600000
3 2 NaN 7 0.857143
4 2 A 6 NaN
5 3 A 4 NaN
6 3 NaN 3 1.333333
7 4 A 10 NaN
8 4 NaN 9 1.111111
9 5 C 8 0.750000
10 5 A 6 NaN
Assuming there are only two values per transaction, you can use agg and divide the first and last element by each other:
df.loc[df['name'].isna(), 'new_column'] = df.sort_values(by='name').\
groupby('transaction')['value'].\
agg(f='first', l='last').agg(lambda x: x['f'] / x['l'], axis=1)

Conditional aggregation after rolling in pandas

I am trying to calculate a rolling mean of a specific column based on a condition in another column.
The condition is to create three different rolling means for column A, as follows -
The rolling mean of A when column B is less than 2
The rolling mean of A when column B is equal to 2
The rolling mean of A when column B is greater than 2
Consider the following df with a window size of 2
A B
0 1 2
1 2 4
2 3 4
3 4 6
4 5 1
5 6 2
The output will be the following-
rolling less rolling equal rolling greater
0 NaN NaN NaN
1 NaN 1 2
2 NaN NaN 2.5
3 NaN NaN 3.5
4 5 NaN 4
5 5 6 NaN
The main difficulty I encountered was that the rolling function is column-wise, and on the other hand, the apply function works rows-wise, but then, calculating the rolling mean is too hard-coded.
Any ideas?
Thanks a lot.
You can create your 3 columns before rolling then compute it:
out = df.join(df.assign(rolling_less=df.mask(df['B'] >= 2)['A'],
rolling_equal=df.mask(df['B'] != 2)['A'],
rolling_greater=df.mask(df['B'] <= 2)['A'])
.filter(like='rolling').rolling(2, min_periods=1).mean())
print(out)
# Output
A B rolling_less rolling_equal rolling_greater
0 1 2 NaN 1.0 NaN
1 2 4 NaN 1.0 2.0
2 3 4 NaN NaN 2.5
3 4 6 NaN NaN 3.5
4 5 1 5.0 NaN 4.0
5 6 2 5.0 6.0 NaN
def function1(ss:pd.Series):
df11=df1.loc[:ss.name].tail(2)
return pd.Series([
df11.loc[lambda dd:dd.B<2,'A'].mean()
,df11.loc[lambda dd:dd.B==2,'A'].mean()
,df11.loc[lambda dd:dd.B>2,'A'].mean()
],index=['rolling less','rolling equal','rolling greater'],name=ss.name)
pd.concat([df1.A.shift(i) for i in range(2)],axis=1)\
.apply(function1,axis=1)
A B rolling less rolling equal rolling greater
0 1 2 NaN 1.0 NaN
1 2 4 NaN 1.0 2.0
2 3 4 NaN NaN 2.5
3 4 6 NaN NaN 3.5
4 5 1 5.0 NaN 4.0
5 6 2 5.0 6.0 NaN

Why does interpolating NaNs result in an empty plot?

I think my toy example below is self-explanatory. Basically, I can plot a line based on 5 values, yet if I interpolate NaNs the resulting line plot is empty. I would expect that matplotlib would still be able to connect the discrete existing points in my data (which are all still present).
a = pd.DataFrame([1,2,3,4,5], index=range(0, 10, 2), columns=['value'])
print(a)
value
0 1
2 2
4 3
6 4
8 5
a.plot()
b = pd.DataFrame([np.NaN]*5, index=range(1, 11, 2), columns=['value'])
print(pd.concat([a, b]).sort_index())
value
0 1.0
1 NaN
2 2.0
3 NaN
4 3.0
5 NaN
6 4.0
7 NaN
8 5.0
9 NaN
pd.concat([a, b]).sort_index().plot()

Applying multiple functions to a pivot table (grouped) dataframe

I currently have a dataframe which looks like this:
df:
store item sales
0 1 1 10
1 1 2 20
2 2 1 10
3 3 2 20
4 4 3 10
5 3 4 15
...
I wanted to view the total sales of each items for each store so I used pivot table to create this:
p_table = pd.pivot_table(df, index='store', values='sales', columns='item', aggfunc=np.sum)
which gives something like:
sales
item 1 2 3 4
store
1 20 30 10 8
2 10 14 12 13
3 1 23 29 10
....
What I want to do now is apply some functions so that each total sales of items represents the percentage of the total sales for a particular store. For example, the value for item 1 at store1 would become:
1. 20/(20+30+10+8) * 100
I am struggling to do this for stacked dataframe. Any suggestions would be much appreciated.
Thanks
I think need divide by div with Series created by sum:
print (p_table)
item 1 2 3 4
store
1 10.0 20.0 NaN NaN
2 10.0 NaN NaN NaN
3 NaN 20.0 NaN 15.0
4 NaN NaN 10.0 NaN
print (p_table.sum(axis=1))
store
1 30.0
2 10.0
3 35.0
4 10.0
dtype: float64
out = p_table.div(p_table.sum(axis=1), axis=0)
print (out)
item 1 2 3 4
store
1 0.333333 0.666667 NaN NaN
2 1.000000 NaN NaN NaN
3 NaN 0.571429 NaN 0.428571
4 NaN NaN 1.0 NaN