Random sampling from a dataframe

Random sampling from a dataframe - pandas

I want to generate 2x6 dataframe which represents a Rack.Half of this dataframe are filled with storage items and the other half is with retrieval items.
I want to do is random chosing half of these 12 items and say that they are storage and others are retrieval.
How can I randomly choose?
I tried random.sample but this chooses random columns.Actually I want to choose random items individually.

Assuming this input:
0 1 2 3 4 5
0 0 1 2 3 4 5
1 6 7 8 9 10 11
You can craft a random numpy array to select/mask half of the values:
a = np.repeat([True,False], df.size//2)
np.random.shuffle(a)
a = a.reshape(df.shape)
Then select your two groups:
df.mask(a)
0 1 2 3 4 5
0 NaN NaN NaN 3.0 4 NaN
1 6.0 NaN 8.0 NaN 10 11.0
df.where(a)
0 1 2 3 4 5
0 0.0 1 2.0 NaN NaN 5.0
1 NaN 7 NaN 9.0 NaN NaN
If you simply want 6 random elements, use nummy.random.choice:
np.random.choice(df.to_numpy(). ravel(), 6, replace=False)
Example:
array([ 4, 5, 11, 7, 8, 3])

Related

pandas dataframe auto fill values if have same value on specific column [duplicate]

I have the data as below, the new pandas version doesn't preserve the grouped columns after the operation of fillna/ffill/bfill. Is there a way to have the grouped data?
data = """one;two;three
1;1;10
1;1;nan
1;1;nan
1;2;nan
1;2;20
1;2;nan
1;3;nan
1;3;nan"""
df = pd.read_csv(io.StringIO(data), sep=";")
print(df)
one two three
0 1 1 10.0
1 1 1 NaN
2 1 1 NaN
3 1 2 NaN
4 1 2 20.0
5 1 2 NaN
6 1 3 NaN
7 1 3 NaN
print(df.groupby(['one','two']).ffill())
three
0 10.0
1 10.0
2 10.0
3 NaN
4 20.0
5 20.0
6 NaN
7 NaN

With the most recent pandas if we would like keep the groupby columns , we need to adding apply here
out = df.groupby(['one','two']).apply(lambda x : x.ffill())
Out[219]:
one two three
0 1 1 10.0
1 1 1 10.0
2 1 1 10.0
3 1 2 NaN
4 1 2 20.0
5 1 2 20.0
6 1 3 NaN
7 1 3 NaN

Does it what you expect?
df['three']= df.groupby(['one','two'])['three'].ffill()
print(df)
# Output:
one two three
0 1 1 10.0
1 1 1 10.0
2 1 1 10.0
3 1 2 NaN
4 1 2 20.0
5 1 2 20.0
6 1 3 NaN
7 1 3 NaN

Yes please set the index and then try grouping it so that it will preserve the columns as shown here:
df = pd.read_csv(io.StringIO(data), sep=";")
df.set_index(['one','two'], inplace=True)
df.groupby(['one','two']).ffill()

Pandas groupby calculation using values from different rows based on other column

I have the following dataframe, observations are grouped in pairs. NaN here represents different products traded in pair wrt A. I want to groupby transaction and compute
A/NaN so that the value for all NaNs can be expressed in unit A.
transaction name value ...many other columns
1 A 3
1 NaN 5
2 NaN 7
2 A 6
3 A 4
3 NaN 3
4 A 10
4 NaN 9
5 C 8
5 A 6
..
Thus the desired df would be
transaction name value new_column ...many other columns
1 A 3 NaN
1 NaN 6 0.5
2 NaN 7 0.8571
2 A 6 NaN
3 A 4 1.333
3 NaN 3 NaN
4 A 10 1.111
4 NaN 9 NaN
5 C 8 0.75
5 A 6 NaN
...

First filter rows with A and convert transaction to index for possible divide rows with missing value by mapped transaction by Series.map:
m = df['name'].ne('A')
s = df[~m].set_index('transaction')['value']
df.loc[m, 'new_column'] = df.loc[m, 'transaction'].map(s) / df.loc[m, 'value']
print (df)
transaction name value new_column
0 1 A 3 NaN
1 1 NaN 5 0.600000
2 2 NaN 7 0.857143
3 2 A 6 NaN
4 3 A 4 NaN
5 3 NaN 3 1.333333
6 4 A 10 NaN
7 4 NaN 9 1.111111
8 5 NaN 8 0.750000
9 5 A 6 NaN
EDIT: There is multiple A values per groups, not only one, possible solution is removed duplicates:
print (df)
transaction name value
0 1 A 3
1 1 A 4
2 1 NaN 5
3 2 NaN 7
4 2 A 6
5 3 A 4
6 3 NaN 3
7 4 A 10
8 4 NaN 9
9 5 C 8
10 5 A 6
# s = df[~m].set_index('transaction')['value']
# df.loc[m, 'new_column'] = df.loc[m, 'transaction'].map(s) / df.loc[m, 'value']
# print (df)
#InvalidIndexError: Reindexing only valid with uniquely valued Index objects
m = df['name'].ne('A')
print (df[~m].drop_duplicates(['transaction','name']))
transaction name value
0 1 A 3
4 2 A 6
5 3 A 4
7 4 A 10
10 5 A 6
s = df[~m].drop_duplicates(['transaction','name']).set_index('transaction')['value']
df.loc[m, 'new_column'] = df.loc[m, 'transaction'].map(s) / df.loc[m, 'value']
print (df)
transaction name value new_column
0 1 A 3 NaN <- 2 times a per 1 group
1 1 A 4 NaN <- 2 times a per 1 group
2 1 NaN 5 0.600000
3 2 NaN 7 0.857143
4 2 A 6 NaN
5 3 A 4 NaN
6 3 NaN 3 1.333333
7 4 A 10 NaN
8 4 NaN 9 1.111111
9 5 C 8 0.750000
10 5 A 6 NaN

Assuming there are only two values per transaction, you can use agg and divide the first and last element by each other:
df.loc[df['name'].isna(), 'new_column'] = df.sort_values(by='name').\
groupby('transaction')['value'].\
agg(f='first', l='last').agg(lambda x: x['f'] / x['l'], axis=1)

Conditional aggregation after rolling in pandas

I am trying to calculate a rolling mean of a specific column based on a condition in another column.
The condition is to create three different rolling means for column A, as follows -
The rolling mean of A when column B is less than 2
The rolling mean of A when column B is equal to 2
The rolling mean of A when column B is greater than 2
Consider the following df with a window size of 2
A B
0 1 2
1 2 4
2 3 4
3 4 6
4 5 1
5 6 2
The output will be the following-
rolling less rolling equal rolling greater
0 NaN NaN NaN
1 NaN 1 2
2 NaN NaN 2.5
3 NaN NaN 3.5
4 5 NaN 4
5 5 6 NaN
The main difficulty I encountered was that the rolling function is column-wise, and on the other hand, the apply function works rows-wise, but then, calculating the rolling mean is too hard-coded.
Any ideas?
Thanks a lot.

You can create your 3 columns before rolling then compute it:
out = df.join(df.assign(rolling_less=df.mask(df['B'] >= 2)['A'],
rolling_equal=df.mask(df['B'] != 2)['A'],
rolling_greater=df.mask(df['B'] <= 2)['A'])
.filter(like='rolling').rolling(2, min_periods=1).mean())
print(out)
# Output
A B rolling_less rolling_equal rolling_greater
0 1 2 NaN 1.0 NaN
1 2 4 NaN 1.0 2.0
2 3 4 NaN NaN 2.5
3 4 6 NaN NaN 3.5
4 5 1 5.0 NaN 4.0
5 6 2 5.0 6.0 NaN

def function1(ss:pd.Series):
df11=df1.loc[:ss.name].tail(2)
return pd.Series([
df11.loc[lambda dd:dd.B<2,'A'].mean()
,df11.loc[lambda dd:dd.B==2,'A'].mean()
,df11.loc[lambda dd:dd.B>2,'A'].mean()
],index=['rolling less','rolling equal','rolling greater'],name=ss.name)
pd.concat([df1.A.shift(i) for i in range(2)],axis=1)\
.apply(function1,axis=1)
A B rolling less rolling equal rolling greater
0 1 2 NaN 1.0 NaN
1 2 4 NaN 1.0 2.0
2 3 4 NaN NaN 2.5
3 4 6 NaN NaN 3.5
4 5 1 5.0 NaN 4.0
5 6 2 5.0 6.0 NaN

Why does interpolating NaNs result in an empty plot?

I think my toy example below is self-explanatory. Basically, I can plot a line based on 5 values, yet if I interpolate NaNs the resulting line plot is empty. I would expect that matplotlib would still be able to connect the discrete existing points in my data (which are all still present).
a = pd.DataFrame([1,2,3,4,5], index=range(0, 10, 2), columns=['value'])
print(a)
value
0 1
2 2
4 3
6 4
8 5
a.plot()
b = pd.DataFrame([np.NaN]*5, index=range(1, 11, 2), columns=['value'])
print(pd.concat([a, b]).sort_index())
value
0 1.0
1 NaN
2 2.0
3 NaN
4 3.0
5 NaN
6 4.0
7 NaN
8 5.0
9 NaN
pd.concat([a, b]).sort_index().plot()

Applying multiple functions to a pivot table (grouped) dataframe

I currently have a dataframe which looks like this:
df:
store item sales
0 1 1 10
1 1 2 20
2 2 1 10
3 3 2 20
4 4 3 10
5 3 4 15
...
I wanted to view the total sales of each items for each store so I used pivot table to create this:
p_table = pd.pivot_table(df, index='store', values='sales', columns='item', aggfunc=np.sum)
which gives something like:
sales
item 1 2 3 4
store
1 20 30 10 8
2 10 14 12 13
3 1 23 29 10
....
What I want to do now is apply some functions so that each total sales of items represents the percentage of the total sales for a particular store. For example, the value for item 1 at store1 would become:
1. 20/(20+30+10+8) * 100
I am struggling to do this for stacked dataframe. Any suggestions would be much appreciated.
Thanks

I think need divide by div with Series created by sum:
print (p_table)
item 1 2 3 4
store
1 10.0 20.0 NaN NaN
2 10.0 NaN NaN NaN
3 NaN 20.0 NaN 15.0
4 NaN NaN 10.0 NaN
print (p_table.sum(axis=1))
store
1 30.0
2 10.0
3 35.0
4 10.0
dtype: float64
out = p_table.div(p_table.sum(axis=1), axis=0)
print (out)
item 1 2 3 4
store
1 0.333333 0.666667 NaN NaN
2 1.000000 NaN NaN NaN
3 NaN 0.571429 NaN 0.428571
4 NaN NaN 1.0 NaN

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Random sampling from a dataframe - pandas

Related

pandas dataframe auto fill values if have same value on specific column [duplicate]

Pandas groupby calculation using values from different rows based on other column

Conditional aggregation after rolling in pandas

Why does interpolating NaNs result in an empty plot?

Applying multiple functions to a pivot table (grouped) dataframe

Categories

Resources