Pandas append row without specifying columns - pandas

I wanted to add or append a row (in the form of a list) to a dataframe. All the methods requires that I turn the list into another dataframe first, eg.
df = df.append(another dataframe)
df = df.merge(another dataframe)
df = pd.concat(df, another dataframe)
I've found a trick if the index is in running number at https://www.statology.org/pandas-add-row-to-dataframe/
import pandas as pd
#create DataFrame
df = pd.DataFrame({'points': [10, 12, 12, 14, 13, 18],
'rebounds': [7, 7, 8, 13, 7, 4],
'assists': [11, 8, 10, 6, 6, 5]})
#view DataFrame
df
points rebounds assists
0 10 7 11
1 12 7 8
2 12 8 10
3 14 13 6
4 13 7 6
5 18 4 5
#add new row to end of DataFrame
df.loc[len(df.index)] = [20, 7, 5]
#view updated DataFrame
df
points rebounds assists
0 10 7 11
1 12 7 8
2 12 8 10
3 14 13 6
4 13 7 6
5 18 4 5
6 20 7 5
However, the dataframe must have index in running number or else, the add/append will override the existing data.
So my question is: Is there are simple, foolproof way to just append/add a list to a dataframe ?
Thanks very much !!!

>>> df
points rebounds assists
3 10 7 11
1 12 7 8
2 12 8 10
If the indexes are "numbers" - you could add 1 to the max index.
>>> df.loc[max(df.index) + 1] = 'my', 'new', 'row'
>>> df
points rebounds assists
3 10 7 11
1 12 7 8
2 12 8 10
4 my new row

Related

Rolling count unique in dataframe's rows or ndarray

Given a dataframe df, how to calculate the rolling count of unique values through rows' direction subject to a boundary condition: window size = n?
Input data:
import pandas as pd
import numpy as np
data = {'col_0':[7, 8, 9, 10, 11, 12],
'col_1':[4, 5, 6, 7, 8, 9],
'col_2':[2, 5, 8, 11, 14, 15],
'col_3':[2, 6, 10, 14, 18, 21],
'col_4':[7, 5, 7, 5, 7, 5],
'col_5':[2, 6, 10, 14, 18, 21]}
df = pd.DataFrame(data)
print(df)
###
col_0 col_1 col_2 col_3 col_4 col_5
0 7 4 2 2 7 2
1 8 5 5 6 5 6
2 9 6 8 10 7 10
3 10 7 11 14 5 14
4 11 8 14 18 7 18
5 12 9 15 21 5 21
Expected output (with window size = 2):
print(df)
###
col_0 col_1 col_2 col_3 col_4 col_5 rolling_nunique
0 7 4 2 2 7 2 3
1 8 5 5 6 5 6 6
2 9 6 8 10 7 10 6
3 10 7 11 14 5 14 8
4 11 8 14 18 7 18 7
5 12 9 15 21 5 21 10
For the example above with window size = 2.
At window 0's array we have row[0].
[[7 4 2 2 7 2]]
rolling_nunique[0] is 3 with the elements being [2, 4, 7].
At window 1's array we have row[0] & row[1].
[[7 4 2 2 7 2]
[8 5 5 6 5 6]]
rolling_nunique[1] is 6 with the elements being [2, 4, 5, 6, 7, 8].
At window 2's array we have row[1] & row[2].
[[ 8 5 5 6 5 6]
[ 9 6 8 10 7 10]]
rolling_nunique[2] is 6 with the elements being [5, 6, 7, 8, 9, 10].
etc.
Using sliding_window_view, you can customize how the values are aggregated in the sliding window. To get values for all rows before the sliding window is full (i.e., emulate min_periods=1 in pd.rolling), we need to add some empty rows at the top. This can be done using vstack and full. At the end, we need to account for these added nan values by filtering them away.
from numpy.lib.stride_tricks import sliding_window_view
w = 2
values = np.vstack([np.full([w-1, df.shape[1]], np.nan), df.values])
m = sliding_window_view(values, w, axis=0).reshape(len(df), -1)
unique_count = [len(np.unique(r[~np.isnan(r)])) for r in m]
df['rolling_nunique'] = unique_count
Result:
col_0 col_1 col_2 col_3 col_4 col_5 rolling_nunique
0 7 4 2 2 7 2 3
1 8 5 5 6 5 6 6
2 9 6 8 10 7 10 6
3 10 7 11 14 5 14 8
4 11 8 14 18 7 18 7
5 12 9 15 21 5 21 10
I found it could resolve by using sliding_window_view() from numpy,
Here's the approach:
rolling = 2
ar = df.values # turn into np.ndarray
length = ar.shape[1]
head_arrs = np.zeros((rolling-1, rolling*length))
cuboid = np.lib.stride_tricks.sliding_window_view(ar, (rolling,length)).astype(float)
plane = cuboid.reshape(-1,rolling*length)
for i in range(rolling-1,0,-1):
head_arr_l = plane[0,:i*length]
head_arr_l = np.pad(head_arr_l.astype(float), (0,length*(rolling-i)), 'constant', constant_values=np.nan)
head_arr_l = np.roll(head_arr_l, length*(rolling-i))
head_arrs[i-1,:] = head_arr_l
plane = np.insert(plane, 0, head_arrs, axis=0)
df['rolling_nunique'] = pd.DataFrame(plane).nunique(axis=1)
df
###
col_0 col_1 col_2 col_3 col_4 col_5 rolling_nunique
0 7 4 2 2 7 2 3
1 8 5 5 6 5 6 6
2 9 6 8 10 7 10 6
3 10 7 11 14 5 14 8
4 11 8 14 18 7 18 7
5 12 9 15 21 5 21 10
[reference] numpy.lib.stride_tricks.sliding_window_view

Length of passed values is 1, index implies 10

Why and what is this error about??? It shows Length of passed
values is 1, index implies 10. I tried many times to run the
code and I come across the same
ser = pd.Series(np.random.randint(1, 50, 10))
result = np.argwhere(ser % 3==0)
print(result)
argwhere() operates on a numpy array not a panda series. See below
a = np.random.randint(1, 50, 12)
a = pd.Series(a)
print(a)
np.argwhere(a.values%3==0)
output
0 28
1 46
2 4
3 40
4 19
5 26
6 6
7 24
8 26
9 30
10 33
11 27
dtype: int64
[250]:
array([[ 6],
[ 7],
[ 9],
[10],
[11]])
Please read documentation for numpy.random.randint You will see that the parameters are (low, high, size).
In your case, you are sending (1, 50, 10). So 10 random numbers will be generated between 1 and 50.
If you want multiples of 3, then you need to do this ser[ser % 3==0] not use np.anywhere.
See similar issue raised earlier and answered on Stack Overflow
import pandas as pd
import numpy as np
ser = pd.Series(np.random.randint(1, 50, 10))
print (ser)
result = ser[ser % 3==0]
print(result)
Output of this will be:
Original Series.
0 17
1 34
2 29
3 15
4 24
5 20
6 21
7 48
8 6
9 42
dtype: int64
Multiples of 3 will be:
3 15
4 24
6 21
7 48
8 6
9 42
dtype: int64
Use Index.tolist:
In [1374]: ser
Out[1374]:
0 44
1 5
2 35
3 10
4 16
5 20
6 25
7 9
8 44
9 16
dtype: int64
In [1372]: l = ser[ser % 3 == 0].index.tolist()
In [1373]: l
Out[1373]: [7]
where l will be a list of indexes of elements which are a multiple of 3.

Pandas Dataframe get trend in column

I have a dataframe:
np.random.seed(1)
df1 = pd.DataFrame({'day':[3, 4, 4, 4, 5, 5, 5, 5, 5, 6, 6],
'item': [1, 1, 2, 2, 1, 2, 3, 3, 4, 3, 4],
'price':np.random.randint(1,30,11)})
day item price
0 3 1 6
1 4 1 12
2 4 2 13
3 4 2 9
4 5 1 10
5 5 2 12
6 5 3 6
7 5 3 16
8 5 4 1
9 6 3 17
10 6 4 2
After the groupby code gb = df1.groupby(['day','item'])['price'].mean(), I get:
gb
day item
3 1 6
4 1 12
2 11
5 1 10
2 12
3 11
4 1
6 3 17
4 2
Name: price, dtype: int64
I want to get the trend from the groupby series replacing back into the dataframe column price. The price is the variation of the item-price with repect to the previous day price
day item price
0 3 1 nan
1 4 1 6
2 4 2 nan
3 4 2 nan
4 5 1 -2
5 5 2 1
6 5 3 nan
7 5 3 nan
8 5 4 nan
9 6 3 6
10 6 4 1
Please help me to code the last step. A single/double line code will be most helpful. As the actual dataframe is huge, I would like to avoid iterations.
Hope this helps!
#get the average values
mean_df=df1.groupby(['day','item'])['price'].mean().reset_index()
#rename columns
mean_df.columns=['day','item','average_price']
#sort by day an item in ascending
mean_df=mean_df.sort_values(by=['day','item'])
#shift the price for each item and each day
mean_df['shifted_average_price'] = mean_df.groupby(['item'])['average_price'].shift(1)
#combine with original df
df1=pd.merge(df1,mean_df,on=['day','item'])
#replace the price by difference of previous day's
df1['price']=df1['price']-df1['shifted_average_price']
#drop unwanted columns
df1.drop(['average_price', 'shifted_average_price'], axis=1, inplace=True)

How to group consecutive values in other columns into ranges based on one column

I have the following dataframe:
I would like to get the following output from the dataframe
Is there anyway to group other columns ['B', 'index'] based on column 'A' using groupby aggregate function, pivot_table in pandas.
I couldn't think about an approach to write code.
Use:
df=df.reset_index() #if 'index' not is a colum
g=df['A'].ne(df['A'].shift()).cumsum()
new_df=df.groupby(g,as_index=False).agg(index=('index',list),A=('A','first'),B=('B',lambda x: list(x.unique())))
print(new_df)
In pandas <0.25:
new_df=df.groupby(g,as_index=False).agg({'index':list,'A':'first','B':lambda x: list(x.unique())})
if you want to repeat repeated in the index use the same function for the index column as for B:
new_df=df.groupby(g,as_index=False).agg(index=('index',lambda x: list(x.unique())),A=('A','first'),B=('B',lambda x: list(x.unique())))
print(new_df)
Here is an example:
df=pd.DataFrame({'index':range(20),
'A':[1,1,1,1,2,2,0,0,0,1,1,1,1,1,1,0,0,0,3,3]
,'B':[1,2,3,5,5,5,7,8,9,9,9,12,12,14,15,16,17,18,19,20]})
print(df)
index A B
0 0 1 1
1 1 1 2
2 2 1 3
3 3 1 5
4 4 2 5
5 5 2 5
6 6 0 7
7 7 0 8
8 8 0 9
9 9 1 9
10 10 1 9
11 11 1 12
12 12 1 12
13 13 1 14
14 14 1 15
15 15 0 16
16 16 0 17
17 17 0 18
18 18 3 19
19 19 3 20
g=df['A'].ne(df['A'].shift()).cumsum()
new_df=df.groupby(g,as_index=False).agg(index=('index',list),A=('A','first'),B=('B',lambda x: list(x.unique())))
print(new_df)
index A B
0 [0, 1, 2, 3] 1 [1, 2, 3, 5]
1 [4, 5] 2 [5]
2 [6, 7, 8] 0 [7, 8, 9]
3 [9, 10, 11, 12, 13, 14] 1 [9, 12, 14, 15]
4 [15, 16, 17] 0 [16, 17, 18]
5 [18, 19] 3 [19, 20]

Add rows as columns in pandas

I'm trying to change my dataset by making all the rows into columns in pandas.
5 6 7
8 9 10
Needs to be changed as
5 6 7 8 9 10
with different headers of course, any suggestions??
Use pd.DataFrame([df.values.flatten()]) as follows:
In [18]: df
Out[18]:
0 1 2
0 5 6 7
1 8 9 10
In [19]: pd.DataFrame([df.values.flatten()])
Out[19]:
0 1 2 3 4 5
0 5 6 7 8 9 10
Explanation:
df.values returns numpy.ndarray:
In [18]: df.values
Out[18]:
array([[ 5, 6, 7],
[ 8, 9, 10]], dtype=int64)
In [19]: type(df.values)
Out[19]: numpy.ndarray
and numpy arrays have .flatten() method:
In [20]: df.values.flatten?
Docstring:
a.flatten(order='C')
Return a copy of the array collapsed into one dimension.
In [21]: df.values.flatten()
Out[21]: array([ 5, 6, 7, 8, 9, 10], dtype=int64)
Pandas.DataFrame constructor expects lists/arrays of rows:
If we try this:
In [22]: pd.DataFrame([ 5, 6, 7, 8, 9, 10])
Out[22]:
0
0 5
1 6
2 7
3 8
4 9
5 10
Pandas thinks that it's a list of rows, where each row has one element.
So i've enclosed that array into square brackets:
In [23]: pd.DataFrame([[ 5, 6, 7, 8, 9, 10]])
Out[23]:
0 1 2 3 4 5
0 5 6 7 8 9 10
which will be understood as one row with 6 columns.
or just in one line:
df = pd.DataFrame([[1,2,3],[4,5,6]])
df.values.flatten()
#out: array([1, 2, 3, 4, 5, 6])
you can also use reduce()
from import pandas as pd
from functools import reduce
df = pd.DataFrame([[5, 6, 7],[8, 9, 10]])
df = pd.DataFrame([reduce(lambda x,y: list(x[1]) + list(y[1]), df.iterrows())])
df
0 1 2 3 4 5
0 5 6 7 8 9 10
Use the reshape function from numpy:
import pandas as pd
import numpy as np
df = pd.DataFrame([[5, 6, 7],[8, 9, 10]])
nparray = np.array(df.iloc[:,:])
x = np.reshape(nparray, -1)
df = pd.DataFrame(x) #to convert back to a dataframe