How to broadcast a pandas dataframe with a single row dataframe? - pandas

I have a dataframe(dall), and I have a single row dataframe that has the same columns (row).
How to get d_result without writing a loop? I understand I can convert dataframe to numpy array and broadcast, but I would imagine Pandas has a way to do it directly. I have tried pd.mul, give me nan results.
dall = pd.DataFrame([[5,4,3], [3,5,5], [6,6,6]], columns=['a','b','c'])
row = pd.DataFrame([[-1, 100, 0]], columns=['a','b','c'])
d_result = pd.DataFrame([[-5,400,0], [-3,500,0], [-6,600,0]], columns=['a','b','c'])
dall
a b c
0 5 4 3
1 3 5 5
2 6 6 6
row
a b c
0 -1 100 0
d_result
a b c
0 -5 400 0
1 -3 500 0
2 -6 600 0

We can use mul
dall=dall.mul(row.loc[0],axis=1)
dall
Out[5]:
a b c
0 -5 400 0
1 -3 500 0
2 -6 600 0

You can do this by multiplying DataFrame obj to Series obj. Something like this:
dall * row.iloc[0]
I think this is essentially same as #WeNYoBen answer.
You can also multiply DataFrame obj to DataFrame obj like below. But be careful that NaN value will not propagate, because NaN value will be replaced to 1.0 before multiplication.
dall.mul(row, axis='columns', fill_value=1.0)

Related

How to append to a data frame from multiple loops

I have a code, which takes in files from csv and takes a price difference, but to make it simplar I made a reproducible example as seen below. I want to append each result to the end of a specific column name. For example the first loop will go through size 1 and minute 1 so it should append to column names 1;1, for file2, file 3, file4. So the output should be :
1;1 1;2 1;3 2;1 2;2 2;3
0 0 0 same below as for 1
0 0 0
2 2 2
2 2 2
4 4 4
4 4 4
5 5 5
0 0 0
0 0 0
0 0 0
2 2 2
2 2 2
4 4 4
4 4 4
6 6 6
6 6 6
0 0 0
0 0 0
0 0 0
2 2 2
2 2 2
4 4 4
4 4 4
6 6 6
7 7 7
I am using a loop to set prefixed data frame columns, because in my original code the number of minutes, sizes, and files is inputted by the user.
import numpy as np
import pandas as pd
file =[1,2,3,4,5,6,6,2]
file2=[1,2,3,4,5,6,7,8]
file3=[1,2,3,4,5,6,7,9]
file4=[1,2,1,2,1,2,1,2]
size=[1,2]
minutes=[1,2,3]
list1=[file,file2,file3]
data=pd.DataFrame(file)
data2=pd.DataFrame(file2)
data3=pd.DataFrame(file3)
list1=(data,data2,data3)
datas=pd.DataFrame(file4)
col_names = [str(sizer)+';'+str(number) for sizer in size for number in minutes]
datanew=pd.DataFrame(columns=col_names)
for sizes in size:
for minute in minutes:
for files in list1:
pricediff=files-data
datanew[str(sizes)+';'+str(minute)]=datanew[str(sizes)+';'+str(minute)].append(pricediff,ignore_index=True)
print(datanew)
Edit: When trying this line: datanew=datanew.append({str(sizes)+';'+str(minute): df['pricediff']},ignore_index=True) It appends the data but the result isn't "clean"
The result from my original data, gives me this:
111;5.0,1111;5.0
"0 4.5
1 0.5
2 8
3 8
4 8
...
704 3.5
705 0.5
706 11.5
707 0.5
708 9.0
Name: pricediff, Length: 709, dtype: object",
"price 0.0
0 0.0
Name: pricediff, dtype: float64",
"0 6.5
1 6.5
2 3.5
3 13.0
Name: pricediff, Length: 4, dtype: float64",
What you are looking for is:
datanew=datanew.append({str(sizes)+';'+str(minute): pricediff}, ignore_index=True)
This happens because you cannot change length of a single column of a dataframe without modifying length of the whole data frame.
Now consider the below as an example:
import pandas as pd
df=pd.DataFrame({"a": list("xyzpqr"), "b": [1,3,5,4,2,7], "c": list("pqrtuv")})
print(df)
#this will fail:
#df["c"]=df["c"].append("abc", ignore_index=True)
#print(df)
#what you can do instead:
df=df.append({"c": "abc"}, ignore_index=True)
print(df)
#you can even create new column that way:
df=df.append({"x": "abc"}, ignore_index=True)
Edit
In order to append pd.Series do literally the same:
abc=pd.Series([-1,-2,-3], name="c")
df=df.append({"c": abc}, ignore_index=True)
print(df)
abc=pd.Series([-1,-2,-3], name="x")
df=df.append({"x": abc}, ignore_index=True)

Why I get the different group size number using pandas groupby() with or without column selection?

I try to use the numpy.size() to count the group size for the groups from pandas Dataframe groupby(), and I get strange result.
>>> df=pd.DataFrame({'A':[1,1,2,2], 'B':[1,2,3,4],'C':[0.11,0.32,0.93,0.65],'D':["This","That","How","What"]})
>>> df
A B C D
0 1 1 0.11 This
1 1 2 0.32 That
2 2 3 0.93 How
3 2 4 0.65 What
>>> df.groupby('A',as_index=False).agg(np.size)
A B C D
0 1 2 2.0 2
1 2 2 2.0 2
>>> df.groupby('A',as_index=False)['C'].agg(np.size)
A C
0 1 8
1 2 8
>>> df.groupby('A',as_index=False)[['C']].agg(np.size)
A C
0 1 2.0
1 2 2.0
>>> grouped = df.groupby('A',as_index=False)
>>> grouped['C','D'].agg(np.size)
A C D
0 1 2.0 2
1 2 2.0 2
In the code, if we use groupby() following ['C'], the group size is 8, equal to the correct group size * column number, that is 2 * 4; if we use groupby() following column [['C']] or ['C','D'], the group size is right.
Why?
It seems that pandas try to execute the aggregation first then do the actual column selection.
If you want to know the group size use one of these:
grouped.size()
grouped.agg("size)
len(grouped)

pandas column operation on certain row in succession

I have a panda dataframe like this:
second block
0 1 a
1 2 b
2 3 c
3 4 a
4 5 c
This is a sequential data and I would like to get a new column which is the time difference between the current block and next time it repeats.
second block freq
0 1 a 3 //(4-1)
1 2 b 0 //(not repeating)
2 3 c 2 //(5-3)
3 4 a 0 //(not repeating)
4 5 c 0 //(not repeating)
I have tried to get the unique list of blocks. Then a for loop that do as below.
for i in unique_block:
df['freq'] = df['timestamp'].shift(-1) - df['timestamp']
I do not know how to get 0 for row index 1,3,4 and since the dataframe is too big. This is not efficient. This is not working.
Thanks.
Use groupby + diff(periods=-1). Multiply by -1 to get your difference convention and fillna with 0.
df['freq'] = (df.groupby('block').diff(-1)*-1).fillna(0)
second block freq
0 1 a 3.0
1 2 b 0.0
2 3 c 2.0
3 4 a 0.0
4 5 c 0.0
You can use shift and transform in your groupby:
df['freq'] = df.groupby('block').second.transform(lambda x: x.shift(-1) - x).fillna(0)
>>> df
second block freq
0 1 a 3.0
1 2 b 0.0
2 3 c 2.0
3 4 a 0.0
4 5 c 0.0
Using
df.groupby('block').second.apply(lambda x : x.diff().shift(-1)).fillna(0)
Out[242]:
0 3.0
1 0
2 2.0
3 0
4 0
Name: second, dtype: float64

Assigning to a slice from another DataFrame requires matching column names?

If I want to set (replace) part of a DataFrame with values from another, I should be able to assign to a slice (as in this question) like this:
df.loc[rows, cols] = df2
Not so in this case, it nulls out the slice instead:
In [32]: df
Out[32]:
A B
0 1 -0.240180
1 2 -0.012547
2 3 -0.301475
In [33]: df2
Out[33]:
C
0 x
1 y
2 z
In [34]: df.loc[:,'B']=df2
In [35]: df
Out[35]:
A B
0 1 NaN
1 2 NaN
2 3 NaN
But it does work with just a column (Series) from df2, which is not an option if I want multiple columns:
In [36]: df.loc[:,'B']=df2['C']
In [37]: df
Out[37]:
A B
0 1 x
1 2 y
2 3 z
Or if the column names match:
In [47]: df3
Out[47]:
B
0 w
1 a
2 t
In [48]: df.loc[:,'B']=df3
In [49]: df
Out[49]:
A B
0 1 w
1 2 a
2 3 t
Is this expected? I don't see any explanation for it in docs or Stackoverflow.
Yes, this is expected. Label alignment is one of the core features of pandas. When you use df.loc[:,'B'] = df2 it needs to align two DataFrames:
df.align(df2)
Out:
( A B C
0 1 -0.240180 NaN
1 2 -0.012547 NaN
2 3 -0.301475 NaN, A B C
0 NaN NaN x
1 NaN NaN y
2 NaN NaN z)
The above shows how each DataFrame looks when aligned as a tuple (the first one is df and the second one is df2). If your df2 also had a column named B with values [1, 2, 3], it would become:
df.align(df2)
Out:
( A B C
0 1 -0.240180 NaN
1 2 -0.012547 NaN
2 3 -0.301475 NaN, A B C
0 NaN 1 x
1 NaN 2 y
2 NaN 3 z)
Since B's are aligned, your assignment would result in
df.loc[:,'B'] = df2
df
Out:
A B
0 1 1
1 2 2
2 3 3
When you use a Series, the alignment will be on a single axis (on index in your example). Since they exactly match, there will be no problem and it will assign the values from df2['C'] to df['B'].
You can either rename the labels before the alignment or use a data structure that doesn't have labels (a numpy array, a list, a tuple...).
You can use the underlying NumPy array:
df.loc[:,'B'] = df2.values
df
A B
0 1 x
1 2 y
2 3 z
Pandas indexing is always sensitive to labeling of both rows and columns. In this case, your rows check out, but your columns do not. (B != C).
Using the underlying NumPy array makes the operation index-insensitive.
The reason that this does work when df2 is a Series is because Series have no concept of columns. The only alignment is on the rows, which are aligned.

how to insert a new integer index ipython pandas

I made value count dataframe from another dataframe
for example
freq
0 2
0.33333 10
1.66667 13
automatically, its indexs are 0, 0.3333, 1.66667
and the indexs can be variable
because I intend to make many dataframes based on a specific value
how can I insert a integer index?
like
freq
0 0 2
1 0.33333 10
2 1.66667 13
thanks
The result you get back from values_count is a series, and to set a generic 1 ... n index, you can use reset_index:
In [4]: s = pd.Series([0,0.3,0.3,1.6])
In [5]: s.value_counts()
Out[5]:
0.3 2
1.6 1
0.0 1
dtype: int64
In [9]: s.value_counts().reset_index(name='freq')
Out[9]:
index freq
0 0.3 2
1 1.6 1
2 0.0 1