Hi I would like to unify columns in the same dataframe to one column such as:
col1 col2
1 1.4 1.5
2 2.3 2.6
3 3.6 6.7
to
col1&2
1 1.4
1 1.5
2 2.3
2 2.6
3 3.6
3 6.7
Thanks for your help
Use stack, then remove level by reset_index and last create one column DataFrame by to_frame:
df = df.stack().reset_index(level=1, drop=True).to_frame('col1&2')
print (df)
col1&2
1 1.4
1 1.5
2 2.3
2 2.6
3 3.6
3 6.7
Or:
df = pd.DataFrame({'col1&2': df.values.reshape(1,-1).ravel()}, index=np.repeat(df.index, 2))
print (df)
col1&2
1 1.4
1 1.5
2 2.3
2 2.6
3 3.6
3 6.7
Related
I have two dataframes: (i) One has two indexes and two headers, and (ii) the other one has one index and one header. The second level of each axis in the first dataframe relates to each axis of the second dataframe. I need to multiply both dataframes based on that relation between the axis.
Dataframe 1:
Dataframe 2:
Expected result (multiplication by index/header):
Try using pd.DataFrame.mul with the level parameter:
import pandas as pd
df = pd.DataFrame([[9,10,2,1,6,5],
[4, 0,3,4,6,6],
[9, 3,9,1,2,3],
[3, 5,9,3,9,0],
[4,4,8,5,10,5],
[5, 3,1,8,5,6]])
df.columns = pd.MultiIndex.from_arrays([[2020]*3+[2021]*3,[1,2,3,1,2,3]])
df.index = pd.MultiIndex.from_arrays([[1]*3+[2]*3,[1,2,3,1,2,3]])
print(df)
print('\n')
df2 = pd.DataFrame([[.1,.3,.6],[.4,.4,.3],[.5,.4,.1]], index=[1,2,3], columns=[1,2,3])
print(df2)
print('\n')
df_out = df.mul(df2, level=1)
print(df_out)
Output:
2020 2021
1 2 3 1 2 3
1 1 9 10 2 1 6 5
2 4 0 3 4 6 6
3 9 3 9 1 2 3
2 1 3 5 9 3 9 0
2 4 4 8 5 10 5
3 5 3 1 8 5 6
1 2 3
1 0.1 0.3 0.6
2 0.4 0.4 0.3
3 0.5 0.4 0.1
2020 2021
1 2 3 1 2 3
1 1 0.9 3.0 1.2 0.1 1.8 3.0
2 1.6 0.0 0.9 1.6 2.4 1.8
3 4.5 1.2 0.9 0.5 0.8 0.3
2 1 0.3 1.5 5.4 0.3 2.7 0.0
2 1.6 1.6 2.4 2.0 4.0 1.5
3 2.5 1.2 0.1 4.0 2.0 0.6
I have a dict with 3 keys, and each value is a list of numpy arrays.
I'd like to to append this dictionary to an empty dataframe so that the values of the numpy arrays in the list will be the first numbers(column 'x'), the values at the second position in the numpy arrays(column 'y'), and the keys to be the final column (column 'z'), like so:
my_dict = {0: [array([5.4, 3.9, 1.3, 0.4]), array([4.9, 3. , 1.4, 0.2]),array([4.6, 3.6, 1. , 0.2]), array([4.6, 3.2, 1.4, 0.2]), array([4.7, 3.2, 1.6, 0.2])],
1: [array([6.1, 2.9, 4.7, 1.4]), array([5.9, 3. , 4.2, 1.5]), array([7.4, 2.8, 6.1, 1.9])],
2: [array([7. , 3.2, 4.7, 1.4]), array([5.6, 2.7, 4.2, 1.3])]}
I'd like to get the below df:
x y z
0 5.4 3.9 0
1 4.9 3. 0
2 4.6 3.6 0
3 4.6 3.2 0
4 4.7 3.2 0
5 6.1 2.9 1
6 5.9 3. 1
7 7.4 2.8 1
8 7. 3.2 2
9 5.6 2.7 2
it's a bit tricky, how can i do it?
This will do it:
data = [j[:2].tolist() + [k] for k, v in my_dict.items() for j in v]
df = pd.DataFrame(data, columns=list('xyz'))
df
x y z
0 5.4 3.9 0
1 4.9 3.0 0
2 4.6 3.6 0
3 4.6 3.2 0
4 4.7 3.2 0
5 6.1 2.9 1
6 5.9 3.0 1
7 7.4 2.8 1
8 7.0 3.2 2
9 5.6 2.7 2
Try this:
target_df=pd.DataFrame(columns=['x','y','z']) # empty dataframe
for k,v in my_dict.items():
for val in v:
d={'x':[val[0]], 'y':[val[1]], 'z':[k]}
target_df=pd.concat([target_df, pd.DataFrame(d)], ignore_index=True)
print(target_df) will give desired dataframe
x y z
0 5.4 3.9 0
1 4.9 3.0 0
2 4.6 3.6 0
3 4.6 3.2 0
4 4.7 3.2 0
5 6.1 2.9 1
6 5.9 3.0 1
7 7.4 2.8 1
8 7.0 3.2 2
9 5.6 2.7 2
I have the following issue with groupby aggregation, i.e adding groups which are not presented in the dataframe but based on the desired output should be included. An example:
import pandas as pd
from pandas.compat import StringIO
csvdata = StringIO("""day,sale
1,1
2,4
2,10
4,7
5,2.3
7,4.4
2,3.4""")
#day 3,6 are intentionally not included here but I'd like to have it in output
df = pd.read_csv(csvdata, sep=",")
df1=df.groupby(['day'])['sale'].agg('sum').reset_index().rename(columns={'sale':'dailysale'})
df1
How can I get the following? Thank you!
1 1.0
2 17.4
3 0.0
4 7.0
5 2.3
6 0.0
7 4.4
You can add Series.reindex with specified range after aggregating sum:
df1 = (df.groupby(['day'])['sale']
.sum()
.reindex(range(1, 8), fill_value=0)
.reset_index(name='dailysale'))
print (df1)
day dailysale
0 1 1.0
1 2 17.4
2 3 0.0
3 4 7.0
4 5 2.3
5 6 0.0
6 7 4.4
Another idea is use ordered categorical, so aggregate sum add missing rows:
df['day'] = pd.Categorical(df['day'], categories=range(1, 8), ordered=True)
df1 = df.groupby(['day'])['sale'].sum().reset_index(name='dailysale')
print (df1)
day dailysale
0 1 1.0
1 2 17.4
2 3 0.0
3 4 7.0
4 5 2.3
5 6 0.0
6 7 4.4
Is there a way using pandas functions to add values/rows by a particular increment?
For example:
This is what I have:
df = pd.DataFrame([1.1,2,2.8])
df
value other1 other2
zebra 0.3 250
bunny 0.7 10
rat 1.0 35
cat 1.1 100
dog 2.0 150
mouse 2.8 125
EDIT 1:
This is what I want, where ideally the inserted rows' index are whatever is easiest but the previous row names are preserved.
df_goal = pd.DataFrame([1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9,2,2.1,2.2,2.3,2.4,2.5,2.6,2.7,2.8])
df_goal
value other1 other2
zebra 0.3 250
1 0.4
2 0.5
3 0.6
bunny 0.7 10
5 0.8
6 0.9
rat 1.0 35
cat 1.1 100
1 1.2
2 1.3
3 1.4
4 1.5
5 1.6
6 1.7
7 1.8
8 1.9
dog 2.0 150
10 2.1
11 2.2
12 2.3
13 2.4
14 2.5
15 2.6
16 2.7
mouse 2.8 125
EDIT 2:
Also I would like to keep the values of other columns that were there previously and any new rows are simply empty or zero.
I think you can use reindex by numpy.arange:
#create index by value column
df = df.reset_index().set_index('value')
#reindex floatindex
s = 0.1
a = np.arange(df.index.min(),df.index.max() + s, step=s)
df = df.reindex(a, tolerance=s/2., method='nearest')
#replace NaN in another columns as index
cols = df.columns.difference(['index'])
df[cols] = df[cols].fillna('')
#replace NaN by range
s = pd.Series(np.arange(len(df.index)), index=df.index)
df['index'] = df['index'].combine_first(s)
#swap column with index
df = df.reset_index().set_index('index')
print (df)
value other1 other2
index
zebra 0.3 250
1 0.4
2 0.5
3 0.6
bunny 0.7 10
5 0.8
6 0.9
rat 1.0 35
cat 1.1 100
9 1.2
10 1.3
11 1.4
12 1.5
13 1.6
14 1.7
15 1.8
16 1.9
dog 2.0 150
18 2.1
19 2.2
20 2.3
21 2.4
22 2.5
23 2.6
24 2.7
mouse 2.8 125
I have a python frame like
y m A B
1990 1 3.4 5
2 4 4.9
...
1990 12 4.0 4.5
...
2000 1 2.3 8.1
2 3.7 5.0
...
2000 12 2.4 9.1
I would like to select 2-12 from the second index (m) and years 1991-2000. I don't seem to get the multindex slicing correct. E.g. I tried
idx = pd.IndexSlice
dfa = df.loc[idx[1:,1:],:]
but that does not seem to slice on the first index. Any suggestions on an elegant solution?
Cheers, Mike
Without a sample code to reproduce your df it is difficult to guess, but if you df is similar to:
import pandas as pd
df = pd.read_csv(pd.io.common.StringIO(""" y m A B
1990 1 3.4 5
1990 2 4 4.9
1990 12 4.0 4.5
2000 1 2.3 8.1
2000 2 3.7 5.0
2000 12 2.4 9.1"""), sep='\s+')
df
y m A B
0 1990 1 3.4 5.0
1 1990 2 4.0 4.9
2 1990 12 4.0 4.5
3 2000 1 2.3 8.1
4 2000 2 3.7 5.0
5 2000 12 2.4 9.1
Then this code will extract what you need:
print df.loc[(df['y'].isin(range(1990,2001))) & df['m'].isin(range(2,12))]
y m A B
1 1990 2 4.0 4.9
4 2000 2 3.7 5.0
If however your df is indexes by y and m, then this will do the same:
df.set_index(['y','m'],inplace=True)
years = df.index.get_level_values(0).isin(range(1990,2001))
months = df.index.get_level_values(1).isin(range(2,12))
df.loc[years & months]
y m A B
1 1990 2 4.0 4.9
4 2000 2 3.7 5.0