How can I change a MultiIndex data to a Stata data file? - pandas

Firstly, I have a multiIdex data in pandas, and i want to export it so that Stata can import this file directly. The Pandas file is as follows:
fund
NAME_CITY 七台河市 万宁市 三亚市 三明市 三沙市
year nsfc
2000 B 0.0 0.0 0.0 0.0 0.0
C 0.0 0.0 0.0 0.0 0.0
D 0.0 0.0 0.0 0.0 0.0
E 0.0 0.0 0.0 0.0 0.0
F 0.0 0.0 0.0 0.0 0.0
I have three indexes: year, nsfc and the names of China's cites(七台河市,万宁市,,,,). How can I drop the fist row which filled "fund" and set 七台河市,万宁市.etc as the columns' name so that it satisfices Stata's request of variable name?
I know the simplest method is export this file to csv file and edit it manually, but I just wonder how to make it by pandas.

Related

Matplotlib eventplot - raster plot from binary values

I have created a dataframe where each column is an equal-length series of 1.0s and 0.0s. There is nothing else in the dataframe. I want to create a raster-style plot from this data where each column would be a horizontal line stacked up along the y-axis and each tick on the x-axis would correspond to a row index value.
However, when I try to do this, I get an "axis -1 is out of bounds for array of dimension 0" error. None of the other entries for this or very similar errors seem to relate to eventplot. I thought the type of data I had would be perfect for eventplot (a discrete black dash wherever there's a 1.0, otherwise nothing), but maybe I'm very wrong.
Here's a toy example of the kind of dataframe I'm trying to pass plus the function as I'm calling it:
SP1 SP3 SP5 SP7 SP9 SP11
0 1.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 1.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 1.0 0.0 0.0 0.0 0.0
4 0.0 0.0 1.0 0.0 1.0 0.0
5 0.0 1.0 0.0 0.0 1.0 1.0
plt.eventplot(df, colors='black', lineoffsets=1,
linelengths=1, orientation='vertical')
Any help appreciated, thank you.
Edit: If I convert my df into an np.array and pass that instead, I no longer get that particular error, but I don't at all get the result I'm looking for. I do get the correct values on the x-axis (in my real data, this is 0-22), but I don't get each column of data represented as a separate line, and I'm having no luck advancing in that direction.
When using eventplot, the array passed to positions needs to contain the row numbers of the ones in each column. Here is an example with your toy data:
import io
import pandas as pd # v 1.2.3
import matplotlib.pyplot as plt # v 3.3.4
# Import data into dataframe
data = """
SP1 SP3 SP5 SP7 SP9 SP11
0 1.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 1.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 1.0 0.0 0.0 0.0 0.0
4 0.0 0.0 1.0 0.0 1.0 0.0
5 0.0 1.0 0.0 0.0 1.0 1.0
"""
df = pd.read_csv(io.StringIO(data), delim_whitespace=True)
# Create series of indexes containing positions for raster plot
positions = df.apply(lambda x: df.index[x == 1])
# Create raster plot with inverted y-axis to display columns in ascending order
plt.eventplot(positions, lineoffsets=df.index, linelengths=0.75, colors='black')
plt.yticks(range(positions.index.size), positions.index)
plt.gca().invert_yaxis()

How do I 'merge' information on a user during different periods in a dataset?

So I'm working with a dataset as an assignment / personal project right now. Basically, I have about 15k entries on about 5k unique IDs and I need to make a simple YES/NO prediction on each ID. Each row is some info on an ID during a certain period(1,2 or 3) and has 43 attributes.
My question is, what's the best approach in this situation? Should I just merge the 3 periods for each ID into 1 and have 129 attributes in a row? Is there a better approach? Thanks in advance.
Here's an exmaple of my dataset
PERIOD ID V_1 V_2 V_3 V_4 V_5 V_6 V_7 V_8 V_9 V_10 V_11 V_12 V_13 V_14 V_15 V_16 V_17 V_18 V_19 V_20 V_21 V_22 V_23 V_24 V_25 V_26 V_27 V_28 V_29 V_30 V_31 V_32 V_33 V_34 V_35 V_36 V_37 V_38 V_39 V_40 V_41 V_42 V_43
0 1 1 27.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 NaN 27.0 2.0 63.48 230.43 226.18 3.92 0.0 0.0 0.33 0.0 0.0 0.0 0.0 92.77 82.12 10.65 0.0 0.0 117.0 112.0 2.0 NaN 35.0 30.0 NaN 0.0 0.0 45.53 1.0550 0.0 0.0 45.53 0.0 0.0
1 2 1 19.0 0.0 NaN 1.0 1.0 0.0 1.0 0.0 NaN 19.0 2.0 NaN 134.75 132.03 2.03 0.0 0.0 0.69 1.0 0.0 0.0 0.0 162.48 162.48 0.00 0.0 NaN 54.0 48.0 2.0 0.0 44.0 44.0 0.0 0.0 0.0 48.00 NaN NaN 0.0 48.00 0.0 0.0
2 3 1 22.0 0.0 0.0 NaN 1.0 0.0 0.0 0.0 0.0 22.0 1.0 21.98 159.08 158.08 1.00 0.0 0.0 0.00 0.0 NaN 0.0 0.0 180.90 180.90 0.00 0.0 0.0 39.0 38.0 1.0 0.0 33.0 33.0 0.0 0.0 NaN 46.59 0.0000 0.0 0.0 46.59 0.0 0.0
3 1 2 NaN NaN 0.0 1.0 1.0 NaN 0.0 NaN 0.0 NaN 4.0 2.20 175.97 164.92 11.00 0.0 0.0 0.05 NaN 0.0 0.0 0.0 281.68 259.63 22.05 NaN 0.0 109.0 103.0 4.0 0.0 152.0 143.0 9.0 0.0 0.0 157.50 3.3075 0.0 0.0 157.50 0.0 0.0
4 2 2 28.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 28.0 8.0 73.93 367.20 339.73 27.47 0.0 0.0 NaN 0.0 0.0 0.0 0.0 504.13 479.53 24.60 0.0 0.0 233.0 222.0 11.0 0.0 288.0 279.0 NaN 0.0 0.0 157.50 3.6400 0.0 0.0 157.50 0.0 0.0
Here's an example of an output
ID OUTPUT
1 1.0
2 0.0
3 0.0
4 0.0
5 1.0
6 1.0
...

Python: How replace non-zero values in a Pandas dataframe with values from a series

I have a dataframe 'A' with 3 columns and 4 rows (X1..X4). Some of the elements in 'A' are non-zero. I have another dataframe 'B' with 1 column and 4 rows (X1..X4). I would like to create a dataframe 'C' so that where 'A' has a nonzero value, it takes the value from the equivalent row in 'B'
I've tried a.where(a!=0,c)..obviously wrong as c is not a scalar
A = pd.DataFrame({'A':[1,6,0,0],'B':[0,0,1,0],'C':[1,0,3,0]},index=['X1','X2','X3','X4'])
B = pd.DataFrame({'A':{'X1':1.5,'X2':0.4,'X3':-1.1,'X4':5.2}})
These are the expected results:
C = pd.DataFrame({'A':[1.5,0.4,0,0],'B':[0,0,-1.1,0],'C':[1.5,0,-1.1,0]},index=['X1','X2','X3','X4'])
np.where():
If you want to assign back to A:
A[:]=np.where(A.ne(0),B,A)
For a new df:
final=pd.DataFrame(np.where(A.ne(0),B,A),columns=A.columns)
A B C
0 1.5 0.0 1.5
1 0.4 0.0 0.0
2 0.0 -1.1 -1.1
3 0.0 0.0 0.0
Usage of fillna
A=A.mask(A.ne(0)).T.fillna(B.A).T
A
Out[105]:
A B C
X1 1.5 0.0 1.5
X2 0.4 0.0 0.0
X3 0.0 -1.1 -1.1
X4 0.0 0.0 0.0
Or
A=A.mask(A!=0,B.A,axis=0)
Out[111]:
A B C
X1 1.5 0.0 1.5
X2 0.4 0.0 0.0
X3 0.0 -1.1 -1.1
X4 0.0 0.0 0.0
Use:
A.mask(A!=0,B['A'],axis=0,inplace=True)
print(A)
A B C
X1 1.5 0.0 1.5
X2 0.4 0.0 0.0
X3 0.0 -1.1 -1.1
X4 0.0 0.0 0.0

how do I sum each column based on condition of another column without iterating over the columns in pandas datframe

I have a data frame as below:
Preg Glucose BloodPressure SkinThickness Insulin Outcome
0 1.0 85.0 66.0 29.0 0.0 0.0
1 8.0 183.0 64.0 0.0 0.0 0.0
2 1.0 89.0 66.0 23.0 94.0 1.0
3 0.0 137.0 40.0 35.0 168.0 1.0
4 5.0 116.0 74.0 0.0 0.0 1.0
I would like a pythonic way to sum each column in separate based on a condition of one of the columns. I could do it with iterating over the df columns, but I'm sure there is a better way I'm not familiar with.
In specific to the data I have, I'd like to sum each column values if at the last column 'Outcome' is equal to 1. In the end, I should get as below:
Preg Glucose BloodPressure SkinThickness Insulin Outcome
0 6.0 342.0 180.0 58.0 262.0 0.0
Any ideas?
Here is a solution to get the expected output:
sum_df = df.loc[df.Outcome == 1.0].sum().to_frame().T
sum_df.Outcome = 0.0
Output:
Preg Glucose BloodPressure SkinThickness Insulin Outcome
0 6.0 342.0 180.0 58.0 262.0 0.0
Documentation:
loc: access a group of rows / columns by labels or boolean array
sum: sum by default over all columns and return a Series indexed by the columns.
to_frame: convert a Series to a DataFrame.
.T: accessor the transpose function, transpose the DataFrame.
use np.where
df1[np.where(df1['Outcome'] == 1,True,False)].sum().to_frame().T
Output
Preg Glucose BloodPressure SkinThickness Insulin Outcome
0 6.0 342.0 180.0 58.0 262.0 3.0
Will these work for you?
df1.loc[~(df1['Outcome'] == 0)].groupby('Outcome').agg('sum').reset_index()
or
df1.loc[df1.Outcome == 1.0].sum().to_frame().T

Python: group by with sum special columns and keep the initial rows too

I have a df:
ID Car Jan17 Jun18 Dec18 Apr19
0 Nissan 0.0 1.7 3.7 0.0
1 Porsche 10.0 0.0 2.8 3.5
2 Golf 0.0 1.7 3.0 2.0
3 Tiguan 1.0 0.0 3.0 5.2
4 Touareg 0.0 0.0 3.0 4.2
5 Mercedes 0.0 0.0 0.0 7.2
6 Passat 0.0 3.0 0.0 0.0
I would like to change the values for row #6: Passat value in Car column by add the values from row#2 & row#3 & row#4 (Golf, Tiguan, Touareg) in the Car column) and also keep the values of row#2 & row#3 & row#4 as initial.
Because Passat includes Golf, Touareg, Tiguan and due to it I need to add the values of Golf, Touareg, Tiguanrows to Passat row.
I tried to do it the following code:
car_list = ['Golf', 'Tiguan', 'Touareg']
for car in car_list:
df['Car'][df['Car']==car]='Passat'
and after I used groupby by Car and sum() function:
df1 = df.groupby(['Car'])['Jan17', 'Jun18', 'Dec18', 'Apr19'].sum().reset_index()
In result, df1 doesn't have initial (Golf, Tiguan, Touareg) rows. So, this way is wrong.
Expected result is df1:
ID Car Jan17 Jun18 Dec18 Apr19
0 Nissan 0.0 1.7 3.7 0.0
1 Porsche 10.0 0.0 2.8 3.5
2 Golf 0.0 1.7 3.0 2.0
3 Tiguan 1.0 0.0 3.0 5.2
4 Touareg 0.0 0.0 3.0 4.2
5 Mercedes 0.0 0.0 0.0 7.2
6 Passat 1.0 4.7 9.0 11.4
I'd appreciate for any idea. Thanks)
First we use .isin to get the correct Cars, then we use .filter to get the correct value columns, finally we sum the values and put them in our variable sums.
Then we select the Passat row and add the values to that row:
sums = df[df['Car'].isin(car_list)].filter(regex='\w{3}\d{2}').sum()
df.loc[df['Car'].eq('Passat'), 'Jan17':] += sums
Output
ID Car Jan17 Jun18 Dec18 Apr19
0 0 Nissan 0.0 1.7 3.7 0.0
1 1 Porsche 10.0 0.0 2.8 3.5
2 2 Golf 0.0 1.7 3.0 2.0
3 3 Tiguan 1.0 0.0 3.0 5.2
4 4 Touareg 0.0 0.0 3.0 4.2
5 5 Mercedes 0.0 0.0 0.0 7.2
6 6 Passat 1.0 4.7 9.0 11.4
Solution is in view of function:
car_list = ['Golf', 'Tiguan', 'Touareg', 'Passat']
def updateCarInfoBySum(df, car_list, name, id):
req = df[df['Car'].isin(car_list)]
req.set_index(['Car', 'ID], inplace=True)
req.loc[('new_value', '000'), :] = req.sum(axis=0)
req.reset_index(inplace=True)
req = req[req.Car != name]
req['Car'][req['Car'] == 'new_value'] = name
req['ID'][req['ID'] == '000'] = id
req.set_index(['Car', 'ID], inplace=True)
df_final = df.copy()
df_final.set_index(['Car', 'ID], inplace=True)
df_final.update(req)
return df_final