Filling Empty Rows with Dictionary Values via For Loop Pandas - pandas

I have a dictionary that looks like this:
my_dict = {2078:'T20',2153:'T20',2223:'T21',2219:'T21'}
My data frame:
Date Code Fleet KM
2021-20-03 2078 T20 20
2021-21-03 2078 NaN 22
2021-21-03 2153 T20 23
2021-21-03 2153 NaN 23
2021-22-03 2223 NaN 28
2021-22-03 2223 NaN 30
2021-22-03 2219 T21 23
2021-23-03 2219 NaN 23
I want to use the values of the dictionary to fill the empty rows in the Fleet column in my df.
So I wrote the code:
for index, row in df.iterrows():
if (pd.isnull(row['Fleet'])):
row['Fleet']= my_dict.row['Fleet']
However, when I check df.info() I can see that the code did not apply even though it runs.
Could someone tell me what I am doing wrong?

Use Series.map by dictionary and raplace missing values by Series.fillna:
df['Fleet'] = df['Fleet'].fillna(df['Code'].map(my_dict))
Or Series.combine_first:
df['Fleet'] = df['Fleet'].combine_first(df['Code'].map(my_dict))

Related

Join values in different dataframes

I am trying trying to join two dataframes in such a way that the resulting union contains info about both of them. My dataframes are similar to:
>> df_1
user_id hashtag1 hashtag2 hashtag3
0000 '#breakfast' '#lunch' '#dinner'
0001 '#day' '#night' NaN
0002 '#breakfast' NaN NaN
The second dataframe contains a unique identifier of the hashtags and their respective score:
>> df_2
hashtag1 score
'#breakfast' 10
'#lunch' 8
'#dinner' 9
'#day' -5
'#night' 6
I want to add a set of columns on my first dataframe that contain the scores of each hashtag used, such as:
user_id hashtag1 hashtag2 hashtag3 score1 score2 score3
0000 '#breakfast' '#lunch' '#dinner' 10 8 9
0001 '#day' '#night' NaN -5 6 NaN
0002 '#breakfast' NaN NaN 10 NaN NaN
I tried to use df.join() but I get an error: "ValueError: You are trying to merge on object and int64 columns. If you wish to proceed you should use pd.concat"
My code is as follows:
new_df = df_1.join(df_2, how='left', on='hashtag1')
I appreciate any help, thank you
You should try pandas.merge:
pandas.merge(df_1, df_2, on='hashtag1', how='left')
If you want to use .join, you need to set the index of df_2.
df_1.join(df_2.set_index('hashtag1'), on='hashtag1', how='left')
Some resources:
https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html#database-style-dataframe-or-named-series-joining-merging
Trouble with df.join(): ValueError: You are trying to merge on object and int64 columns

How to concatenate a dataframe to a multiindex main dataframe along columns

I have tried a few answers but was not able to get the desired result in my case.
I am working with stocks data.
I have a list ['3MINDIA.NS.csv', 'AARTIDRUGS.NS.csv', 'AARTIIND.NS.csv', 'AAVAS.NS.csv', 'ABB.NS.csv']
for every stock in the list I get an output which contains trades and related info.. it looks something like this:
BUY SELL profits rel_profits
0 2004-01-13 2004-01-27 -44.200012 -0.094606
1 2004-02-05 2004-02-16 18.000000 0.044776
2 2005-03-08 2005-03-11 25.000000 0.048077
3 2005-03-31 2005-04-01 13.000000 0.025641
4 2005-10-11 2005-10-26 -20.400024 -0.025342
5 2005-10-31 2005-11-04 67.000000 0.095578
6 2006-05-22 2006-06-05 -55.100098 -0.046693
7 2007-03-06 2007-03-14 3.000000 0.001884
8 2007-03-19 2007-03-28 41.500000 0.028222
9 2007-07-31 2007-08-14 69.949951 0.038224
10 2008-01-24 2008-02-05 25.000000 0.013055
11 2009-11-04 2009-11-05 50.000000 0.031250
12 2010-12-10 2010-12-15 63.949951 0.018612
13 2011-02-02 2011-02-15 -53.050049 -0.015543
14 2011-09-30 2011-10-07 74.799805 0.018181
15 2015-12-09 2015-12-18 -215.049805 -0.019523
16 2016-01-18 2016-02-01 -475.000000 -0.046005
17 2016-11-16 2016-11-30 -1217.500000 -0.096877
18 2018-03-26 2018-04-02 0.250000 0.000013
19 2018-05-22 2018-05-25 250.000000 0.012626
20 2018-06-05 2018-06-12 101.849609 0.005361
21 2018-09-25 2018-10-10 -2150.000000 -0.090717
22 2021-01-27 2021-02-03 500.150391 0.024638
23 2021-06-30 2021-07-07 393.000000 0.016038
24 2021-08-12 2021-08-13 840.000000 0.035279
25 NaN NaN -1693.850281 0.995277
# note: every dataframe will have a last row with NaN values in buy, sell columns
# each datafram has different number of rows
Now I tried to add an extra level of index to this dataframe like this:
symbol = name of the stock from given list for ex. for 3MINDIA.NS.csv symbol is 3MINDIA
trades.columns = pd.MultiIndex.from_product([[symbol], trades.columns])
after this I tried to concatenate each trades dataframe that is generated in the loop to a main dataframe using:
result_df = pd.concat([result_df, trades], axis=1)
# I am trying to do this so that Whenever
I call result_df[symbol] I should be able
to see the trade dates for that particular symbol.
But I get a result_df that has lot of NaN values because each trades dataframe has variable number of rows in it.
IS there any way I can combine trades dataframes along the columns with stock symbol as higher level index and not get all the NaN values in my result_df
result_df I got
So I found a way to get what I wanted.
first I added this code in loop
trades = pd.concat([trades], keys=[symbol], names=['Stocks'])
after this I used concatenate again on result_df and trades
# Desired Result
result_df = pd.concat([result_df, trades], axis=0, ignore_index=False)
And BAM!!! This is exactly what I wanted

Using python pandas, how can I merge datasets together and create a column that has the unique modifier? [duplicate]

This question already has answers here:
Convert columns into rows with Pandas
(6 answers)
Closed 1 year ago.
Here is the current dataset that I am working with.
df contains Knn, Kss, and Ktt in three separate columns.
What I have been unable to figure out is how to merge the three into a single column and have a column that has a label.
Here is what I currently have but I
df_CohBeh = pd.concat([pd.DataFrame(Knn),
pd.DataFrame(Kss),
pd.DataFrame(Ktt)],
keys=['Knn', 'Kss', 'Ktt'],
ignore_index=True)
Which looks like this:
display(df_CohBeh)
Knn Kss Ktt
0 24.579131 NaN NaN
1 21.673524 NaN NaN
2 25.785409 NaN NaN
3 20.686215 NaN NaN
4 21.504863 NaN NaN
.. ... ... ...
106 NaN NaN 27.615440
107 NaN NaN 27.636029
108 NaN NaN 26.215347
109 NaN NaN 27.626850
110 NaN NaN 25.473380
Which is in essence filtering them, but I would rather have a single column with a string that I can use for plotting on the same seaborn graph "Knn", "Kss", "Ktt". To look at various distributions.
I'm not sure how to create a column that can label the Knn value in the label column.
If df looks like that:
>>> df
Knn Kss Ktt
0 96.054660 72.301166 15.355594
1 36.221933 72.646999 41.670382
2 96.503307 78.597493 71.959442
3 53.867432 17.315678 35.006592
4 43.014227 75.122762 83.666844
5 63.301808 72.514763 64.597765
6 0.201688 1.069586 98.816202
7 48.558265 87.660352 9.140665
8 64.353999 43.534200 15.202242
9 41.260903 24.128533 25.963022
10 63.571747 17.474933 47.093538
11 91.006290 90.834753 37.672980
12 61.960163 87.308155 64.698762
13 87.403750 86.402637 78.946980
14 22.238364 88.394919 81.935868
15 56.356764 80.575804 72.925204
16 30.431063 4.466978 32.257898
17 21.403800 46.752591 59.831690
18 57.330671 14.172341 64.764542
19 54.163311 66.037043 0.822948
Try df.melt
to merge the three into a single column and have a column that has a label.
variable value
0 Knn 96.054660
1 Knn 36.221933
2 Knn 96.503307
3 Knn 53.867432
4 Knn 43.014227
5 Knn 63.301808
...
20 Kss 72.301166
21 Kss 72.646999
22 Kss 78.597493
23 Kss 17.315678
24 Kss 75.122762
25 Kss 72.514763
...
40 Ktt 15.355594
41 Ktt 41.670382
42 Ktt 71.959442
43 Ktt 35.006592
44 Ktt 83.666844
45 Ktt 64.597765
...
You should use an pandas Series.
knn = pd.DataFram({...})
kss = pd.DataFram({...})
ktt = pd.DataFram({...})
l = knn.values.flatten() + kss.values.flatten() + ktt.values.flatten()
s = pd.Series(l, name="Knn")

How to replace NA in columns with its values in other rows based on when it was recorded and reduce the size of the dataframe in pandas?

I have a huge pandas data frame containing data about hospital encounters. This data frame has the following columns: hospital encounter id (hadm_id), a datetime object indicating the time a variable was charted (ce_charttime) and the values of the recorded variables. There are many variables, but for simplicity, I am currently working with only 2 variables heart rate (hr) and respiratory rate (resp). Here is the head of the data frame:
hadm_id ce_charttime hr resp
0 100020 2142-11-30 23:06:00 62.0 NaN
1 100020 2142-11-30 23:06:00 NaN 13.0
2 100021 2109-08-21 20:00:00 134.0 NaN
3 100021 2109-08-21 19:30:00 133.0 NaN
4 100021 2109-08-21 20:00:00 NaN 18.0
If you notice, the encounter with hadm_id=100020, has two rows. However, both the rows have the same ce_charttime with value 2142-11-30 23:06:00, which means it should really be one row, with one ce_charttime having a value for both hr and resp: ce_charttime=2142-11-30 23:06:00, hr=62.0, resp=NaN.
Similarly, for the encounter with hadm_id=100021, there are 3 rows, however, there really needs to be only 2 rows. After sorting by time, the first row would have the values ce_charttime=19:30:00, hr=133.0, resp=NaN and the second row would have the values ce_charttime=2109-08-21 20:00:00, hr=134.0, resp=18.0.
Essentially, I need the data frame to look like this:
hadm_id ce_charttime hr resp
0 100020 2142-11-30 23:06:00 62.0 13.0
1 100021 2109-08-21 19:30:00 133.0 NaN
2 100021 2109-08-21 20:00:00 134.0 18.0
This is just a sample of the dataframe, this dataframe has more than 30 variables, with more than 8000 unique encounters with a lot of rows with this kind of redundant information. Is there way to filter this redundant information?
Any help is appreciated. Please let me know if further information is needed.
Thanks.
Use GroupBy.sum with min_count=1 to keep NaN value:
df.groupby(['hadm_id','ce_charttime']).sum(min_count = 1).reset_index()
This works if there is no more than one rows (hr,resp) with different values per group
Output:
hadm_id ce_charttime hr resp
0 100020 2142 11-30-23:06:00 62.0 13.0
1 100021 2109 08-21-19:30:00 133.0 NaN
2 100021 2109 08-21-20:00:00 134.0 18.0

Can you prevent automatic alphabetical order of df.append()?

I am trying to append data to a log where the order of columns isn't in alphabetical order but makes logical sense, ex.
Org_Goals_1 Calc_Goals_1 Diff_Goals_1 Org_Goals_2 Calc_Goals_2 Diff_Goals_2
I am running through several calculations based on different variables and logging the results through appending a dictionary of the values after each run. Is there a way to prevent the df.append() function to order the columns alphabetically?
Seems you have to reorder the columns after the append operation:
In [25]:
# assign the appended dfs to merged
merged = df1.append(df2)
# create a list of the columns in the order you desire
cols = list(df1) + list(df2)
# assign directly
merged.columns = cols
# column order is now as desired
merged.columns
Out[25]:
Index(['Org_Goals_1', 'Calc_Goals_1', 'Diff_Goals_1', 'Org_Goals_2', 'Calc_Goals_2', 'Diff_Goals_2'], dtype='object')
example:
In [26]:
df1 = pd.DataFrame(columns=['Org_Goals_1','Calc_Goals_1','Diff_Goals_1'], data = randn(5,3))
df2 = pd.DataFrame(columns=['Org_Goals_2','Calc_Goals_2','Diff_Goals_2'], data=randn(5,3))
merged = df1.append(df2)
cols = list(df1) + list(df2)
merged.columns = cols
merged
Out[26]:
Org_Goals_1 Calc_Goals_1 Diff_Goals_1 Org_Goals_2 Calc_Goals_2 \
0 0.028935 NaN -0.687143 NaN 1.528579
1 0.943432 NaN -2.055357 NaN -0.720132
2 0.035234 NaN 0.020756 NaN 1.556319
3 1.447863 NaN 0.847496 NaN -1.458852
4 0.132337 NaN -0.255578 NaN -0.222660
0 NaN 0.131085 NaN 0.850022 NaN
1 NaN -1.942110 NaN 0.672965 NaN
2 NaN 0.944052 NaN 1.274509 NaN
3 NaN -1.796448 NaN 0.130338 NaN
4 NaN 0.961545 NaN -0.741825 NaN
Diff_Goals_2
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
0 0.727619
1 0.022209
2 -0.350757
3 1.116637
4 1.947526
The same alpha sorting of the columns happens with concat also so it looks like you have to reorder after appending.
EDIT
An alternative is to use join:
In [32]:
df1.join(df2)
Out[32]:
Org_Goals_1 Calc_Goals_1 Diff_Goals_1 Org_Goals_2 Calc_Goals_2 \
0 0.163745 1.608398 0.876040 0.651063 0.371263
1 -1.762973 -0.471050 -0.206376 1.323191 0.623045
2 0.166269 1.021835 -0.119982 1.005159 -0.831738
3 -0.400197 0.567782 -1.581803 0.417112 0.188023
4 -1.443269 -0.001080 0.804195 0.480510 -0.660761
Diff_Goals_2
0 -2.723280
1 2.463258
2 0.147251
3 2.328377
4 -0.248114
Actually, I found "advanced indexing" to work quite well
df2=df.ix[:,'order of columns']
As I see it, the order is lost, but when appending, the original data should have the correct order. To maintain that, assuming Dataframe 'alldata' and dataframe to be appended data 'newdata', appending and keeping column order as in 'alldata' would be:
alldata.append(newdata)[list(alldata)]
(I encountered this problem with named date fields, where 'Month' would be sorted between 'Minute' and 'Second')