inserting an empty line in between every two elements a column (data frame + pandas) - pandas

My data frame looks something like this:
Games
0 CAR 20
1 DEN 21
2 TB 31
3 ATL 24
4 SD 27
5 KC 33
6 CIN 23
7 NYJ 22
import pandas as pd
df =pd.read_csv('weekone.txt',)
df.columns=['Games']
I'm trying to put a blank line in between every two elements (teams).
So I want it to look like this:
Games
0 CAR 20
1 DEN 21
2 TB 31
3 ATL 24
4 SD 27
5 KC 33
6 CIN 23
7 NYJ 22
But when I'm using this loop
for i in df2.index:
if (df2.index[i])%2 == 1:
df2.Games[i]=df2.Games[i]+('\n')
else:
df2.Games[i] = df2.Games[i]
I'm getting an output like this:
Games
0 CAR 20
1 DEN 21\n
2 TB 31
3 ATL 24\n
4 SD 27
5 KC 33\n
6 CIN 23
7 NYJ 22\n
What am I doing wrong? Thanks.

you can do it this way:
In [172]: x
Out[172]:
Games
0 CAR 20
1 DEN 21
2 TB 31
3 ATL 24
4 SD 27
5 KC 33
6 CIN 23
7 NYJ 22
In [173]: %paste
empty_line = pd.DataFrame([''], columns=x.columns, index=[''])
rslt = x.loc[:1]
g = x.groupby(x.index//2)
for i in range(1, len(g)):
rslt = pd.concat([rslt.append(empty_line), g.get_group(i)])
## -- End pasted text --
In [174]: rslt
Out[174]:
Games
0 CAR 20
1 DEN 21
2 TB 31
3 ATL 24
4 SD 27
5 KC 33
6 CIN 23
7 NYJ 22
the index's dtype is object now:
In [178]: rslt.index.dtype
Out[178]: dtype('O')
or having -1 as an index for empty lines:
In [175]: %paste
empty_line = pd.DataFrame([''], columns=x.columns, index=[-1])
rslt = x.loc[:1]
g = x.groupby(x.index//2)
for i in range(1, len(g)):
rslt = pd.concat([rslt.append(empty_line), g.get_group(i)])
## -- End pasted text --
In [176]: rslt
Out[176]:
Games
0 CAR 20
1 DEN 21
-1
2 TB 31
3 ATL 24
-1
4 SD 27
5 KC 33
-1
6 CIN 23
7 NYJ 22
index dtype:
In [181]: rslt.index.dtype
Out[181]: dtype('int64')

Related

Find Max Gradient by Row in For Loop Pandas

I have a df of 15 x 4 and I'm trying to compute the maximum gradient in a North (N) minus South (S) direction for each row using a "S" and "N" value for each min or max in the rows below. I'm not sure that this is the best pythonic way to do this. My df "ms" looks like this:
minSlats minNlats maxSlats maxNlats
0 57839.4 54917.0 57962.6 56979.9
0 57763.2 55656.7 58120.0 57766.0
0 57905.2 54968.6 58014.3 57031.6
0 57796.0 54810.2 57969.0 56848.2
0 57820.5 55156.4 58019.5 57273.2
0 57542.7 54330.6 58057.6 56145.1
0 57829.8 54755.4 57978.8 56777.5
0 57796.0 54810.2 57969.0 56848.2
0 57639.4 54286.6 58087.6 56140.1
0 57653.3 56182.7 57996.5 57975.8
0 57665.1 56048.3 58069.7 58031.4
0 57559.9 57121.3 57890.8 58043.0
0 57689.7 55155.5 57959.4 56440.8
0 57649.4 56076.5 58043.0 58037.4
0 57603.9 56290.0 57959.8 57993.9
My loop structure looks like this:
J = len(ms)
grad = pd.DataFrame()
for i in range(J):
if ms.maxSlats.iloc[i] > ms.maxNlats.iloc[i]:
gr = ( ms.maxSlats.iloc[i] - ms.minNlats.iloc[i] ) * -1
grad[gr] = [i+1, i]
elif ms.maxNlats.iloc[i] > ms.maxSlats.iloc[i]:
gr = ms.maxNlats.iloc[i] - ms.minSlats.iloc[i]
grad[gr] = [i+1, i]
grad = grad.T # need to transpose
print(grad)
I obtain the correct answer but I'm wondering if there is a cleaner way to do this to obtain the same answer below:
grad.T
Out[317]:
0 1
-3045.6 1 0
-2463.3 2 1
-3045.7 3 2
-3158.8 8 7
-2863.1 5 4
-3727.0 6 5
-3223.4 7 6
-3801.0 9 8
-1813.8 10 9
-2021.4 11 10
483.1 12 11
-2803.9 13 12
-1966.5 14 13
390.0 15 14
thank you,
Use np.where to compute gradient and keep only last duplicated index.
grad = np.where(ms.maxSlats > ms.maxNlats, (ms.maxSlats - ms.minNlats) * -1,
ms.maxNlats - ms.minSlats)
df = pd.DataFrame({'A': pd.RangeIndex(1, len(ms)+1),
'B': pd.RangeIndex(0, len(ms))},
index=grad)
df = df[~df.index.duplicated(keep='last')]
>>> df
A B
-3045.6 1 0
-2463.3 2 1
-3045.7 3 2
-2863.1 5 4
-3727.0 6 5
-3223.4 7 6
-3158.8 8 7
-3801.0 9 8
-1813.8 10 9
-2021.4 11 10
483.1 12 11
-2803.9 13 12
-1966.5 14 13
390.0 15 14

How to sum values under GroupBy and consecutive date conditions?

Given table:
ID
LINE
SITE
DATE
UNITS
TOTAL
1
X
AAA
02-May-2017
12
30
2
X
AAA
03-May-2017
10
22
3
X
AAA
04-May-2017
22
40
4
Z
AAA
20-MAY-2017
15
44
5
Z
AAA
21-May-2017
8
30
6
Z
BBB
22-May-2017
10
32
7
Z
BBB
23-May-2017
25
52
8
K
CCC
02-Jun-2017
6
22
9
K
CCC
03-Jun-2017
4
33
10
K
CCC
12-Aug-2017
11
44
11
K
CCC
13-Aug-2017
19
40
12
K
CCC
14-Aug-2017
30
40
for each row if ID,LINE ,SITE equal to previous row (day) need to calculate as below (last day) and (last 3 days ) :
Note that is need to insure date are consecutive under "groupby" of ID,LINE ,SITE columns
ID
LINE
SITE
DATE
UNITS
TOTAL
Last day
Last 3 days
1
X
AAA
02-May-2017
12
30
0
0
2
X
AAA
03-May-2017
10
22
12/30
12/30
3
X
AAA
04-May-2017
22
40
10/22
(10+12)/(30+22)
4
Z
AAA
20-MAY-2017
15
44
0
0
5
Z
AAA
21-May-2017
8
30
15/44
15/44
6
Z
BBB
22-May-2017
10
32
0
0
7
Z
BBB
23-May-2017
25
52
10/32
10/32
8
K
CCC
02-Jun-2017
6
22
0
0
9
K
CCC
03-Jun-2017
4
33
6/22
6/22
10
K
CCC
12-Aug-2017
11
44
4/33
0
11
K
CCC
13-Aug-2017
19
40
11/44
(11/44)
12
K
CCC
14-Aug-2017
30
40
19/40
(11+19/44+40)
In this cases i usually do a for loop with groupby:
import pandas as pd
import numpy as np
#copied your table
table = pd.read_csv('/home/fm/Desktop/stackover.csv')
table.set_index('ID', inplace = True)
table[['Last day','Last 3 days']] = np.nan
for i,r in table.groupby(['LINE' ,'SITE']):
#First subset non sequential dates
limits_interval = pd.to_datetime(r['DATE']).diff() != '1 days'
#First element is a false positive, as its impossible to calculate past days from first day
limits_interval.iloc[0]=False
ids_subset = r.index[limits_interval].to_list()
ids_subset.append(r.index[-1]+1) #to consider all values
id_start = 0
for id_end in ids_subset:
r_sub = r.loc[id_start:id_end-1, :].copy()
id_start = id_end
#move all values one day off, if the database is as in your example (1 line per day) wont have problems
r_shifted = r_sub.shift(1)
r_sub['Last day']=r_shifted['UNITS']/r_shifted['TOTAL']
aux_units_cumsum = r_shifted['UNITS'].cumsum()
aux_total_cumsum = r_shifted['TOTAL'].cumsum()
r_sub['Last 3 days'] = aux_units_cumsum/aux_total_cumsum
r_sub.fillna(0, inplace = True)
table.loc[r_sub.index,:]=r_sub.copy()
You can make a function and apply in groupby, it would be cleaner: Apply function to pandas groupby. It would be more elegant.
Wish I could help you, good luck

'Series' objects are mutable, thus they cannot be hashed trying to sum columns and datatype is float

I am tryning to sum all values in a range of columns from the third to last of several thousand columns using:
day3prep['D3counts'] = day3prep.sum(day3prep.iloc[:, 2:].sum(axis=1))
dataframe is formated as:
ID G1 Z1 Z2 ...ZN
0 50 13 12 ...62
1 51 62 23 ...19
dataframe with summed column:
ID G1 Z1 Z2 ...ZN D3counts
0 50 13 12 ...62 sum(Z1:ZN in row 0)
1 51 62 23 ...19 sum(Z1:ZN in row 1)
I've changed the NaNs to 0's. The datatype is float but I am getting the error:
'Series' objects are mutable, thus they cannot be hashed
You only need this part:
day3prep['D3counts'] = day3prep.iloc[:, 2:].sum(axis=1)
With some random numbers:
import pandas as pd
import random
random.seed(42)
day3prep = pd.DataFrame({'ID': random.sample(range(10), 5), 'G1': random.sample(range(10), 5),
'Z1': random.sample(range(10), 5), 'Z2': random.sample(range(10), 5), 'Z3': random.sample(range(10), 5)})
day3prep['D3counts'] = day3prep.iloc[:, 2:].sum(axis=1)
Output:
> day3prep
ID G1 Z1 Z2 Z3 D3counts
0 1 2 0 8 8 16
1 0 1 9 0 6 15
2 4 8 1 3 3 7
3 9 4 7 5 7 19
4 6 3 6 6 4 16

How to create a partially filled column in pandas

I have a df_trg with, say 10 rows numbered 0-9.
I get from various sources values for an additional column foo which contains only a subset of rows, e.g. S1 has 0-3, 7, 9 and S2 has 4, 6.
I would like to get a data frame with a single new column foo where some rows may remain NaN.
Is there a "nicer" way other than:
df_trg['foo'] = np.nan
for src in sources:
df_trg['foo'][df_trg.index.isin(src.index)] = src
for example, using join or merge?
Let's create the source DataFrame (df), s1 and s2 (Series objects with
updating data) and a list of them (sources):
df = pd.DataFrame(np.arange(1, 51).reshape((5, -1)).T)
s1 = pd.Series([11, 12, 13, 14, 15, 16], index=[0, 1, 2, 3, 7, 9])
s2 = pd.Series([27, 28], index=[4, 6])
sources = [s1, s2]
Start the computation from adding foo column, initially filled with
an empty string:
df = df.assign(foo='')
Then run the following "updating" loop:
for src in sources:
df.foo.update(other=src)
The result is:
0 1 2 3 4 foo
0 1 11 21 31 41 11
1 2 12 22 32 42 12
2 3 13 23 33 43 13
3 4 14 24 34 44 14
4 5 15 25 35 45 27
5 6 16 26 36 46
6 7 17 27 37 47 28
7 8 18 28 38 48 15
8 9 19 29 39 49
9 10 20 30 40 50 16
In my opinion, this solution is (at least a little) nicer than yours and
shorter.
Alternative: Fill foo column initially with NaN, but this time
updating values will be converted to float (side effect of using NaN).

How to subtract one dataframe from another?

First, let me set the stage.
I start with a pandas dataframe klmn, that looks like this:
In [15]: klmn
Out[15]:
K L M N
0 0 a -1.374201 35
1 0 b 1.415697 29
2 0 a 0.233841 18
3 0 b 1.550599 30
4 0 a -0.178370 63
5 0 b -1.235956 42
6 0 a 0.088046 2
7 0 b 0.074238 84
8 1 a 0.469924 44
9 1 b 1.231064 68
10 2 a -0.979462 73
11 2 b 0.322454 97
Next I split klmn into two dataframes, klmn0 and klmn1, according to the value in the 'K' column:
In [16]: k0 = klmn.groupby(klmn['K'] == 0)
In [17]: klmn0, klmn1 = [klmn.ix[k0.indices[tf]] for tf in (True, False)]
In [18]: klmn0, klmn1
Out[18]:
( K L M N
0 0 a -1.374201 35
1 0 b 1.415697 29
2 0 a 0.233841 18
3 0 b 1.550599 30
4 0 a -0.178370 63
5 0 b -1.235956 42
6 0 a 0.088046 2
7 0 b 0.074238 84,
K L M N
8 1 a 0.469924 44
9 1 b 1.231064 68
10 2 a -0.979462 73
11 2 b 0.322454 97)
Finally, I compute the mean of the M column in klmn0, grouped by the value in the L column:
In [19]: m0 = klmn0.groupby('L')['M'].mean(); m0
Out[19]:
L
a -0.307671
b 0.451144
Name: M
Now, my question is, how can I subtract m0 from the M column of the klmn1 sub-dataframe, respecting the value in the L column? (By this I mean that m0['a'] gets subtracted from the M column of each row in klmn1 that has 'a' in the L column, and likewise for m0['b'].)
One could imagine doing this in a way that replaces the the values in the M column of klmn1 with the new values (after subtracting the value from m0). Alternatively, one could imagine doing this in a way that leaves klmn1 unchanged, and instead produces a new dataframe klmn11 with an updated M column. I'm interested in both approaches.
If you reset the index of your klmn1 dataframe to be that of the column L, then your dataframe will automatically align the indices with any series you subtract from it:
In [1]: klmn1.set_index('L')['M'] - m0
Out[1]:
L
a 0.777595
a -0.671791
b 0.779920
b -0.128690
Name: M
Option #1:
df1.subtract(df2, fill_value=0)
Option #2:
df1.subtract(df2, fill_value=None)