Pandas Normalization using groupby - pandas

I have a two series column
First Column date which ranges from 2015-01-01 to 2019-01-01 and second column has some random values and I want to create a new column which should look like below
I have a pandas column like below
A1 B1
2015-01-01 A
2015-02-01 A
2015-03-01 A
2015-04-01 A
2015-01-01 B
2015-02-01. B
-----
and I want a new column like below
A1 B1 B
2015-01-01 A 0
2015-02-01 A 1
2015-03-01 A 2
2015-05-01. A 3
2015-01-01 B 0
2015-02-01. B 1
I think I am supposed to use groupby function on B1 but not sure how to do that

groupby.cumcount
df.assign(B=df.groupby('B1').cumcount())
A1 B1 B
0 2015-01-01 A 0
1 2015-02-01 A 1
2 2015-03-01 A 2
3 2015-04-01 A 3
4 2015-01-01 B 0
5 2015-02-01 B 1
In place
df['B'] = df.groupby('B1').cumcount()

Related

excel index match with pandas

I am trying replicate the excel index match in pandas so as to produce a new column which copy's the date on the first occurrence of value in colB being exceeded or matched by value in colC
date colA colB colC colD desired_output
0 2020-04-01 00:00:00 2 1 e 2020-04-02 00:00:00
1 2020-04-02 00:00:00 8 4 4 d 2020-04-02 00:00:00
2 2020-04-03 00:00:00 1 2 a 2020-04-03 00:00:00
3 2020-04-04 00:00:00 4 2 3 b 2020-04-04 00:00:00
4 2020-04-05 00:00:00 5 3 1 c 2020-04-07 00:00:00
5 2020-04-06 00:00:00 9 4 1 m
6 2020-04-07 00:00:00 5 3 3 c 2020-04-07 00:00:00
Here is the code that i have tried so far, unsuccessfully:
col_6 = []
for ind in df3.index:
if df3['colC'][ind] >= df3['colB']:
col_6.append(df3['date'][ind]
else:
col_6.append('')
df3['desired_output'] = col_6
and have also tried:
col_6 = []
for ind in df3.index:
if df3['colB'][ind] <= df3['colC']:
col_6.append(df3['date'][ind]
else:
col_6.append('')
df3['desired_output'] = col_6
...this second attempt has come the closest, but only produces results when the 'if' conditions occur within the same index row in the dataframe. For instance, the value of 'colB' in index row 4 is exceeded by value of 'colC' in index row 6 but my attempted code is unsuccessful at capturing this sort of occurrence

Sybase ASE looping over two tables for data calculation

I'm not well versed with SQL and wondering how to do this in a Syabse - ASE stored procedure. Would appreciate any guidance on this.
I have table-1 (t1) and table-2 (t2) that I need to loop over for calculations like (t1.c4+t2.c3)*2+(t1.c5+t2.c4)*5.
Steps:
Get all the rows from table-1 whose datetime column value is between a user given datetime range
For each row from table-1 get row(s) from table-2 where the datetime value from table-1 row falls between start datetime and end datetime column values in table-2
If only 1 row matched from table-2, take the values from table-1 row and table-2 row and do calculations and do step-7
If more than 1 row matched, find the row from table-2 whose start datetime is an exact match with the datetime from table-1 row
If no exact match found, flag error in table-1 row and proceed with the next row from table-1
If only one row match found, do the calculations and do step-7
Insert the calculation result into the current row of table-1
Go to step-2; until no more rows left in table-1
What is the optimal approach to do this ? Should I use cursor's Or temporary tables ?
T1
------------------------------------------------
C1 C2 C3 C4 C5 C6
------------------------------------------------
ABC 10 15 2019-03-01 00:30
XYZ 12 13 2019-03-01 01:00
DEF 5 7 2019-03-01 02:00
IJK 17 3 2019-03-02 01:00
T2
------------------------------------------------
C1 C2 C3 C4 C5
------------------------------------------------
LMN 1 5 2019-03-01 00:30 2019-03-02 00:00
OPQ 2 3 2019-03-01 01:00 2019-03-01 01:30
STU 4 2 2019-03-01 01:30 2019-03-01 03:00
KJF 3 1 2019-03-01 02:30 2019-03-01 03:00
User input: 2019-03-01 00:00 to 2019-03-01 00:30 (ABC & LMN Rows match)
Expected out:
------------------------------------------------------------
C1 C2 C3 C4 C5 C6
--------------------------------------------------------------
ABC 10 15 2019-03-01 00:30 (10*1)+(15*5)
XYZ 12 13 2019-03-01 01:00
DEF 5 7 2019-03-01 02:00
IJK 17 3 2019-03-02 01:00
User input: 2019-03-01 01:00 to 2019-03-01 01:30 (XYZ & OPQ Rows match)
Expected out:
------------------------------------------------------------
C1 C2 C3 C4 C5 C6
--------------------------------------------------------------
ABC 10 15 2019-03-01 00:30
XYZ 12 13 2019-03-01 01:00 (12*2)+(13*3)
DEF 5 7 2019-03-01 02:00
IJK 17 3 2019-03-02 01:00
User input: 2019-03-01 23:59 TO 2019-03-02 01:00 (IJK row & No matching rows in t2)
Expected output:
----------------------------------------------------
C1 C2 C3 C4 C5 C6
----------------------------------------------------
ABC 10 15 2019-03-01 00:30
XYZ 12 13 2019-03-01 01:00
DEF 5 7 2019-03-01 02:00
IJK 17 3 2019-03-02 01:00 No matching row
User input: 2019-03-01 01:30 TO 2019-03-01 02:30 (DEF row)
Expected output:
Though DEF falls in STU & KJF date range start dates i.e. C4 col of either rows match with DEF datetime exactly
----------------------------------------------------
C1 C2 C3 C4 C5 C6
----------------------------------------------------
ABC 10 15 2019-03-01 00:30
XYZ 12 13 2019-03-01 01:00
DEF 5 7 2019-03-01 02:00 No unique match
IJK 17 3 2019-03-02 01:00

Converting dataframe object to date using to_datetime

I have a data set that looks like this:
date id
0 2014-01-01 11000929
1 2014-01-01 11000190
2 2014-01-01 11000216
3 2014-01-01 11000822
4 2014-01-01 11000971
5 2014-01-01 11000721
6 2014-01-01 11000970
7 2014-01-01 11000574
8 2014-01-01 11000967
9 2014-01-01 11000172
10 2014-01-01 11000208
11 2014-01-01 11000966
12 2014-01-01 11000344
13 2014-01-01 11000965
14 2014-01-01 11000935
15 2014-01-01 11000964
16 2014-01-01 11000741
17 2014-01-01 11000868
18 2014-01-01 11000035
19 2014-01-01 11000203
20 2014-01-02 11000574
as you can see there is a lot of duplciate date times for different products, I will merge this table with another table which requires me to convert date column, which is currently and object, to datetime64[ns].
I tried
df_date_id.date = pd.to_datetime(df_date_id.date)
but I end up having the error:
TypeError: <class 'pandas._libs.tslibs.period.Period'> is not convertible to datetime
p.s: the table I am going to merge with looks like this:
date id score
0 2014-01-01 11000035 75
1 2014-01-02 11000035 84
2 2014-01-03 11000035 55
so date format of both tables looks the same to me.
Thanks in advance.
I think is necessary convert period to datetimes with to_timestamp:
df['date'] = df['date'].dt.to_timestamp()
print (df['date'].dtypes)
datetime64[ns]
Another solution is convert column in another DataFrame to periods like:
df2['date'] = df2['date'].dt.to_period('d')
Works for me by specifying the format:
df.date = pd.to_datetime(df.date, format='%Y-%M-%d')
date id
0 2014-01-01 00:01:00 11000929
1 2014-01-01 00:01:00 11000190
2 2014-01-01 00:01:00 11000216
3 2014-01-01 00:01:00 11000822
4 2014-01-01 00:01:00 11000971
If not try:
df.date = pd.to_datetime(df.date.astype(str), format='%Y-%M-%d')

Pandas Group by before outer Join

I have two tables with the following formats:
Table1: key = Date, Index
Date Index Value1
0 2015-01-01 A -1.292040
1 2015-04-01 A 0.535893
2 2015-02-01 B -1.779029
3 2015-06-01 B 1.129317
Table2: Key = Date
Date Value2
0 2015-01-01 2.637761
1 2015-02-01 -0.496927
2 2015-03-01 0.226914
3 2015-04-01 -2.010917
4 2015-05-01 -1.095533
5 2015-06-01 0.651244
6 2015-07-01 0.036592
7 2015-08-01 0.509352
8 2015-09-01 -0.682297
9 2015-10-01 1.231889
10 2015-11-01 -1.557481
11 2015-12-01 0.332942
Table2 has more rows and I want to join Table1 into Table2 on Date so I can do stuff with the Values. However, I also want to bring in Index and and fill in for each index, all the Dates they don't have like this:
Result:
Date Index Value1 Value2
0 2015-01-01 A -1.292040 2.637761
1 2015-02-01 A NaN -0.496927
2 2015-03-01 A NaN 0.226914
3 2015-04-01 A 0.535893 -2.010917
4 2015-05-01 A NaN -1.095533
5 2015-06-01 A NaN 0.651244
6 2015-07-01 A NaN 0.036592
7 2015-08-01 A NaN 0.509352
8 2015-09-01 A NaN -0.682297
9 2015-10-01 A NaN 1.231889
10 2015-11-01 A NaN -1.557481
11 2015-12-01 A NaN 0.332942
.... and so on with Index B
I suppose I could manually filter out each Index value from Table1 into Table2, but that would be really tedious and troublesome if I didn't actually know all the indexes. I essentially want to do a "Table1 group by Index and right join to Table2 on Date" at the same time, but I'm stuck on how to express this.
Running the latest versions of Pandas and Jupyter.
EDIT: I have a program to fill in the NaNs, so they're not a problem right now.
It seems you want to merge 'Value1' of df1 with df2 on 'Date', while assigning the Index to every date. You can use pd.concat with a list comprehension
import pandas as pd
pd.concat([df2.assign(Index=i).merge(gp, how='left') for i, gp in df1.groupby('Index')],
ignore_index=True)
Output:
Date Value2 Index Value1
0 2015-01-01 2.637761 A -1.292040
1 2015-02-01 -0.496927 A NaN
2 2015-03-01 0.226914 A NaN
3 2015-04-01 -2.010917 A 0.535893
4 2015-05-01 -1.095533 A NaN
5 2015-06-01 0.651244 A NaN
6 2015-07-01 0.036592 A NaN
7 2015-08-01 0.509352 A NaN
8 2015-09-01 -0.682297 A NaN
9 2015-10-01 1.231889 A NaN
10 2015-11-01 -1.557481 A NaN
11 2015-12-01 0.332942 A NaN
12 2015-01-01 2.637761 B NaN
13 2015-02-01 -0.496927 B -1.779029
14 2015-03-01 0.226914 B NaN
15 2015-04-01 -2.010917 B NaN
16 2015-05-01 -1.095533 B NaN
17 2015-06-01 0.651244 B 1.129317
18 2015-07-01 0.036592 B NaN
19 2015-08-01 0.509352 B NaN
20 2015-09-01 -0.682297 B NaN
21 2015-10-01 1.231889 B NaN
22 2015-11-01 -1.557481 B NaN
23 2015-12-01 0.332942 B NaN
By not specifying the merge keys, it's automatically using the intersection of columns, which is ['Date', 'Index'] for each group.

Insert 0 in pandas series for timeseries gaps

In order to properly plot data, I need the missing values to be shown as 0. I do not want to have a 0 value for each missing day, as that bloats the storage. How do I insert 0 value for each type column for each gap's first and last day? I do not need 0 inserted before and after the whole sequence. Bonus: what if timeseries is monthly or weekly data (date set to the first of the month, or to every Monday)
For example, this timeseries contains one gap between 3rd and 10th of January for type A. I need to insert a 0 value on the 4th and the 9th of January.
df = DataFrame({"date":[datetime(2015,1,1) + timedelta(days=x) for x in range(0, 3)+range(8, 13)+range(2, 9)], "type": ['A']*8+['B']*7, "value": np.random.randint(10, 100, size=15)})
date type value
0 2015-01-01 A 97
1 2015-01-02 A 11
2 2015-01-03 A 89 <-- last date before the gap
3 2015-01-09 A 31 <-- first day after the gap
4 2015-01-10 A 64
5 2015-01-11 A 82
6 2015-01-12 A 75
7 2015-01-13 A 24
8 2015-01-03 B 72
9 2015-01-04 B 46
10 2015-01-05 B 26
11 2015-01-06 B 91
12 2015-01-07 B 36
13 2015-01-08 B 53
14 2015-01-09 B 85
Desired result (the row indexes would would be different)
date type value
0 2015-01-01 A 97
1 2015-01-02 A 11
2 2015-01-03 A 89
. 2015-01-03 A 0 <-- gap starts - new value
<-- do NOT insert any more values for 04--07
. 2015-01-08 A 0 <-- gap ends - new value
3 2015-01-09 A 31
4 2015-01-10 A 64
5 2015-01-11 A 82
6 2015-01-12 A 75
7 2015-01-13 A 24
8 2015-01-03 B 72
9 2015-01-04 B 46
10 2015-01-05 B 26
11 2015-01-06 B 91
12 2015-01-07 B 36
13 2015-01-08 B 53
14 2015-01-09 B 85
Maybe an inelegant solution, but it seems to be easiest to split the dataframe up, fill in the missing dates, and recombine, like so:
# with pandas imported as pd
dfA = df[df.type=='A']
new_axis = pd.date_range(df.date.min(), df.date.max())
dfA.set_index('date', inplace=True)
missing_dates = list(set(new_axis).difference(dfA.index))
dfA.loc[min(missing_dates)] = 'A', 0
dfA.loc[max(missing_dates)] = 'A', 0
df = pd.concat([df[df.type=='B'].set_index('date'), dfA])