Unexpected groupby result: some rows are missing - pandas

I am facing an issue with transforming my data using Pandas' groupby. I have a table (several million rows and 3 variables) that I am trying to group by "Date" variable.
Snippet from a raw table:
Date V1 V2
07_19_2017_17_00_06 10 5
07_19_2017_17_00_06 20 6
07_19_2017_17_00_08 15 3
...
01_07_2019_14_06_59 30 1
01_07_2019_14_06_59 40 2
The goal is to group rows with the same value of "Date" by applying a mean function over V1 and sum function over V2. So that the expected result resembles:
Date V1 V2
07_19_2017_17_00_06 15 11 # This row has changed
07_19_2017_17_00_08 15 3
...
01_07_2019_14_06_59 35 3 # and this one too!
My code:
df = df.groupby(['Date'], as_index=False).agg({'V1': 'mean', 'V2': 'sum'})
The output I am getting, however, is totally unexpected and I am can't find a reasonable explanation of why it happens. It seems like Pandas is only processing data from 01_01_2018_00_00_01 to 12_31_2018_23_58_40, instead of 07_19_2017_17_00_06 to 01_07_2019_14_06_59.
Date V1 V2
01_01_2018_00_00_01 30 3
01_01_2018_00_00_02 20 4
...
12_31_2018_23_58_35 15 3
12_31_2018_23_58_40 16 11
If you have any clue, I would really appreciate your input. Thank you!

I suspect that the issue is based around Pandas not recognizing the date format that I've used. A solution turned out to be quite simple: convert all of the dates into UNIX time format, divide by 60 and then, repeat the groupby procedure.

Related

Changing column name and it's values at the same time

Pandas help!
I have a specific column like this,
Mpg
0 18
1 17
2 19
3 21
4 16
5 15
Mpg is mile per gallon,
Now I need to replace that 'MPG' column to 'litre per 100 km' and change those values to litre per 100 km' at the same time. Any help? Thanks beforehand.
-Tom
I changed the name of the column but doing both simultaneously,i could not.
Use pop to return and delete the column at the same time and rdiv to perform the conversion (1 mpg = 1/235.15 liter/100km):
df['litre per 100 km'] = df.pop('Mpg').rdiv(235.15)
If you want to insert the column in the same position:
df.insert(df.columns.get_loc('Mpg'), 'litre per 100 km',
df.pop('Mpg').rdiv(235.15))
Output:
litre per 100 km
0 13.063889
1 13.832353
2 12.376316
3 11.197619
4 14.696875
5 15.676667
An alternative to pop would be to store the result in another dataframe. This way you can perform the two steps at the same time. In my code below, I first reproduce your dataframe, then store the constant for conversion and perform it on all entries using the apply method.
df = pd.DataFrame({'Mpg':[18,17,19,21,16,15]})
cc = 235.214583 # constant for conversion from mpg to L/100km
df2 = pd.DataFrame()
df2['litre per 100 km'] = df['Mpg'].apply(lambda x: cc/x)
print(df2)
The output of this code is:
litre per 100 km
0 13.067477
1 13.836152
2 12.379715
3 11.200694
4 14.700911
5 15.680972
as expected.

Pivoting with grouby?

I wonder if you can help me to find a solution for the following problem. Given a data frame df1 like this
d1={'L':['aaa','bbb','ccc','aaa','bbb','ddd'],
'w':[1,5,9,13,17,21],
'x':[2,6,10,14,18,22],
'y':[3,7,11,15,19,23],
'z':[4,8,12,16,20,24]}
df1=pd.DataFrame(d1)
and two dictionaries to define grouping over columns and rows
dctRowGroups={'aaa':'A','bbb':'B','ccc':'A','ddd':'B'}
dctColGroups={'w':'ALPHA','x':'BETA','y':'ALPHA','z':'BETA'}
I wanted to aggregate over columns as a first step. Applying
g2=df1.groupby(dctColGroups,axis=1)
g2.sum()
results in
but I wanted to keep the 'L' column for the next step row-wise aggregation, i.e. the result should be a dataframe df2 more like this:
What do I need to code to make this happen?
As a next step, I want to aggregate df2 over the rows using the dctRowGroups dictionary
g3=df2.groupby(dctRowGroups,axis=0)
g3.sum()
to get a final result like this:
In what way can I do all these steps in as few lines of code as possible?
Appreciate your advice on this.
Thanks a lot
Willfried.
You can do:
Firstly create df2 and insert 'L' column by using insert() method:
df2=df1.groupby(dctColGroups,axis=1).sum()
df2.insert(0,'L',df1['L']) #use this only when the order matters
#OR(use anyone of the method either insert or assign)
df2=df2.assign(L=df1['L']) #otherwise use this
Finally use assign() ,map() and groupby() method:
result=df2.assign(L=df2['L'].map(dctRowGroups)).groupby('L').sum()
Outputs:
df2:
L ALPHA BETA
0 aaa 4 6
1 bbb 12 14
2 ccc 20 22
3 aaa 28 30
4 bbb 36 38
5 ddd 44 46
result:
ALPHA BETA
L
A 52 58
B 92 98

Creating a Nested/Loop Calculation in Vertica (?)

So maybe I'm just way over-thinking things, but is there any way to replicate a nested/loop calculation in Vertica with just SQL syntax.
Explanation -
In Column AP I have remaining values per month by an attribute key, in column CHANGE_1M I have an attribution value to apply.
The goal is for future values to calculate the preceding Row partition AP*CHANGE_1M, by the subsequent row partition CHANGE_1M to fill in the future AP values.
For reference I have 15,000 Keys Per Period and 60 Periods Per Year in the full-data set.
Sample Calculation
Period 5 =
(Period4_AP * Period5_CHANGE_1M)+Period4_AP
Period 6 =
(((Period4_AP * Period5_CHANGE_1M)+Period4_AP)*Period6_CHANGE_1M)
+
((Period4_AP * Period5_CHANGE_1M)+Period4_AP)
ect.
Sample Data on Top
Expected Results below
Vertica does not have (yet?) the RECURSIVE WITH clause, which you would need for the recursive calculation you seem to be needing here.
Only possible workaround would be tedious: write (or generate, using perl or Python, for example) as many nested queries as you need iterations.
I'll only want to detail this if you want to go down that path.
Long time no see - I should have returned to answer this question earlier.
I got so stuck on thinking of the programmatic way to solve this issue, I inherently forgot it is a math equation, and where you have math functions you have solutions.
Basically this question revolves around doing table multiplication.
The solution is to simply use LOG/LN functions to multiply and convert back using EXP.
Snippet of the simple solve.
Hope this helps other lost souls, don't forget your math background and spiral into a whirlpool of self-defeat.
EXP(SUM(LN(DEGREDATION)) OVER (ORDER BY PERIOD_NUMBER ASC ROWS UNBOUNDED PRECEDING)) AS DEGREDATION_RATE
** Controlled by what factors/attributes you need the data stratified by with a PARTITION
Basically instead of starting at the retention PX/P0, I back into with the degradation P1/P0 - P2/P1 ect.
PERIOD_NUMBER
DEGRADATION
DEGREDATION_RATE
DEGREDATION_RATE x 100000
0
100.00%
100.00%
100000.00
1
57.72%
57.72%
57715.18
2
60.71%
35.04%
35036.59
3
70.84%
24.82%
24820.66
4
76.59%
19.01%
19009.17
5
79.29%
15.07%
15071.79
6
83.27%
12.55%
12550.59
7
82.08%
10.30%
10301.94
8
86.49%
8.91%
8910.59
9
89.60%
7.98%
7984.24
10
86.03%
6.87%
6868.79
11
86.00%
5.91%
5907.16
12
90.52%
5.35%
5347.00
13
91.89%
4.91%
4913.46
14
89.86%
4.41%
4414.99
15
91.96%
4.06%
4060.22
16
89.36%
3.63%
3628.28
17
90.63%
3.29%
3288.13
18
92.45%
3.04%
3039.97
19
94.95%
2.89%
2886.43
20
92.31%
2.66%
2664.40
21
92.11%
2.45%
2454.05
22
93.94%
2.31%
2305.32
23
89.66%
2.07%
2066.84
24
94.12%
1.95%
1945.26
25
95.83%
1.86%
1864.21
26
92.31%
1.72%
1720.81
27
96.97%
1.67%
1668.66
28
90.32%
1.51%
1507.18
29
90.00%
1.36%
1356.46
30
94.44%
1.28%
1281.10
31
94.12%
1.21%
1205.74
32
100.00%
1.21%
1205.74
33
90.91%
1.10%
1096.13
34
90.00%
0.99%
986.52
35
94.44%
0.93%
931.71
36
100.00%
0.93%
931.71

transform data frame in time series for date type POSIXct

I have a data frame with the following two variables:
amount: num 1213.5 34.5 ...
txn_date: POSIXct, format "2017-05-01 12:13:30" ...
I want to transform it in a time series using ts().
I started using this code:
Z <- zoo(data$amount, order.by=as.Date(as.character(data$txn_date), format="%Y/%m/%d %H:%M:%S"))
But the problem is that in Z I loose the dates. In fact, all the dates are reported as NA.
How can I solve it?
For my analysis is important to have date in the format:%Y/%m/%d %H:%M:%S
for example 2017-05-01 12:13:30. I don't want to remove the time component in the variable txn_date.
Yhan you for your help,
Andrea
I think your prolem comes from the way you're manipulating your data frame, could post more details about it please ?
I think i have a fix for you.
Data frame I used :
> df1
$data
value
1 1.9150
2 3.1025
3 6.7400
4 8.5025
5 11.0025
6 9.8025
7 9.0775
8 7.0900
9 6.8525
10 7.4900
$date
%Y-%m-%d
1 1974-01-01
2 1974-01-02
3 1974-01-03
4 1974-01-04
5 1974-01-05
6 1974-01-06
7 1974-01-07
8 1974-01-08
9 1974-01-09
10 1974-01-10
> class(df1$data$value)
[1] "numeric"
> class(df1$date$`%Y-%m-%d`)
[1] "POSIXct" "POSIXt"
Then I can create a time serie by calling zoo like that :
> Z<-zoo(df1$data,order.by=(as.POSIXct(df1$date$`%Y-%m-%d`)))
> Z
value
1974-01-01 1.9150
1974-01-02 3.1025
1974-01-03 6.7400
1974-01-04 8.5025
1974-01-05 11.0025
1974-01-06 9.8025
1974-01-07 9.0775
1974-01-08 7.0900
1974-01-09 6.8525
1974-01-10 7.4900
The important thing here is that I use df1$date$%Y-%m-%d instead of just
df1$date
In fact if I try the way you did it I get NA values too :
> Z<-zoo(df1$data,order.by=as.POSIXct(as.Date(as.character(df1$date),format("%Y-%m-%d"))))
> Z
value
<NA> 1.915
To get the name of data$txn_date you can use the following command : names(data$txn_date) and try my solution with your data frame and name.
> names(df1$date)
[1] "%Y-%m-%d"

SSRS Chart with Grouping like in Excel

I wasnt able to find anything like this yet... but here is what i need to do:
I have a query result like this:
ID Data1 Data2 Data3 Data4 ... Data7
1 12 13 15 1 ... 12
2 12 13 15 1 ... 12
3 12 13 15 1 ... 12
4 12 13 15 1 ... 12
I need to make a BarChart With 2 Values, 1 is the first row (ID=1) one is the last row (ID=4). The column headers DataX is what i need the series to be paired by.
Example:
ID Insured Uninsured Rejected
1 12 3 0
4 16 9 2
In the BarChart i need to see the number of insured or ID=1 and ID=2 next to each other, the number of Uninsured and rejected the same.
I feel like i have tried all ways possible but was not able to get anything besides a BarChart where all values of ID=1 where displayed and then all values for ID=2 where displayed next to each other.
Im sure this was a very confusing way to describe it, but i hope someone can understand what i am looking for.
NOTE: I tried to do this in Excel, and it worked within 2 minutes. I set the filter: Series on the 2 rows that i wanted, and set the Categories to the dataX Columns as described, and everything looked great. When i tried to translate this into SSRS i was able to do all the same things in the Series and Categories, but then i had to put in values and that screwed everything up.
PLEASE HELP!
I bet you need to add a grouping to your values by a spanning factor.