Pandas Groupby Sum to Columns - pandas

I am doing groupby and can get the sum for a column ok, but how to I get sum of two columns together?
detail [ 'debit' ] = df.groupby ( 'type' ) [ 'debit' ].sum ()
detail [ 'credit' ] = df.groupby ( 'type' ) [ 'credit' ].sum ()
Now I need the (credit - debit) together.
something like this:
detail [ 'profit' ] = df.groupby ( 'type' ) ( [ 'credit' ] - [ 'debit' ] ).sum ()
obviously that does not work.
Thanks.

Like #IanS suggested, I would first save the result in a new column and apply the groupby function afterwards:
df['profit'] = df['credit'] - df['debit']
detail = df.groupby('type').sum()[['profit', 'credit', 'debit']]
I also combined the the groupby-actions into one.

Have you tried:
detail [ 'profit' ] = sum(df.groupby ( 'type' ) [ 'credit' ] - df.groupby ( 'type' ) [ 'debit' ])

Related

Combine conditions in table expression?

I am using the line_index function and would like to search for two values, not only for carrid but also for connid. Is it possible? If so, in what way?
Because right now, this works:
lv_index = line_index( lt[ carrid = 'LH' ] ).
But after adding the code [ connid = '2407' ] like this:
lv_index = line_index( lt[ carrid = 'LH' ][ connid = '2407' ] ).
I get a syntax error:
LT[ ] is not an internal table
All fields (conditions) just one after the other inside one bracket:
lv_index = line_index( lt[ carrid = 'LH'
connid = '2407' ] ).
I'd like to comment about the chaining of Table Expressions.
So the answer corresponding to the OP example is that a single Table Expression must be used (itab[...]) with as many components as needed, and not a chain of table expressions as was done (itab[...][...]).
lt[ carrid = 'LH' ][ connid = '2407' ] can never be valid (because connid = '2407' would imply that each line of LT is itself an internal table but carrid = 'LH' is contradictory as it means that each line of LT is a structure).
But other syntaxes of chained table expressions can be valid, like e.g. (provided that the internal tables are defined adequately)
itab[ 1 ][ comp1 = 'A' ]
itab[ comp1 = 'A' ][ 1 ]
itab[ comp1 = 'A' ]-itabx[ compx = 42 ]
Here is an example that you can play with:
TYPES: BEGIN OF ty_structure,
connid TYPE c LENGTH 4,
END OF ty_structure,
ty_table TYPE STANDARD TABLE OF ty_structure WITH EMPTY KEY,
BEGIN OF ty_structure_2,
carrid TYPE c LENGTH 2,
table TYPE ty_table,
END OF ty_structure_2,
ty_table_2 TYPE STANDARD TABLE OF ty_structure_2 WITH EMPTY KEY,
ty_table_3 TYPE STANDARD TABLE OF ty_table_2 WITH EMPTY KEY.
DATA(lt) = VALUE ty_table_3( ( VALUE #( ( carrid = 'LH' table = VALUE #( ( connid = '2407' ) ) ) ) ) ).
DATA(structure) = lt[ 1 ][ carrid = 'LH' ]-table[ connid = '2407' ].

Collapse elements of array of structs in BigQuery

I have an array of structs in BigQuery that looks like:
"categories": [
{
"value": "A",
"question": "Q1",
},
{
"value": "B",
"question": "Q2",
},
{
"value": "C",
"question": "Q3",
}
]
I'd like to collapse the values "A", "B" and "C" into a separate column, and the value for this particular row should be something like "A - B - C".
How can I do this with a query in BigQuery?
Consider below
select id,
( select string_agg(value, ' - ')
from t.questions_struct) values
from questions t
if applied to sample data in your question/answer -
with questions as (
SELECT 1 AS id,
[
STRUCT("A" as value, "Q1" as question),
STRUCT("B" as value, "Q2" as question),
STRUCT("C" as value, "Q3" as question)
] AS questions_struct
)
output is
Assuming this is an array of structs, you can use:
select (select q.value from unnest(ar) q where q.question = 'q1') as q1,
(select q.value from unnest(ar) q where q.question = 'q2') as q2,
(select q.value from unnest(ar) q where q.question = 'q3') as q3
from t;
I think it can be done with the following code:
with questions as (
SELECT 1 AS id,
[
STRUCT("A" as value, "Q1" as question),
STRUCT("B" as value, "Q2" as question),
STRUCT("C" as value, "Q3" as question)
] AS questions_struct
), unnested as (
select * from questions, unnest(questions_struct) as questions_struct
) select id, string_agg(value, ' - ') from unnested group by 1

Power BI DAX - find repeatability

Given data as such:
Month ValueA
1 T
2 T
3 T
4 F
Is there a way to make a measure that would find if for each month, last three Values were True?
So the output would be (F,F,T,F)?
That would propably mean that my actual problem is solvable, which is finding from:
Month ValueA ValueB ValueC
1 T F T
2 T T T
3 T T T
4 F T F
the count of those booleans for each row, so the output would be (0,0,2[A and C],1[B])
EDIT:
Okay, I managed to solve the first part with this:
Previous =
VAR PreviousDate =
MAXX(
FILTER(
ALL( 'Table' ),
EARLIER( 'Table'[Month] ) > 'Table'[Month]
),
'Table'[Month]
)
VAR PreviousDate2 =
MAXX(
FILTER(
ALL( 'Table' ),
EARLIER( 'Table'[Month] ) - 1 > 'Table'[Month]
),
'Table'[Month]
)
RETURN
IF(
CALCULATE(
MAX( 'Table'[Value] ),
FILTER(
'Table',
'Table'[Month] = PreviousDate
)
) = "T"
&& CALCULATE(
MAX( 'Table'[Value] ),
FILTER(
'Table',
'Table'[Month] = PreviousDate2
)
) = "T"
&& 'Table'[Value] = "T",
TRUE,
FALSE
)
But is there a way to use it with unknown number of columns?
Without hard - coding every column name? Like a loop or something.
I would redo the data table in power query (upivoting the ValueX-columns) and changing T/F to 1/0. Then have a dim table with a relationship to Month, like this:
Then add a measure like this:
Three Consec T =
var maxMonth = MAX('Data'[Month])
var tempTab =
FILTER(
dimMonth;
'dimMonth'[MonthNumber] <= maxMonth && 'dimMonth'[MonthNumber] > maxMonth -3
)
var sumMonth =
MAXX(
'dimMonth';
CALCULATE(
SUM('Data'[OneOrZero]);
tempTab
)
)
return
IF(
sumMonth >= 3;
"3 months in a row";
"No"
)
Then I can have a visual like this when the slicer indicates which time window I'm looking at and the table shows if there has been 3 consecutive Ts or not.

SQL function not displaying two decimal places although input parameter value is float

I have a function that rounds to the nearest value in SQL as per below. When I pass my value in and run the function manually, it works as expected. However when I use it within a select statement, it removes the decimal places.
E.g. I expect the output to be 9.00 but instead I only see 9.
CREATE FUNCTION [dbo].[fn_PriceLadderCheck]
(#CheckPrice FLOAT,
#Jur VARCHAR(10))
RETURNS FLOAT
AS
BEGIN
DECLARE #ReturnPrice FLOAT
IF (#Jur = 'SE')
BEGIN
SET #ReturnPrice = (SELECT [Swedish Krona ]
FROM tbl_priceladder_swedishkrona
WHERE [Swedish Krona ] = #CheckPrice +
(SELECT MIN(ABS([Swedish Krona ] - #CheckPrice))
FROM tbl_priceladder_swedishkrona)
OR [Swedish Krona ] = #CheckPrice -
(SELECT MIN(ABS([Swedish Krona ] - #CheckPrice))
FROM tbl_priceladder_swedishkrona))
END
IF (#Jur = 'DK')
BEGIN
SET #ReturnPrice = (SELECT [Danish Krone ]
FROM tbl_priceladder_danishkrone
WHERE [Danish Krone ] = #CheckPrice +
(SELECT MIN(ABS([Danish Krone ] - #CheckPrice))
FROM tbl_priceladder_danishkrone)
OR [Danish Krone ] = #CheckPrice -
(SELECT MIN(ABS([Danish Krone ] - #CheckPrice))
FROM tbl_priceladder_danishkrone))
END
RETURN #ReturnPrice
END
Run SQL manually:
declare #checkprice float
set #checkprice = '10.3615384615385'
SELECT [Swedish Krona ]
FROM tbl_priceladder_swedishkrona
WHERE [Swedish Krona ] = #CheckPrice +
( SELECT MIN(ABS([Swedish Krona ] - #CheckPrice))
FROM tbl_priceladder_swedishkrona
)
OR [Swedish Krona ] = #CheckPrice -
( SELECT MIN(ABS([Swedish Krona ] - #CheckPrice))
FROM tbl_priceladder_swedishkrona
)
When I use this function with a SQL select statement for some reason it removes the 2 decimal points.
SELECT
Article, Colour,
dbo.fn_PriceLadderCheck([New Price], 'se') AS [New Price]
FROM
#temp2 t
[New Price] on its own is example output is 10.3615384615385
Any ideas?
Cast the result into a Decimal and specify the scale.
See the example below.
RETURN SELECT CAST(#ReturnPrice AS DECIMAL(16,2))

mark rows with timestamp between times

I need to mark rows in a time series where the timestamps fall between given time-of-day blocks; when I have eg
values = ([ 'motorway' ] * 5000) + ([ 'link' ] * 300) + ([ 'motorway' ] * 7000)
df = pd.DataFrame.from_dict({
'timestamp': pd.date_range(start='2018-1-1', end='2018-1-2', freq='s').tolist()[:len(values)],
'road_type': values,
})
df.set_index('timestamp', inplace=True)
I need to add a column rush that marks rows where timestamp is between 06:00 and 09:00 or 15:30 and 19:00. I've seen between_time but I don't know how to apply it here.
edit: based on this answer I managed to put together
df['rush'] = df.index.isin(df.between_time('00:00:15', '00:00:20', include_start=True, include_end=True).index) | df.index.isin(df.between_time('00:00:54', '00:00:59', include_start=True, include_end=True).index)
but I wonder whether there isn't a more elegant way.
One alternative using between
from datetime import time as t
values = ([ 'motorway' ] * 5000) + ([ 'link' ] * 300) + ([ 'motorway' ] * 7000)
df = pd.DataFrame.from_dict({
'timestamp': pd.date_range(start='2018-1-1', end='2018-1-2',
freq='s').tolist()[:len(values)],
'road_type': values,
})
time = df['timestamp'].dt.time
df['rush'] = (time.between(t(0,6,0), t(0,9,0)) | time.between(t(0,15,30),t(0,19,0))).values
Or slicing the df using datetime.time
df = df.set_index(df.timestamp.dt.time)
df['rush'] = df.index.isin(df[t(0,6,0):t(0,9,0)].index | df[t(0,15,30):t(0,19,0)].index)
df = df.reset_index(drop=True)