Keep timestamp information when using daily groupby - pandas

Please see code example and output below. Using groupby I am able to identify the max "Ask Volume Bid Volume Total"value for each day.
I also want to see what time of day this happened for each day. Basically I don't want to lose the time from the timestamp. How do I do this please?

Use idxmax on the groupby object to index back into your original df so you can see the full resolution timestamps associated with those max values:
data.loc[grouped['Ask Volume Bid Volumne Total'].idxmax()]

Related

How to resample and extend datetime data in this pandas DataFrame?

if have a DataFrame like the one in the photo (https://i.stack.imgur.com/5u3WR.png), and I would like to have, for each grid point the same time series (repeated over and over again), namely:
t_index_np = np.arange('2013-01-01', '2022-12-31 23:00:00', dtype='datetime64[h]')
The frequency is hourly.
You have to take into account that for many grid points there is only one associated date.
What I have tried so far is using a for cycle with resample and pd.merge, but the problem there is that it doesn't work for such points (those with only one date data). Concerning the Total Power column, it must be forward-filled.
Thanks in advance!

Is it possible to create a rolling average in SQL that resets based on another column value?

I am working with data that contains gas concentrations from different locations. My goal is to create a rolling average for each location's two week sample periods. My problem is that I know how to use the window function, but I dont know how to make the window reset whenever the location id changes.
Picture of the desired output data
Ideally, there would be a moving average for each location ID that restarts when it hits the next location ID. I have torn apart the internet looking for a solution, so I'm hoping that a specific question might help me find the answer.
If you need more information, please let me know and I'll give whatever else I can. Thank you!
SELECT
start,
end,
concentration,
location,
AVG(concentration) OVER(PARTITION BY location ORDER BY start) AS rolling_average
FROM <your_table>

Doubleor triple timestamp issue

I am using SQL assistant and my data brings in snapshots from a huge database in the form of timestamps. Occasionally the snapshots bring in multiples per hour. The data is correct, multiple snapshots do happen from time to time within an hour, not always but it does happen.
I am bringing this into Spotfire and viewing by an hour and when more than one snapshot happens in the hour, the data shows as doubled.
I only want to display one per hour preferably the last(max) timestamp for the hour. Example; for the 7 am hour the data has a snapshot for 7:10 am and one for 7:55 am.
These are correct but I only want to display the last(max) timestamp, 7:55 am in this case. I can't figure the issue out in Spotfire so I am leaning towards a fix in SQL. How can I display only 1 for each hour?
You'd do this similarly to how you'd probably do it in SQL -- using a ranking/rownumber function.
The basic way Rank in Spotfire works is Rank(Order columns, order direction, partitioned columns, tie method)
You need to partition by the combination of Date and Hour, and then sort descending by your timestamp column.
So the code to identify the rows that you want to isolate should be something along the lines of:
Rank([TimestampColumn], "desc", Date([TimestampColumn]), Hour([TimestampColumn]), "ties.method=first")
What you do with it from here is going to depend on how you plan to use the data - for example, you can Limit Data Using Expression and set the code above = 1 which will limit your table accordingly (helpful if you don't want your users to accidentally forget to filter), or you can create a calculated column which turns it into a flag of some form like here:
If(Rank([TimestampColumn], "desc", Date([TimestampColumn]), Hour([TimestampColumn]), "ties.method=first") = 1, "Latest", "Duplicate")
Which allows your users to filter by this property. This way, they have the option to look at the extra rows.
Ultimately, though, if you want to only ever see these rows, and have no use for the earlier records, I'd probably do it in SQL, if you have that ability. This reduces the number of rows you have to load into your analytic.

Google Sheets monthly sum code optimization

Good day everybody, we are making a spreadsheets for incoming cash monitoring, and I feel like the method I used to achieve the monthly sum is possibly the worst.
So I was wondering if some of you guys have a shorter solution
=SUM(FILTER('Dashboard'!D2:D;'Dashboard'!E2:E="Incoming";'Dashboard'!C2:C>=DATE(text(today()-text(today();"dd");"yyyy");(text(today()-text(today();"dd");"mm"));(text(today()-text(today();"dd");"dd")));'Dashboard'!C2:C<=DATE(text(today();"yyyy");(text(today();"mm"));(text(today();"dd")))))
So since this looks like a cluster**** of code, i will try to annotate it:
=SUM(FILTER('Dashboard'!D2:D;'Dashboard'!E2:E="Incoming"
Sort by only the incoming cash and not outgoing
;Dashboard'!C2:C>=DATE(text(today()-text(today();"dd");"yyyy");(text(today()-text(today();"dd");"mm"));(text(today()-text(today();"dd");"dd")));'Dashboard'!C2:C<=DATE(text(today();"yyyy");(text(today();"mm"));(text(today();"dd")))
The range is from 1'st day of the month to todays date.
Method: Get todays date, and subtract todays date, to get the first day of the month.
Which isn't even a true monthly sum, rather than up to current day sum.
I'm really sorry but due to company policy I cant link the file itself, but the sheet is rather simple
The columns are:
Date, Sum, "Incoming/Outgoing", "Cash/Credit"
I also have a weekly sum, but I feel like that formula is somewhat decent
=query(filter('Dashboard'!C2:D;'Dashboard'!E2:E="Incoming";weeknum('Dashboard'!C2:C;1)=weeknum(today();1));"Select Sum (Col2) label Sum(Col2)''";-1)
There's no need to format the date to 'yyyy-mm-dd'. You can use EndOfMONTH to get the last day of last month.
=SUM(FILTER('Dashboard'!D2:D;'Dashboard'!E2:E="Incoming";'Dashboard'!C2:C>EOMONTH(TODAY(),-1);'Dashboard'!C2:C<=TODAY())

BigQuery Google Analytics sessionsWithEvent metric

I'm having trouble creating a BigQuery query that will allow for me to fetch the Google Analytics ga:sessionsWithEvent metric.
This is what I tried:
SELECT
EXACT_COUNT_DISTINCT(concat(fullvisitorid, string(visitid))) AS distinctVisitIds
FROM
(TABLE_DATE_RANGE([xxxxxxxx.ga_sessions_], TIMESTAMP('2016-11-30'), TIMESTAMP('2016-12-26')))
WHERE
hits.type='EVENT'
The logic in the query above seems sound - get all the rows that have a hit.type of 'EVENT' and sum up the exact count of distinct fullVisitorId/VisitId results - aka. the number of unique sessions with an event.
But the numbers I get from here are close but higher than what I get using query explorer
Thank you.
EDIT: Addressing comment below to use wider date range with date filter
With date range +-5 days, this makes the query
SELECT
EXACT_COUNT_DISTINCT(concat(fullvisitorid, string(visitid))) AS distinctVisitIds
FROM
(TABLE_DATE_RANGE([xxxxxxxx.ga_sessions_], TIMESTAMP('2016-11-25'), TIMESTAMP('2016-12-31')))
WHERE
hits.type='EVENT'
AND ('20161130'<=date AND date<='20161226')
Unfortunately I still get the same number
Don't rely on the table dates, usually even on later days you can have metrics from previous days. Instead use a larger date range on from and exact date range on columns.
AFAIK also the data explorer does approximations.