Pandas multi index/groupby - pandas

I have a data frame containing several registers from sold products, revenues, prices and dates
DATA CLIENT PRODUCT PRICE
2020-08-28 xxxxxxx RIOT 20.0
I am to group my information by year/month and product. I am running a group by to_period that extract the exactly information :
dfgmv = dfgift[['PRODUCT','PRICE']].groupby([dfgift.DATA.dt.to_period("M"), 'PRODUCT']).agg(['count','sum'])
This is the output :
PRICE
count sum
DATA PRODUCT
2020-08 RIOT 2 40.00
The question is that, as I export to excel, the date column is not interpreted as da date (yyyy-mm). I am trying to convert the yyyy-mm to something like yyyy-mm-dd so Excel understand it.
I´ve read several questions about multi index but my knowledge wasn't enough to use that info to apply here. I tried to change my index to datetime, but, as I run it, I lost the second index column (product).
dfgmv.index = pd.to_datetime(dfgmv.index.get_level_values(0).astype('datetime64[ns]'))
.
VALOR
count sum
DATA
2020-08-01 2 40.00
So, How can I change the information format without losing my index?

Index.set_levels is designed to allow for the setting of specific index level(s).
dfgmv.index = (
dfgmv.index.set_levels(dfgmv.index.levels[0].astype('datetime64[ns]'), level=0)
)
Result
PRICE
count sum
DATA PRODUCT
2020-08-01 RIOT 1 20.0

You can change your groupby to include start of month for each date you have:
dfgmv = dfgift[['PRODUCT','PRICE']].groupby([dfgift.DATA.astype('datetime64[M]'), 'PRODUCT']).agg(['count','sum'])
dfgmv
PRICE
count sum
DATA PRODUCT
2020-08-01 RIOT 1 20.0

Related

How to filter by last max value/date on DataStudio?

I have a BigQuery dataset updating on irregular times (can be once, twice a week, or less). Data is structured as following.
id
Column1
Column2
data_date(timestamp)
0
Datapoint0
Datapoint00
2022-01-01
1
Datapoint1
Datapoint01
2022-01-01
2
Datapoint2
Datapoint02
2022-01-03
3
Datapoint3
Datapoint03
2022-01-03
4
Datapoint4
Datapoint04
2022-02-01
5
Datapoint5
Datapoint05
2022-02-01
6
Datapoint6
Datapoint06
2022-02-15
7
Datapoint7
Datapoint07
2022-02-15
Timestamp is a string in 'YYYY-MM-DD' format.
I want to make a chart and a pivot table in Google DataStudio that automatically filters by the latest datapoints ('2022-02-15' in the example). All the solutions I tried are either sub-optimal or just don't work:
Creating a support column doesn't work because I need to mix aggregated and non-aggregated fields (data_date and the latest data_date)
Adding a filter to the charts allows me to specify only a specific day - I would need to edit the chart regularly every time the underlyind data is updated
Using a dropdown filter allows me to dynamically filter whatever date I need. However I consider it suboptimal because I can't have it automatically select the latest date. Having a date filter can make it dynamic, but since the update time is not regular it may select a date range with multiple timestamps/or none at all, so it's also a sub-optimal solution
Honestly I'm out of ideas. I stupidly thought it was possible to add a column saying data_date = (select max(data_date) from dataset, but it seems not possible since max needs to work on aggregated data.
One possible solution could be creating a view that can have the latest data point, and referencing the view from the data studio.
CREATE OR REPLACE VIEW `project_id.dataset_id.table_name` AS
SELECT *
FROM `bigquery-public-data.covid19_ecdc_eu.covid_19_geographic_distribution_worldwide`
ORDER BY date DESC # or timestamp DESC
LIMIT 1

Creating hash value from multiple columns

I have an example product sales table which consists of around 15 columns and a couple thousand rows. The columns I'm most interested in look like this:
product_type currency amount order_time delivered_time
hoodie GBP 60.0 2021-03-10 14:32:07 2021-03-13 16:05:52
shirt EUR 30.0 2021-03-20 19:22:32 2021-03-24 11:18:46
...
There currently is a unique identifier but it isn't useful for broad analysis, there can be multiple products in an order but they'd all have a different identifier so you can't match them up.
What i want to do is create a new identifier column using a hash function, I've used the below code and showed an example output of what I get:
SELECT *, Md5(product_type||currency||amount)
FROM sales
product_type currency amount identifier
Coat GBP 100.0 825be52c31f1d92584720466d743e2cf
Coat GBP 100.0 825be52c31f1d92584720466d743e2cf
This code works for the 3 columns that I've included in the hash function, but I also want to include the two DATETIME columns into the function but it doesn't work. I've used this code to try include them, the code runs but the hash values i get are completed different to each other even if all the values in the columns match up:
SELECT *, Md5(product_type||currency||amount||TRUNC(order_time)||TRUNC(delivered_time))
I've used the TRUNC function on the two date columns as I'm not too concerned about the exact minutes or seconds, mainly interested in just the date itself. How could I include the two datetime columns without it messing up the hash function?
Use to_char(date, 'YYYY-MM-DD') instead of TRUNC()
Md5(product_type||currency||amount||to_char(order_time,'YYYY-MM-DD')||to_char(delivered_time,'YYYY-MM-DD'))
Or if the datatype is string/varchar, use SUBSTRING(date,1,10)
Md5(product_type||currency||amount||SUBSTRING(order_time,1,10)||SUBSTRING(delivered_time,1,10))

How to select by 1 xbar date/second in kdb+

I am trying to run a select on a table whereby the data in the table ranges across multiple days, thus it does not conform to daily data that the documentation eludes to.
Application of the xbar selection accross multiple days obviously results in data that is not ordered i.e. select last size, last price by 1 xbar time.second on data that includes 2 days would result in:
second | size price
====================
00:00:01 | 400 555.5
00:00:01 | 600 606.0
00:00:02 | 400 555.5
00:00:02 | 600 606.0
How can one add the current date in the selection such that the result like what is done in pandas can still be orderly across multiple days e.g: 2019-09-26 16:34:40
Furthermore how does one achieve this whilst maintaining a date format that is compatible with pandas once stored in csv?
NB: It is easiest for us to assist you if you provide code that can replicate a sample of the kind of table that you are working with. Otherwise we need to make assumptions about your data.
Assuming that your time column is of timestamp type (e.g. 2019.09.03D23:11:54.711811000), a simple solution is to xbar by one as a timespan, rather than using the time.second syntax:
select last size, last price by 0D00:00:01 xbar time from data
Using xbar keeps the time column as a timestamp rather than casting it to second type.
If your time column is of some other temporal type then you can still use this method if you have a date column in your table that you can use to cast time to a timestamp. This would look something like:
select last size, last price by 0D00:00:01 xbar date+time from data
I would suggest to group by both date and second, and the sum them
update time: date+time from
select last size, last price
by date: `date$time, time: 1 xbar `second$time from data
Or the other shorter and more efficient option is to sum date and second right in the group clause:
select last size, last price by time: (`date$time) + 1 xbar `second$time from data

Calculation for month number in time series data

The data I am working with is oil and gas production data. The production table uniquely identifies each well and contains a time series of production values. I want to be able to calculate a column that contains the month number occurrence of production for every well in the production table. This needs to be a calculation, so I can graph the production for various wells based on the production month, not the calendar month. (I want to compare well performance across wells over the life of wells.) Also note that there could be gaps in the production data so you can't depend on having twelve months of sequential production for each well.
I tried using the answer in this postRankValues but the calculation would never finish. I have over 4 million rows of production data.
In the table shown below, the values shown in ProdMonth is what I need to calculate based on their time occurrence shown in ProdDate. This needs to be performed as a row calculation for each unique WellId
Thanks.
WellID ProdDate ProdMonth
1 12/1/2011 1
1 1/1/2012 2
1 2/1/2012 3
1 3/1/2012 4
… … …
1 11/1/2012 12
2 3/1/2014 1
2 4/1/2014 2
2 5/1/2014 3
2 6/1/2014 4
2 7/1/2014 5
… … …
2 2/1/2014 12
I would create a new date table that has a row for each day (the granularity of your data). I would then add to that table the ProdMonth column. This will ensure you have dates for all days (even if there are gaps in the well reporting data). Then you can use a relationship between the well production data and the Date table on the ProdDate field. Then if you pull in the ProdMonth from the date table, you'll have a list of all of the ProdMonths (hint: you may need to select 'show values with no data' on the field right click menu in the fields well). Then if you add to the same visualization WellID you should be able to see which wells were active in which ProdMonth. If WellID is a number, you might need do use the 'do not summarize' feature on the WellID to get the result you desire.
I posted this question on the PowerPivotPro and Tom Allan provided the DAX formula I needed. First step was to calculate a field that concatenated Year and Month (YearMonth). Then utilized the RANKXX function as such:
= RANKX ( FILTER ( Data, [WellID] = EARLIER ( [WellID] ) ), [YearMonth], , 1, DENSE )
That did the trick and performed fairly quickly on 12mm rows.

SSRS Multiple Dates returned from Dataset, want to display 1 date per column

I apologize if this is a stupid question..as I am new to SSRS. I have a dataset that returns about 15 dates e.g
01/01/2013
01/05/2013
01/20/2013
01/25/2013
..etc
and I want to put each one of those dates in a new column next to itself like the following:
Day1 Day2 Day3 Day 5
01/01/2013 01/05/2013 01/20/2013 01/25/2013
any idea on how to do so? I would really appreciate the help
Build a table/matrix and create a column group that contains your date field. It will expand the dates out horizontally when it renders. Here is a link that contains instructions to add a column group to an existing table.
If you have fixed number of dates (15 in your case) then you can use pivot in SQL and let your dataset return dates in horizontal format (i.e 1 row with 15 columns) otherwise you can use column grouping to achieve this.
Thanks,
Neeraj