In my dataset as follows (two columns: DATE and RATE)
I want to get the mean for the RATE for each day (from the dataset, you can see that there are multiple rate values for the same day). I have about 1,000 rows, so that I am trying to find an easier way to calculate the mean for each day, then save the results to a data frame.
You have to group by date then aggregate
https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.aggregate.html
In your case
df.groupby('DATE').agg({'RATE': ['mean']})
You can groupby the date and perform mean operation.
new_df = df.groupby('DATE').mean()
Related
this is how the data frame looks I want to group by day and then do the following to get merged intervals for each group.Is there an elegant way to shrink merge overlapping intervals for each day?
(df["startTime"]>df["endTime"].shift()).cumsum()
I know that I can add a column that denotes the partition on day like so
df["partition"]=df.groupby(["actualDay","weekDay"]).ngroup()
but how do I make a shift exclusively within the group?
I have a table that has each transaction along with a field that shows how many units were cancelled in the order. If I filter the table on cancelled_units > 0 i can pull all transactions that are cancelled. There is also detailed date information for each transaction but I think I only need date. I need to create a rate calculation of total cancelled orders / total orders to get cancellation rate and then spread that out across every week for the past 12 months. I was thinking maybe using a CASE statement with some sort of counter in place? Also, I am using Databricks so maybe there is some built in function or operators that would make this easier. Appreciate you taking a look at my question.
From the context provided, you have a data frame with the list of transactions. It is also clear that there is a transaction column, the timestamp indicating when the order was placed, and the number of units cancelled in each transaction. So, when you filter your data frame on condition cancelled_units>0, and count the number of fields, you get number of cancelled_orders.
Using spark in Databricks:
Now, to find the rate of cancellation (cancelled_orders/total_orders) for every week in the past 12 months. I was able to find a way that allows you to calculate the rate of cancellation in a PARTICULAR YEAR (not past 12 months). So, gather all the records in a certain year first.
Since the timestamp indicating when the order was placed is already available in the data frame, we can use this timestamp to find out which week of the year this transaction was made. You can use the following way to achieve this (similar sytax for both pyspark and spark with scala).
df.withColumn("order_placed_week",date_format(col("transaction_date"), "w")).show()
Here transaction_date is a timestamp. If it is a date, then use the following method.
df.withColumn("order_placed_week", date_format(to_date("transaction_date", "dd/mm/yyyy"), "w")).show()
to_date() function helps you to specify the format of the transaction_date in your data frame.
The libraries that are required to be imported are:
For pyspark:
from pyspark.sql.functions import to_date, date_format, col
Reference: https://www.datasciencemadesimple.com/get-month-year-and-quarter-from-date-in-pyspark/
For spark:
import org.apache.spark.sql.functions._
Reference: https://sparkbyexamples.com/spark/spark-how-to-get-a-day-and-week-of-year/
After completing this process, you can use the resulting data frame with order_placed_week column to get cancellation rate.
Get the count of orders for each week number, and then the count of orders with cancelled units using groupBy and filter. Dividing the count_of_cancelled/total_count for each week will give your desired result.
I have a dataframe with 19M rows of different customers (~10K customers) and for their daily consumption over different date ranges. I have resampled this data into weekly consumption and the resulted dataframe is 2M rows. I want to know the ranges of consecutive dates for each customer and select those with the max(range). Any ideas? Thank you!
It would be great if you could post some example code, so the replies will be more specific.
You probably want to do something like earliest = df.groupby('Customer_ID').min()['Consumption_date'] to get the earliest consumption date per customer, and latest = df.groupby('Customer_ID').max()['Consumption_date'] for the latest consumption date, and then take the difference time_span = latest-earliest to get the time span per customer.
Knowing the specific df and variable names would be great
I have an hourly (timestamp) dataset of events from the past month.
I would like to check the performance of events that occurred between certain hours, group them together and average the results.
For example: AVG income of the hours 23:00-02:00 per user:
So if I have this data set below. I'd like to summarise the coloured rows and then average them (the result should be 218).
I tried NTILE but it couldn't divide the data properly, ignoring the irrelevant hours.
Is there a good way to create these custom buckets using SQL?
dataset
From description not exactly sure how you want to aggregate. If you provide an example dataset can update answer.
However you can easily achieve this with an AVG and IF statement.
AVG(IF(EXTRACT(HOUR FROM timestamp_field) BETWEEN 0 AND 4, value, NULL) as avg_value
Using the above you can then group by either day or month to get the aggregation level you want.
I have a data-set which includes time {hh,mm,ss} and temperature.
I want to aggregate the temperature with respect to the time.
For each minute in a specific hour there are number of temperature records and I want to calculate the average of them to have a single value for each minute.
Thanks in advance.
Use date functions ( http://www.w3schools.com/sql/ ) to get more general (less precise) time [i.e. Hour and Minute only], group by that and use Average SQL function to get your average value.