I am adding a column to a dataframe calculating the number of days between each previous date for each of the customers with the following formula but I end up with out of memory
lapsed['Days']=lapsed[['Customer Number','GL Date']].groupby(['Customer Number']).diff()
The dataframe contains more than 1mln records
Customer Number is an int64 and I was thinking to run the the above statement withing ranges of numbers but I do not know if this is the best aproach
Any suggestion?
Related
In my dataset as follows (two columns: DATE and RATE)
I want to get the mean for the RATE for each day (from the dataset, you can see that there are multiple rate values for the same day). I have about 1,000 rows, so that I am trying to find an easier way to calculate the mean for each day, then save the results to a data frame.
You have to group by date then aggregate
https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.aggregate.html
In your case
df.groupby('DATE').agg({'RATE': ['mean']})
You can groupby the date and perform mean operation.
new_df = df.groupby('DATE').mean()
I have a dataframe with 19M rows of different customers (~10K customers) and for their daily consumption over different date ranges. I have resampled this data into weekly consumption and the resulted dataframe is 2M rows. I want to know the ranges of consecutive dates for each customer and select those with the max(range). Any ideas? Thank you!
It would be great if you could post some example code, so the replies will be more specific.
You probably want to do something like earliest = df.groupby('Customer_ID').min()['Consumption_date'] to get the earliest consumption date per customer, and latest = df.groupby('Customer_ID').max()['Consumption_date'] for the latest consumption date, and then take the difference time_span = latest-earliest to get the time span per customer.
Knowing the specific df and variable names would be great
Example Data Picture
My main data table is constructed in the following way:
1.State
2.Product
3.Account Name
4.Jan-20
5.Feb-20
.
.
.
N.)Recent Month- Recent Year
My goal is to get 6 total sums based on 6 different contiguous that are user selected. For example, if someone wanted the value of an Account Given a State and Product for FY-2020, they would sum columns 4 to column 15 (Twelve Months).
I am going to be running joins and queries off of the State, Product, Account combinations (first three columns), but I need to create a method to sum the Data table given a list of Column Numbers.
At this point, I am not looking to put non-contiguous columns in a selected Time Period (i.e. all time period selections will be from Col.Beg_TPn to and including Col.End_TPn). The Data Table houses monthly data that will have one new column every month. The Column Number should stay consistent as we are not looking back further than FY-2020.
This a much easier problem in Excel as you can do a simple SumIfs with an Index of the Column Range as the SumRange and then you filter on Columns 1,2,3. My data table is about 30,000 rows, so Excel freezes on me when could calculations and functions on the entire data set.
What is the best way to go about this in Microsoft Access? Ideally, I would like to create a CTE_TimePeriodTotals that houses the First 3 Columns (State,Product,Account) and then 6 TP Columns (tp-1,tp-2...) that holds the sum of each time period for each row based on the Time Period Column Starts and Time Period Column End.
Hi I have the following dataframe and I want to count the number of times that each year repeats
df = pd.DataFrame({'year':[1958,1963,1958,1963],'title':['a','g','z','e']})
How can I group by the year and count how many times each year is? I would create an additional column with the count.
Check with value_counts
out = df['year'].value_counts()
This question is similar to my previous one: Shifting elements of column based on index given condition on another column
I have a dataframe (df) with 2 columns and 1 index.
Index is datetime index and is in format of 2001-01-30 .... etc and the index is ordered by DATE and there are thousands of identical dates (and is monthly dates). Column A is company name (which corresponds to the date), Column B are share prices for the company names in column A for the date in the Index.
Now there are multiple companies in Column A for each date, and companies do vary over time (so the data is not predictable fully).
I want to create a column C which has the 3 day rolling exponential weighting average of the price for a particular company using the current and 2 dates before for a particular company in column A.
I have tried a few methods but have failed. Thanks.
Try:
df.groupby('ColumnA', as_index=False).apply(lambda g: g.ColumnB.ewm(3).mean())