Have a summation of some filtered rows - dataframe

I have a dataframe and i want some specific info out of it. I want to filter some rows and have another column but not sure how to implement the code
This is the dataframe that I have
df
my data are like above and it continues up to 2022. What I want is a code for having a summation of stdedc +aglivc for those time which have .33 at the end. Would you please help me to find it out?

Related

Highest count in a data frame pandas

I'm trying to extract 10 columns from a csv file with pandas in Google colab so I'm using this line firstten = data.iloc[:, 0:10] # first ten columns of data frame with all rows
After that I used firstten.count() to count the number of values in each column so after that what I want to know is that if in one of those columns have 80 different values and the others are lower I want to know from this ten columns the highest number of counts for that, I used the max but it didn't work some help with this please I don't need to know which column just the number in some pieces of data will be repeated 80 like 4 times so also to have this in mind with the solution. Also I thought about using the sort function but maybe there is another option.
Thanks for your help this is the output that I would like to get:

running a for loop on a daterange using pyspark

I have a dataframe where one of the columns contains a date range from 2019-01-01 to 2019-02-01 where the format is:
yyyy-mm-dd Is there a way to loop over the dataframe by each day, selecting a day and then filter by that day. I would like to do some calculations on the filtered dataframe since each day has multiple records.
Since this is distributed computing, one method I came across is to insert a row number column using row_number() over a window of the entire dataframe and then run the for loop. But I feel that this is counterproductive, since I would be forcing the entire dataframe into a single node and my dataframe has millions of rows.
Is there a way to a for or while loop in the pyspark dataframe without using window function?
Your expert insights would be greatly welcomed!
Thank You

Manipulating time series with different start dates

I'm a novice Python programmer and I have an issue that I was hoping you could help me with.
I have two time series in Pandas, but they start at different dates. Let's say one starts in 1989, and the other in 2002. Now I want to to compare the cumulative growth of the two, by indexing both series to 2002 (the first time period where I have data for both), and calculate the ratio.
What is the best way to go about it? Ideally, the script should check what's the earliest available data for a pair of series and index both to 100 from that point onward.
Thank you in advance!
A practical solution may be to split the dataframe into two columns, one for each time series, and add a 'monthyear' column to each dataframe, that only lists the month and year (e.g. 05-2015). Then, you can use pd.merge on both dataframe on that month variable, keeping only the rows that have overlapping months in which they occur. The function would be pd.merge(df1, df2, on='monthyear', how='inner')
You can split the pandas dataframe by creating a new dataframe and loading in only 1 column (or row, depending on how your dataframe looks like). df1 = pd.Dataframe(original_dataframe[0]) and df2 = pd.Dataframe(original_dataframe[1])

Pandas Dataframe Functionality

I would like to create dataframes using excel spreadsheets as source data. I need to transform the data series from the format used to store the data in the excel spreadsheets to the dataframe variable end product.
I would like to know if users have experience in using various python methods to accomplish the following:
-data series transform: I have a series that includes one data value per month, but would like to expand the table of values to include one value per day using an index (or perhaps column with date values). So if table1 has a month based index and table2 has a daily index how can I convert table1 values to the table2 based index.
-dataframe sculpting: the data I am working with is not similar in length, some data sets are longer than others. By what methods is it possible to find the shortest series length in a column in the context of a multicolumn dataframe?
Essentially, I would like to take individual tables from workbooks and combine them into a single dataframe that uses a single index value as the basis for their presentation. My workbook tables may have data point frequencies of daily, weekly, or monthly and I would like to build a dataframe that uses the daily index as a basis for the table elements while including an element for each day in series that are weekly and monthly.
I am looking at the Pandas library, but perhaps there are other libraries that I have overlooked with additional functionality.
Thanks for helping!
For your first question, try something like:
df1 = df1.resample('1d').first()
df2.merge(df1)
That will upsample your monthly or weekly dataframes and merge it with your daily dataframe. Take a look at the interpolate method to fill in missing values. To get the name of the shortest column try this out:
df.count().idxmin()
Hope that helps!

Checking for same value in rows and calculate corresponding total

I have a large s/sheet. Values in column A correspond to values in column B, C & D.
I need to combine some rows which have same value in column A and automatically calculates total of value in column B in all corresponding rows.
Then i need to delete all unnecessary rows
Any ideas how i can do this with some code?
I think that you can use Power Query or VBA. Probably you will be able to achieve this with formulas, but it will be not flexible. With Power QUery you can combine data from multiple sources, clean and transform and even load directly into PowerPivot model. If you will have some detailed information please let me know. If you can upload the sample workbook with your data i will be able to provide you some more information.