Is there a way to make separate dataframes from one dataframe? - dataframe

I have a dataframe containing stock prices from different dates ranging from 2017 to 2022. The dates are set as the index and they are formatted as YYYY-MM-DD. I'm trying to create separate dataframes for each year. Is there a way I can filter my data to do so?
I am new to python and am not sure of how to do this. I tried to use .iloc to pick the specific year I wanted, but it didn't work since the dates are formatted as YYY-MM-DD.

Related

Column with multiple data types in Power BI Matrix Table

My issue is similar to this one Multiple data types in a Power BI matrix but I've got a bit of a different setup that's throwing everything off.
What I'm trying to do is create a matrix table with several metrics that are categorized as Current (raw data values) and Prior (year over year percent growth/decline). I've created some dummy data in Excel to get the format the way I want it in PowerBI (see below):
Desired Format
As you can see the Current values are coming in as integers and the Prior % numbers as percentages which is exactly what I want; however, I was able to accomplish this through a custom column with the following formula:
Revenue2 = IF(Scorecard2[Current_Prior] = "Current", FORMAT(FIXED(Scorecard2[Revenue],0), "$#,###"), FORMAT(Scorecard2[Revenue], "Percent"))
The problem is that the data comes from a SQL query and you can't use the FORMAT() function in DirectQuery. Is there a way I can have two different datatypes in the same column of data? See below for how the SQL data comes into PowerBI (I can change this if need be):
SQL
Create 2 separate measures, one for the Current second for Prior, and format these measures.
Probably you can also use a case in SQL query to format your data to bring it as STRING.
What I wound up doing was reformatting the SQL code to look like this:
Solution
That way Current/Prior are have two separate values and the "metric" is categorical.
I got the idea from this post:
Simple way to transpose columns and rows in SQL?

Merging two dataframes in pandas, multiple same column names. Can't figure out syntax

I have merged my two dataframes in Pandas but I am unable to figure out how to merge two columns with same name (Country and Year). I am able to merge either Country OR Year, but, not both.
Whenever I merge say, Country, my year columns become year_x and Year_y by default, and vice versa.
Here is my syntax:
merged = pd.merge(left=df, right=df1, left_on='Year', right_on='Year')
Is there a way using this method that I can have both Year and Country? I tried to find the answer online, used different permutations in the code, such as adding both Country and Year, but I receive syntax errors every time.
Thank you for any assistance.
It's not quite clear what you want to achieve. If you need to merge two dataframes that have 2 identically named columns (Year and Country), something like this may help:
merged = pd.merge(left=df, right=df1, on=["Year", "Country"])

running a for loop on a daterange using pyspark

I have a dataframe where one of the columns contains a date range from 2019-01-01 to 2019-02-01 where the format is:
yyyy-mm-dd Is there a way to loop over the dataframe by each day, selecting a day and then filter by that day. I would like to do some calculations on the filtered dataframe since each day has multiple records.
Since this is distributed computing, one method I came across is to insert a row number column using row_number() over a window of the entire dataframe and then run the for loop. But I feel that this is counterproductive, since I would be forcing the entire dataframe into a single node and my dataframe has millions of rows.
Is there a way to a for or while loop in the pyspark dataframe without using window function?
Your expert insights would be greatly welcomed!
Thank You

Manipulating time series with different start dates

I'm a novice Python programmer and I have an issue that I was hoping you could help me with.
I have two time series in Pandas, but they start at different dates. Let's say one starts in 1989, and the other in 2002. Now I want to to compare the cumulative growth of the two, by indexing both series to 2002 (the first time period where I have data for both), and calculate the ratio.
What is the best way to go about it? Ideally, the script should check what's the earliest available data for a pair of series and index both to 100 from that point onward.
Thank you in advance!
A practical solution may be to split the dataframe into two columns, one for each time series, and add a 'monthyear' column to each dataframe, that only lists the month and year (e.g. 05-2015). Then, you can use pd.merge on both dataframe on that month variable, keeping only the rows that have overlapping months in which they occur. The function would be pd.merge(df1, df2, on='monthyear', how='inner')
You can split the pandas dataframe by creating a new dataframe and loading in only 1 column (or row, depending on how your dataframe looks like). df1 = pd.Dataframe(original_dataframe[0]) and df2 = pd.Dataframe(original_dataframe[1])

Pandas Dataframe Functionality

I would like to create dataframes using excel spreadsheets as source data. I need to transform the data series from the format used to store the data in the excel spreadsheets to the dataframe variable end product.
I would like to know if users have experience in using various python methods to accomplish the following:
-data series transform: I have a series that includes one data value per month, but would like to expand the table of values to include one value per day using an index (or perhaps column with date values). So if table1 has a month based index and table2 has a daily index how can I convert table1 values to the table2 based index.
-dataframe sculpting: the data I am working with is not similar in length, some data sets are longer than others. By what methods is it possible to find the shortest series length in a column in the context of a multicolumn dataframe?
Essentially, I would like to take individual tables from workbooks and combine them into a single dataframe that uses a single index value as the basis for their presentation. My workbook tables may have data point frequencies of daily, weekly, or monthly and I would like to build a dataframe that uses the daily index as a basis for the table elements while including an element for each day in series that are weekly and monthly.
I am looking at the Pandas library, but perhaps there are other libraries that I have overlooked with additional functionality.
Thanks for helping!
For your first question, try something like:
df1 = df1.resample('1d').first()
df2.merge(df1)
That will upsample your monthly or weekly dataframes and merge it with your daily dataframe. Take a look at the interpolate method to fill in missing values. To get the name of the shortest column try this out:
df.count().idxmin()
Hope that helps!