Put rows in between existed rows, and interpolate the values in the data frame - pandas

Hello.
I would like you to ask how to put new rows ( 60 number of new rows ) in between every existed rows.
What I am thinking of is as shown as in the picture.
I want to put new rows in between every existed rows, and interpolate the values.
Can you guide me how to do this?
I am using Pandas, and Numpy libraries
Thank you.

You can multiple index values by 3, then DataFrame.reindex for add missing values and last use DataFrame.interpolate:
#solution working with default index
df = df.reset_index(drop=True)
df = df.rename(lambda x: x * 3).reindex(np.arange(df.index.max() * 3 + 1)).interpolate()

Related

How to broadcast a list of data into dataframe (Or multiIndex )

I have a big dataframe its about 200k of rows and 3 columns (x, y, z). Some rows doesn't have y,z values and just have x value. I want to make a new column that first set of data with z value be 1,second one be 2,then 3, etc. Or make a multiIndex same format.
Following image shows what I mean
Like this image
I made a new column called "NO." and put zero as initial value. Then
I tried to record the index of where I want the new column get a new value. with following code
df = pd.read_fwf(path, header=None, names=['x','y','z'])
df['NO.']=0
index_NO_changed = df.index[df['z'].isnull()]
Then I loop through it and change the number:
for i in range(len(index_NO_changed)-1):
df['NO.'].iloc[index_NO_changed[i]:index_NO_changed[i+1]]=i+1
df['NO.'].iloc[index_NO_changed[-1]:]=len(index_NO_changed)
But the problem is I get a warning that "
A value is trying to be set on a copy of a slice from a DataFrame
I was wondering
Is there any better way? Is creating multiIndex instead of adding another column easier considering size of dataframe?

pandas vertical concat not working as expected

I have 2 dataframes which I am trying to merge/stack vertically. First dataframe has 25 columns and second dataframe has 13 columns, out of which I only want to select 1.
When I execute the code, I get more records than expected.
I dont understand where does the problem lie.
To understand this, I tried loading the data again in a fresh pandas dataframe.
input_df = pd.read_csv()
print(input_df.shape)
(8809, 11)
filtered_df = input_df[input_df['label'] != -1] # try to filter based on label column
print(filtered_df.shape)
(6603, 11)
But when I simply print the filtered_df, I can still see 8809 records.

Pandas groupby year filtering the dataframe by n largest values

I have a dataframe at hourly level with several columns. I want to extract the entire rows (containing all columns) of the 10 top values of a specific column for every year in my dataframe.
so far I ran the following code:
df = df.groupby([df.index.year])['totaldemand'].apply(lambda grp: grp.nlargest(10)))
The problem here is that I only get the top 10 values for each year of that specific column and I lose the other columns. How can I do this operation and having the corresponding values of the other columns that correspond to the top 10 values per year of my 'totaldemand' column?
We usually do head after sort_values
df = df.sort_values('totaldemand',ascending = False).groupby([df.index.year])['totaldemand'].head(10)
nlargest can be applied to each group, passing the column to look for
largest values.
So run:
df.groupby([df.index.year]).apply(lambda grp: grp.nlargest(3, 'totaldemand'))
Of course, in the final version replace 3 with your actual value.
Get the index of your query and use it as a mask on your original df:
idx = df.groupby([df.index.year])['totaldemand'].apply(lambda grp: grp.nlargest(10))).index.to_list()
df.iloc[idx,]
(or something to that extend, I can't test now without any test data)

Sample Pandas dataframe based on values in column

I have a large dataframe that I want to sample based on values on the target column value, which is binary : 0/1
I want to extract equal number of rows that have 0's and 1's in the "target" column. I was thinking of using the pandas sampling function but not sure how to declare the equal number of samples I want from both classes for the dataframe based on the target column.
I was thinking of using something like this:
df.sample(n=10000, weights='target', random_state=1)
Not sure how to edit it to get 10k records with 5k 1's and 5k 0's in the target column. Any help is appreciated!
You can group the data by target and then sample,
df = pd.DataFrame({'col':np.random.randn(12000), 'target':np.random.randint(low = 0, high = 2, size=12000)})
new_df = df.groupby('target').apply(lambda x: x.sample(n=5000)).reset_index(drop = True)
new_df.target.value_counts()
1 5000
0 5000
Edit: Use DataFrame.sample
You get similar results using DataFrame.sample
new_df = df.groupby('target').sample(n=5000)
You can use DataFrameGroupBy.sample method as follwing:
sample_df = df.groupby("target").sample(n=5000, random_state=1)
Also found this to be a good method:
df['weights'] = np.where(df['target'] == 1, .5, .5)
sample_df = df.sample(frac=.1, random_state=111, weights='weights')
Change the value of frac depending on the percent of data you want back from the original dataframe.
You will have to run a df0.sample(n=5000) and df1.sample(n=5000) and then combine df0 and df1 into a dfsample dataframe. You can create df0 and df1 by df.filter() with some logic. If you provide sample data I can help you construct that logic.

How do I preset the dimensions of my dataframe in pandas?

I am trying to preset the dimensions of my data frame in pandas so that I can have 500 rows by 300 columns. I want to set it before I enter data into the dataframe.
I am working on a project where I need to take a column of data, copy it, shift it one to the right and shift it down by one row.
I am having trouble with the last row being cut off when I shift it down by one row (eg: I started with 23 rows and it remains at 23 rows despite the fact that I shifted down by one and should have 24 rows).
Here is what I have done so far:
bolusCI = pd.DataFrame()
##set index to very high number to accommodate shifting row down by 1
bolusCI = bolus_raw[["Activity (mCi)"]].copy()
activity_copy = bolusCI.shift(1)
activity_copy
pd.concat([bolusCI, activity_copy], axis =1)
Thanks!
There might be a more efficient way to achieve what you are looking to do, but to directly answer your question you could do something like this to init the DataFrame with certain dimensions
pd.DataFrame(columns=range(300),index=range(500))
You just need to define the index and columns in the constructor. The simplest way is to use pandas.RangeIndex. It mimics np.arange and range in syntax. You can also pass a name parameter to name it.
pd.DataFrame
pd.Index
df = pd.DataFrame(
index=pd.RangeIndex(500),
columns=pd.RangeIndex(300)
)
print(df.shape)
(500, 300)