Panda's Update information based on bins / cutting - pandas

I'm working on a dataset which has a large amount of missing information.
I understand I could use FillNA but i'd like to base my updates on the binned values of another column.
Selection of missing data:
missing = train[train['field'].isnull()]
Bin the data (this works correctly):
filter_values = [0, 42, 63, 96, 118, 160]
labels = [1,2,3,4,5]
out = pd.cut(missing['field2'], bins = filter_values, labels=labels)
counts = pd.value_counts(out)
print(counts)
Now, based on the bin assignments, I would like to set the correct bin label, to the missing/train['field'] for all data assigned to this bin.

IIUC:
You just need to fillna
train['field'] = train['field'].fillna(out)

Related

Add new columns to excel file from multiple datasets with Pandas in Google Colab

I'm trying to add some columns to a excel file after some data but I'm not having good results just overwriting what I have. Let me give you some context: I'm reading a csv, for each column I'm using a for to value_counts and then create a frame from this value_counts here the code for just one column:
import pandas as pd
data= pd.read_csv('responses.csv')
datatoexcel = data['Music'].value_counts().to_frame()
datatoexcel.to_excel('savedataframetocolumns.xlsx') #Name of the file
This works like this ...
And with that code for only one column I have the format that I actually need for excel.
But the problem is when I try to do it with for to all the columns and then "Append" to excel the following dataframes using this formula:
for columnName in df:
datasetstoexcel = df.value_counts(columnName).to_frame()
print(datasetstoexcel)
# Here is my problem with the following line the .to_excel
x.to_excel('quickgraph.xlsx') #I tried more code lines but I'll leave this one as base
The result that I want to reach is this one:
I'm really close to finish this code, some help here please!
How about this?
Sample data
df = pd.DataFrame({
"col1": [1,2,3,4],
"col2": [5,6,7,8],
"col3": [9, 9, 11, 12],
"col4": [13, 14, 15, 16],
})
Find value counts and add to a list
li = []
for i in range(0, len(df)):
value_counts = df.iloc[:, i].value_counts().to_frame().reset_index()
li.append(value_counts)
concat all the dataframes inside li and write to excel
pd.concat(li, axis=1).to_excel("result.xlsx")
Sample output:

Matplotlib plot with x-axis as binned data and y-axis as the mean value of various variables in the bin?

My apologies if this is rather basic; I can't seem to find a good answer yet because everything refers only to histograms. I have circular data, with a degrees value as the index. I am using pd.cut() to create bins of a few degrees in order to summarize the dataset. Then, I use df.groupby() and .mean() to calculate mean values of all columns for the respective bins.
Now - I would like to plot this, with the bins on the x-axis, and lines for the columns.
I tried to iterate over the columns, adding them as:
for i in df.columns:
ax.plot(df.index,df[i])
However, this gives me the error: "float() argument must be a string or number, not 'pandas._libs.interval.Interval'
Therefore, I assume it wants the x-axis values to be numbers or strings and not intervals. Is there a way I can make this work?
To get the dataframe containing the mean values of each variable with respect to bins, I used:
bins = np.arange(0,360,5)
df = df.groupby(pd.cut(df[Dir]),bins)).mean()
Here is what df looks like at the point of plotting - each column includes mean values for each variable 0,1,2 etc. for each bin, which I would like plotted on y-axis, and "Dir" is the index with bins.
0 1 2 3 4 5
Dir
(0, 5] 37.444135 2922.848675 3244.325904 4203.001446 36.262371 37.493497
(5, 10] 42.599494 3248.194328 3582.355759 4061.098517 36.351476 37.148341
(10, 15] 47.277694 2374.379517 2709.435714 2932.064076 36.537377 36.878293
(15, 20] 52.345712 2626.774240 2659.391040 3087.324800 36.114965 36.603918
(20, 25] 57.318976 2207.845000 2228.002353 2811.066176 36.279392 37.165979
(25, 30] 62.454386 2436.117405 2839.255696 3329.441772 36.762896 37.861577
(30, 35] 67.705955 3138.968411 3462.831977 4007.180620 36.462313 37.560977
(35, 40] 72.554786 2554.552620 2548.955581 3079.570159 36.256386 36.819579
(40, 45] 77.501479 2862.703066 2965.408491 2857.901887 36.170788 36.140976
(45, 50] 82.386679 2973.858188 2539.348967 2000.606359 36.067776 37.210645
We have multiple options, we can obtain the middle of the bin using as shown below. You can also access the left and right side of the bins, as described here. Let me know if you need any further help.
df = pd.DataFrame(data={'x': np.random.uniform(low=0, high=10, size=10), 'y': np.random.exponential(size=10)})
bins = range(0,360,5)
df['bin'] = pd.cut(df['x'], bins)
agg_df = df.groupby(by='bin').mean()
# this is the important step. We can obtain the interval index from the categorical input using this line.
mids = pd.IntervalIndex(agg_df.index.get_level_values('bin')).mid
# to apply for plots:
for col in df.columns:
plt.plot(mids, df[col])

How to calculate average of values of a column for a particular value in another column?

I have a data frame that looks like this.
How can I get the average doc/duration for each window into another data frame?
I need it in the following way
Dataframe should contain only one column i.e mean. If there are 3000 windows then there should be 3000 rows in axis 0 which represent the windows and the mean will contain the average value. If that particular window is not present in the initial data frame the corresponding value for that window needs to be 0.
Use .groupby() method and then compute the mean:
import pandas as pd
df = pd.DataFrame({'10s_windows': [304, 374, 374, 374, 374, 3236, 3237, 3237, 3237],
'doc/duration': [0.1, 0.1, 0.2, 0.2, 0.12, 0.34, 0.32, 0.44, 0.2]})
new_df = df.groupby('10s_windows').mean()
Which results in:
Source: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html

Create pandas dataframe from set of dictionaries

I need to create a pandas data frame from different dictionaries where keys must act as column names inside the data frame. If the data frame doesn't have the key listed as a column, then it has to create it dynamically and attached as a new column to the data frame.
I expect the input as,
1st dict-> {'mse': 0.04, 'accuracy': 0.91, 'mean':0.75}
2nd dict-> {'mse': 0.04, 'accuracy': 0.91}
3rd dict-> {'mse': 0.04, 'accuracy': 0.91, 'f1-score':0.95}
And the output should be,
1st iteration of a loop it takes keys as columns name for data frame and creates if no data frame present with values as 1st row.
2nd iteration checks if keys are present as columns in the data frame and insert if already present else create a column and insert values as 2nd row.
I exactly don't know how to run the loop dynamically in python. Can anyone please help me in resolving the issue? Thanks in advance!
here is the docs from_records
import pandas as pd
dict = {'mse': 0.04, 'accuracy': 0.91, 'mean':0.75}
dict2 = {'mse': 0.04, 'accuracy': 0.91}
dict3 = {'mse': 0.04, 'accuracy': 0.91, 'f1-score':0.95}
mydicts = [dict, dict2, dict3]
df = pd.DataFrame.from_records(mydicts).fillna(0)
print(df)
or simply that said in comments
pd.DataFrame(mydicts)

Sample Pandas dataframe based on values in column

I have a large dataframe that I want to sample based on values on the target column value, which is binary : 0/1
I want to extract equal number of rows that have 0's and 1's in the "target" column. I was thinking of using the pandas sampling function but not sure how to declare the equal number of samples I want from both classes for the dataframe based on the target column.
I was thinking of using something like this:
df.sample(n=10000, weights='target', random_state=1)
Not sure how to edit it to get 10k records with 5k 1's and 5k 0's in the target column. Any help is appreciated!
You can group the data by target and then sample,
df = pd.DataFrame({'col':np.random.randn(12000), 'target':np.random.randint(low = 0, high = 2, size=12000)})
new_df = df.groupby('target').apply(lambda x: x.sample(n=5000)).reset_index(drop = True)
new_df.target.value_counts()
1 5000
0 5000
Edit: Use DataFrame.sample
You get similar results using DataFrame.sample
new_df = df.groupby('target').sample(n=5000)
You can use DataFrameGroupBy.sample method as follwing:
sample_df = df.groupby("target").sample(n=5000, random_state=1)
Also found this to be a good method:
df['weights'] = np.where(df['target'] == 1, .5, .5)
sample_df = df.sample(frac=.1, random_state=111, weights='weights')
Change the value of frac depending on the percent of data you want back from the original dataframe.
You will have to run a df0.sample(n=5000) and df1.sample(n=5000) and then combine df0 and df1 into a dfsample dataframe. You can create df0 and df1 by df.filter() with some logic. If you provide sample data I can help you construct that logic.