Scaling down high dimensional pandas' data frame data using sklean - pandas

I am trying to scale down values in pandas data frame. The problem is that I have 291 dimensions, so scale down the values one by one is time consuming if we are to do it as follows:
from sklearn.preprocessing import StandardScaler
sclaer = StandardScaler()
scaler = sclaer.fit(dataframe['dimension_1'])
dataframe['dimension_1'] = scaler.transform(dataframe['dimension_1'])
Problem: This is only for one dimension, so how we can do this please for the 291 dimension in one shot?

You can pass in a list of the columns that you want to scale instead of individually scaling each column.
# convert the columns labelled 0 and 1 to boolean values
df.replace({0: False, 1: True}, inplace=True)
# make a copy of dataframe
scaled_features = df.copy()
# take the numeric columns i.e. those which are not of type object or bool
col_names = df.dtypes[df.dtypes != 'object'][df.dtypes != 'bool'].index.to_list()
features = scaled_features[col_names]
# Use scaler of choice; here Standard scaler is used
scaler = StandardScaler().fit(features.values)
features = scaler.transform(features.values)
scaled_features[col_names] = features

I normally use pipeline, since it can do multi-step transformation.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
num_pipeline = Pipeline([('std_scale', StandardScaler())])
transformed_dataframe = num_pipeline.fit_transform(dataframe)
If you need to do more for transformation, e.g. fill NA,
you just add in the list (Line 3 of the code).
Note: The above code works, if the datatype of all columns is numeric. If not we need to
select only numeric columns
pass into the pipeline, then
put the result back to the original dataframe.
Here is the code for the 3 steps:
num_col = dataframe.dtypes[df.dtypes != 'object'][dataframe.dtypes != 'bool'].index.to_list()
df_num = dataframe[num_col] #1
transformed_df = num_pipeline.fit_transform(dataframe) #2
dataframe[num_col] = transformed_df #3

Related

How to build columns in Plotly with multiple values sorted by value?

I have a dataframe with data, the code is below, in which there are 3 columns - date, system and number, building a bar graph in Plotly I get two bars in which I cannot set the sorting by values, they are atomatically sorted by name.
import pandas as pd
import numpy as np
data = [('2022-10-01','Pay1',644), ('2022-10-01','Pay2',1460), ('2022-10-01','Pay3',1221), ('2022-10-01','Pay4',1623),\
('2022-10-01','Pay5',1904), ('2022-10-01','Pay6',1853), ('2022-10-01','Pay7',1826), ('2022-10-01','Pay8',247),\
('2022-10-01','Pay9',713), ('2022-10-01','Pay10',1159), ('2022-10-02','Pay1',755), ('2022-10-02','Pay2',786),\
('2022-10-02','Pay3',623), ('2022-10-02','Pay4',1766), ('2022-10-02','Pay5',1141), ('2022-10-02','Pay6',362),\
('2022-10-02','Pay7',1097), ('2022-10-02','Pay8',655), ('2022-10-02','Pay9',1569), ('2022-10-02','Pay10',796)]
data = pd.DataFrame(data,columns=['date','system','number'])
import plotly.express as px
fig = px.bar(data, x='date', y='number',
color='system')
fig.show()
I want to get a bar that will be sorted by value, from smallest to largest in each case
The expected graph is a stacked graph using the same color for categorical variables, and the order of the graphs is in order of increasing numerical value. To make the categorical variables the same color, create a dictionary of default discrete to maps and system columns. Add a column of colors to each data frame. Extract data frames by date, sort them in numerical order of size, and loop through them row by row.
import plotly.graph_objects as go
import plotly.express as px
colors = px.colors.qualitative.Plotly
system_name = data['system'].unique()
colors_dict = {k:v for k,v in zip(system_name, colors)}
# print(colors_dict)
fig = go.Figure()
dff = data.query('date =="2022-10-01"')
dff = dff.sort_values('number',ascending=False)
dff['color'] = dff['system'].map(colors_dict)
for row in dff.itertuples():
fig.add_trace(go.Bar(x=[row.date], y=[row.number], name=row.system, marker_color=row.color))
fig.update_layout(barmode='stack')
dfm = data.query('date =="2022-10-02"')
dfm = dfm.sort_values('number',ascending=False)
dfm['color'] = dfm['system'].map(colors_dict)
for row in dfm.itertuples():
fig.add_trace(go.Bar(x=[row.date], y=[row.number], name=row.system, marker_color=row.color))
fig.update_layout(barmode='stack')
names = set()
fig.for_each_trace(
lambda trace:
trace.update(showlegend=False)
if (trace.name in names) else names.add(trace.name))
fig.show()

How Can I Add A Regression Line to pandas.plot(kind='bar)?

I'd like to add a regression line for each flavor below. How can I do that? Do I need to use subplots? Is it possible using pandas.plot or do I need to use the full matplotlib?
import pandas as pd
# initialize list of lists
data = [[1,157.842730083188,202.290991182781,244.849416438322],
[2,234.516775578511,190.104435611797,202.157088214941],
[3,198.279130213755,193.075780258345,194.112394276613],
[4,156.285653517235,198.382900113055,185.380696178104],
[5,190.653607667334,208.807038546447,202.662790911701],
[6,192.027054343382,168.768097007287,179.315293388299],
[7,144.927513854729,166.183469310198,157.338388768229],
[8,194.096584739985,177.710332802887,188.006211652239],
[9,131.613923150861,112.503607632448,128.947939049068],
[10,139.545538050778,129.935716833166,139.334073132085]
]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['DensityDecileRank', 'Flavor1', 'Flavor2', 'Flavor3'])
df.plot(x='DensityDecileRank',
kind='bar',
stacked=False)
If you don't mind to use numpy to explicitly calculate the regression values,
the following code snippet based on this can be used as a quick solution:
ax = df.plot(x='DensityDecileRank', kind='bar', stacked=False)
rank, flavors = df.columns[0], df.columns[1:]
for flavor in flavors:
reg_func = np.poly1d(np.polyfit(df[rank], df[flavor], 1))
ax.plot(reg_func(df[rank]))
plt.show()
The code above derives the function reg_func for each flavor, which can be used for calculating the regression values based on the rank values.
The regression lines are plotted in the order of the flavor columns to match the colors. Further formatting can be added to ax.plot.

Extracting column from Array in python

I am beginner in Python and I am stuck with data which is array of 32763 number, separated by comma. Please find the data here data
I want to convert this into two column 1 from (0:16382) and 2nd column from (2:32763). in the end I want to plot column 1 as x axis and column 2 as Y axis. I tried the following code but I am not able to extract the columns
import numpy as np
import pandas as pd
import matplotlib as plt
data = np.genfromtxt('oscilloscope.txt',delimiter=',')
df = pd.DataFrame(data.flatten())
print(df)
and then I want to write the data in some file let us say data1 in the format as shown in attached pic
It is hard to answer without seeing the format of your data, but you can try
data = np.genfromtxt('oscilloscope.txt',delimiter=',')
print(data.shape) # here we check we got something useful
# this should split data into x,y at position 16381
x = data[:16381]
y = data[16381:]
# now you can create a dataframe and print to file
df = pd.DataFrame({'x':x, 'y':y})
df.to_csv('data1.csv', index=False)
Try this.
#input as dataframe df, its chunk_size, extract output as list. you can mention chunksize what you want.
def split_dataframe(df, chunk_size = 16382):
chunks = list()
num_chunks = len(df) // chunk_size + 1
for i in range(num_chunks):
chunks.append(df[i*chunk_size:(i+1)*chunk_size])
return chunks
or
np.array_split

How do I pre-process the dataset if the feature ranges are too wide?

I have a dataset with 5 features and each column being in a different range of numbers. I have tried using MinMaxScaler and StandardScaler but the accuracy for this multi-class problem is too low.
If StandardScaler and MinMaxScaler don't have the desired affect, then another thing to check for is skewed data:
# Check the skew of all numerical features
numeric_feats = all_data.dtypes[all_data.dtypes != "object"].index
skewed_feats = all_data[numeric_feats].apply(lambda x: skew(x.dropna())).sort_values(ascending=False)
print("\nSkew in numerical features: \n")
skewness = pd.DataFrame({'Skew' :skewed_feats})
skewness.head(10)
Lower is better. If you get high scores, you can use a transform (log, boxcox, etc) to make the data distribution more normal in shape.
correcting for skew:
skewness = skewness[abs(skewness) > 0.75]
print("There are {} skewed numerical features to Box Cox transform".format(skewness.shape[0]))
from scipy.special import boxcox1p
skewed_features = skewness.index
lam_f = 0.15
for feat in skewed_features:
#all_data[feat] += 1
all_data[feat] = boxcox1p(all_data[feat], lam_f)
Other things to try:
either remove fliers or try RobustScaler()
PowerTransformer()
Reference: https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html

Creating dataframe boxplot from dataframe with row and column multiindex

I have the following Pandas data frame and I'm trying to create a boxplot of the "dur" value for both client and server organized by qdepth (qdepth on x-axis, duration on y-axis, with two variables client and server). It seems like I need to get client and serveras columns. I haven't been able to figure this out trying combinations ofunstackandreset_index`.
Here's some dummy data I recreated since you didn't post yours aside from an image:
qdepth,mode,runid,dur
1,client,0x1b7bd6ef955979b6e4c109b47690c862,7.0
1,client,0x45654ba030787e511a7f0f0be2db21d1,30.0
1,server,0xb760550f302d824630f930e3487b4444,19.0
1,server,0x7a044242aec034c44e01f1f339610916,95.0
2,client,0x51c88822b28dfa006bf38603d74f9911,15.0
2,client,0xd5a9028fddf9a400fd8513edbdc58de0,49.0
2,server,0x3943710e587e3932adda1cad8eaf2aeb,30.0
2,server,0xd67650fd984a48f2070de426e0a942b0,93.0
Load the data: df = pd.read_clipboard(sep=',', index_col=[0,1,2])
Option 1:
df.unstack(level=1).boxplot()
Option 2:
df.unstack(level=[0,1]).boxplot()
Option 3:
Using seaborn:
import seaborn as sns
sns.boxplot(x="qdepth", hue="mode", y="dur", data=df.reset_index(),)
Update:
To answer your comment, here's a very approximate way (could be used as a starting point) to recreate the seaborn option using only pandas and matplotlib:
fig, ax = plt.subplots(nrows=1,ncols=1, figsize=(12,6))
#bp = df.unstack(level=[0,1])['dur'].boxplot(ax=ax, return_type='dict')
bp = df.reset_index().boxplot(column='dur',by=['qdepth','mode'], ax=ax, return_type='dict')['dur']
# Now fill the boxes with desired colors
boxColors = ['darkkhaki', 'royalblue']
numBoxes = len(bp['boxes'])
for i in range(numBoxes):
box = bp['boxes'][i]
boxX = []
boxY = []
for j in range(5):
boxX.append(box.get_xdata()[j])
boxY.append(box.get_ydata()[j])
boxCoords = list(zip(boxX, boxY))
# Alternate between Dark Khaki and Royal Blue
k = i % 2
boxPolygon = mpl.patches.Polygon(boxCoords, facecolor=boxColors[k])
ax.add_patch(boxPolygon)
plt.show()