How this converters can not work? - pandas

The code as below:
I try to use read_excel and want to apply multiplication to one column, so I use converters, but it doesn't work, the column didn't change.
If I use the wrong way?
import pandas as pd
import numpy as np
df = pd.read_excel('Energy Indicators.xls',sheetname='Energy', header=1, skiprows=16, skip_footer=38, index_col=None, names=['Country','Energy Supply', 'Energy Supply per Capita', '% Renewable'], parse_cols='C,D,E,F', converters = {'Energy Supply': (lambda x: x*1000000)})

Related

How to access a dataframe from a Python dataframe list through a cell from a date column in the dataframe

I have created a list (df) which contains some dataframes after importing csv files. Instead of accessing this dataframes using df[0], df[1] etc, I would like to access them in a much easier way with something like df[20/04/22] or df[date=='20/04/22] or something similar. I am really new to Python and programming, thank you very much in advance. I attach the simplified code (contains only 2 items in the list) for simplyfying reasons.
I came up with two ways of achieving that but each time I have some trouble realising them.
Through my directory path names. Each csv (dataframe) file name includes the date in each original name file, something like : "5f05d5d83a442d4f78db0a19_2022-04-01.csv"
Each csv (dataframe), includes a date column (object type) which I have changed to datetime64 type so I can work with plots. So, I thought that maybe through this column what I ask would be possible.
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import datetime
from datetime import date
from datetime import time
from pandas.tseries.offsets import DateOffset
import glob
import os
path = "C:/Users/dsdadsdsaa/"
all_files = glob.glob(path + '*.csv')
df = []
for filename in all_files:
dataframe = pd.read_csv(filename, index_col=None, header=0)
df.append(dataframe)
for i in range(0,2):
df[i]['date'] = pd.to_datetime(df[i]['date'])
df[i]['time'] = pd.to_datetime(df[i]['time'])
df[0]

How to most efficiently use Pandas UDF in Spark with multiple Series as inputs

I have some PySpark code that aims to run a machine learning model trained in sklearn on a pyspark dataframe looks like this:
from sklearn.ensemble import RandomForestRegressor
X = np.random.rand(1000, 100)
y = np.random.randint(2, size=1000)
tree = RandomForestRegressor(n_jobs=4)
tree.fit(X, y)
pdf = pd.DataFrame(X)
df = spark.createDataFrame(pdf)
from pyspark.sql.functions import pandas_udf, PandasUDFType
#pandas_udf('double')
# Input/output are both a pandas.Series of doubles
def pandas_plus_one(*args):
return pd.Series(tree.predict(pd.concat([args[i] for i in range(100)],axis=1)))
df = df.withColumn('result', pandas_plus_one(*[df[i] for i in range(100)]))
My question is that is this the most efficient way to do things with PySpark? In particular, I would like to avoid having to do pd.concat which involves copying all the Series (which were probably adjacent in memory anyways) to a new pandas DataFrame inside of the UDF function. The ideal solution would be for the Pandas UDF to accept a DataFrame as an input, but I haven't found a way to make it work.
Note: I am not looking for solutions that involve SparkML scikit-spark etc.

Good alternative to exec in python 2.7

I have code where I need to create pandas dataframe with the name from list. I know this can be achieved by using exec() function. But looks like its slowing down my app. Is there any better alternative to it ?
import pandas as pd
df_names = ["first","second","third"]
col_names = ['A','B','C']
for names in df_names:
exec("%s=pd.DataFrame(columns=col_names)"%(names))
Found below method and its working for me
import pandas as pd
df_names = ["first","second","third"]
col_names = ['A','B','C']
d={}
for names in df_names:
d[names]=pd.DataFrame(columns=col_names)

What is the difference between doing a regression with a dataframe and ndarray?

I would like to know why would I need to convert my dataframe to ndarray when doing a regression, since I get the same result for intercept and coef when I do not convert it?
import matplotlib.pyplot as plt
import pandas as pd
import pylab as pl
import numpy as np
from sklearn import linear_model
%matplotlib inline
# import data and create dataframe
!wget -O FuelConsumption.csv https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/FuelConsumptionCo2.csv
df = pd.read_csv("FuelConsumption.csv")
cdf = df[['ENGINESIZE','CYLINDERS','FUELCONSUMPTION_COMB','CO2EMISSIONS']]
# Split train/ test data
msk = np.random.rand(len(df)) < 0.8
train = cdf[msk]
test = cdf[~msk]
# Modeling
regr = linear_model.LinearRegression()
train_x = np.asanyarray(train[['ENGINESIZE']])
train_y = np.asanyarray(train[['CO2EMISSIONS']])
**# if I use the dataframe, train[['ENGINESIZE']] for 'x', and train[['CO2EMISSIONS']] for 'y'
below, I get the same result**
regr.fit (train_x, train_y)
# The coefficients
print ('Coefficients: ', regr.coef_)
print ('Intercept: ',regr.intercept_)
Thank you very much!
So df is the loaded dataframe, cdf is another frame with selected columns, and train is selected rows.
train[['ENGINESIZE']] is a 1 column dataframe (I believe train['ENGINESIZE'] would be a pandas Series).
I believe the preferred syntax for getting an array from the dataframe is:
train[['ENGINESIZE']].values # or
train[['ENGINESIZE']].to_numpy()
though
np.asanyarray(train[['ENGINESIZE']])
is supposed to do the same thing.
Digging down through the regr.fit code I see that it calls sklearn.utils.check_X_y which in turn calls sklearn.tils.check_array. That takes care of converting the inputs to numpy arrays, with some awareness of pandas dataframe peculiarities (such as multiple dtypes).
So it appears that if fit accepts your dataframes, you don't need to convert them ahead of time. But if you can get a nice array from the dataframe, there's no harm in do that either. Either way the fit is done with arrays, derived from the dataframe.

Rolling multidimensional function in pandas

Let's say, I have the following code.
import numpy as np
import pandas as pd
x = pd.DataFrame(np.random.randn(100, 3)).rolling(window=10, center=True).cov()
For each index, I have a 3x3 matrix. I would like to calculate eigenvalues and then some function of those eigenvalues. Or, perhaps, I might want to compute some function of eigenvalues and eigenvectors. The point is that if I take x.loc[0] then I have no problem to compute anything from that matrix. How do I do it in a rolling fashion for all matrices?
Thanks!
You can use the analogous eigenvector/eigenvalue methods in spicy.sparse.linalg.
import numpy as np
import pandas as pd
from scipy import linalg as LA
x = pd.DataFrame(np.random.randn(100, 3)).rolling(window=10, center=True).cov()
for i in range(len(x)):
try:
e_vals,e_vec = LA.eig(x.loc[i])
print(e_vals,e_vec)
except:
continue
If there are no NaN values present then you need not use the try and except instead go for only for loop.