Computing for the mean of a given column from a dataframe - pandas

I need to find the arithmetic mean of each columns by returning res?
def ave(df, name):
df = {
'Courses':["Spark","PySpark","Python","pandas",None],
'Fee' :[20000,25000,22000,None,30000],
'Duration':['30days','40days','35days','None','50days'],
'Discount':[1000,2300,1200,2000,None]}
#CODE HERE
res = []
for i in df.columns:
res.append(col_ave(df, i))
I tried individually creating codes for the mean but Im having trouble

Related

Change number format in Excel using names of headers - openpyxl [duplicate]

I have an Excel (.xlsx) file that I'm trying to parse, row by row. I have a header (first row) that has a bunch of column titles like School, First Name, Last Name, Email, etc.
When I loop through each row, I want to be able to say something like:
row['School']
and get back the value of the cell in the current row and the column with 'School' as its title.
I've looked through the OpenPyXL docs but can't seem to find anything terribly helpful.
Any suggestions?
I'm not incredibly familiar with OpenPyXL, but as far as I can tell it doesn't have any kind of dict reader/iterator helper. However, it's fairly easy to iterate over the worksheet rows, as well as to create a dict from two lists of values.
def iter_worksheet(worksheet):
# It's necessary to get a reference to the generator, as
# `worksheet.rows` returns a new iterator on each access.
rows = worksheet.rows
# Get the header values as keys and move the iterator to the next item
keys = [c.value for c in next(rows)]
for row in rows:
values = [c.value for c in row]
yield dict(zip(keys, values))
Excel sheets are far more flexible than CSV files so it makes little sense to have something like DictReader.
Just create an auxiliary dictionary from the relevant column titles.
If you have columns like "School", "First Name", "Last Name", "EMail" you can create the dictionary like this.
keys = dict((value, idx) for (idx, value) in enumerate(values))
for row in ws.rows[1:]:
school = row[keys['School'].value
I wrote DictReader based on openpyxl. Save the second listing to file 'excel.py' and use it as csv.DictReader. See usage example in the first listing.
with open('example01.xlsx', 'rb') as source_data:
from excel import DictReader
for row in DictReader(source_data, sheet_index=0):
print(row)
excel.py:
__all__ = ['DictReader']
from openpyxl import load_workbook
from openpyxl.cell import Cell
Cell.__init__.__defaults__ = (None, None, '', None) # Change the default value for the Cell from None to `` the same way as in csv.DictReader
class DictReader(object):
def __init__(self, f, sheet_index,
fieldnames=None, restkey=None, restval=None):
self._fieldnames = fieldnames # list of keys for the dict
self.restkey = restkey # key to catch long rows
self.restval = restval # default value for short rows
self.reader = load_workbook(f, data_only=True).worksheets[sheet_index].iter_rows(values_only=True)
self.line_num = 0
def __iter__(self):
return self
#property
def fieldnames(self):
if self._fieldnames is None:
try:
self._fieldnames = next(self.reader)
self.line_num += 1
except StopIteration:
pass
return self._fieldnames
#fieldnames.setter
def fieldnames(self, value):
self._fieldnames = value
def __next__(self):
if self.line_num == 0:
# Used only for its side effect.
self.fieldnames
row = next(self.reader)
self.line_num += 1
# unlike the basic reader, we prefer not to return blanks,
# because we will typically wind up with a dict full of None
# values
while row == ():
row = next(self.reader)
d = dict(zip(self.fieldnames, row))
lf = len(self.fieldnames)
lr = len(row)
if lf < lr:
d[self.restkey] = row[lf:]
elif lf > lr:
for key in self.fieldnames[lr:]:
d[key] = self.restval
return d
The following seems to work for me.
header = True
headings = []
for row in ws.rows:
if header:
for cell in row:
headings.append(cell.value)
header = False
continue
rowData = dict(zip(headings, row))
wantedValue = rowData['myHeading'].value
I was running into the same issue as described above. Therefore I created a simple extension called openpyxl-dictreader that can be installed through pip. It is very similar to the suggestion made by #viktor earlier in this thread.
The package is largely based on source code of Python's native csv.DictReader class. It allows you to select items based on column names using openpyxl. For example:
import openpyxl_dictreader
reader = openpyxl_dictreader.DictReader("names.xlsx", "Sheet1")
for row in reader:
print(row["First Name"], row["Last Name"])
Putting this here for reference.

In a Pandas UDF, is it possible to return a Series containing nested lists?

I would like to create a Pandas UDF that returns a series containing a list of lists. This list represents the normalized pixel values of an image. Is it possible to return this datatype from a Pandas UDF? I tried adding ArrayType(FloatType()) to the Pandas UDF decorator, but am getting the error: could not convert <nested lists with floating point values> with type list: tried to convert to float 32. Would be great to hear your thoughts on this, thanks!
#pandas_udf(ArrayType(FloatType()))
def base64_to_arr(base64_images: pd.Series) -> pd.Series:
def base64_to_arr(img):
img_bytes = base64.b64decode(img)
img_array = np.load(BytesIO(img_bytes))
## Resize
pil_img = Image.fromarray(img_array)
resized_img = pil_img.resize((32, 32))
resized_image_arr = np.array(resized_img)
##
normalized_img = resized_image_arr.astype("float32") / 255
formatted_img = normalized_img.tolist()
return formatted_img
arr_images = base64_images.map(base64_to_arr)
return arr_images

How to display output of IBM tone analyzer in pandas frame

My dataset has following features: "description", "word_count", "char_count", "stopwords". The feature "description" has datatype as string which contains some text. I am doing IBM tone_analysis on this feature which gives me correct output and looks like this:
[{'document_tone': {'tones': [{'score': 0.677676,
'tone_id': 'analytical',
'tone_name': 'Analytical'}]}},
{'document_tone': {'tones': [{'score': 0.620279,
'tone_id': 'analytical',
'tone_name': 'Analytical'}]}},
The code for above is given as below:
result =[]
for i in new_df['description']:
tone_analysis = ta.tone(
{'text': i},
# 'application/json'
).get_result()
result.append(tone_analysis)
I need to keep the above output in pandas data frame.
Use lambda function in Series.apply:
new_df['new'] = new_df['description'].apply(lambda i: ta.tone({'text': i}).get_result())
EDIT:
def f(i):
x = ta.tone({'text': i}).get_result()['document_tone']['tones']
return pd.Series(x[0])
new_df = new_df.join(new_df['description'].apply(f).drop('tone_id', axis=1))
print (new_df)
If need also remove description column:
new_df = new_df.join(new_df.pop('description').apply(f).drop('tone_id', axis=1))

How to merge 4 dataframes into one

I have created a functions that returns a dataframe.Now, i want merge all dataframe into one. First, i called all the function and used reduce and merge function.It did not work as expected.The error i am getting is "cannot combine function.It should be dataframe or series.I checked the type of my df,it is dataframe not functions. I don't know where the error is coming from.
def func1():
return df1
def func2():
return df2
def func3():
return df3
def func4():
return df4
def alldfs():
df_1 = func1()
df_2 = func2()
df_3 = func3()
df_4 = func4()
result = reduce(lambda df_1,d_2,df_3,df_4: pd.merge(df_1,df_2,df_3,df_4,on ="EMP_ID"),[df1,df2,df3,df4)
print(result)
You could try something like this ( assuming that EMP_ID is common across all dataframes and you want the intersection of all dataframes ) -
result = pd.merge(df1, df2, on='EMP_ID').merge(df3, on='EMP_ID').merge(df4, on='EMP_ID')

Assigning values to dataframe columns

In the below code, the dataframe df5 is not getting populated. I am just assigning the values to dataframe's columns and I have specified the column beforehand. When I print the dataframe, it returns an empty dataframe. Not sure whether I am missing something.
Any help would be appreciated.
import math
import pandas as pd
columns = ['ClosestLat','ClosestLong']
df5 = pd.DataFrame(columns=columns)
def distance(pt1, pt2):
return math.sqrt((pt1[0] - pt2[0])**2 + (pt1[1] - pt2[1])**2)
for pt1 in df1:
closestPoints = [pt1, df2[0]]
for pt2 in df2:
if distance(pt1, pt2) < distance(closestPoints[0], closestPoints[1]):
closestPoints = [pt1, pt2]
df5['ClosestLat'] = closestPoints[1][0]
df5['ClosestLat'] = closestPoints[1][0]
df5['ClosestLong'] = closestPoints[1][1]
print ("Point: " + str(closestPoints[0]) + " is closest to " + str(closestPoints[1]))
From the look of your code, you're trying to populate df5 with a list of latitudes and longitudes. However, you're making a couple mistakes.
The columns of pandas dataframes are Series, and hold some type of sequential data. So df5['ClosestLat'] = closestPoints[1][0] attempts to assign the entire column a single numerical value, and results in an empty column.
Even if the dataframe wasn't ignoring your attempts to assign a real number to the column, you would lose data because you are overwriting the column with each loop.
The Solution: Build a list of lats and longs, then insert into the dataframe.
import math
import pandas as pd
columns = ['ClosestLat','ClosestLong']
df5 = pd.DataFrame(columns=columns)
def distance(pt1, pt2):
return math.sqrt((pt1[0] - pt2[0])**2 + (pt1[1] - pt2[1])**2)
lats, lngs = [], []
for pt1 in df1:
closestPoints = [pt1, df2[0]]
for pt2 in df2:
if distance(pt1, pt2) < distance(closestPoints[0], closestPoints[1]):
closestPoints = [pt1, pt2]
lats.append(closestPoints[1][0])
lngs.append(closestPoints[1][1])
df['ClosestLat'] = pd.Series(lats)
df['ClosestLong'] = pd.Series(lngs)