Python 3: handling numpy arrays and export via openpyxl - numpy

I am working with an array consisting of several lists. Of each sublist, I want to take the mean and the std. deviation, and write them in an excel sheet.
The code I have does its job, but it gives me headache as I feel I'm not using python efficiently at all, especially in step (2), where I use numpy in a step-by-step manner. Also, I don't get why I have to do the modification in step (3) in order to bring the data ("total") in a form that I can feed to the openpyxl writer ("total_list"). I would appreciate any help in making it more elegant, here is my code:
import numpy as np
from openpyxl import Workbook
from itertools import chain
# (1) Make up sample array:
arr = [[1,1,3], [3,4,2], [4,4,5], [6,6,5]]
# (2) Make up lists containing average values and std. deviations
avg = []
dev = []
for i in arr:
avg.append(np.mean(i))
dev.append(np.std(i))
# (3) Make an alternating list (avg 1, dev 1, avg 2, dev 2, ...)
total = chain.from_iterable( zip( avg, dev ) )
# (4) Make an alternative list that can be fed to the xlsx writer
total_list = []
for i in total:
total_list.append(i)
# Write to Excel file
wb = Workbook()
ws = wb.active
ws.append(total_list)
wb.save("temp.xlsx")
I would like to have the format shown in the picture attached. It is important, that all data are in one row.

Improvements on the numpy code:
In [272]: arr = [[1,1,3], [3,4,2], [4,4,5], [6,6,5]]
Make an array from this list. This isn't required since np.mean does it under the covers, but it should help visualize the action.
In [273]: arr = np.array(arr)
In [274]: arr
Out[274]:
array([[1, 1, 3],
[3, 4, 2],
[4, 4, 5],
[6, 6, 5]])
Now calculate mean and std for the whole array; use axis=1 to act on rows. So you don't to iterate on the sublists of arr.
In [277]: m=np.mean(arr, axis=1)
In [278]: s=np.std(arr, axis=1)
In [279]: m
Out[279]: array([ 1.66666667, 3. , 4.33333333, 5.66666667])
In [280]: s
Out[280]: array([ 0.94280904, 0.81649658, 0.47140452, 0.47140452])
There are various ways of turning these 2 arrays into the interleaved array. One is to stack them vertically, and then transpose. This is the numpy answer to the list zip(*...) trick.
In [281]: data=np.vstack([m,s])
In [282]: data
Out[282]:
array([[ 1.66666667, 3. , 4.33333333, 5.66666667],
[ 0.94280904, 0.81649658, 0.47140452, 0.47140452]])
In [283]: data=data.T.ravel()
In [284]: data
Out[284]:
array([ 1.66666667, 0.94280904, 3. , 0.81649658, 4.33333333,
0.47140452, 5.66666667, 0.47140452])
I don't have openpyxl', but can write a csv withsavetxt`:
In [296]: np.savetxt('test.txt',[data],fmt='%f', delimiter=',',header='#mean1 std1 ...')
In [297]: cat test.txt
# #mean1 std1 ...
1.666667,0.942809,3.000000,0.816497,4.333333,0.471405,5.666667,0.471405
I used [data] because data, as calculated is 1d, and savetxt would save that as a column. It iterates on the 'rows' of data.

I would use Pandas module, as it can do all mentioned tasks pretty easy:
import pandas as pd
df = pd.DataFrame(arr)
In [250]: df
Out[250]:
0 1 2
0 1 1 3
1 3 4 2
2 4 4 5
3 6 6 5
In [251]: df.T
Out[251]:
0 1 2 3
0 1 3 4 6
1 1 4 4 6
2 3 2 5 5
In [252]: df.T.mean()
Out[252]:
0 1.666667
1 3.000000
2 4.333333
3 5.666667
dtype: float64
In [253]: df.T.std(ddof=0)
Out[253]:
0 0.942809
1 0.816497
2 0.471405
3 0.471405
dtype: float64
you can also easily save your DataFrame as Excel file:
df.to_excel(r'/path/to/file.xlsx', index=False)
Altogether:
In [260]: df['avg'] = df.mean(axis=1)
In [261]: df['dev'] = df.std(axis=1, ddof=0)
In [262]: df
Out[262]:
0 1 2 avg dev
0 1 1 3 1.666667 0.816497
1 3 4 2 3.000000 0.707107
2 4 4 5 4.333333 0.408248
3 6 6 5 5.666667 0.408248
In [263]: df.to_excel('d:/temp/result.xlsx', index=False)
result.xlsx:

Related

Pandas aggregate to a list of dicts [duplicate]

I have a pandas data frame df like:
a b
A 1
A 2
B 5
B 5
B 4
C 6
I want to group by the first column and get second column as lists in rows:
A [1,2]
B [5,5,4]
C [6]
Is it possible to do something like this using pandas groupby?
You can do this using groupby to group on the column of interest and then apply list to every group:
In [1]: df = pd.DataFrame( {'a':['A','A','B','B','B','C'], 'b':[1,2,5,5,4,6]})
df
Out[1]:
a b
0 A 1
1 A 2
2 B 5
3 B 5
4 B 4
5 C 6
In [2]: df.groupby('a')['b'].apply(list)
Out[2]:
a
A [1, 2]
B [5, 5, 4]
C [6]
Name: b, dtype: object
In [3]: df1 = df.groupby('a')['b'].apply(list).reset_index(name='new')
df1
Out[3]:
a new
0 A [1, 2]
1 B [5, 5, 4]
2 C [6]
A handy way to achieve this would be:
df.groupby('a').agg({'b':lambda x: list(x)})
Look into writing Custom Aggregations: https://www.kaggle.com/akshaysehgal/how-to-group-by-aggregate-using-py
If performance is important go down to numpy level:
import numpy as np
df = pd.DataFrame({'a': np.random.randint(0, 60, 600), 'b': [1, 2, 5, 5, 4, 6]*100})
def f(df):
keys, values = df.sort_values('a').values.T
ukeys, index = np.unique(keys, True)
arrays = np.split(values, index[1:])
df2 = pd.DataFrame({'a':ukeys, 'b':[list(a) for a in arrays]})
return df2
Tests:
In [301]: %timeit f(df)
1000 loops, best of 3: 1.64 ms per loop
In [302]: %timeit df.groupby('a')['b'].apply(list)
100 loops, best of 3: 5.26 ms per loop
To solve this for several columns of a dataframe:
In [5]: df = pd.DataFrame( {'a':['A','A','B','B','B','C'], 'b':[1,2,5,5,4,6],'c'
...: :[3,3,3,4,4,4]})
In [6]: df
Out[6]:
a b c
0 A 1 3
1 A 2 3
2 B 5 3
3 B 5 4
4 B 4 4
5 C 6 4
In [7]: df.groupby('a').agg(lambda x: list(x))
Out[7]:
b c
a
A [1, 2] [3, 3]
B [5, 5, 4] [3, 4, 4]
C [6] [4]
This answer was inspired from Anamika Modi's answer. Thank you!
Use any of the following groupby and agg recipes.
# Setup
df = pd.DataFrame({
'a': ['A', 'A', 'B', 'B', 'B', 'C'],
'b': [1, 2, 5, 5, 4, 6],
'c': ['x', 'y', 'z', 'x', 'y', 'z']
})
df
a b c
0 A 1 x
1 A 2 y
2 B 5 z
3 B 5 x
4 B 4 y
5 C 6 z
To aggregate multiple columns as lists, use any of the following:
df.groupby('a').agg(list)
df.groupby('a').agg(pd.Series.tolist)
b c
a
A [1, 2] [x, y]
B [5, 5, 4] [z, x, y]
C [6] [z]
To group-listify a single column only, convert the groupby to a SeriesGroupBy object, then call SeriesGroupBy.agg. Use,
df.groupby('a').agg({'b': list}) # 4.42 ms
df.groupby('a')['b'].agg(list) # 2.76 ms - faster
a
A [1, 2]
B [5, 5, 4]
C [6]
Name: b, dtype: object
As you were saying the groupby method of a pd.DataFrame object can do the job.
Example
L = ['A','A','B','B','B','C']
N = [1,2,5,5,4,6]
import pandas as pd
df = pd.DataFrame(zip(L,N),columns = list('LN'))
groups = df.groupby(df.L)
groups.groups
{'A': [0, 1], 'B': [2, 3, 4], 'C': [5]}
which gives and index-wise description of the groups.
To get elements of single groups, you can do, for instance
groups.get_group('A')
L N
0 A 1
1 A 2
groups.get_group('B')
L N
2 B 5
3 B 5
4 B 4
It is time to use agg instead of apply .
When
df = pd.DataFrame( {'a':['A','A','B','B','B','C'], 'b':[1,2,5,5,4,6], 'c': [1,2,5,5,4,6]})
If you want multiple columns stack into list , result in pd.DataFrame
df.groupby('a')[['b', 'c']].agg(list)
# or
df.groupby('a').agg(list)
If you want single column in list, result in ps.Series
df.groupby('a')['b'].agg(list)
#or
df.groupby('a')['b'].apply(list)
Note, result in pd.DataFrame is about 10x slower than result in ps.Series when you only aggregate single column, use it in multicolumns case .
Just a suplement. pandas.pivot_table is much more universal and seems more convenient:
"""data"""
df = pd.DataFrame( {'a':['A','A','B','B','B','C'],
'b':[1,2,5,5,4,6],
'c':[1,2,1,1,1,6]})
print(df)
a b c
0 A 1 1
1 A 2 2
2 B 5 1
3 B 5 1
4 B 4 1
5 C 6 6
"""pivot_table"""
pt = pd.pivot_table(df,
values=['b', 'c'],
index='a',
aggfunc={'b': list,
'c': set})
print(pt)
b c
a
A [1, 2] {1, 2}
B [5, 5, 4] {1}
C [6] {6}
If looking for a unique list while grouping multiple columns this could probably help:
df.groupby('a').agg(lambda x: list(set(x))).reset_index()
Building upon #B.M answer, here is a more general version and updated to work with newer library version: (numpy version 1.19.2, pandas version 1.2.1)
And this solution can also deal with multi-indices:
However this is not heavily tested, use with caution.
If performance is important go down to numpy level:
import pandas as pd
import numpy as np
np.random.seed(0)
df = pd.DataFrame({'a': np.random.randint(0, 10, 90), 'b': [1,2,3]*30, 'c':list('abcefghij')*10, 'd': list('hij')*30})
def f_multi(df,col_names):
if not isinstance(col_names,list):
col_names = [col_names]
values = df.sort_values(col_names).values.T
col_idcs = [df.columns.get_loc(cn) for cn in col_names]
other_col_names = [name for idx, name in enumerate(df.columns) if idx not in col_idcs]
other_col_idcs = [df.columns.get_loc(cn) for cn in other_col_names]
# split df into indexing colums(=keys) and data colums(=vals)
keys = values[col_idcs,:]
vals = values[other_col_idcs,:]
# list of tuple of key pairs
multikeys = list(zip(*keys))
# remember unique key pairs and ther indices
ukeys, index = np.unique(multikeys, return_index=True, axis=0)
# split data columns according to those indices
arrays = np.split(vals, index[1:], axis=1)
# resulting list of subarrays has same number of subarrays as unique key pairs
# each subarray has the following shape:
# rows = number of non-grouped data columns
# cols = number of data points grouped into that unique key pair
# prepare multi index
idx = pd.MultiIndex.from_arrays(ukeys.T, names=col_names)
list_agg_vals = dict()
for tup in zip(*arrays, other_col_names):
col_vals = tup[:-1] # first entries are the subarrays from above
col_name = tup[-1] # last entry is data-column name
list_agg_vals[col_name] = col_vals
df2 = pd.DataFrame(data=list_agg_vals, index=idx)
return df2
Tests:
In [227]: %timeit f_multi(df, ['a','d'])
2.54 ms ± 64.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [228]: %timeit df.groupby(['a','d']).agg(list)
4.56 ms ± 61.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Results:
for the random seed 0 one would get:
The easiest way I have found to achieve the same thing, at least for one column, which is similar to Anamika's answer, just with the tuple syntax for the aggregate function.
df.groupby('a').agg(b=('b','unique'), c=('c','unique'))
Let us using df.groupby with list and Series constructor
pd.Series({x : y.b.tolist() for x , y in df.groupby('a')})
Out[664]:
A [1, 2]
B [5, 5, 4]
C [6]
dtype: object
Here I have grouped elements with "|" as a separator
import pandas as pd
df = pd.read_csv('input.csv')
df
Out[1]:
Area Keywords
0 A 1
1 A 2
2 B 5
3 B 5
4 B 4
5 C 6
df.dropna(inplace = True)
df['Area']=df['Area'].apply(lambda x:x.lower().strip())
print df.columns
df_op = df.groupby('Area').agg({"Keywords":lambda x : "|".join(x)})
df_op.to_csv('output.csv')
Out[2]:
df_op
Area Keywords
A [1| 2]
B [5| 5| 4]
C [6]
Answer based on #EdChum's comment on his answer. Comment is this -
groupby is notoriously slow and memory hungry, what you could do is sort by column A, then find the idxmin and idxmax (probably store this in a dict) and use this to slice your dataframe would be faster I think
Let's first create a dataframe with 500k categories in first column and total df shape 20 million as mentioned in question.
df = pd.DataFrame(columns=['a', 'b'])
df['a'] = (np.random.randint(low=0, high=500000, size=(20000000,))).astype(str)
df['b'] = list(range(20000000))
print(df.shape)
df.head()
# Sort data by first column
df.sort_values(by=['a'], ascending=True, inplace=True)
df.reset_index(drop=True, inplace=True)
# Create a temp column
df['temp_idx'] = list(range(df.shape[0]))
# Take all values of b in a separate list
all_values_b = list(df.b.values)
print(len(all_values_b))
# For each category in column a, find min and max indexes
gp_df = df.groupby(['a']).agg({'temp_idx': [np.min, np.max]})
gp_df.reset_index(inplace=True)
gp_df.columns = ['a', 'temp_idx_min', 'temp_idx_max']
# Now create final list_b column, using min and max indexes for each category of a and filtering list of b.
gp_df['list_b'] = gp_df[['temp_idx_min', 'temp_idx_max']].apply(lambda x: all_values_b[x[0]:x[1]+1], axis=1)
print(gp_df.shape)
gp_df.head()
This above code takes 2 minutes for 20 million rows and 500k categories in first column.
Sorting consumes O(nlog(n)) time which is the most time consuming operation in the solutions suggested above
For a simple solution (containing single column) pd.Series.to_list would work and can be considered more efficient unless considering other frameworks
e.g.
import pandas as pd
from string import ascii_lowercase
import random
def generate_string(case=4):
return ''.join([random.choice(ascii_lowercase) for _ in range(case)])
df = pd.DataFrame({'num_val':[random.randint(0,100) for _ in range(20000000)],'string_val':[generate_string() for _ in range(20000000)]})
%timeit df.groupby('string_val').agg({'num_val':pd.Series.to_list})
For 20 million records it takes about 17.2 seconds. compared to apply(list) which takes about 19.2 and lambda function which takes about 20.6s
Just to add up to previous answers, In my case, I want the list and other functions like min and max. The way to do that is:
df = pd.DataFrame({
'a':['A','A','B','B','B','C'],
'b':[1,2,5,5,4,6]
})
df=df.groupby('a').agg({
'b':['min', 'max',lambda x: list(x)]
})
#then flattening and renaming if necessary
df.columns = df.columns.to_flat_index()
df.rename(columns={('b', 'min'): 'b_min', ('b', 'max'): 'b_max', ('b', '<lambda_0>'): 'b_list'},inplace=True)
It's a bit old but I was directed here. Is there anyway to group it by multiple different columns?
"column1", "column2", "column3"
"foo", "val1", 3
"foo", "val2", 0
"foo", "val2", 3
"bar", "other", 99
to this:
"column1", "column2", "column3"
"foo", "val1", [ 3 ]
"foo", "val2", [ 0, 3 ]
"bar", "other", [ 99 ]

Easiest way to ignore or drop one header row from first page, when parsing table spanning several pages

I am parsing a PDF with tabula-py, and I need to ignore the first two tables, but then parse the rest of the tables as one, and export to a CSV. On the first relevant table (index 2) the first row is a header-row, and I want to leave this out of the csv.
See my code below, including my attempt at dropping the relevant row from the Pandas frame.
What is the easiest/most elegant way of achieving this?
tables = tabula.read_pdf('input.pdf', pages='all', multiple_tables=True)
f = open('output.csv', 'w')
# tables[2].drop(index=0) # tried this, but makes no difference
for df in tables[2:]:
df.to_csv(f, index=False, sep=';')
f.close()
Given the following toy dataframes:
import pandas as pd
tables = [
pd.DataFrame([[1, 3], [2, 4]]),
pd.DataFrame([["a", "b"], [1, 3], [2, 4]]),
]
for table in tables:
print(table)
# Ouput
0 1
0 1 3
1 2 4
0 1
0 a b <<< Unwanted row in table[1]
1 1 3
2 2 4
You can drop the first row of the second dataframe either by reassigning the resulting dataframe (preferable way):
tables[1] = tables[1].drop(index=0)
Or inplace:
tables[1].drop(index=0, inplace=True)
And so, in both cases:
print(table[1])
# Output
0 1
1 1 3
2 2 4

How to use pandas rename() on multi-index columns?

How can can simply rename a MultiIndex column from a pandas DataFrame, using the rename() function?
Let's look at an example and create such a DataFrame:
import pandas
df = pandas.DataFrame({'A': [1, 1, 1, 2, 2], 'B': range(5), 'C': range(5)})
df = df.groupby("A").agg({"B":["min","max"],"C":"mean"})
print(df)
B C
min max mean
A
1 0 2 1.0
2 3 4 3.5
I am able to select a given MultiIndex column by using a tuple for its name:
print(df[("B","min")])
A
1 0
2 3
Name: (B, min), dtype: int64
However, when using the same tuple naming with the rename() function, it does not seem it is accepted:
df.rename(columns={("B","min"):"renamed"},inplace=True)
print(df)
B C
min max mean
A
1 0 2 1.0
2 3 4 3.5
Any idea how rename() should be called to deal with Multi-Index columns?
PS : I am aware of the other options to flatten the column names before, but this prevents one-liners so I am looking for a cleaner solution (see my previous question)
This doesn't answer the question as worded, but it will work for your given example (assuming you want them all renamed with no MultiIndex):
import pandas as pd
df = pd.DataFrame({'A': [1, 1, 1, 2, 2], 'B': range(5), 'C': range(5)})
df = df.groupby("A").agg(
renamed=('B', 'min'),
B_max=('B', 'max'),
C_mean=('C', 'mean'),
)
print(df)
renamed B_max C_mean
A
1 0 2 1.0
2 3 4 3.5
For more info, you can see the pandas docs and some related other questions.

Rolling Second highest in a pandas dataframe

I am trying to find the top and second highest value
I can get the highest using
df['B'] = df['a'].rolling(window=3).max()
But how do I get the second highest please?
Such that df['C'] will display as per below
A B C
1
6
5 6 5
4 6 5
12 12 5
Generic n-highest values in rolling/sliding windows
Here's one using np.lib.stride_tricks.as_strided to create sliding windows that lets us choose any generic N highest value in sliding windows -
# https://stackoverflow.com/a/40085052/ #Divakar
def strided_app(a, L, S ): # Window len = L, Stride len/stepsize = S
nrows = ((a.size-L)//S)+1
n = a.strides[0]
return np.lib.stride_tricks.as_strided(a, shape=(nrows,L), strides=(S*n,n))
# Return N highest nums in rolling windows of length W off array ar
def N_highest(ar, W, N=1):
# ar : Input array
# W : Window length
# N : Get us the N-highest in sliding windows
A2D = strided_app(ar,W,1)
idx = (np.argpartition(A2D, -N, axis=1) == A2D.shape[1]-N).argmax(1)
return A2D[np.arange(len(idx)), idx]
Sample runs -
In [634]: a = np.array([1,6,5,4,12]) # input array
In [635]: N_highest(a, W=3, N=1) # highest in W=3
Out[635]: array([ 6, 6, 12])
In [636]: N_highest(a, W=3, N=2) # second highest
Out[636]: array([5, 5, 5])
In [637]: N_highest(a, W=3, N=3) # third highest
Out[637]: array([1, 4, 4])
Another shorter way based on strides, would be with direct sorting, like so -
np.sort(strided_app(ar,W,1), axis=1)[:,-N]]
Solving our case
Hence, to solve our case, we need to concatenate with NaNs alongwith the result from the above mentioned function, like so -
W = 3
df['C'] = np.r_[ [np.nan]*(W-1), N_highest(df.A.values, W=W, N=2)]
Based on direct sorting, we would have -
df['C'] = np.r_[ [np.nan]*(W-1), np.sort(strided_app(df.A,W,1), axis=1)[:,-2]]
Sample run -
In [578]: df
Out[578]:
A
0 1
1 6
2 5
3 4
4 3 # <== Different from given sample, for variety
In [619]: W = 3
In [620]: df['C'] = np.r_[ [np.nan]*(W-1), N_highest(df.A.values, W=W, N=2)]
In [621]: df
Out[621]:
A C
0 1 NaN
1 6 NaN
2 5 5.0
3 4 5.0
4 3 4.0 # <== Second highest from the last group off : [5,4,3]

pandas faster series of lists unrolling for one-hot encoding?

I'm reading from a database that had many array type columns, which pd.read_sql gives me a dataframe with columns that are dtype=object, containing lists.
I'd like an efficient way to find which rows have arrays containing some element:
s = pd.Series(
[[1,2,3], [1,2], [99], None, [88,2]]
)
print s
..
0 [1, 2, 3]
1 [1, 2]
2 [99]
3 None
4 [88, 2]
1-hot-encoded feature tables for an ML application and I'd like to end up with tables like:
contains_1 contains_2, contains_3 contains_88
0 1 ...
1 1
2 0
3 nan
4 0
...
I can unroll a series of arrays like so:
s2 = s.apply(pd.Series).stack()
0 0 1.0
1 2.0
2 3.0
1 0 1.0
1 2.0
2 0 99.0
4 0 88.0
1 2.0
which gets me at the being able to find the elements meeting some test:
>>> print s2[(s2==2)].index.get_level_values(0)
Int64Index([0, 1, 4], dtype='int64')
Woot! This step:
s.apply(pd.Series).stack()
produces a great intermediate data-structure (s2) that's fast to iterate over for each category. However, the apply step is jaw-droppingly slow (many 10's of seconds for a single column with 500k rows with lists of 10's of items), and I have many columns.
Update: It seems likely that having the data in a series of lists to begin with in quite slow. Performing unroll in the SQL side seems tricky (I have many columns that I want to unroll). Is there a way to pull array data into a better structure?
import numpy as np
import pandas as pd
import cytoolz
s0 = s.dropna()
v = s0.values.tolist()
i = s0.index.values
l = [len(x) for x in v]
c = cytoolz.concat(v)
n = np.append(0, np.array(l[:-1])).cumsum().repeat(l)
k = np.arange(len(c)) - n
s1 = pd.Series(c, [i.repeat(l), k])
UPDATE: What worked for me...
def unroll(s):
s = s.dropna()
v = s.values.tolist()
c = pd.Series(x for x in cytoolz.concat(v)) # 16 seconds!
i = s.index
lens = np.array([len(x) for x in v]) #s.apply(len) is slower
n = np.append(0, lens[:-1]).cumsum().repeat(lens)
k = np.arange(sum(lens)) - n
s = pd.Series(c)
s.index = [i.repeat(lens), k]
s = s.dropna()
return s
It should be possible to replace:
s = pd.Series(c)
s.index = [i.repeat(lens), k]
with:
s = pd.Series(c, index=[i.repeat(lens), k])
But this doesn't work. (Says is ok here )