Python 3: handling numpy arrays and export via openpyxl - numpy
I am working with an array consisting of several lists. Of each sublist, I want to take the mean and the std. deviation, and write them in an excel sheet.
The code I have does its job, but it gives me headache as I feel I'm not using python efficiently at all, especially in step (2), where I use numpy in a step-by-step manner. Also, I don't get why I have to do the modification in step (3) in order to bring the data ("total") in a form that I can feed to the openpyxl writer ("total_list"). I would appreciate any help in making it more elegant, here is my code:
import numpy as np
from openpyxl import Workbook
from itertools import chain
# (1) Make up sample array:
arr = [[1,1,3], [3,4,2], [4,4,5], [6,6,5]]
# (2) Make up lists containing average values and std. deviations
avg = []
dev = []
for i in arr:
avg.append(np.mean(i))
dev.append(np.std(i))
# (3) Make an alternating list (avg 1, dev 1, avg 2, dev 2, ...)
total = chain.from_iterable( zip( avg, dev ) )
# (4) Make an alternative list that can be fed to the xlsx writer
total_list = []
for i in total:
total_list.append(i)
# Write to Excel file
wb = Workbook()
ws = wb.active
ws.append(total_list)
wb.save("temp.xlsx")
I would like to have the format shown in the picture attached. It is important, that all data are in one row.
Improvements on the numpy code:
In [272]: arr = [[1,1,3], [3,4,2], [4,4,5], [6,6,5]]
Make an array from this list. This isn't required since np.mean does it under the covers, but it should help visualize the action.
In [273]: arr = np.array(arr)
In [274]: arr
Out[274]:
array([[1, 1, 3],
[3, 4, 2],
[4, 4, 5],
[6, 6, 5]])
Now calculate mean and std for the whole array; use axis=1 to act on rows. So you don't to iterate on the sublists of arr.
In [277]: m=np.mean(arr, axis=1)
In [278]: s=np.std(arr, axis=1)
In [279]: m
Out[279]: array([ 1.66666667, 3. , 4.33333333, 5.66666667])
In [280]: s
Out[280]: array([ 0.94280904, 0.81649658, 0.47140452, 0.47140452])
There are various ways of turning these 2 arrays into the interleaved array. One is to stack them vertically, and then transpose. This is the numpy answer to the list zip(*...) trick.
In [281]: data=np.vstack([m,s])
In [282]: data
Out[282]:
array([[ 1.66666667, 3. , 4.33333333, 5.66666667],
[ 0.94280904, 0.81649658, 0.47140452, 0.47140452]])
In [283]: data=data.T.ravel()
In [284]: data
Out[284]:
array([ 1.66666667, 0.94280904, 3. , 0.81649658, 4.33333333,
0.47140452, 5.66666667, 0.47140452])
I don't have openpyxl', but can write a csv withsavetxt`:
In [296]: np.savetxt('test.txt',[data],fmt='%f', delimiter=',',header='#mean1 std1 ...')
In [297]: cat test.txt
# #mean1 std1 ...
1.666667,0.942809,3.000000,0.816497,4.333333,0.471405,5.666667,0.471405
I used [data] because data, as calculated is 1d, and savetxt would save that as a column. It iterates on the 'rows' of data.
I would use Pandas module, as it can do all mentioned tasks pretty easy:
import pandas as pd
df = pd.DataFrame(arr)
In [250]: df
Out[250]:
0 1 2
0 1 1 3
1 3 4 2
2 4 4 5
3 6 6 5
In [251]: df.T
Out[251]:
0 1 2 3
0 1 3 4 6
1 1 4 4 6
2 3 2 5 5
In [252]: df.T.mean()
Out[252]:
0 1.666667
1 3.000000
2 4.333333
3 5.666667
dtype: float64
In [253]: df.T.std(ddof=0)
Out[253]:
0 0.942809
1 0.816497
2 0.471405
3 0.471405
dtype: float64
you can also easily save your DataFrame as Excel file:
df.to_excel(r'/path/to/file.xlsx', index=False)
Altogether:
In [260]: df['avg'] = df.mean(axis=1)
In [261]: df['dev'] = df.std(axis=1, ddof=0)
In [262]: df
Out[262]:
0 1 2 avg dev
0 1 1 3 1.666667 0.816497
1 3 4 2 3.000000 0.707107
2 4 4 5 4.333333 0.408248
3 6 6 5 5.666667 0.408248
In [263]: df.to_excel('d:/temp/result.xlsx', index=False)
result.xlsx:
Related
Pandas aggregate to a list of dicts [duplicate]
I have a pandas data frame df like: a b A 1 A 2 B 5 B 5 B 4 C 6 I want to group by the first column and get second column as lists in rows: A [1,2] B [5,5,4] C [6] Is it possible to do something like this using pandas groupby?
You can do this using groupby to group on the column of interest and then apply list to every group: In [1]: df = pd.DataFrame( {'a':['A','A','B','B','B','C'], 'b':[1,2,5,5,4,6]}) df Out[1]: a b 0 A 1 1 A 2 2 B 5 3 B 5 4 B 4 5 C 6 In [2]: df.groupby('a')['b'].apply(list) Out[2]: a A [1, 2] B [5, 5, 4] C [6] Name: b, dtype: object In [3]: df1 = df.groupby('a')['b'].apply(list).reset_index(name='new') df1 Out[3]: a new 0 A [1, 2] 1 B [5, 5, 4] 2 C [6]
A handy way to achieve this would be: df.groupby('a').agg({'b':lambda x: list(x)}) Look into writing Custom Aggregations: https://www.kaggle.com/akshaysehgal/how-to-group-by-aggregate-using-py
If performance is important go down to numpy level: import numpy as np df = pd.DataFrame({'a': np.random.randint(0, 60, 600), 'b': [1, 2, 5, 5, 4, 6]*100}) def f(df): keys, values = df.sort_values('a').values.T ukeys, index = np.unique(keys, True) arrays = np.split(values, index[1:]) df2 = pd.DataFrame({'a':ukeys, 'b':[list(a) for a in arrays]}) return df2 Tests: In [301]: %timeit f(df) 1000 loops, best of 3: 1.64 ms per loop In [302]: %timeit df.groupby('a')['b'].apply(list) 100 loops, best of 3: 5.26 ms per loop
To solve this for several columns of a dataframe: In [5]: df = pd.DataFrame( {'a':['A','A','B','B','B','C'], 'b':[1,2,5,5,4,6],'c' ...: :[3,3,3,4,4,4]}) In [6]: df Out[6]: a b c 0 A 1 3 1 A 2 3 2 B 5 3 3 B 5 4 4 B 4 4 5 C 6 4 In [7]: df.groupby('a').agg(lambda x: list(x)) Out[7]: b c a A [1, 2] [3, 3] B [5, 5, 4] [3, 4, 4] C [6] [4] This answer was inspired from Anamika Modi's answer. Thank you!
Use any of the following groupby and agg recipes. # Setup df = pd.DataFrame({ 'a': ['A', 'A', 'B', 'B', 'B', 'C'], 'b': [1, 2, 5, 5, 4, 6], 'c': ['x', 'y', 'z', 'x', 'y', 'z'] }) df a b c 0 A 1 x 1 A 2 y 2 B 5 z 3 B 5 x 4 B 4 y 5 C 6 z To aggregate multiple columns as lists, use any of the following: df.groupby('a').agg(list) df.groupby('a').agg(pd.Series.tolist) b c a A [1, 2] [x, y] B [5, 5, 4] [z, x, y] C [6] [z] To group-listify a single column only, convert the groupby to a SeriesGroupBy object, then call SeriesGroupBy.agg. Use, df.groupby('a').agg({'b': list}) # 4.42 ms df.groupby('a')['b'].agg(list) # 2.76 ms - faster a A [1, 2] B [5, 5, 4] C [6] Name: b, dtype: object
As you were saying the groupby method of a pd.DataFrame object can do the job. Example L = ['A','A','B','B','B','C'] N = [1,2,5,5,4,6] import pandas as pd df = pd.DataFrame(zip(L,N),columns = list('LN')) groups = df.groupby(df.L) groups.groups {'A': [0, 1], 'B': [2, 3, 4], 'C': [5]} which gives and index-wise description of the groups. To get elements of single groups, you can do, for instance groups.get_group('A') L N 0 A 1 1 A 2 groups.get_group('B') L N 2 B 5 3 B 5 4 B 4
It is time to use agg instead of apply . When df = pd.DataFrame( {'a':['A','A','B','B','B','C'], 'b':[1,2,5,5,4,6], 'c': [1,2,5,5,4,6]}) If you want multiple columns stack into list , result in pd.DataFrame df.groupby('a')[['b', 'c']].agg(list) # or df.groupby('a').agg(list) If you want single column in list, result in ps.Series df.groupby('a')['b'].agg(list) #or df.groupby('a')['b'].apply(list) Note, result in pd.DataFrame is about 10x slower than result in ps.Series when you only aggregate single column, use it in multicolumns case .
Just a suplement. pandas.pivot_table is much more universal and seems more convenient: """data""" df = pd.DataFrame( {'a':['A','A','B','B','B','C'], 'b':[1,2,5,5,4,6], 'c':[1,2,1,1,1,6]}) print(df) a b c 0 A 1 1 1 A 2 2 2 B 5 1 3 B 5 1 4 B 4 1 5 C 6 6 """pivot_table""" pt = pd.pivot_table(df, values=['b', 'c'], index='a', aggfunc={'b': list, 'c': set}) print(pt) b c a A [1, 2] {1, 2} B [5, 5, 4] {1} C [6] {6}
If looking for a unique list while grouping multiple columns this could probably help: df.groupby('a').agg(lambda x: list(set(x))).reset_index()
Building upon #B.M answer, here is a more general version and updated to work with newer library version: (numpy version 1.19.2, pandas version 1.2.1) And this solution can also deal with multi-indices: However this is not heavily tested, use with caution. If performance is important go down to numpy level: import pandas as pd import numpy as np np.random.seed(0) df = pd.DataFrame({'a': np.random.randint(0, 10, 90), 'b': [1,2,3]*30, 'c':list('abcefghij')*10, 'd': list('hij')*30}) def f_multi(df,col_names): if not isinstance(col_names,list): col_names = [col_names] values = df.sort_values(col_names).values.T col_idcs = [df.columns.get_loc(cn) for cn in col_names] other_col_names = [name for idx, name in enumerate(df.columns) if idx not in col_idcs] other_col_idcs = [df.columns.get_loc(cn) for cn in other_col_names] # split df into indexing colums(=keys) and data colums(=vals) keys = values[col_idcs,:] vals = values[other_col_idcs,:] # list of tuple of key pairs multikeys = list(zip(*keys)) # remember unique key pairs and ther indices ukeys, index = np.unique(multikeys, return_index=True, axis=0) # split data columns according to those indices arrays = np.split(vals, index[1:], axis=1) # resulting list of subarrays has same number of subarrays as unique key pairs # each subarray has the following shape: # rows = number of non-grouped data columns # cols = number of data points grouped into that unique key pair # prepare multi index idx = pd.MultiIndex.from_arrays(ukeys.T, names=col_names) list_agg_vals = dict() for tup in zip(*arrays, other_col_names): col_vals = tup[:-1] # first entries are the subarrays from above col_name = tup[-1] # last entry is data-column name list_agg_vals[col_name] = col_vals df2 = pd.DataFrame(data=list_agg_vals, index=idx) return df2 Tests: In [227]: %timeit f_multi(df, ['a','d']) 2.54 ms ± 64.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) In [228]: %timeit df.groupby(['a','d']).agg(list) 4.56 ms ± 61.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) Results: for the random seed 0 one would get:
The easiest way I have found to achieve the same thing, at least for one column, which is similar to Anamika's answer, just with the tuple syntax for the aggregate function. df.groupby('a').agg(b=('b','unique'), c=('c','unique'))
Let us using df.groupby with list and Series constructor pd.Series({x : y.b.tolist() for x , y in df.groupby('a')}) Out[664]: A [1, 2] B [5, 5, 4] C [6] dtype: object
Here I have grouped elements with "|" as a separator import pandas as pd df = pd.read_csv('input.csv') df Out[1]: Area Keywords 0 A 1 1 A 2 2 B 5 3 B 5 4 B 4 5 C 6 df.dropna(inplace = True) df['Area']=df['Area'].apply(lambda x:x.lower().strip()) print df.columns df_op = df.groupby('Area').agg({"Keywords":lambda x : "|".join(x)}) df_op.to_csv('output.csv') Out[2]: df_op Area Keywords A [1| 2] B [5| 5| 4] C [6]
Answer based on #EdChum's comment on his answer. Comment is this - groupby is notoriously slow and memory hungry, what you could do is sort by column A, then find the idxmin and idxmax (probably store this in a dict) and use this to slice your dataframe would be faster I think Let's first create a dataframe with 500k categories in first column and total df shape 20 million as mentioned in question. df = pd.DataFrame(columns=['a', 'b']) df['a'] = (np.random.randint(low=0, high=500000, size=(20000000,))).astype(str) df['b'] = list(range(20000000)) print(df.shape) df.head() # Sort data by first column df.sort_values(by=['a'], ascending=True, inplace=True) df.reset_index(drop=True, inplace=True) # Create a temp column df['temp_idx'] = list(range(df.shape[0])) # Take all values of b in a separate list all_values_b = list(df.b.values) print(len(all_values_b)) # For each category in column a, find min and max indexes gp_df = df.groupby(['a']).agg({'temp_idx': [np.min, np.max]}) gp_df.reset_index(inplace=True) gp_df.columns = ['a', 'temp_idx_min', 'temp_idx_max'] # Now create final list_b column, using min and max indexes for each category of a and filtering list of b. gp_df['list_b'] = gp_df[['temp_idx_min', 'temp_idx_max']].apply(lambda x: all_values_b[x[0]:x[1]+1], axis=1) print(gp_df.shape) gp_df.head() This above code takes 2 minutes for 20 million rows and 500k categories in first column.
Sorting consumes O(nlog(n)) time which is the most time consuming operation in the solutions suggested above For a simple solution (containing single column) pd.Series.to_list would work and can be considered more efficient unless considering other frameworks e.g. import pandas as pd from string import ascii_lowercase import random def generate_string(case=4): return ''.join([random.choice(ascii_lowercase) for _ in range(case)]) df = pd.DataFrame({'num_val':[random.randint(0,100) for _ in range(20000000)],'string_val':[generate_string() for _ in range(20000000)]}) %timeit df.groupby('string_val').agg({'num_val':pd.Series.to_list}) For 20 million records it takes about 17.2 seconds. compared to apply(list) which takes about 19.2 and lambda function which takes about 20.6s
Just to add up to previous answers, In my case, I want the list and other functions like min and max. The way to do that is: df = pd.DataFrame({ 'a':['A','A','B','B','B','C'], 'b':[1,2,5,5,4,6] }) df=df.groupby('a').agg({ 'b':['min', 'max',lambda x: list(x)] }) #then flattening and renaming if necessary df.columns = df.columns.to_flat_index() df.rename(columns={('b', 'min'): 'b_min', ('b', 'max'): 'b_max', ('b', '<lambda_0>'): 'b_list'},inplace=True)
It's a bit old but I was directed here. Is there anyway to group it by multiple different columns? "column1", "column2", "column3" "foo", "val1", 3 "foo", "val2", 0 "foo", "val2", 3 "bar", "other", 99 to this: "column1", "column2", "column3" "foo", "val1", [ 3 ] "foo", "val2", [ 0, 3 ] "bar", "other", [ 99 ]
Easiest way to ignore or drop one header row from first page, when parsing table spanning several pages
I am parsing a PDF with tabula-py, and I need to ignore the first two tables, but then parse the rest of the tables as one, and export to a CSV. On the first relevant table (index 2) the first row is a header-row, and I want to leave this out of the csv. See my code below, including my attempt at dropping the relevant row from the Pandas frame. What is the easiest/most elegant way of achieving this? tables = tabula.read_pdf('input.pdf', pages='all', multiple_tables=True) f = open('output.csv', 'w') # tables[2].drop(index=0) # tried this, but makes no difference for df in tables[2:]: df.to_csv(f, index=False, sep=';') f.close()
Given the following toy dataframes: import pandas as pd tables = [ pd.DataFrame([[1, 3], [2, 4]]), pd.DataFrame([["a", "b"], [1, 3], [2, 4]]), ] for table in tables: print(table) # Ouput 0 1 0 1 3 1 2 4 0 1 0 a b <<< Unwanted row in table[1] 1 1 3 2 2 4 You can drop the first row of the second dataframe either by reassigning the resulting dataframe (preferable way): tables[1] = tables[1].drop(index=0) Or inplace: tables[1].drop(index=0, inplace=True) And so, in both cases: print(table[1]) # Output 0 1 1 1 3 2 2 4
How to use pandas rename() on multi-index columns?
How can can simply rename a MultiIndex column from a pandas DataFrame, using the rename() function? Let's look at an example and create such a DataFrame: import pandas df = pandas.DataFrame({'A': [1, 1, 1, 2, 2], 'B': range(5), 'C': range(5)}) df = df.groupby("A").agg({"B":["min","max"],"C":"mean"}) print(df) B C min max mean A 1 0 2 1.0 2 3 4 3.5 I am able to select a given MultiIndex column by using a tuple for its name: print(df[("B","min")]) A 1 0 2 3 Name: (B, min), dtype: int64 However, when using the same tuple naming with the rename() function, it does not seem it is accepted: df.rename(columns={("B","min"):"renamed"},inplace=True) print(df) B C min max mean A 1 0 2 1.0 2 3 4 3.5 Any idea how rename() should be called to deal with Multi-Index columns? PS : I am aware of the other options to flatten the column names before, but this prevents one-liners so I am looking for a cleaner solution (see my previous question)
This doesn't answer the question as worded, but it will work for your given example (assuming you want them all renamed with no MultiIndex): import pandas as pd df = pd.DataFrame({'A': [1, 1, 1, 2, 2], 'B': range(5), 'C': range(5)}) df = df.groupby("A").agg( renamed=('B', 'min'), B_max=('B', 'max'), C_mean=('C', 'mean'), ) print(df) renamed B_max C_mean A 1 0 2 1.0 2 3 4 3.5 For more info, you can see the pandas docs and some related other questions.
Rolling Second highest in a pandas dataframe
I am trying to find the top and second highest value I can get the highest using df['B'] = df['a'].rolling(window=3).max() But how do I get the second highest please? Such that df['C'] will display as per below A B C 1 6 5 6 5 4 6 5 12 12 5
Generic n-highest values in rolling/sliding windows Here's one using np.lib.stride_tricks.as_strided to create sliding windows that lets us choose any generic N highest value in sliding windows - # https://stackoverflow.com/a/40085052/ #Divakar def strided_app(a, L, S ): # Window len = L, Stride len/stepsize = S nrows = ((a.size-L)//S)+1 n = a.strides[0] return np.lib.stride_tricks.as_strided(a, shape=(nrows,L), strides=(S*n,n)) # Return N highest nums in rolling windows of length W off array ar def N_highest(ar, W, N=1): # ar : Input array # W : Window length # N : Get us the N-highest in sliding windows A2D = strided_app(ar,W,1) idx = (np.argpartition(A2D, -N, axis=1) == A2D.shape[1]-N).argmax(1) return A2D[np.arange(len(idx)), idx] Sample runs - In [634]: a = np.array([1,6,5,4,12]) # input array In [635]: N_highest(a, W=3, N=1) # highest in W=3 Out[635]: array([ 6, 6, 12]) In [636]: N_highest(a, W=3, N=2) # second highest Out[636]: array([5, 5, 5]) In [637]: N_highest(a, W=3, N=3) # third highest Out[637]: array([1, 4, 4]) Another shorter way based on strides, would be with direct sorting, like so - np.sort(strided_app(ar,W,1), axis=1)[:,-N]] Solving our case Hence, to solve our case, we need to concatenate with NaNs alongwith the result from the above mentioned function, like so - W = 3 df['C'] = np.r_[ [np.nan]*(W-1), N_highest(df.A.values, W=W, N=2)] Based on direct sorting, we would have - df['C'] = np.r_[ [np.nan]*(W-1), np.sort(strided_app(df.A,W,1), axis=1)[:,-2]] Sample run - In [578]: df Out[578]: A 0 1 1 6 2 5 3 4 4 3 # <== Different from given sample, for variety In [619]: W = 3 In [620]: df['C'] = np.r_[ [np.nan]*(W-1), N_highest(df.A.values, W=W, N=2)] In [621]: df Out[621]: A C 0 1 NaN 1 6 NaN 2 5 5.0 3 4 5.0 4 3 4.0 # <== Second highest from the last group off : [5,4,3]
pandas faster series of lists unrolling for one-hot encoding?
I'm reading from a database that had many array type columns, which pd.read_sql gives me a dataframe with columns that are dtype=object, containing lists. I'd like an efficient way to find which rows have arrays containing some element: s = pd.Series( [[1,2,3], [1,2], [99], None, [88,2]] ) print s .. 0 [1, 2, 3] 1 [1, 2] 2 [99] 3 None 4 [88, 2] 1-hot-encoded feature tables for an ML application and I'd like to end up with tables like: contains_1 contains_2, contains_3 contains_88 0 1 ... 1 1 2 0 3 nan 4 0 ... I can unroll a series of arrays like so: s2 = s.apply(pd.Series).stack() 0 0 1.0 1 2.0 2 3.0 1 0 1.0 1 2.0 2 0 99.0 4 0 88.0 1 2.0 which gets me at the being able to find the elements meeting some test: >>> print s2[(s2==2)].index.get_level_values(0) Int64Index([0, 1, 4], dtype='int64') Woot! This step: s.apply(pd.Series).stack() produces a great intermediate data-structure (s2) that's fast to iterate over for each category. However, the apply step is jaw-droppingly slow (many 10's of seconds for a single column with 500k rows with lists of 10's of items), and I have many columns. Update: It seems likely that having the data in a series of lists to begin with in quite slow. Performing unroll in the SQL side seems tricky (I have many columns that I want to unroll). Is there a way to pull array data into a better structure?
import numpy as np import pandas as pd import cytoolz s0 = s.dropna() v = s0.values.tolist() i = s0.index.values l = [len(x) for x in v] c = cytoolz.concat(v) n = np.append(0, np.array(l[:-1])).cumsum().repeat(l) k = np.arange(len(c)) - n s1 = pd.Series(c, [i.repeat(l), k]) UPDATE: What worked for me... def unroll(s): s = s.dropna() v = s.values.tolist() c = pd.Series(x for x in cytoolz.concat(v)) # 16 seconds! i = s.index lens = np.array([len(x) for x in v]) #s.apply(len) is slower n = np.append(0, lens[:-1]).cumsum().repeat(lens) k = np.arange(sum(lens)) - n s = pd.Series(c) s.index = [i.repeat(lens), k] s = s.dropna() return s It should be possible to replace: s = pd.Series(c) s.index = [i.repeat(lens), k] with: s = pd.Series(c, index=[i.repeat(lens), k]) But this doesn't work. (Says is ok here )