Matching IP Address to IP Network and returning associated column - pandas

I have two dataframes in pandas.
import pandas as pd
inp1 = [{'network':'1.0.0.0/24', 'A':1, 'B':2}, {'network':'5.46.8.0/23', 'A':3, 'B':4}, {'network':'78.212.13.0/24', 'A':5, 'B':6}]
df1 = pd.DataFrame(inp)
print("df1", df1)
inp2 = [{'ip':'1.0.0.10'}, {'ip':'blahblahblah'}, {'ip':'78.212.13.249'}]
df2 = pd.DataFrame(inp2)
print("df2", df2)
Output:
network A B
0 1.0.0.0/24 1 2
1 5.46.8.0/23 3 4
2 78.212.13.0/24 5 6
ip
0 1.0.0.10
1 blahblahblah
2 78.212.13.249
The ultimate output I want would appear as follows:
ip A B
0 1.0.0.10 1 2
1 blahblahblah NaN Nan
2 78.212.13.249 5 6
I want to iterate through each cell in df2['ip'] and check if it belongs to a network in df1['network']. If it belongs to a network, it would return the corresponding A and B column for the specific ip address. I have referenced this article and considered netaddr, IPNetwork, IPAddress, ipaddress but cannot quite figure it out.
Help appreciated!

You can do it using netaddr + apply(). Here is an example:
from netaddr import IPNetwork, IPAddress, AddrFormatError
network_df = pd.DataFrame([
{'network': '1.0.0.0/24', 'A': 1, 'B': 2},
{'network': '5.46.8.0/23', 'A': 3, 'B': 4},
{'network': '78.212.13.0/24', 'A': 5, 'B': 6}
])
ip_df = pd.DataFrame([{'ip': '1.0.0.10'}, {'ip': 'blahblahblah'}, {'ip': '78.212.13.249'}])
# create all networks using netaddr
networks = (IPNetwork(n) for n in network_df.network.to_list())
def find_network(ip):
# return empty string when bad/wrong IP
try:
ip_address = IPAddress(ip)
except AddrFormatError:
return ''
# return network name as string if we found network
for network in networks:
if ip_address in network:
return str(network.cidr)
return ''
# add network column. set network names by ip column
ip_df['network'] = ip_df['ip'].apply(find_network)
# just merge by network columns(str in both dataframes)
result = pd.merge(ip_df, network_df, how='left', on='network')
# you don't need network column in expected output...
result = result.drop(columns=['network'])
print(result)
# ip A B
# 0 1.0.0.10 1.0 2.0
# 1 blahblahblah NaN NaN
# 2 78.212.13.249 5.0 6.0
See comments. Hope this helps.

If you're willing to use R instead of Python, I've written an ipaddress package which can solve this problem. There's still an underlying loop, but it's implemented in C++ (much faster!)
library(tibble)
library(ipaddress)
library(fuzzyjoin)
addr <- tibble(
address = ip_address(c("1.0.0.10", "blahblahblah", "78.212.13.249"))
)
#> Warning: Problem on row 2: blahblahblah
nets <- tibble(
network = ip_network(c("1.0.0.0/24", "5.46.8.0/23", "78.212.13.0/24")),
A = c(1, 3, 5),
B = c(2, 4, 6)
)
fuzzy_left_join(addr, nets, c("address" = "network"), is_within)
#> # A tibble: 3 x 4
#> address network A B
#> <ip_addr> <ip_netwk> <dbl> <dbl>
#> 1 1.0.0.10 1.0.0.0/24 1 2
#> 2 NA NA NA NA
#> 3 78.212.13.249 78.212.13.0/24 5 6
Created on 2020-09-02 by the reprex package (v0.3.0)

Related

Pandas aggregate to a list of dicts [duplicate]

I have a pandas data frame df like:
a b
A 1
A 2
B 5
B 5
B 4
C 6
I want to group by the first column and get second column as lists in rows:
A [1,2]
B [5,5,4]
C [6]
Is it possible to do something like this using pandas groupby?
You can do this using groupby to group on the column of interest and then apply list to every group:
In [1]: df = pd.DataFrame( {'a':['A','A','B','B','B','C'], 'b':[1,2,5,5,4,6]})
df
Out[1]:
a b
0 A 1
1 A 2
2 B 5
3 B 5
4 B 4
5 C 6
In [2]: df.groupby('a')['b'].apply(list)
Out[2]:
a
A [1, 2]
B [5, 5, 4]
C [6]
Name: b, dtype: object
In [3]: df1 = df.groupby('a')['b'].apply(list).reset_index(name='new')
df1
Out[3]:
a new
0 A [1, 2]
1 B [5, 5, 4]
2 C [6]
A handy way to achieve this would be:
df.groupby('a').agg({'b':lambda x: list(x)})
Look into writing Custom Aggregations: https://www.kaggle.com/akshaysehgal/how-to-group-by-aggregate-using-py
If performance is important go down to numpy level:
import numpy as np
df = pd.DataFrame({'a': np.random.randint(0, 60, 600), 'b': [1, 2, 5, 5, 4, 6]*100})
def f(df):
keys, values = df.sort_values('a').values.T
ukeys, index = np.unique(keys, True)
arrays = np.split(values, index[1:])
df2 = pd.DataFrame({'a':ukeys, 'b':[list(a) for a in arrays]})
return df2
Tests:
In [301]: %timeit f(df)
1000 loops, best of 3: 1.64 ms per loop
In [302]: %timeit df.groupby('a')['b'].apply(list)
100 loops, best of 3: 5.26 ms per loop
To solve this for several columns of a dataframe:
In [5]: df = pd.DataFrame( {'a':['A','A','B','B','B','C'], 'b':[1,2,5,5,4,6],'c'
...: :[3,3,3,4,4,4]})
In [6]: df
Out[6]:
a b c
0 A 1 3
1 A 2 3
2 B 5 3
3 B 5 4
4 B 4 4
5 C 6 4
In [7]: df.groupby('a').agg(lambda x: list(x))
Out[7]:
b c
a
A [1, 2] [3, 3]
B [5, 5, 4] [3, 4, 4]
C [6] [4]
This answer was inspired from Anamika Modi's answer. Thank you!
Use any of the following groupby and agg recipes.
# Setup
df = pd.DataFrame({
'a': ['A', 'A', 'B', 'B', 'B', 'C'],
'b': [1, 2, 5, 5, 4, 6],
'c': ['x', 'y', 'z', 'x', 'y', 'z']
})
df
a b c
0 A 1 x
1 A 2 y
2 B 5 z
3 B 5 x
4 B 4 y
5 C 6 z
To aggregate multiple columns as lists, use any of the following:
df.groupby('a').agg(list)
df.groupby('a').agg(pd.Series.tolist)
b c
a
A [1, 2] [x, y]
B [5, 5, 4] [z, x, y]
C [6] [z]
To group-listify a single column only, convert the groupby to a SeriesGroupBy object, then call SeriesGroupBy.agg. Use,
df.groupby('a').agg({'b': list}) # 4.42 ms
df.groupby('a')['b'].agg(list) # 2.76 ms - faster
a
A [1, 2]
B [5, 5, 4]
C [6]
Name: b, dtype: object
As you were saying the groupby method of a pd.DataFrame object can do the job.
Example
L = ['A','A','B','B','B','C']
N = [1,2,5,5,4,6]
import pandas as pd
df = pd.DataFrame(zip(L,N),columns = list('LN'))
groups = df.groupby(df.L)
groups.groups
{'A': [0, 1], 'B': [2, 3, 4], 'C': [5]}
which gives and index-wise description of the groups.
To get elements of single groups, you can do, for instance
groups.get_group('A')
L N
0 A 1
1 A 2
groups.get_group('B')
L N
2 B 5
3 B 5
4 B 4
It is time to use agg instead of apply .
When
df = pd.DataFrame( {'a':['A','A','B','B','B','C'], 'b':[1,2,5,5,4,6], 'c': [1,2,5,5,4,6]})
If you want multiple columns stack into list , result in pd.DataFrame
df.groupby('a')[['b', 'c']].agg(list)
# or
df.groupby('a').agg(list)
If you want single column in list, result in ps.Series
df.groupby('a')['b'].agg(list)
#or
df.groupby('a')['b'].apply(list)
Note, result in pd.DataFrame is about 10x slower than result in ps.Series when you only aggregate single column, use it in multicolumns case .
Just a suplement. pandas.pivot_table is much more universal and seems more convenient:
"""data"""
df = pd.DataFrame( {'a':['A','A','B','B','B','C'],
'b':[1,2,5,5,4,6],
'c':[1,2,1,1,1,6]})
print(df)
a b c
0 A 1 1
1 A 2 2
2 B 5 1
3 B 5 1
4 B 4 1
5 C 6 6
"""pivot_table"""
pt = pd.pivot_table(df,
values=['b', 'c'],
index='a',
aggfunc={'b': list,
'c': set})
print(pt)
b c
a
A [1, 2] {1, 2}
B [5, 5, 4] {1}
C [6] {6}
If looking for a unique list while grouping multiple columns this could probably help:
df.groupby('a').agg(lambda x: list(set(x))).reset_index()
Building upon #B.M answer, here is a more general version and updated to work with newer library version: (numpy version 1.19.2, pandas version 1.2.1)
And this solution can also deal with multi-indices:
However this is not heavily tested, use with caution.
If performance is important go down to numpy level:
import pandas as pd
import numpy as np
np.random.seed(0)
df = pd.DataFrame({'a': np.random.randint(0, 10, 90), 'b': [1,2,3]*30, 'c':list('abcefghij')*10, 'd': list('hij')*30})
def f_multi(df,col_names):
if not isinstance(col_names,list):
col_names = [col_names]
values = df.sort_values(col_names).values.T
col_idcs = [df.columns.get_loc(cn) for cn in col_names]
other_col_names = [name for idx, name in enumerate(df.columns) if idx not in col_idcs]
other_col_idcs = [df.columns.get_loc(cn) for cn in other_col_names]
# split df into indexing colums(=keys) and data colums(=vals)
keys = values[col_idcs,:]
vals = values[other_col_idcs,:]
# list of tuple of key pairs
multikeys = list(zip(*keys))
# remember unique key pairs and ther indices
ukeys, index = np.unique(multikeys, return_index=True, axis=0)
# split data columns according to those indices
arrays = np.split(vals, index[1:], axis=1)
# resulting list of subarrays has same number of subarrays as unique key pairs
# each subarray has the following shape:
# rows = number of non-grouped data columns
# cols = number of data points grouped into that unique key pair
# prepare multi index
idx = pd.MultiIndex.from_arrays(ukeys.T, names=col_names)
list_agg_vals = dict()
for tup in zip(*arrays, other_col_names):
col_vals = tup[:-1] # first entries are the subarrays from above
col_name = tup[-1] # last entry is data-column name
list_agg_vals[col_name] = col_vals
df2 = pd.DataFrame(data=list_agg_vals, index=idx)
return df2
Tests:
In [227]: %timeit f_multi(df, ['a','d'])
2.54 ms ± 64.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [228]: %timeit df.groupby(['a','d']).agg(list)
4.56 ms ± 61.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Results:
for the random seed 0 one would get:
The easiest way I have found to achieve the same thing, at least for one column, which is similar to Anamika's answer, just with the tuple syntax for the aggregate function.
df.groupby('a').agg(b=('b','unique'), c=('c','unique'))
Let us using df.groupby with list and Series constructor
pd.Series({x : y.b.tolist() for x , y in df.groupby('a')})
Out[664]:
A [1, 2]
B [5, 5, 4]
C [6]
dtype: object
Here I have grouped elements with "|" as a separator
import pandas as pd
df = pd.read_csv('input.csv')
df
Out[1]:
Area Keywords
0 A 1
1 A 2
2 B 5
3 B 5
4 B 4
5 C 6
df.dropna(inplace = True)
df['Area']=df['Area'].apply(lambda x:x.lower().strip())
print df.columns
df_op = df.groupby('Area').agg({"Keywords":lambda x : "|".join(x)})
df_op.to_csv('output.csv')
Out[2]:
df_op
Area Keywords
A [1| 2]
B [5| 5| 4]
C [6]
Answer based on #EdChum's comment on his answer. Comment is this -
groupby is notoriously slow and memory hungry, what you could do is sort by column A, then find the idxmin and idxmax (probably store this in a dict) and use this to slice your dataframe would be faster I think
Let's first create a dataframe with 500k categories in first column and total df shape 20 million as mentioned in question.
df = pd.DataFrame(columns=['a', 'b'])
df['a'] = (np.random.randint(low=0, high=500000, size=(20000000,))).astype(str)
df['b'] = list(range(20000000))
print(df.shape)
df.head()
# Sort data by first column
df.sort_values(by=['a'], ascending=True, inplace=True)
df.reset_index(drop=True, inplace=True)
# Create a temp column
df['temp_idx'] = list(range(df.shape[0]))
# Take all values of b in a separate list
all_values_b = list(df.b.values)
print(len(all_values_b))
# For each category in column a, find min and max indexes
gp_df = df.groupby(['a']).agg({'temp_idx': [np.min, np.max]})
gp_df.reset_index(inplace=True)
gp_df.columns = ['a', 'temp_idx_min', 'temp_idx_max']
# Now create final list_b column, using min and max indexes for each category of a and filtering list of b.
gp_df['list_b'] = gp_df[['temp_idx_min', 'temp_idx_max']].apply(lambda x: all_values_b[x[0]:x[1]+1], axis=1)
print(gp_df.shape)
gp_df.head()
This above code takes 2 minutes for 20 million rows and 500k categories in first column.
Sorting consumes O(nlog(n)) time which is the most time consuming operation in the solutions suggested above
For a simple solution (containing single column) pd.Series.to_list would work and can be considered more efficient unless considering other frameworks
e.g.
import pandas as pd
from string import ascii_lowercase
import random
def generate_string(case=4):
return ''.join([random.choice(ascii_lowercase) for _ in range(case)])
df = pd.DataFrame({'num_val':[random.randint(0,100) for _ in range(20000000)],'string_val':[generate_string() for _ in range(20000000)]})
%timeit df.groupby('string_val').agg({'num_val':pd.Series.to_list})
For 20 million records it takes about 17.2 seconds. compared to apply(list) which takes about 19.2 and lambda function which takes about 20.6s
Just to add up to previous answers, In my case, I want the list and other functions like min and max. The way to do that is:
df = pd.DataFrame({
'a':['A','A','B','B','B','C'],
'b':[1,2,5,5,4,6]
})
df=df.groupby('a').agg({
'b':['min', 'max',lambda x: list(x)]
})
#then flattening and renaming if necessary
df.columns = df.columns.to_flat_index()
df.rename(columns={('b', 'min'): 'b_min', ('b', 'max'): 'b_max', ('b', '<lambda_0>'): 'b_list'},inplace=True)
It's a bit old but I was directed here. Is there anyway to group it by multiple different columns?
"column1", "column2", "column3"
"foo", "val1", 3
"foo", "val2", 0
"foo", "val2", 3
"bar", "other", 99
to this:
"column1", "column2", "column3"
"foo", "val1", [ 3 ]
"foo", "val2", [ 0, 3 ]
"bar", "other", [ 99 ]

Pandas create categorical column filled in with string values from other variable

I have multiple files with foreign key relations (in csv). So one file will refer to a page as number 123, and there is a separate file that maps that number to '/homepage'. The mapping file is not ordereded and not zero based.
I can't figure out from the documentation how to use astype('category) with a lookup dict or something.
Any help?
# the file with the foreign keys
lookup_df = pd.DataFrame({'page':[123,2,3], 'name':['/homepage','/search','/checkout']})
# the file with the pages
df1 = pd.DataFrame({'pages':[2,3,123]})
# wanted df1, with categorical 'pages' column
# pages
# 0 /search
# 1 /checkout
# 2 /homepage
# but instead ofcourse
# pages
# 0 2
# 1 3
# 2 123
You could try this:
from functools import cache
import pandas as pd
lookup_df = pd.DataFrame(
{"page": [123, 2, 3], "name": ["/homepage", "/search", "/checkout"]}
)
df1 = pd.DataFrame({"pages": [2, 3, 123, 4, 3, 123, 2]})
#cache
def match(value):
try:
return lookup_df.loc[lookup_df["page"] == value, "name"].values[0]
except IndexError:
return "/unknown"
df1["name"] = df1["pages"].apply(match)
print(df1)
# Output
pages name
0 2 /search
1 3 /checkout
2 123 /homepage
3 4 /unknown
4 3 /checkout
5 123 /homepage
6 2 /search

Crosstab using multi-element calculation

I would like to create a crosstab from a dataframe df, comparing each record of df to each other, i.e. pairwise, and calculate one number from several elements of the rows of df. As an example, let's take the following dataframe and calculate the (squared) distance between the points:
import pandas as pd
df = pd.DataFrame({"Point": ["A", "B", "C"], "x": [10, 20, 30], "y": [1, 2, 3]})
df["XX"] = 1
result = (
df.merge(df, on="XX")
.assign(distance=lambda d: (d["x_x"] - d["x_y"]) ** 2 + (d["y_x"] - d["y_y"]) ** 2)
.loc[:, ["Point_x", "Point_y", "distance"]]
.pivot(index="Point_x", columns="Point_y")
)
yielding the desired result:
distance
Point_y A B C
Point_x
A 0 5 20
B 5 0 5
C 20 5 0
Is there a better way to do this without resorting to adding a dummy field XX and merging on it? I tried multiple variations of
df = df.drop("XX", axis=1)
result = pd.crosstab(index=df["Point"], columns=df["Point"])
with values= and aggfunc= parameters, but to no avail. Possibly there is also an easier way using numpy?
"cross" merge
Assuming 1.2.0+, you can avoid the dummy XX column by merging with how="cross":
cross: creates the cartesian product from both frames, preserves the order of the left keys (new in version 1.2.0)
(df.merge(df, how="cross")
.assign(distance=lambda d: (d["x_x"] - d["x_y"]) ** 2 + (d["y_x"] - d["y_y"]) ** 2)
.loc[:, ["Point_x", "Point_y", "distance"]]
.pivot(index="Point_x", columns="Point_y"))
# distance
# Point_y A B C
# Point_x
# A 0 101 404
# B 101 0 101
# C 404 101 0
numpy broadcasting
You can do the pairwise calculations in numpy by using singleton dimensions (None or np.newaxis):
x = (df.x.values[:, None] - df.x.values) ** 2
y = (df.y.values[:, None] - df.y.values) ** 2
pd.DataFrame(x + y, index=df.Point, columns=df.Point)
# Point A B C
# Point
# A 0 101 404
# B 101 0 101
# C 404 101 0
scipy squareform
If you compute a vector of pairwise values (e.g., result of pdist), you can use squareform to crosstab the vector:
from scipy.spatial.distance import squareform, pdist
pd.DataFrame(squareform(pdist(df[["x", "y"]]) ** 2), columns=df.Point, index=df.Point)
# Point A B C
# Point
# A 0.0 101.0 404.0
# B 101.0 0.0 101.0
# C 404.0 101.0 0.0
As another option, using euclidean_distances from sklearn:
from sklearn.metrics.pairwise import euclidean_distances
euclidean_distances(
df[['x', 'y']],
df[['x', 'y']], squared=True)
Output:
array([[ 0., 101., 404.],
[101., 0., 101.],
[404., 101., 0.]])

Rolling Second highest in a pandas dataframe

I am trying to find the top and second highest value
I can get the highest using
df['B'] = df['a'].rolling(window=3).max()
But how do I get the second highest please?
Such that df['C'] will display as per below
A B C
1
6
5 6 5
4 6 5
12 12 5
Generic n-highest values in rolling/sliding windows
Here's one using np.lib.stride_tricks.as_strided to create sliding windows that lets us choose any generic N highest value in sliding windows -
# https://stackoverflow.com/a/40085052/ #Divakar
def strided_app(a, L, S ): # Window len = L, Stride len/stepsize = S
nrows = ((a.size-L)//S)+1
n = a.strides[0]
return np.lib.stride_tricks.as_strided(a, shape=(nrows,L), strides=(S*n,n))
# Return N highest nums in rolling windows of length W off array ar
def N_highest(ar, W, N=1):
# ar : Input array
# W : Window length
# N : Get us the N-highest in sliding windows
A2D = strided_app(ar,W,1)
idx = (np.argpartition(A2D, -N, axis=1) == A2D.shape[1]-N).argmax(1)
return A2D[np.arange(len(idx)), idx]
Sample runs -
In [634]: a = np.array([1,6,5,4,12]) # input array
In [635]: N_highest(a, W=3, N=1) # highest in W=3
Out[635]: array([ 6, 6, 12])
In [636]: N_highest(a, W=3, N=2) # second highest
Out[636]: array([5, 5, 5])
In [637]: N_highest(a, W=3, N=3) # third highest
Out[637]: array([1, 4, 4])
Another shorter way based on strides, would be with direct sorting, like so -
np.sort(strided_app(ar,W,1), axis=1)[:,-N]]
Solving our case
Hence, to solve our case, we need to concatenate with NaNs alongwith the result from the above mentioned function, like so -
W = 3
df['C'] = np.r_[ [np.nan]*(W-1), N_highest(df.A.values, W=W, N=2)]
Based on direct sorting, we would have -
df['C'] = np.r_[ [np.nan]*(W-1), np.sort(strided_app(df.A,W,1), axis=1)[:,-2]]
Sample run -
In [578]: df
Out[578]:
A
0 1
1 6
2 5
3 4
4 3 # <== Different from given sample, for variety
In [619]: W = 3
In [620]: df['C'] = np.r_[ [np.nan]*(W-1), N_highest(df.A.values, W=W, N=2)]
In [621]: df
Out[621]:
A C
0 1 NaN
1 6 NaN
2 5 5.0
3 4 5.0
4 3 4.0 # <== Second highest from the last group off : [5,4,3]

Python 3: handling numpy arrays and export via openpyxl

I am working with an array consisting of several lists. Of each sublist, I want to take the mean and the std. deviation, and write them in an excel sheet.
The code I have does its job, but it gives me headache as I feel I'm not using python efficiently at all, especially in step (2), where I use numpy in a step-by-step manner. Also, I don't get why I have to do the modification in step (3) in order to bring the data ("total") in a form that I can feed to the openpyxl writer ("total_list"). I would appreciate any help in making it more elegant, here is my code:
import numpy as np
from openpyxl import Workbook
from itertools import chain
# (1) Make up sample array:
arr = [[1,1,3], [3,4,2], [4,4,5], [6,6,5]]
# (2) Make up lists containing average values and std. deviations
avg = []
dev = []
for i in arr:
avg.append(np.mean(i))
dev.append(np.std(i))
# (3) Make an alternating list (avg 1, dev 1, avg 2, dev 2, ...)
total = chain.from_iterable( zip( avg, dev ) )
# (4) Make an alternative list that can be fed to the xlsx writer
total_list = []
for i in total:
total_list.append(i)
# Write to Excel file
wb = Workbook()
ws = wb.active
ws.append(total_list)
wb.save("temp.xlsx")
I would like to have the format shown in the picture attached. It is important, that all data are in one row.
Improvements on the numpy code:
In [272]: arr = [[1,1,3], [3,4,2], [4,4,5], [6,6,5]]
Make an array from this list. This isn't required since np.mean does it under the covers, but it should help visualize the action.
In [273]: arr = np.array(arr)
In [274]: arr
Out[274]:
array([[1, 1, 3],
[3, 4, 2],
[4, 4, 5],
[6, 6, 5]])
Now calculate mean and std for the whole array; use axis=1 to act on rows. So you don't to iterate on the sublists of arr.
In [277]: m=np.mean(arr, axis=1)
In [278]: s=np.std(arr, axis=1)
In [279]: m
Out[279]: array([ 1.66666667, 3. , 4.33333333, 5.66666667])
In [280]: s
Out[280]: array([ 0.94280904, 0.81649658, 0.47140452, 0.47140452])
There are various ways of turning these 2 arrays into the interleaved array. One is to stack them vertically, and then transpose. This is the numpy answer to the list zip(*...) trick.
In [281]: data=np.vstack([m,s])
In [282]: data
Out[282]:
array([[ 1.66666667, 3. , 4.33333333, 5.66666667],
[ 0.94280904, 0.81649658, 0.47140452, 0.47140452]])
In [283]: data=data.T.ravel()
In [284]: data
Out[284]:
array([ 1.66666667, 0.94280904, 3. , 0.81649658, 4.33333333,
0.47140452, 5.66666667, 0.47140452])
I don't have openpyxl', but can write a csv withsavetxt`:
In [296]: np.savetxt('test.txt',[data],fmt='%f', delimiter=',',header='#mean1 std1 ...')
In [297]: cat test.txt
# #mean1 std1 ...
1.666667,0.942809,3.000000,0.816497,4.333333,0.471405,5.666667,0.471405
I used [data] because data, as calculated is 1d, and savetxt would save that as a column. It iterates on the 'rows' of data.
I would use Pandas module, as it can do all mentioned tasks pretty easy:
import pandas as pd
df = pd.DataFrame(arr)
In [250]: df
Out[250]:
0 1 2
0 1 1 3
1 3 4 2
2 4 4 5
3 6 6 5
In [251]: df.T
Out[251]:
0 1 2 3
0 1 3 4 6
1 1 4 4 6
2 3 2 5 5
In [252]: df.T.mean()
Out[252]:
0 1.666667
1 3.000000
2 4.333333
3 5.666667
dtype: float64
In [253]: df.T.std(ddof=0)
Out[253]:
0 0.942809
1 0.816497
2 0.471405
3 0.471405
dtype: float64
you can also easily save your DataFrame as Excel file:
df.to_excel(r'/path/to/file.xlsx', index=False)
Altogether:
In [260]: df['avg'] = df.mean(axis=1)
In [261]: df['dev'] = df.std(axis=1, ddof=0)
In [262]: df
Out[262]:
0 1 2 avg dev
0 1 1 3 1.666667 0.816497
1 3 4 2 3.000000 0.707107
2 4 4 5 4.333333 0.408248
3 6 6 5 5.666667 0.408248
In [263]: df.to_excel('d:/temp/result.xlsx', index=False)
result.xlsx: