Large Sampling with Replacement by index layer of a Pandas multiindexed Dataframe - pandas

Imagine a dataframe with the structure below:
>>> print(pair_df)
0 1
centre param h pair_ind
0 x1 1 (0, 1) 2.244282 2.343915
(1, 2) 2.343915 2.442202
(2, 3) 2.442202 2.538162
(3, 4) 2.538162 2.630836
(4, 5) 2.630836 2.719298
... ... ...
9 x3 7 (1, 8) 1.407902 1.417398
(2, 9) 1.407953 1.422860
8 (0, 8) 1.407896 1.417398
(1, 9) 1.407902 1.422860
9 (0, 9) 1.407896 1.422860
[1350 rows x 2 columns]
What is the most efficient way to largely (e.g., 1000 times) sample (with replacement) this dataframe by index layer centre (10 values here) and put them all together?
I have found two solutions:
1)
import numpy as np
bootstrap_rand = np.random.choice(list(range(0,10)), size=len(range(0,10*1000)), replace=True).tolist()
sampled_df = pd.concat([pair_df.loc[idx[i, :, :, :], :] for i in bootstrap_rand])
sampled_df = pair_df.unstack(['param', 'h', 'pair_ind']).\
sample(10*1000, replace=True).\
stack(['param', 'h', 'pair_ind'])
Any more efficient ideas?

Related

Julia: sort two arrays (like lexsort in numpy)

Python example
In Numpy there is lexsort to sort one array within another:
Given multiple sorting keys, which can be interpreted as columns in a
spreadsheet, lexsort returns an array of integer indices that
describes the sort order by multiple columns.
So taking the following example:
import numpy as np
a = np.array([1,1,1,2,2,2])
b = np.array([10,8,11,4,8,0])
sorted_idx = np.lexsort((b,a))
print(b[sorted_idx])
# [ 8 10 11 0 4 8]
So this sorts b within a as we can see like:
1 1 1 2 2 2
8 10 11 0 4 8
I can't find anything similar in Julia so I wonder how this can be achieved? In my case two columns is sufficient
Julia
So lets take the same data to figure that out:
a = Vector([1,1,1,2,2,2])
b = Vector([10,8,11,4,8,0])
Little benchmark
using StructArrays
using DataFrames
using BenchmarkTools
a = rand(100000)
b = rand(100000)
function f1(a, b)
return sortperm(StructArray((a, b)))
end
function f2(a, b)
return sortperm(DataFrame(a=a,b=b, copycols=false))
end
function f3(a, b)
return sortperm(collect(zip(a, b)))
end
#btime f1(a,b)
#btime f2(a,b)
#btime f3(a,b)
Giving:
6.075 ms (8 allocations: 781.50 KiB)
13.808 ms (8291 allocations: 5.93 MiB)
15.892 ms (7 allocations: 2.29 MiB)
So StructArray is almost twice as fast as the other two and uses less memory
Similar to the DataFrames solution but a bit more lightweight in terms of dependencies, a nice solution is to use the StructArrays package, which lets you treat a pair of arrays as if it were an array of tuples, without making a copy of the data, which you can then sort (tuples sort lexicographically):
using StructArrays
i = sortperm(StructArray((a, b)))
Instead of finding the permutation array i and doing b[i], you can also do:
sort!(StructArray((a, b)))
which sorts both a and b in-place lexicographically by (a[j], b[j]).
Use sort and sortperm functions with a vector of tuples:
julia> a = [1, 1, 1, 2, 2, 2];
julia> b = [10, 8, 11, 4, 8, 0];
julia> x = collect(zip(a, b))
6-element Vector{Tuple{Int64, Int64}}:
(1, 10)
(1, 8)
(1, 11)
(2, 4)
(2, 8)
(2, 0)
julia> sort(x)
6-element Vector{Tuple{Int64, Int64}}:
(1, 8)
(1, 10)
(1, 11)
(2, 0)
(2, 4)
(2, 8)
julia> sortperm(x) #indices
6-element Vector{Int64}:
2
1
3
6
4
5
With DataFrames.jl it can be a bit shorter to write:
using DataFrames
sortperm(DataFrame(a=a,b=b, copycols=false))
copycols=false is to avoid unnecessary copy of vectors when creating a data frame. If you do not care about performance and want a short code then you can even write:
sortperm(DataFrame(; a, b))

TF-IDF using in pandas data frame

i am trying to use TF-IDF in pandas with data set content two columns first column it content text data and the another one it content categorical data looks like blow
summary type of attack
unknown african american assailants fired seve... Armed Assault
unknown perpetrators detonated explosives paci... Bombing
karl armstrong member years gang threw firebom... Infrastructure
karl armstrong member years gang broke into un... Infrastructure
unknown perpetrators threw molotov cocktail in... Infrastructure
i want to use tf-idf to convert the first column and then use it to build the mode for prediction of the second columns that content the attack type
I helped you to process your df into X and y to be trained with a short example.
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder
data = {'summary':['unknown african american assailants fired',
'Armed Assault unknown perpetrators detonated explosives','Bombing karl armstrong member years gang threw'],'type of attack':['bullet','explosion','gang']}
#tfidf
df = pd.DataFrame(data)
tf = TfidfVectorizer()
X = tf.fit_transform(df['summary'])
#label encoding
le = LabelEncoder()
y = le.fit_transform(df['type of attack'])
#your X and y ready to be trained
print('X----')
print(X)
print('y----')
print(y)
Output
X----
(0, 9) 0.4673509818107163
(0, 4) 0.4673509818107163
(0, 1) 0.4673509818107163
(0, 0) 0.4673509818107163
(0, 15) 0.35543246785041743
(1, 8) 0.4233944834119594
(1, 7) 0.4233944834119594
(1, 13) 0.4233944834119594
(1, 5) 0.4233944834119594
(1, 2) 0.4233944834119594
(1, 15) 0.3220024178194947
(2, 14) 0.37796447300922725
(2, 10) 0.37796447300922725
(2, 16) 0.37796447300922725
(2, 12) 0.37796447300922725
(2, 3) 0.37796447300922725
(2, 11) 0.37796447300922725
(2, 6) 0.37796447300922725
y----
[0 1 2]

Sum of data entry with the given index in pandas dataframe

I try to get the sum of possible combination of given data in pandas dataframe. To do this I use itertools combination to get all of possible combinations, then by using loop, I sum each of it.
Is there any way to do this without using the loop?
Please check the following script that I created to shows what I want.
import pandas as pd
import itertools as it
A = pd.Series([50, 20, 75], index = list(range(1, 4)))
df = pd.DataFrame({'A': A})
listNew = []
for i in range(1, len(df.A)+1):
Temp=it.combinations(df.index.values, i)
for data in Temp:
listNew.append(data)
print(listNew)
for data in listNew:
print(df.A[list(data)].sum())
Output of these scripts are:
[(1,), (2,), (3,), (1, 2), (1, 3), (2, 3), (1, 2, 3)]
50
20
75
70
125
95
145
thank you in advance.
IIUC, using reindex
#convert you list of tuple to data frame and using stack to flatten it
s=pd.DataFrame([(1,), (2,), (3,), (1, 2),(1, 3),(2, 3), (1, 2, 3)]).stack().to_frame('index')
# then we reindex base on the order of it using df.A
s['Value']=df.reindex(s['index']).A.values
#you can using groupby here, but since the index is here, I will recommend sum with level
s=s.Value.sum(level=0)
s
Out[796]:
0 50
1 20
2 75
3 70
4 125
5 95
6 145
Name: Value, dtype: int64

Selection of second columns data based on match of first column with another text file in python

I have little knowledge of numpy arrays and iterations. I have two input
files. First columns of both files represents time in milliseconds.
Input file 1 is reference or simulated value. Input file 2 is obtained from test value. I want to compare(plot second vs first ) second column of input-2
file with second column of first file if and only if when there is match of time in first column in corresponding files.
I am trying it through iterations but could not find proper results yet.How
to find index when there is a match?
import numpy as np
my_file=np.genfromtxt('path/input1.txt')
Sim_val=np.genfromtxt('path/input2.txt')
inp1=my_file[:,0]
inp12=my_file[:,1]
inpt2=Sim_val[:,0]
inpt21=Sim_val[:,1]
xarray=np.array(inp1)
yarray=np.array(inp12)
data=np.array([xarray,yarray])
ldata=data.T
zarray=np.array(inpt2)
tarray=np.array(inpt21)
mdata=np.array([zarray,tarray])
kdata=mdata.T
i=np.searchsorted(kdata[:,0],ldata[:,0])
print i
My inputfile-2 & Inputfile-1 is
0 5 0 5
100 6 50 6
200 10 200 15
300 12 350 12
400 15 # Obtained 400 15 #Simulated Value
500 20 #Value 500 25
600 0 650 0
700 11 700 11
800 12 850 8
900 19 900 19
1000 10 1000 3
Having really a hard time with numpy arrays and iterations.
Please anybody suggest how can I solve above problem. In-fact I
have other columns too but all manipulation is depend on match of first column(Time match).
Once again very much thanks in Advance.
Did you mean something like
import numpy as np
simulated = np.array([
(0, 5),
(100, 6),
(200, 10),
(300, 12),
(400, 15),
(500, 20),
(600, 0),
(700, 11),
(800, 12),
(900, 19),
(1000, 10)
])
actual = np.array([
(0, 5),
(50, 6),
(200, 15),
(350, 12),
(400, 15),
(500, 25),
(650, 0),
(700, 11),
(850, 8),
(900, 19),
(1000, 3)
])
def indexes_where_match(A, B):
""" an iterator that goes over the indexes of wherever the entries in A's first-col and B's first-col match """
return (i for i, (a, b) in enumerate(zip(A, B)) if a[0] == b[0])
def main():
for i in indexes_where_match(simulated, actual):
print(simulated[i][1], 'should be compared to', actual[i][1])
if __name__ == '__main__':
main()
You could also use column-slicing, like this:
simulated_time, simulated_values = simulated[..., 0], simulated[..., 1:]
actual_time, actual_values = actual[..., 0], actual[..., 1:]
indexes_where_match = (i for i, (a, b) in enumerate(zip(simulated_time, actual_time)) if a == b)
for i in indexes_where_match:
print(simulated_values[i], 'should be compared to', actual_values[i])
# outputs:
# [5] should be compared to [5]
# [10] should be compared to [15]
# [15] should be compared to [15]
# [20] should be compared to [25]
# [11] should be compared to [11]
# [19] should be compared to [19]
# [10] should be compared to [3]

Python, Numpy: all UNIQUE combinations of a numpy.array() vector

I want to get all unique combinations of a numpy.array vector (or a pandas.Series). I used itertools.combinations but it's very slow. For an array of size (1000,) it takes many hours. Here is my code using itertools (actually I use combination differences):
def a(array):
temp = pd.Series([])
for i in itertools.combinations(array, 2):
temp = temp.append(pd.Series(np.abs(i[0]-i[1])))
temp.index=range(len(temp))
return temp
As you see there is no repetition!!
The sklearn.utils.extmath.cartesian is really fast and good but it provides repetitions which I do not want! I need help rewriting above function without using itertools and much more speed for large vectors.
You could take the upper triangular part of a matrix formed on the Cartesian product with the binary operation (here subtraction, as in your example):
import numpy as np
n = 3
a = np.random.randn(n)
print(a)
print(a - a[:, np.newaxis])
print((a - a[:, np.newaxis])[np.triu_indices(n, 1)])
gives
[ 0.04248369 -0.80162228 -0.44504522]
[[ 0. -0.84410597 -0.48752891]
[ 0.84410597 0. 0.35657707]
[ 0.48752891 -0.35657707 0. ]]
[-0.84410597 -0.48752891 0.35657707]
with n=1000 (and output piped to /dev/null) this runs in 0.131s
on my relatively modest laptop.
For a random array of ints:
import numpy as np
import pandas as pd
import itertools as it
b = np.random.randint(0, 8, ((6,)))
# array([7, 0, 6, 7, 1, 5])
pd.Series(list(it.combinations(np.unique(b), 2)))
it returns:
0 (0, 1)
1 (0, 5)
2 (0, 6)
3 (0, 7)
4 (1, 5)
5 (1, 6)
6 (1, 7)
7 (5, 6)
8 (5, 7)
9 (6, 7)
dtype: object