I have a dataset as shown below, each sample has x and y values and the corresponding result
Sr. X Y Resut
1 2 12 Positive
2 4 3 positive
....
Visualization
Grid size is 12 * 8
How I can calculate the nearest distance for each sample from red points (positive ones)?
Red = Positive,
Blue = Negative
Sr. X Y Result Nearest-distance-red
1 2 23 Positive ?
2 4 3 Negative ?
....
dataset
Its a lot easier when there is sample data, make sure to include that next time.
I generate random data
import numpy as np
import pandas as pd
import sklearn
x = np.linspace(1,50)
y = np.linspace(1,50)
GRID = np.meshgrid(x,y)
grid_colors = 1* ( np.random.random(GRID[0].size) > .8 )
sample_data = pd.DataFrame( {'X': GRID[0].flatten(), 'Y':GRID[1].flatten(), 'grid_color' : grid_colors})
sample_data.plot.scatter(x="X",y='Y', c='grid_color', colormap='bwr', figsize=(10,10))
BallTree (or KDTree) can create a tree to query with
from sklearn.neighbors import BallTree
red_points = sample_data[sample_data.grid_color == 1]
blue_points = sample_data[sample_data.grid_color != 1]
tree = BallTree(red_points[['X','Y']], leaf_size=15, metric='minkowski')
and use it with
distance, index = tree.query(sample_data[['X','Y']], k=1)
now add it to the DataFrame
sample_data['nearest_point_distance'] = distance
sample_data['nearest_point_X'] = red_points.X.values[index]
sample_data['nearest_point_Y'] = red_points.Y.values[index]
which gives
X Y grid_color nearest_point_distance nearest_point_X \
0 1.0 1.0 0 2.0 3.0
1 2.0 1.0 0 1.0 3.0
2 3.0 1.0 1 0.0 3.0
3 4.0 1.0 0 1.0 3.0
4 5.0 1.0 1 0.0 5.0
nearest_point_Y
0 1.0
1 1.0
2 1.0
3 1.0
4 1.0
Modification to have red point not find themself;
Find the nearest k=2 instead of k=1;
distance, index = tree.query(sample_data[['X','Y']], k=2)
And, with help of numpy indexing, make red points use the second instead of the first found;
sample_size = GRID[0].size
sample_data['nearest_point_distance'] = distance[np.arange(sample_size),sample_data.grid_color]
sample_data['nearest_point_X'] = red_points.X.values[index[np.arange(sample_size),sample_data.grid_color]]
sample_data['nearest_point_Y'] = red_points.Y.values[index[np.arange(sample_size),sample_data.grid_color]]
The output type is the same, but due to randomness it won't agree with earlier made picture.
cKDTree for scipy can calculate that distance for you. Something along those lines should work:
df['Distance_To_Red'] = cKDTree(coordinates_of_red_points).query((df['x'], df['y']), k=1)
Related
import pandas as pd
import recordlinkage as rl
lst_left = [...]
lst_right = [...]
df_left = pd.DataFrame(lst_left, columns=pd.Index(["city_id", "street_name"]))
df_right = pd.DataFrame(lst_right, columns=pd.Index(["city_id", "street_name"]))
indexer = rl.Index()
indexer.block("city_id")
pairs = indexer.index(df_left, df_right)
compare = rl.Compare(indexing_type="label")
compare.string("street_name", "street_name", method="damerau_levenshtein", threshold=0.7)
features = compare.compute(pairs, df_left, df_right)
matches = features[features[0] == 1.0]
And I get matches pairs MultiIndex
Out[4]:
0
0 0 1.0
1 1 1.0
2 2 1.0
4 3 1.0
6 5 1.0
7 6 1.0
8 7 1.0
10 8 1.0
12 9 1.0
13 10 1.0
14 11 1.0
15 12 1.0
And now I want to left join (sql left outer join) df_left and df_right dataframes based on those matches pairs keeping unmatched elements from df_left DataFrame.
How can I do that?
P.S. To get only matched records I use
df_left.loc[matches.index.get_level_values(0)].reset_index().merge(df_right.loc[matches.index.get_level_values(1)].reset_index(), how="left", left_index=True, right_index=True)
But I don't know how to merge and keep unmatched rows from left DataFrame.
Thank You
I am new to Python and lost in the way to approach this problem: I have a dataframe where the information I need is mostly grouped in layers of 2,3 and 4 rows. Each group has a different ID in one of the columns. I need to create another dataframe where the groups of rows are now a single row, where the information is unstacked in more columns. Later I can drop unwanted/redundant columns.
I think I need to iterate through the dataframe rows and filter for each ID unstacking the rows into a new dataframe. I cannot obtain much from unstack or groupby functions. Is there a easy function or combination that can make this task?
Here is a sample of the dataframe:
2_SH1_G8_D_total;Positions tolerance d [z] ;"";0.000; ;0.060;"";0.032;0.032;53%
12_SH1_G8_D_total;Positions tolerance d [z] ;"";-58.000;"";"";"";---;"";""
12_SH1_G8_D_total;Positions tolerance d [z] ;"";-1324.500;"";"";"";---;"";""
12_SH1_G8_D_total;Positions tolerance d [z] ;"";391.000;"";"";"";390.990;"";""
13_SH1_G8_D_total;Flatness;"";0.000; ;0.020;"";0.004;0.004;20%
14_SH1_G8_D_total;Parallelism tolerance ;"";0.000; ;0.030;"";0.025;0.025;84%
15_SH1_B1_B;Positions tolerance d [x y] ;"";0.000; ;0.200;"";0.022;0.022;11%
15_SH1_B1_B;Positions tolerance d [x y] ;"";265.000;"";"";"";264.993;"";""
15_SH1_B1_B;Positions tolerance d [x y] ;"";1502.800;"";"";"";1502.792;"";""
15_SH1_B1_B;Positions tolerance d [x y] ;"";-391.000;"";"";"";---;"";""
The original dataframe has information in 4 rows, but not always. Ending dataframe should have only one row per Id occurrence, with all the info in the columns.
So far, with help, I managed to run this code:
with open(path, newline='') as datafile:
data = csv.reader(datafile, delimiter=';')
for row in data:
tmp.append(row)
# Create data table joining data with the same GAT value, GAT is the ID I need
Data = []
Data.append(tmp[0])
GAT = tmp[0][0]
j = 0
counter = 0
for i in range(0,len(tmp)):
if tmp[i][0] == GAT:
counter = counter + 1
if counter == 2:
temp=(tmp[i][5],tmp[i][7],tmp[i][8],tmp[i][9])
else:
temp = (tmp[i][3], tmp[i][7])
Data[j].extend(temp)
else:
Data.append(tmp[i])
GAT = tmp[i][0]
j = j + 1
# for i in range(0,len(Data)):
# print(Data[i])
with open('output.csv', 'w', newline='') as outputfile:
writedata = csv.writer(outputfile, delimiter=';')
for i in range(0, len(Data)):
writedata.writerow(Data[i]);
But is not really using pandas, which probably will give me more power handling the data. In addition, this open() commands have troubles with the non-ascii characters I am unable to solve.
Is there a more elegant way using pandas?
So basically you're doing a "partial transpose". Is this what you want (referenced from this answer)?
Sample Data
With unequal number of rows per line
ID col1 col2
0 A 1.0 2.0
1 A 3.0 4.0
2 B 5.0 NaN
3 B 7.0 8.0
4 B 9.0 10.0
5 B NaN 12.0
Code
import pandas as pd
import io
# read df
df = pd.read_csv(io.StringIO("""
ID col1 col2
A 1 2
A 3 4
B 5 nan
B 7 8
B 9 10
B nan 12
"""), sep=r"\s{2,}", engine="python")
# solution
g = df.groupby('ID').cumcount()
df = df.set_index(['ID', g]).unstack().sort_index(level=1, axis=1)
df.columns = [f'{a}_{b+1}' for a, b in df.columns]
Result
print(df)
col1_1 col2_1 col1_2 col2_2 col1_3 col2_3 col1_4 col2_4
ID
A 1.0 2.0 3.0 4.0 NaN NaN NaN NaN
B 5.0 NaN 7.0 8.0 9.0 10.0 NaN 12.0
Explanation
After the .set_index(["ID", g]) step, the dataset becomes
col1 col2
ID
A 0 1.0 2.0
1 3.0 4.0
B 0 5.0 NaN
1 7.0 8.0
2 9.0 10.0
3 NaN 12.0
where the multi-index is perfect for df.unstack().
I have a dataframe with dates and a value per day. I want to see the gradient of the value, if it is growing, declining, .... The best way is to apply a linear regression with day as x and value as y:
import pandas as pd
df = pd.DataFrame({'customer':['a','a','a','b','b','b'],
'day':[1,2,4,2,3,4],
'value':[1.5,2.4,3.6,1.5,1.3,1.1]})
df:
customer day value
0 a 1 1.5
1 a 2 2.4
2 a 4 3.6
3 b 2 1.5
4 b 3 1.3
5 b 4 1.1
By hand I can do a linear regression:
from sklearn.linear_model import LinearRegression
def gradient(x,y):
return LinearRegression().fit(x,y).coef_[0]
xa = df[df.customer =='a'].day.values.reshape(-1, 1)
ya = df[df.customer =='a'].value.values.reshape(-1, 1)
xb = df[df.customer =='b'].day.values.reshape(-1, 1)
yb = df[df.customer =='b'].value.values.reshape(-1, 1)
print(gradient(xa,ya),gradient(xb,yb))
result: [0.68571429] [-0.2]
But I would like to use a groupby as in
df.groupby('customer').agg({'value':['mean','sum','gradient']})
with an output like:
value
mean sum gradient
customer
a 2.5 7.5 0.685
b 1.3 3.9 -0.2
the issue is that the gradient needs 2 columns as input.
You can do:
# calculate gradient
v = (df
.groupby('customer')
.apply(lambda x: gradient(x['day'].to_numpy().reshape(-1, 1),
x['value'].to_numpy().reshape(-1, 1)))
v.name = 'gradient'
# calculate mean, sum
d1 = df.groupby('customer').agg({'value': ['mean', 'sum']})
# join the results
d1 = d1.join(v)
# fix columns
d1.columns = d1.columns.str.join('')
print(d1)
valuemean valuesum gradient
customer
a 2.5 7.5 0.685714
b 1.3 3.9 -0.200000
I am trying to fill mean values of columns for all NaNs values in the column.
import numpy as np
import pandas as pd
table = pd.DataFrame({'A':[1,2,np.nan],
'B':[3,np.nan, np.nan],
'C':[4,5,6]})
def impute_missing_values(table):
for column in table:
for value in column:
if value == 'NaN':
value = column.mean(skipna=True)
else:
value = value
impute_missing_values(table)
table
Why I am getting an error for this code?
IIUC:
table.fillna(table.mean())
Output:
A B C
0 1.0 3.0 4
1 2.0 3.0 5
2 1.5 3.0 6
Okay, I am adding this as another answer because this isn't something I recommend at all. Using pandas methods vectorizes operations for better performance.
Using loops is not recommended when possible to avoid.
However, here is a quick fix to your code:
import pandas as pd
import numpy as np
import math
table = pd.DataFrame({'A':[1,2,np.nan],
'B':[3,np.nan, np.nan],
'C':[4,5,6]})
def impute_missing_values(df):
for column in df:
for idx, value in df[column].iteritems():
if math.isnan(value):
df.loc[idx,column] = df[column].mean(skipna=True)
else:
pass
return df
impute_missing_values(table)
table
Output:
A B C
0 1.0 3.0 4
1 2.0 3.0 5
2 1.5 3.0 6
You can try the SimpleImputer from scikit learn (https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html#sklearn.impute.SimpleImputer) using the mean option.
import pandas as pd
from sklearn.impute import SimpleImputer
table = pd.DataFrame({'A':[1,2,np.nan],
'B':[3,np.nan, np.nan],
'C':[4,5,6]})
print(table, '\n')
imp = SimpleImputer(missing_values=np.nan, strategy='mean')
table_means = pd.DataFrame(imp.fit_transform(table), columns = {'C','B','A'})
print(table_means)
The print commands results in:
A B C
0 1.0 3.0 4
1 2.0 NaN 5
2 NaN NaN 6
A C B
0 1.0 3.0 4.0
1 2.0 3.0 5.0
2 1.5 3.0 6.0
To correct your code (as per my comment below):
def impute_missing_values(table):
for column in table:
table.loc[:,column] = np.where(table[column].isna(), table[column].mean(), table[column])
return table
I have a numpy user-item matrix, each row corresponds to an user and each columns corresponds to an item.
I want to convert the matrix in a pandas DataFrame like the one as follows:
user item rating
0 1 1907 4.0
1 1 1028 5.0
2 1 608 4.0
3 1 2692 4.0
4 1 1193 5.0
I use the following code to generate a DataFrame:
predictions = pd.DataFrame(data=pred)
predictions = predictions.stack().reset_index(name='rating')
predictions.columns = ['user', 'item', 'rating']
and I obtain a df like this:
user item rating
0 0 0 5.000000
1 0 1 0.000000
2 0 2 0.000000
3 0 3 0.000000
Is there a way in pandas to map each values in user and items columns to a value stored in list? User with value 0 should be mapped with the 1st value in user list, user with value 5 with the 6th element in user list and so on...
I'm trying using:
predictions[["user"]].apply(lambda value: users[value])
but I got an IndexError I don't understand because my users list is of size 96
IndexError: ('index 96 is out of bounds for axis 1 with size 96', 'occurred at index user')
my fault was in this code:
while not session.should_stop():
predictions = session.run(decoder_op)
pred = np.vstack((pred, predictions))
just replaced with:
np.vstack((pred, predictions))
and it works like a charm with:
predictions['user'] = predictions['user'].map(lambda value: users[value])
predictions['item'] = predictions['item'].map(lambda value: items[value])