I want to use values from dataframeA as upper and lower bounds to filter dataframeB

I want to use values from dataframeA as upper and lower bounds to filter dataframeB - pandas

I have two dataframes A and B.
Dataframe A has 4 columns with 2 sets of maximum and minimums that I want to use as upper and lower bounds for 2 columns in dataframe B.
latitude = data['y']
longitude = data['x']
upper_lat = coords['lat_max']
lower_lat = coords['lat_min']
upper_lon = coords['long_max']
lower_lon = coords['long_min']
def filter_data_2(filter, upper_lat, lower_lat, upper_lon, lower_lon, lat, lon):
v = filter[(lower_lat <= lat <= upper_lat ) & (lower_lon <= lon <= upper_lon)]
return v
newdata = filter_data_2(data, upper_lat, lower_lat, upper_lon, lower_lon, latitude, longitude)
ValueError: Can only compare identically-labeled Series objects

MWE:
import pandas as pd
a = {'lower_lon': [2,4,6], 'upper_lon': [4,6,10], 'lower_lat': [1,3,5], 'upper_lat': [3,5,7]}
constraints = pd.DataFrame(data=a)
constraints
lower_lon upper_lon lower_lat upper_lat
0 2 4 1 3
1 4 6 3 5
2 6 10 5 7
b = {'lon' : [3, 5, 7, 9, 11, 13, 15], 'lat': [2, 4, 6, 8, 10, 12, 14]}
to_filter = pd.DataFrame(data=b)
to_filter
lon lat
0 3 2
1 5 4
2 7 6
3 9 8
4 11 10
5 13 12
6 15 14
lat = to_filter['lat']
lon = to_filter['lon']
lower_lon = constraints['lower_lon']
upper_lon = constraints['upper_lon']
lower_lat = constraints['lower_lat']
upper_lat = constraints['upper_lat']
v = to_filter[(lower_lat <= lat) & (lat <= upper_lat) & (lower_lon <= lon) & (lon <= upper_lon)]
Expected Results
v
lon lat
0 3 2
1 5 4
2 7 6

The global filter will be the union of the sets of all the contraints, in pandas you could:
v = pd.DataFrame()
for i in constraints.index:
# Current constraints
min_lon, max_lon, min_lat, max_lat = constraints.loc[i, :]
# Apply filter
df = to_filter[ (to_filter.lon>= min_lon & to_filter.lon<= max_lon) & (to_filter.lat>= min_lat & to_filter.lat<= max_lat) ]
# Join in a single df previous and current filter outcome
v= pd.concat( [v, df] )
# Remove duplicates, if any
v = v.drop_duplicates()

Related

How can I aggregate strings from many cells into one cell?

Say I have two classes with a handful of students each, and I want to think of the possible pairings in each class. In my original data, I have one line per student.
What's the easiest way in Pandas to turn this dataset
Class Students
0 1 A
1 1 B
2 1 C
3 1 D
4 1 E
5 2 F
6 2 G
7 2 H
Into this new stuff?
Class Students
0 1 A,B
1 1 A,C
2 1 A,D
3 1 A,E
4 1 B,C
5 1 B,D
6 1 B,E
7 1 C,D
6 1 B,E
8 1 C,D
9 1 C,E
10 1 D,E
11 2 F,G
12 2 F,H
12 2 G,H

Try This:
import itertools
import pandas as pd
cla = [1, 1, 1, 1, 1, 2, 2, 2]
s = ["A", "B", "C", "D" , "E", "F", "G", "H"]
df = pd.DataFrame(cla, columns=["Class"])
df['Student'] = s
def create_combos(list_students):
combos = itertools.combinations(list_students, 2)
str_students = []
for i in combos:
str_students.append(str(i[0])+","+str(i[1]))
return str_students
def iterate_df(class_id):
df_temp = df.loc[df['Class'] == class_id]
list_student = list(df_temp['Student'])
list_combos = create_combos(list_student)
list_id = [class_id for i in list_combos]
return list_id, list_combos
list_classes = set(list(df['Class']))
new_id = []
new_combos = []
for idx in list_classes:
tmp_id, tmp_combo = iterate_df(idx)
new_id += tmp_id
new_combos += tmp_combo
new_df = pd.DataFrame(new_id, columns=["Class"])
new_df["Student"] = new_combos
print(new_df)

How to do the formulas without splitting the dataframe which had different conditions

I have the following dataframe
import pandas as pd
d = {
'ID':[1,2,3,4,5],
'Price1':[5,9,4,3,9],
'Price2':[9,10,13,14,18],
'Type':['A','A','B','C','D'],
}
df = pd.DataFrame(data = d)
df
For applying the formula without condition I use the following code
df = df.eval(
'Price = (Price1*Price1)/2'
)
df
How to do the formulas without splitting the dataframe which had different conditions
Need a new column called Price_on_type
The formula is differing for each type
For type A the formula for Price_on_type = Price1+Price1
For type B the formula for Price_on_type = (Price1+Price1)/2
For type C the formula for Price_on_type = Price1
For type D the formula for Price_on_type = Price2
Expected Output:
import pandas as pd
d = {
'ID':[1,2,3,4,5],
'Price1':[5,9,4,3,9],
'Price2':[9,10,13,14,18],
'Price':[12.5,40.5, 8.0, 4.5, 40.5],
'Price_on_type':[14,19,8.0,3,18],
}
df = pd.DataFrame(data = d)
df

You can use numpy.select:
masks = [df['Type'] == 'A',
df['Type'] == 'B',
df['Type'] == 'C',
df['Type'] == 'D']
vals = [df.eval('(Price1*Price1)'),
df.eval('(Price1*Price1)/2'),
df['Price1'],
df['Price2']]
Or:
vals = [df['Price1'] + df['Price2'],
(df['Price1'] + df['Price2']) / 2,
df['Price1'],
df['Price2']]
df['Price_on_type'] = np.select(masks, vals)
print (df)
ID Price1 Price2 Type Price_on_type
0 1 5 9 A 14.0
1 2 9 10 A 19.0
2 3 4 13 B 8.5
3 4 3 14 C 3.0
4 5 9 18 D 18.0

If your data is not too big, using apply with custom function on axis=1
def Prices(x):
dict_sw = {
'A': x.Price1 + x.Price2,
'B': (x.Price1 + x.Price2)/2,
'C': x.Price1,
'D': x.Price2,
}
return dict_sw[x.Type]
In [239]: df['Price_on_type'] = df.apply(Prices, axis=1)
In [240]: df
Out[240]:
ID Price1 Price2 Type Price_on_type
0 1 5 9 A 14.0
1 2 9 10 A 19.0
2 3 4 13 B 8.5
3 4 3 14 C 3.0
4 5 9 18 D 18.0
Or using the trick convert True to 1 and False to 0
df['Price_on_type'] = \
(df.Type == 'A') * (df.Price1 + df.Price2) + \
(df.Type == 'B') * (df.Price1 + df.Price2)/2 + \
(df.Type == 'C') * df.Price1 + \
(df.Type == 'D') * df.Price2
Out[308]:
ID Price1 Price2 Type Price_on_type
0 1 5 9 A 14.0
1 2 9 10 A 19.0
2 3 4 13 B 8.5
3 4 3 14 C 3.0
4 5 9 18 D 18.0

Getting count of rows from breakpoints of different column

Consider there are two columns A and B in a dataframe. How can I decile column A and use those breakpoints of column A deciles to calculate the count of rows in column B??
import pandas as pd
import numpy as np
df=pd.read_excel("E:\Sai\Development\UCG\qcut.xlsx")
df['Range']=pd.qcut(df['a'],10)
df_gb=df.groupby('Range',as_index=False).agg({'a':[min,max,np.size]})
df_gb.columns = df_gb.columns.droplevel()
df_gb=df_gb.rename(columns={'':'Range','size':'count_A'})
df['Range_B']=0
df['Range_B'].loc[df['b']<=df_gb['max'][0]]=1
df['Range_B'].loc[(df['b']>df_gb['max'][0]) & (df['b']<=df_gb['max'][1])]=2
df['Range_B'].loc[(df['b']>df_gb['max'][1]) & (df['b']<=df_gb['max'][2])]=3
df['Range_B'].loc[(df['b']>df_gb['max'][2]) & (df['b']<=df_gb['max'][3])]=4
df['Range_B'].loc[(df['b']>df_gb['max'][3]) & (df['b']<=df_gb['max'][4])]=5
df['Range_B'].loc[(df['b']>df_gb['max'][4]) & (df['b']<=df_gb['max'][5])]=6
df['Range_B'].loc[(df['b']>df_gb['max'][5]) & (df['b']<=df_gb['max'][6])]=7
df['Range_B'].loc[(df['b']>df_gb['max'][6]) & (df['b']<=df_gb['max'][7])]=8
df['Range_B'].loc[(df['b']>df_gb['max'][7]) & (df['b']<=df_gb['max'][8])]=9
df['Range_B'].loc[df['b']>df_gb['max'][8]]=10
df_gb_b=df.groupby('Range_B',as_index=False).agg({'b':np.size})
df_gb_b=df_gb_b.rename(columns={'b':'count_B'})
df_final = pd.concat([df_gb, df_gb_b], axis=1)
df_final=df_final[['Range','count_A','count_B']]
Is there any simple solution, as I intend to do for so many columns

I hope this would help:
df['Range'] = pd.qcut(df['a'], 10)
df2 = df.groupby(['Range'])['a'].count().reset_index().rename(columns = {'a':'count_A'})
for item in df2['Range'].values:
df2.loc[df2['Range'] == item, 'count_B'] = df['b'].apply(lambda x: x in item).sum()
df2 = df2.sort_values('Range', ascending = True)
if you want to additionally count values b that are out of range a:
min_border = df2['Range'].values[0].left
max_border = df2['Range'].values[-1].right
df2.loc[0, 'count_B'] += df.loc[df['b'] <= min_border, 'b'].count()
df2.iloc[-1, 2] += df.loc[df['b'] > max_border, 'b'].count()

One way -
df = pd.DataFrame({'A': np.random.randint(0, 100, 20), 'B': np.random.randint(0, 10, 20)})
bins = [0, 1, 4, 8, 16, 32, 60, 100, 200, 500, 5999]
labels = ["{0} - {1}".format(i, j) for i, j in zip(bins, bins[1:])]
df['group_A'] = pd.cut(df['A'], bins, right=False, labels=labels)
df['group_B'] = pd.cut(df.B, bins, right=False, labels=labels)
df1 = df.groupby(['group_A'])['A'].count().reset_index()
df2 = df.groupby(['group_B'])['B'].count().reset_index()
df_final = pd.merge(df1, df2, left_on =['group_A'], right_on =['group_B']).drop(['group_B'], axis=1).rename(columns={'group_A': 'group'})
print(df_final)
Output
group A B
0 0 - 1 0 1
1 1 - 4 1 3
2 4 - 8 1 9
3 8 - 16 2 7
4 16 - 32 3 0
5 32 - 60 7 0
6 60 - 100 6 0
7 100 - 200 0 0
8 200 - 500 0 0
9 500 - 5999 0 0

Apply function on a two dataframe rows

Given a pandas dataframe like this:
df = pd.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]})
col1 col2
0 1 4
1 2 5
2 3 6
I would like to do something equivalent to this using a function but without passing "by value" or as a global variable the whole dataframe (it could be huge and then it would give me a memory error):
i = -1
for index, row in df.iterrows():
if i < 0:
i = index
continue
c1 = df.loc[i][0] + df.loc[index][0]
c2 = df.loc[i][1] + df.loc[index][1]
df.ix[index, 0] = c1
df.ix[index, 1] = c2
i = index
col1 col2
0 1 4
1 3 9
2 6 15
i.e., I would like to have a function which will give me the previous output:
def my_function(two_rows):
row1 = two_rows[0]
row2 = two_rows[1]
c1 = row1[0] + row2[0]
c2 = row1[1] + row2[1]
row2[0] = c1
row2[1] = c2
return row2
df.apply(my_function, axis=1)
df
col1 col2
0 1 4
1 3 9
2 6 15
Is there a way of doing this?

What you've demonstrated is a cumsum
df.cumsum()
col1 col2
0 1 4
1 3 9
2 6 15
def f(df):
n = len(df)
r = range(1, n)
for j in df.columns:
for i in r:
df[j].values[i] += df[j].values[i - 1]
return df
f(df)
To define a function as a loop that does this in place
Slow cell by cell
def f(df):
n = len(df)
r = range(1, n)
for j in df.columns:
for i in r:
df[j].values[i] += df[j].values[i - 1]
return df
f(df)
col1 col2
0 1 4
1 3 9
2 6 15
Compromise between memory and efficiency
def f(df):
for j in df.columns:
df[j].values[:] = df[j].values.cumsum()
return df
f(df)
f(df)
col1 col2
0 1 4
1 3 9
2 6 15
Note that you don't need to return df. I chose to for convenience.

How to use r kmeans cluster vector to repaint plot?

km = kmeans(FourA,3)
km$cluster
[1] 1 1 1 2 1 1 2 2 2 2 3 2 ...
How do I use the km$cluster vector to create 3 new arrays so that I can plot the graph with the three clusters using a different character/color?

For your reference
x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2),
matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2))
cl <- kmeans(x, 3, nstart = 25)
plot(x, col = cl$cluster)
points(cl$centers, col = 1:3, pch = 8)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

I want to use values from dataframeA as upper and lower bounds to filter dataframeB - pandas

Related

How can I aggregate strings from many cells into one cell?

How to do the formulas without splitting the dataframe which had different conditions

Getting count of rows from breakpoints of different column

Apply function on a two dataframe rows

How to use r kmeans cluster vector to repaint plot?

Categories

Resources