Select Rows That Does Not Contain any Negative Or Missing Value - sql

Assume a database table has a few hundred columns. In SQL statements, how would you select rows/records that do not contain any negative or missing value? Can you do it using the sqldf package for R users?
Here is an example of data frame with 6 rows and 2 columns:
D = data.frame(X = c(23, -24, 35, 12, 34, 41),
Y = c(100, 98, 89, NA, 56, 90))
The SQL statement(s) should only return a table containing the rows 1, 3, 5, and 6.

text = "X Y
23 100
-24 98
35 89
12 NA
34 56
41 90"
df = read.table(text=text, header = T)
# install.packages("sqldf")
library(sqldf)
conditions = c(">=0","NOT NULL")
columns = colnames(df)
applyConditions <- function(columns,conditions){
grid = expand.grid(columns,conditions)
apply(grid, 1,
function(x) paste(x, collapse = " ")
)
}
select <- "SELECT * FROM df where "
where <- paste(applyConditions(columns,conditions),collapse = " AND ")
sqldf(paste(select,where))

Related

Remove the identical values, and leave only different

I would like to know if there is more optimal solution to leave the different value (to easily catch them) and to remove identical values under some columns.
merged = pd.merge(us_df, gb_df, how='outer', indicator=True)
res = pd.merge(merged[merged['_merge'] == 'left_only'].drop('_merge', axis=1),
merged[merged['_merge'] == 'right_only'].drop('_merge', axis=1),
on=us_df.columns.tolist()[0:col_range],
how='outer',
suffixes=('_US', '_GB')).fillna(' ')
cols = [col for col in res.columns.tolist() if '_US' in col or '_GB' in col]
sorted_cols = [col for col in res.columns.tolist() if '_US' not in col and '_GB' not in col] + sorted(cols)
I get this table (res):
Id
ages_GB
ages_US
salary_GB
salary_US
6
45
45
34
67
43
12
11
65
65
So far, I used this iteration:
cols = [ages_US, salary_US, ages_GB, salary_GB]
for i, row in res.iterrows():
for us, gb in zip(cols[:len(cols) // 2], cols[len(cols) // 2:]):
if row[us] == row[gb]:
res.at[i, us] = res.at[i, gb] = ' '
to get the result (where identical values under columns in cols are replaced with " " (space)):
Id
ages_GB
ages_US
salary_GB
salary_US
6
34
67
43
12
11
Is there another method to get the similar result?
Given your example I think loc offers a simpler solution assuming you want to compare two sets of columns.
I will first recreate a reproducible example of your dataset (I would recommend you create this in future questions as it makes it easier to understand and answer you question: How to create a Minimal, Reproducible Example)
d = {
'ages_GB': [45, 12],
'ages_US': [45, 11],
'salary_GB': [34, 65],
'salary_US': [67, 65]
}
df = pd.DataFrame(data=d)
print(df)
Initial DataFrame
ages_GB ages_US salary_GB salary_US
0 45 45 34 67
1 12 11 65 65
The simplest solution I can think of is to use loc to just reassign records to "" or NaN where ages_GB == ages_US & salary_GB == salary_US.
df.loc[df.ages_GB == df.ages_US, ['ages_GB', 'ages_US']] = ["", ""]
df.loc[df.salary_GB == df.salary_US, ['salary_GB', 'salary_US']] = ["", ""]
Output
ages_GB ages_US salary_GB salary_US
0 34 67
1 12 11
For a generic method, you can groupby on axis=1 using the columns prefixes, and get the duplicated values to use with mask:
prefix = df.columns.str.extract('^([^_]+)', expand=False)
# ['Id', 'ages', 'ages', 'salary', 'salary']
m = df.groupby(prefix, axis=1).transform(lambda s: s.duplicated(keep=False))
out = df.mask(m, '')
Output:
Id ages_GB ages_US salary_GB salary_US
0 6 34 67
1 43 12 11
Intermediate m:
Id ages_GB ages_US salary_GB salary_US
0 False True True False False
1 False False False True True

I want to use values from dataframeA as upper and lower bounds to filter dataframeB

I have two dataframes A and B.
Dataframe A has 4 columns with 2 sets of maximum and minimums that I want to use as upper and lower bounds for 2 columns in dataframe B.
latitude = data['y']
longitude = data['x']
upper_lat = coords['lat_max']
lower_lat = coords['lat_min']
upper_lon = coords['long_max']
lower_lon = coords['long_min']
def filter_data_2(filter, upper_lat, lower_lat, upper_lon, lower_lon, lat, lon):
v = filter[(lower_lat <= lat <= upper_lat ) & (lower_lon <= lon <= upper_lon)]
return v
newdata = filter_data_2(data, upper_lat, lower_lat, upper_lon, lower_lon, latitude, longitude)
ValueError: Can only compare identically-labeled Series objects
MWE:
import pandas as pd
a = {'lower_lon': [2,4,6], 'upper_lon': [4,6,10], 'lower_lat': [1,3,5], 'upper_lat': [3,5,7]}
constraints = pd.DataFrame(data=a)
constraints
lower_lon upper_lon lower_lat upper_lat
0 2 4 1 3
1 4 6 3 5
2 6 10 5 7
b = {'lon' : [3, 5, 7, 9, 11, 13, 15], 'lat': [2, 4, 6, 8, 10, 12, 14]}
to_filter = pd.DataFrame(data=b)
to_filter
lon lat
0 3 2
1 5 4
2 7 6
3 9 8
4 11 10
5 13 12
6 15 14
lat = to_filter['lat']
lon = to_filter['lon']
lower_lon = constraints['lower_lon']
upper_lon = constraints['upper_lon']
lower_lat = constraints['lower_lat']
upper_lat = constraints['upper_lat']
v = to_filter[(lower_lat <= lat) & (lat <= upper_lat) & (lower_lon <= lon) & (lon <= upper_lon)]
Expected Results
v
lon lat
0 3 2
1 5 4
2 7 6
The global filter will be the union of the sets of all the contraints, in pandas you could:
v = pd.DataFrame()
for i in constraints.index:
# Current constraints
min_lon, max_lon, min_lat, max_lat = constraints.loc[i, :]
# Apply filter
df = to_filter[ (to_filter.lon>= min_lon & to_filter.lon<= max_lon) & (to_filter.lat>= min_lat & to_filter.lat<= max_lat) ]
# Join in a single df previous and current filter outcome
v= pd.concat( [v, df] )
# Remove duplicates, if any
v = v.drop_duplicates()

geom_violin using the weight aesthetic unexpectedly drop levels

library(tidyverse)
set.seed(12345)
dat <- data.frame(year = c(rep(1990, 100), rep(1991, 100), rep(1992, 100)),
fish_length = sample(x = seq(from = 10, 131, by = 0.1), 300, replace = F),
nb_caught = sample(x = seq(from = 1, 200, by = 0.1), 300, replace = T),
stringsAsFactors = F) %>%
mutate(age = ifelse(fish_length < 20, 1,
ifelse(fish_length >= 20 & fish_length < 100, 2,
ifelse(fish_length >= 100 & fish_length < 130, 3, 4)))) %>%
arrange(year, fish_length)
head(dat)
year fish_length nb_caught age
1 1990 10.1 45.2 1
2 1990 10.7 170.0 1
3 1990 10.9 62.0 1
4 1990 12.1 136.0 1
5 1990 14.1 80.8 1
6 1990 15.0 188.9 1
dat %>% group_by(year) %>% summarise(ages = n_distinct(age)) # Only 1992 has age 4 fish
# A tibble: 3 x 2
year ages
<dbl> <int>
1 1990 3
2 1991 3
3 1992 4
dat %>% filter(age == 4) # only 1 row for age 4
year fish_length nb_caught age
1 1992 130.8 89.2 4
Here:
year = year of sampling
fish_length = length of the fish in cm
nb_caught = number of fish caught following the use of an age-length key, hence explaining the presence of decimals
age = age of the fish
graph1: geom_violin not using the weight aesthetic.
Here, I got to copy each line of dat according to the value found in nb_caught.
dim(dat) # 300 rows
dat_graph1 <- dat[rep(1:nrow(dat), floor(dat$nb_caught)), ]
dim(dat_graph1) # 30932 rows
dat_graph1$nb_caught <- NULL # useless now
sum(dat$nb_caught) - nrow(dat_graph1) # 128.2 rows lost here
Since I have decimal values of nb_caught, I took the integer value to create dat_graph1. I lost 128.2 "rows" in the process.
Now for the graph:
dat_tile <- data.frame(year = sort(unique(dat$year))[sort(unique(dat$year)) %% 2 == 0])
# for the figure's background
graph1 <- ggplot(data = dat_graph1,
aes(x = as.factor(year), y = fish_length, fill = as.factor(age),
color = as.factor(age), .drop = F)) +
geom_tile(data = dat_tile, aes(x = factor(year), y = 1, height = Inf, width = 1),
fill = "grey80", inherit.aes = F) +
geom_violin(draw_quantiles = c(0.05, 0.5, 0.95), color = "black",
scale = "width", position = "dodge") +
scale_x_discrete(expand = c(0,0)) +
labs(x = "Year", y = "Fish length", fill = "Age", color = "Age", title = "graph1") +
scale_fill_brewer(palette = "Paired", drop = F) + # drop = F for not losing levels
scale_color_brewer(palette = "Paired", drop = F) + # drop = F for not losing levels
scale_y_continuous(expand = expand_scale(mult = 0.01)) +
theme_bw()
graph1
graph1
Note here that I have a flat bar for age 4 in year 1992.
dat_graph1 %>% filter(year == 1992, age == 4) %>% pull(fish_length) %>% unique
[1] 130.8
That is because I only have one length for that particular year-age combination.
graph2: geom_violin using the weight aesthetic.
Now, instead of copying each row of dat by the value of number_caught, let's use the weight aesthetic.
Let's calculate the weight wt that each line of dat will have in the calculation of the density curve of each year-age combinations.
dat_graph2 <- dat %>%
group_by(year, age) %>%
mutate(wt = nb_caught / sum(nb_caught)) %>%
as.data.frame()
head(dat_graph2)
year fish_length nb_caught age wt
1 1990 10.1 45.2 1 0.03573123
2 1990 10.7 170.0 1 0.13438735
3 1990 10.9 62.0 1 0.04901186
4 1990 12.1 136.0 1 0.10750988
5 1990 14.1 80.8 1 0.06387352
6 1990 15.0 188.9 1 0.14932806
graph2 <- ggplot(data = dat_graph2,
aes(x = as.factor(year), y = fish_length, fill = as.factor(age),
color = as.factor(age), .drop = F)) +
geom_tile(data = dat_tile, aes(x = factor(year), y = 1, height = Inf, width = 1),
fill = "grey80", inherit.aes = F) +
geom_violin(aes(weight = wt), draw_quantiles = c(0.05, 0.5, 0.95), color = "black",
scale = "width", position = "dodge") +
scale_x_discrete(expand = c(0,0)) +
labs(x = "Year", y = "Fish length", fill = "Age", color = "Age", title = "graph2") +
scale_fill_brewer(palette = "Paired", drop = F) + # drop = F for not losing levels
scale_color_brewer(palette = "Paired", drop = F) + # drop = F for not losing levels
scale_y_continuous(expand = expand_scale(mult = 0.01)) +
theme_bw()
graph2
dat_graph2 %>% filter(year == 1992, age == 4)
year fish_length nb_caught age wt
1 1992 130.8 89.2 4 1
graph2
Note here that the flat bar for age 4 in year 1992 seen on graph1 has been dropped here even though the line exists in dat_graph2.
My questions
Why is the age 4 in 1992 level dropped when using the weight aesthetic? How can I overcome this?
Why are the two graphs not visually alike even though they used the same data?
Thanks in advance for your help!
1.
Problem 1 is not related to using the weight aesthetic. You can check this by dropping the weight aesthetic in the code for your second graph. The problem is, that the algorithm for computing the density fails, when there are too less observations.
That is the reason, why group 4 shows up in graph 1 with the expanded dataset (grpah 1). Here you increase the number of observations by replicating the number of obs.
Unfortunately, geom_violin gives no warning in your specific case. However, if you filter dat_graph2 for age == 4 geom_violin gives you the warning
Warning message:
Computation failed in `stat_ydensity()`:
replacement has 1 row, data has 0
geom_density is much clearer on this issue, giving a warning, that groups with less than two obs have been dropped.
Unfortunately, I have no solution to overcome this, besides working with the expanded dataset.
2.
Concerning problem 2 I have no convincing answer except that I guess that this is related to the details of the kernel density estimator used by geom_violin, geom_density, ... and perhaps also somehow related to the number of data points.

insert value into random row

I have a dataframe as below.
D1 = pd.DataFrame({'a': [15, 22, 107, 120],
'b': [25, 21, 95, 110]})
I am trying to randomly add two rows into column 'b' to get the effect of below. In each case the inserted 0 in this case shifts the rows down one.
D1 = pd.DataFrame({'a': [15, 22, 107, 120, 0, 0],
'b': [0, 25, 21, 0, 95, 110]})
Everything I have seen is about inserting into the whole column as opposed to individual rows.
Here is one potential way to achieve this using numpy.random.randint and numpy.insert:
import numpy as np
n = 2
rand_idx = np.random.randint(0, len(D1), size=n)
# Append 'n' rows of zeroes to D1
D2 = D1.append(pd.DataFrame(np.zeros((n, D1.shape[1])), columns=D1.columns, dtype=int), ignore_index=True)
# Insert n zeroes into random indices and assign back to column 'b'
D2['b'] = np.insert(D1['b'].values, rand_idx, 0)
print(D2)
a b
0 15 25
1 22 0
2 107 0
3 120 21
4 0 95
5 0 110
Use numpy.insert with set positions - for a by random and for b by length of original DataFrame:
n = 2
new = np.zeros(n, dtype=int)
a = np.insert(D1['b'].values, len(D1), new)
b = np.insert(D1['a'].values, np.random.randint(0, len(D1), size=n), new)
#python 0.24+
#a = np.insert(D1['b'].to_numpy(), len(D1), new)
#b = np.insert(D1['a'].to_numpy(), np.random.randint(0, len(D1), size=n), new)
df = pd.DataFrame({'a':a, 'b': b})
print (df)
a b
0 25 0
1 21 15
2 95 22
3 110 0
4 0 107
5 0 120

Getting count of rows from breakpoints of different column

Consider there are two columns A and B in a dataframe. How can I decile column A and use those breakpoints of column A deciles to calculate the count of rows in column B??
import pandas as pd
import numpy as np
df=pd.read_excel("E:\Sai\Development\UCG\qcut.xlsx")
df['Range']=pd.qcut(df['a'],10)
df_gb=df.groupby('Range',as_index=False).agg({'a':[min,max,np.size]})
df_gb.columns = df_gb.columns.droplevel()
df_gb=df_gb.rename(columns={'':'Range','size':'count_A'})
df['Range_B']=0
df['Range_B'].loc[df['b']<=df_gb['max'][0]]=1
df['Range_B'].loc[(df['b']>df_gb['max'][0]) & (df['b']<=df_gb['max'][1])]=2
df['Range_B'].loc[(df['b']>df_gb['max'][1]) & (df['b']<=df_gb['max'][2])]=3
df['Range_B'].loc[(df['b']>df_gb['max'][2]) & (df['b']<=df_gb['max'][3])]=4
df['Range_B'].loc[(df['b']>df_gb['max'][3]) & (df['b']<=df_gb['max'][4])]=5
df['Range_B'].loc[(df['b']>df_gb['max'][4]) & (df['b']<=df_gb['max'][5])]=6
df['Range_B'].loc[(df['b']>df_gb['max'][5]) & (df['b']<=df_gb['max'][6])]=7
df['Range_B'].loc[(df['b']>df_gb['max'][6]) & (df['b']<=df_gb['max'][7])]=8
df['Range_B'].loc[(df['b']>df_gb['max'][7]) & (df['b']<=df_gb['max'][8])]=9
df['Range_B'].loc[df['b']>df_gb['max'][8]]=10
df_gb_b=df.groupby('Range_B',as_index=False).agg({'b':np.size})
df_gb_b=df_gb_b.rename(columns={'b':'count_B'})
df_final = pd.concat([df_gb, df_gb_b], axis=1)
df_final=df_final[['Range','count_A','count_B']]
Is there any simple solution, as I intend to do for so many columns
I hope this would help:
df['Range'] = pd.qcut(df['a'], 10)
df2 = df.groupby(['Range'])['a'].count().reset_index().rename(columns = {'a':'count_A'})
for item in df2['Range'].values:
df2.loc[df2['Range'] == item, 'count_B'] = df['b'].apply(lambda x: x in item).sum()
df2 = df2.sort_values('Range', ascending = True)
if you want to additionally count values b that are out of range a:
min_border = df2['Range'].values[0].left
max_border = df2['Range'].values[-1].right
df2.loc[0, 'count_B'] += df.loc[df['b'] <= min_border, 'b'].count()
df2.iloc[-1, 2] += df.loc[df['b'] > max_border, 'b'].count()
One way -
df = pd.DataFrame({'A': np.random.randint(0, 100, 20), 'B': np.random.randint(0, 10, 20)})
bins = [0, 1, 4, 8, 16, 32, 60, 100, 200, 500, 5999]
labels = ["{0} - {1}".format(i, j) for i, j in zip(bins, bins[1:])]
df['group_A'] = pd.cut(df['A'], bins, right=False, labels=labels)
df['group_B'] = pd.cut(df.B, bins, right=False, labels=labels)
df1 = df.groupby(['group_A'])['A'].count().reset_index()
df2 = df.groupby(['group_B'])['B'].count().reset_index()
df_final = pd.merge(df1, df2, left_on =['group_A'], right_on =['group_B']).drop(['group_B'], axis=1).rename(columns={'group_A': 'group'})
print(df_final)
Output
group A B
0 0 - 1 0 1
1 1 - 4 1 3
2 4 - 8 1 9
3 8 - 16 2 7
4 16 - 32 3 0
5 32 - 60 7 0
6 60 - 100 6 0
7 100 - 200 0 0
8 200 - 500 0 0
9 500 - 5999 0 0