How can I set conditions for dataframes? - pandas

/we.tl/t-ghXIOjPznq
Here is my xlsx file.
https://imgur.com/b8kTbNV
I have such a dataframe. I want to define only for conditions where LITHOLOGY column is 1. In order to do that;
df2 = pd.read_excel('V131BLOG.xlsx')
LITHOLOGY = [1] &
df2[df2.LITHOLOGY.isin(LITHOLOGY)]
There hasn't been a problem so far. I was able to filter as I wanted.
https://imgur.com/wcSvokM
In addition to these, I want to see cells with LITHOLOGY column as 1 If It's thickness is bigger than 15cms. What I mean is that, the cumulative difference of consecutive cells of DEPTH_MD column should be bigger than 10cms. I have not made any progress on this. What path should I follow?
As you can see in this (https://imgur.com/a/02nlUUl) figure, there can be seen serial group of LITHOLOGY column as 1. But when you check the DEPTH_MD values, upper group is equal to 10cms, on the other side, lower group is equal 5cms. I want to create a dataframe that only contains bigger than 10cms DEPTH_MD values.
Input:
DEPTH_MD CALIPER GR LITHOLOGY SHALLOW DEEP
1980 329.00 26.8964 25.47160 2 2.99103 2.62130
1981 329.05 26.8574 32.54390 2 2.94772 2.58945
1982 329.10 27.1297 28.83750 1 2.90123 2.55601
1983 329.15 26.9742 17.91150 2 2.80383 2.52327
1984 329.20 28.3946 31.94310 2 2.76041 2.49050
1985 329.25 30.9402 17.63760 1 2.71992 2.46051
1986 329.30 35.2419 17.69170 1 2.67355 2.42852
1987 329.35 37.9206 17.74620 1 2.61838 2.33619
1988 329.40 39.9189 24.84460 2 2.56200 2.28671
1989 329.45 41.4947 7.03354 2 2.50669 2.23887
1990 329.50 41.5473 7.03354 2 2.42167 2.19944
1991 329.55 41.0158 10.58260 2 2.40039 2.17235
Output except:
DEPTH_MD CALIPER GR LITHOLOGY SHALLOW DEEP
1985 329.25 30.9402 17.6376 1 2.71992 2.46051
1986 329.30 35.2419 17.6917 1 2.67355 2.42852
1987 329.35 37.9206 17.7462 1 2.61838 2.33619

Group the consecutive 'LITHOLOGY' rows then compute the thickness and finally broadcast to all rows:
df['THICKNESS'] = (
df.groupby(df['LITHOLOGY'].ne(df['LITHOLOGY'].shift()).cumsum())['DEPTH_MD']
.transform(lambda x: x.diff().sum())
)
out = df[(df['LITHOLOGY'] == 1) & (df['THICKNESS'] >= 0.1)]
Output:
>>> out
DEPTH_MD CALIPER GR LITHOLOGY SHALLOW DEEP THICKNESS
1985 329.25 30.9402 17.6376 1 2.71992 2.46051 0.1
1986 329.30 35.2419 17.6917 1 2.67355 2.42852 0.1
1987 329.35 37.9206 17.7462 1 2.61838 2.33619 0.1

Related

Pandas Groupby Multiple Conditions KeyError

I have a df called df_out with column names such as this in the following insert but I cannot for some reason use 'groupby' function with the column headers since it keeps giving me KeyError: 'year'. I"ve researched and tried stripping white space, resetting the index, allowing white space before my groupby setting, etc and I cannot get past this KeyError. The df_out looks like this:
df_out.columns
Out[185]:
Index(['year', 'month', 'BARTON CHAPEL', 'BARTON I', 'BIG HORN I',
'BLUE CREEK', 'BUFFALO RIDGE I', 'CAYUGA RIDGE', 'COLORADO GREEN',
'DESERT WIND', 'DRY LAKE I', 'EL CABO', 'GROTON', 'NEW HARVEST',
'PENASCAL I', 'RUGBY', 'TULE'],
dtype='object', name='plant_name')
But, when I use df_out.head(), I get a different answer with the leading column of 'plant_name' so this maybe is where the error is coming from or related. Here is the output columns from -
df_out.head()
Out[187]:
plant_name year month BARTON CHAPEL BARTON I BIG HORN I BLUE CREEK \
0 1991 1 6.432285 7.324126 5.170067 6.736384
1 1991 2 7.121324 6.973586 4.922693 7.473527
2 1991 3 8.125793 8.681317 5.796599 8.401855
3 1991 4 7.454972 8.037764 7.272292 7.961625
4 1991 5 7.012809 6.530013 6.626949 6.009825
plant_name BUFFALO RIDGE I CAYUGA RIDGE COLORADO GREEN DESERT WIND \
0 7.163790 7.145323 5.783629 5.682003
1 7.595744 7.724717 6.245952 6.269524
2 8.111411 9.626075 7.918871 6.657648
3 8.807458 8.618806 7.011444 5.848736
4 7.734852 6.267097 7.410013 5.099610
plant_name DRY LAKE I EL CABO GROTON NEW HARVEST PENASCAL I \
0 4.721089 10.747285 7.456640 6.921801 6.296425
1 5.095923 8.891057 7.239762 7.449122 6.484241
2 8.409637 12.238508 8.274046 8.824758 8.444960
3 7.893694 10.837139 6.381736 8.840431 7.282444
4 8.496976 8.636882 6.856747 7.469825 7.999530
plant_name RUGBY TULE
0 7.028360 4.110605
1 6.394687 5.257128
2 6.859462 10.789516
3 7.590153 7.425153
4 7.556546 8.085255
My groupby statement that is getting the KeyError looks like this and I'm trying to calculate the average by rows of year and month based on a subset of columns from df_out found in the list - 'west':
west=['BIG HORN I','DRY LAKE I', 'TULE']
westavg = df_out[df_out.columns[df_out.columns.isin(west)]].groupby(['year','month']).mean()
thank you very much,
Your code can be broken down as:
westavg = (df_out[df_out.columns[df_out.columns.isin(west)]]
.groupby(['year','month']).mean()
)
which is not working because ['year','month'] are not columns of df_out[df_out.columns[df_out.columns.isin(west)]].
Try:
west_cols = [c for c in df_out if c in west]
westavg = df_out.groupby(['year','month'])[west_cols].mean()
Ok, with the help of Quang Hoang below, I understood the problem and came up with this answer that works that I am able to understand a bit better using .intersection:
westavg = df_out[df_out.columns.intersection(west)].mean(axis=1)
#gives me average of each row from the subset of columns defined by the list 'west'`.

Pandas: Make a column with (1,2,3) if string of another column-Row value starts with ("A","B","C")

I have dataframe with filenames and classification, these are predictions from a network, I want to map them into integers to evaluate prediction from a network.
My dataframe is :
Filename: Class:
GHT347 Europe
GHT568 lONDON
GHT78 Europe
HJU US
HJI lONDON
HJK US
KLO Europe
KLU lONDON
KLP lONDON
KLY1 lONDON
KL34 US
The true prediction should be :
GHT-- EUROPE
HJU -- US
KL -- London
I want to map : GHT and Europe to 1, US and HJ to 0, KL and London to 2 by adding an additional two columns Prediction and Actual
Actual Prediction
1 1
1 2
pandas str.startswith method returns true or false, here I want three values. Can anyone guide me?
i cannot fully understand what you want, but I can give you some tips
use regular expressions:
df['actual'] = np.nan
df.loc[(df.Filename.str.contains('^GHT.*')) & (df.Class == 'Europe'), 'Actual'] = 1
df.loc[(df.Filename.str.contains('^HJ.*')) & (df.Class == 'US'), 'Actual'] = 0
and so on
You can set column values to anything you like, based on the values of one or more other columns. This toy example shows one way to do it:
row1list = ['GHT347', 'Europe']
row2list = ['GHT568', 'lONDON']
row3list = ['KLU', 'lONDON']
df = pd.DataFrame([row1list, row2list, row3list],
columns=['Filename', 'Class'])
df['Actual'] = -1 # start with a value you will ignore
df['Prediction'] = -1
df.loc[(df['Filename'].str.startswith('GHT')) & (df['Class'] == 'Europe'), 'Actual'] = 1
df.loc[(df['Filename'].str.startswith('KL')) & (df['Class'] == 'lONDON'), 'Prediction'] = 2
print(df)
# Filename Class Actual Prediction
# 0 GHT347 Europe 1 -1
# 1 GHT568 lONDON -1 -1
# 2 KLU lONDON -1 2

Averaging dataframes with many string columns and display back all the columns

I have struggled with this even after looking at the various past answers to no avail.
My data consists of columns numeric and non numeric. I'd like to average the numeric columns and display my data on the GUI together with the information on the non-numeric columns.The non numeric columns have info such as names,rollno,stream while the numeric columns contain students marks for various subjects. It works well when dealing with one dataframe but fails when I combine two or more dataframes in which it returms only the average of the numeric columns and displays it leaving the non numeric columns undisplayed. Below is one of the codes I've tried so far.
df=pd.concat((df3,df5))
dfs =df.groupby(df.index,level=0).mean()
headers = list(dfs)
self.marks_table.setRowCount(dfs.shape[0])
self.marks_table.setColumnCount(dfs.shape[1])
self.marks_table.setHorizontalHeaderLabels(headers)
df_array = dfs.values
for row in range(dfs.shape[0]):
for col in range(dfs.shape[1]):
self.marks_table.setItem(row, col,QTableWidgetItem(str(df_array[row,col])))
A working code should return averages in something like this
STREAM ADM NAME KCPE ENG KIS
0 EAGLE 663 FLOYCE ATI 250 43 5
1 EAGLE 664 VERONICA 252 32 33
2 EAGLE 665 MACREEN A 341 23 23
3 EAGLE 666 BRIDGIT 286 23 2
Rather than
ADM KCPE ENG KIS
0 663.0 250.0 27.5 18.5
1 664.0 252.0 26.5 33.0
2 665.0 341.0 17.5 22.5
3 666.0 286.0 38.5 23.5
Sample data
Df1 = pd.DataFrame({
'STREAM':[NORTH,SOUTH],
'ADM':[437,238,439],
'NAME':[JAMES,MARK,PETER],
'KCPE':[233,168,349],
'ENG':[70,28,79],
'KIS':[37,82,79],
'MAT':[67,38,29]})
Df2 = pd.DataFrame({
'STREAM':[NORTH,SOUTH],
'ADM':[437,238,439],
'NAME':[JAMES,MARK,PETER],
'KCPE':[233,168,349],
'ENG':[40,12,56],
'KIS':[33,43,43],
'MAT':[22,58,23]})
Your question not clear. However guessing the origin of question based on content. I have modified your datframes which were not well done by adding a stream called 'CENTRAL', see
Df1 = pd.DataFrame({'STREAM':['NORTH','SOUTH', 'CENTRAL'],'ADM':[437,238,439], 'NAME':['JAMES','MARK','PETER'],'KCPE':[233,168,349],'ENG':[70,28,79],'KIS':[37,82,79],'MAT':[67,38,29]})
Df2 = pd.DataFrame({ 'STREAM':['NORTH','SOUTH','CENTRAL'],'ADM':[437,238,439], 'NAME':['JAMES','MARK','PETER'],'KCPE':[233,168,349],'ENG':[40,12,56],'KIS':[33,43,43],'MAT':[22,58,23]})
I have assumed you want to merge the two dataframes and find avarage
df3=Df2.append(Df1)
df3.groupby(['STREAM','ADM','NAME'],as_index=False).sum()
Outcome

Pandas Dataframe: Divide Column entries by number of occurence

my Problem:
I have this DF:
df_problem = pd.DataFrame({"Share":['5%','6%','9%','9%', '9%'],"level_1":[0,0,1,2,3], 'BO':['Nestle', 'Procter', 'Nestle', 'Tesla', 'Jeff']})
The Problem is, that the 9% are actually divided by the three shareholders. So I want to giv each of them their share of 3% and put it to their names. It then should look like this:
df_solution = pd.DataFrame({"Share":['5%','6%','3%','3%', '3%'],"level_1":[0,0,0,1,2], 'BO': ['Nestle', 'Procter', 'Nestle', 'Tesla', 'Jeff']})
How do I do this in a simple way?
You could try something like this:
f_problem['Share'] = (f_problem['Share'].str.replace('%', '').astype(float) /
f_problem.groupby('Share')['BO'].
transform('count')).astype(str) + '%'
>>> f_problem
Share level_1 BO
0 5.0% 0 Nestle
1 6.0% 0 Procter
2 3.0% 1 Nestle
3 3.0% 2 Tesla
4 3.0% 3 Jeff
Please note that I have assumed that the value of the column 'Share' to be float as you could see above.

find closest match within a vector to fill missing values using dplyr

A dummy dataset is :
data <- data.frame(
group = c(1,1,1,1,1,2),
dates = as.Date(c("2005-01-01", "2006-05-01", "2007-05-01","2004-08-01",
"2005-03-01","2010-02-01")),
value = c(10,20,NA,40,NA,5)
)
For each group, the missing values need to be filled with the non-missing value corresponding to the nearest date within same group. In case of a tie, pick any.
I am using dplyr. which.closest from birk but it needs a vector and a value. How to look up within a vector without writing loops. Even if there is an SQL solution, will do.
Any pointers to the solution?
May be something like: value = value[match(which.closest(dates,THISdate) & !is.na(value))]
Not sure how to specify Thisdate.
Edit: The expected value vector should look like:
value = c(10,20,20,40,10,5)
Using knn1 (nearest neighbor) from the class package (which comes with R -- don't need to install it) and dplyr define an na.knn1 function which replaces each NA value in x with the non-NA x value having the closest time.
library(class)
na.knn1 <- function(x, time) {
is_na <- is.na(x)
if (sum(is_na) == 0 || all(is_na)) return(x)
train <- matrix(time[!is_na])
test <- matrix(time[is_na])
cl <- x[!is_na]
x[is_na] <- as.numeric(as.character(knn1(train, test, cl)))
x
}
data %>% mutate(value = na.knn1(value, dates))
giving:
group dates value
1 1 2005-01-01 10
2 1 2006-05-01 20
3 1 2007-05-01 20
4 1 2004-08-01 40
5 1 2005-03-01 10
6 2 2010-02-01 5
Add an appropriate group_by if the intention was to do this by group.
You can try the use of sapply to find the values closest since the x argument in `which.closest only takes a single value.
first create a vect whereby the dates with no values are replaced with NA and use it within the which.closest function.
library(birk)
vect=replace(data$dates,which(is.na(data$value)),NA)
transform(data,value=value[sapply(dates,which.closest,vec=vect)])
group dates value
1 1 2005-01-01 10
2 1 2006-05-01 20
3 1 2007-05-01 20
4 1 2004-08-01 40
5 1 2005-03-01 10
6 2 2010-02-01 5
if which.closest was to take a vector then there would be no need of sapply. But this is not the case.
Using the dplyr package:
library(birk)
library(dplyr)
data%>%mutate(vect=`is.na<-`(dates,is.na(value)),
value=value[sapply(dates,which.closest,vect)])%>%
select(-vect)