Pandas: Make a column with (1,2,3) if string of another column-Row value starts with ("A","B","C") - pandas

I have dataframe with filenames and classification, these are predictions from a network, I want to map them into integers to evaluate prediction from a network.
My dataframe is :
Filename: Class:
GHT347 Europe
GHT568 lONDON
GHT78 Europe
HJU US
HJI lONDON
HJK US
KLO Europe
KLU lONDON
KLP lONDON
KLY1 lONDON
KL34 US
The true prediction should be :
GHT-- EUROPE
HJU -- US
KL -- London
I want to map : GHT and Europe to 1, US and HJ to 0, KL and London to 2 by adding an additional two columns Prediction and Actual
Actual Prediction
1 1
1 2
pandas str.startswith method returns true or false, here I want three values. Can anyone guide me?

i cannot fully understand what you want, but I can give you some tips
use regular expressions:
df['actual'] = np.nan
df.loc[(df.Filename.str.contains('^GHT.*')) & (df.Class == 'Europe'), 'Actual'] = 1
df.loc[(df.Filename.str.contains('^HJ.*')) & (df.Class == 'US'), 'Actual'] = 0
and so on

You can set column values to anything you like, based on the values of one or more other columns. This toy example shows one way to do it:
row1list = ['GHT347', 'Europe']
row2list = ['GHT568', 'lONDON']
row3list = ['KLU', 'lONDON']
df = pd.DataFrame([row1list, row2list, row3list],
columns=['Filename', 'Class'])
df['Actual'] = -1 # start with a value you will ignore
df['Prediction'] = -1
df.loc[(df['Filename'].str.startswith('GHT')) & (df['Class'] == 'Europe'), 'Actual'] = 1
df.loc[(df['Filename'].str.startswith('KL')) & (df['Class'] == 'lONDON'), 'Prediction'] = 2
print(df)
# Filename Class Actual Prediction
# 0 GHT347 Europe 1 -1
# 1 GHT568 lONDON -1 -1
# 2 KLU lONDON -1 2

Related

Pandas bringing in data from another dataframe

I am trying to bring data from a dataframe which is mapping table into another dataframe using the following, however I get an error 'x' is not defined, what am I doing wrong pls?
Note for values not in the mapping table (China/CN) I would just like the value to be blank or nan. If there are values in the mapping table that are not in my data - I don't want to include them.
import pandas as pd
languages = {'Language': ["English", "German", "French", "Spanish"],
'countryCode': ["EN", "DE", "FR", "ES"]
}
countries = {'Country': ["Australia", "Argentina", "Mexico", "Algeria", "China"],
'countryCode': ["EN", "ES", "ES", "FR", "CN"]
}
language_map = pd.DataFrame(languages)
data = pd.DataFrame(countries)
def language_converter(x):
return language_map.query(f"countryCode=='{x}'")['Language'].values[0]
data['Language'] = data['countryCode'].apply(language_converter(x))
Use pandas.DataFrame.merge:
data.merge(language_map, how='left')
Output:
Country countryCode Language
0 Australia EN English
1 Argentina ES Spanish
2 Mexico ES Spanish
3 Algeria FR French
4 China CN NaN
.apply accepts a callable object, but you've passed language_converter(x) which is already a function call with undefined x variable as apply is not applied yet.
A valid usage is: .apply(language_converter).
But next, you'll have another error IndexError: index 0 is out of bounds for axis 0 with size 0 as some country codes may not be found (which breaks the indexing .values[0]).
If proceeding with your starting approach a valid version would look as below:
def language_converter(x):
lang = language_map[language_map["countryCode"] == x]['Language'].values
return lang[0] if lang.size > 0 else np.nan
data['Language'] = data['countryCode'].apply(language_converter)
print(data)
Country countryCode Language
0 Australia EN English
1 Argentina ES Spanish
2 Mexico ES Spanish
3 Algeria FR French
4 China CN NaN
But, instead of defining and applying language_converter it's much simpler and straightforward to map country codes explicitly with just:
data['Language'] = data['countryCode'].map(language_map.set_index("countryCode")['Language'])

How can I set conditions for dataframes?

/we.tl/t-ghXIOjPznq
Here is my xlsx file.
https://imgur.com/b8kTbNV
I have such a dataframe. I want to define only for conditions where LITHOLOGY column is 1. In order to do that;
df2 = pd.read_excel('V131BLOG.xlsx')
LITHOLOGY = [1] &
df2[df2.LITHOLOGY.isin(LITHOLOGY)]
There hasn't been a problem so far. I was able to filter as I wanted.
https://imgur.com/wcSvokM
In addition to these, I want to see cells with LITHOLOGY column as 1 If It's thickness is bigger than 15cms. What I mean is that, the cumulative difference of consecutive cells of DEPTH_MD column should be bigger than 10cms. I have not made any progress on this. What path should I follow?
As you can see in this (https://imgur.com/a/02nlUUl) figure, there can be seen serial group of LITHOLOGY column as 1. But when you check the DEPTH_MD values, upper group is equal to 10cms, on the other side, lower group is equal 5cms. I want to create a dataframe that only contains bigger than 10cms DEPTH_MD values.
Input:
DEPTH_MD CALIPER GR LITHOLOGY SHALLOW DEEP
1980 329.00 26.8964 25.47160 2 2.99103 2.62130
1981 329.05 26.8574 32.54390 2 2.94772 2.58945
1982 329.10 27.1297 28.83750 1 2.90123 2.55601
1983 329.15 26.9742 17.91150 2 2.80383 2.52327
1984 329.20 28.3946 31.94310 2 2.76041 2.49050
1985 329.25 30.9402 17.63760 1 2.71992 2.46051
1986 329.30 35.2419 17.69170 1 2.67355 2.42852
1987 329.35 37.9206 17.74620 1 2.61838 2.33619
1988 329.40 39.9189 24.84460 2 2.56200 2.28671
1989 329.45 41.4947 7.03354 2 2.50669 2.23887
1990 329.50 41.5473 7.03354 2 2.42167 2.19944
1991 329.55 41.0158 10.58260 2 2.40039 2.17235
Output except:
DEPTH_MD CALIPER GR LITHOLOGY SHALLOW DEEP
1985 329.25 30.9402 17.6376 1 2.71992 2.46051
1986 329.30 35.2419 17.6917 1 2.67355 2.42852
1987 329.35 37.9206 17.7462 1 2.61838 2.33619
Group the consecutive 'LITHOLOGY' rows then compute the thickness and finally broadcast to all rows:
df['THICKNESS'] = (
df.groupby(df['LITHOLOGY'].ne(df['LITHOLOGY'].shift()).cumsum())['DEPTH_MD']
.transform(lambda x: x.diff().sum())
)
out = df[(df['LITHOLOGY'] == 1) & (df['THICKNESS'] >= 0.1)]
Output:
>>> out
DEPTH_MD CALIPER GR LITHOLOGY SHALLOW DEEP THICKNESS
1985 329.25 30.9402 17.6376 1 2.71992 2.46051 0.1
1986 329.30 35.2419 17.6917 1 2.67355 2.42852 0.1
1987 329.35 37.9206 17.7462 1 2.61838 2.33619 0.1

Need help to transform table. Used pivot but unable to get desired output [duplicate]

I'm having trouble rearranging the following data frame:
set.seed(45)
dat1 <- data.frame(
name = rep(c("firstName", "secondName"), each=4),
numbers = rep(1:4, 2),
value = rnorm(8)
)
dat1
name numbers value
1 firstName 1 0.3407997
2 firstName 2 -0.7033403
3 firstName 3 -0.3795377
4 firstName 4 -0.7460474
5 secondName 1 -0.8981073
6 secondName 2 -0.3347941
7 secondName 3 -0.5013782
8 secondName 4 -0.1745357
I want to reshape it so that each unique "name" variable is a rowname, with the "values" as observations along that row and the "numbers" as colnames. Sort of like this:
name 1 2 3 4
1 firstName 0.3407997 -0.7033403 -0.3795377 -0.7460474
5 secondName -0.8981073 -0.3347941 -0.5013782 -0.1745357
I've looked at melt and cast and a few other things, but none seem to do the job.
Using reshape function:
reshape(dat1, idvar = "name", timevar = "numbers", direction = "wide")
The new (in 2014) tidyr package also does this simply, with gather()/spread() being the terms for melt/cast.
Edit: Now, in 2019, tidyr v 1.0 has launched and set spread and gather on a deprecation path, preferring instead pivot_wider and pivot_longer, which you can find described in this answer. Read on if you want a brief glimpse into the brief life of spread/gather.
library(tidyr)
spread(dat1, key = numbers, value = value)
From github,
tidyr is a reframing of reshape2 designed to accompany the tidy data framework, and to work hand-in-hand with magrittr and dplyr to build a solid pipeline for data analysis.
Just as reshape2 did less than reshape, tidyr does less than reshape2. It's designed specifically for tidying data, not the general reshaping that reshape2 does, or the general aggregation that reshape did. In particular, built-in methods only work for data frames, and tidyr provides no margins or aggregation.
You can do this with the reshape() function, or with the melt() / cast() functions in the reshape package. For the second option, example code is
library(reshape)
cast(dat1, name ~ numbers)
Or using reshape2
library(reshape2)
dcast(dat1, name ~ numbers)
Another option if performance is a concern is to use data.table's extension of reshape2's melt & dcast functions
(Reference: Efficient reshaping using data.tables)
library(data.table)
setDT(dat1)
dcast(dat1, name ~ numbers, value.var = "value")
# name 1 2 3 4
# 1: firstName 0.1836433 -0.8356286 1.5952808 0.3295078
# 2: secondName -0.8204684 0.4874291 0.7383247 0.5757814
And, as of data.table v1.9.6 we can cast on multiple columns
## add an extra column
dat1[, value2 := value * 2]
## cast multiple value columns
dcast(dat1, name ~ numbers, value.var = c("value", "value2"))
# name value_1 value_2 value_3 value_4 value2_1 value2_2 value2_3 value2_4
# 1: firstName 0.1836433 -0.8356286 1.5952808 0.3295078 0.3672866 -1.6712572 3.190562 0.6590155
# 2: secondName -0.8204684 0.4874291 0.7383247 0.5757814 -1.6409368 0.9748581 1.476649 1.1515627
With tidyr, there is pivot_wider() and pivot_longer() which are generalized to do reshaping from long -> wide or wide -> long, respectively. Using the OP's data:
single column long -> wide
library(tidyr)
dat1 %>%
pivot_wider(names_from = numbers, values_from = value)
# # A tibble: 2 x 5
# name `1` `2` `3` `4`
# <fct> <dbl> <dbl> <dbl> <dbl>
# 1 firstName 0.341 -0.703 -0.380 -0.746
# 2 secondName -0.898 -0.335 -0.501 -0.175
multiple columns long -> wide
pivot_wider() is also capable of more complex pivot operations. For example, you can pivot multiple columns simultaneously:
# create another column for showing the functionality
dat2 <- dat1 %>%
dplyr::rename(valA = value) %>%
dplyr::mutate(valB = valA * 2)
dat2 %>%
pivot_wider(names_from = numbers, values_from = c(valA, valB))
# # A tibble: 2 × 9
# name valA_1 valA_2 valA_3 valA_4 valB_1 valB_2 valB_3 valB_4
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 firstName 0.341 -0.703 -0.380 -0.746 0.682 -1.41 -0.759 -1.49
# 2 secondName -0.898 -0.335 -0.501 -0.175 -1.80 -0.670 -1.00 -0.349
There is much more functionality to be found in the docs.
Using your example dataframe, we could:
xtabs(value ~ name + numbers, data = dat1)
Other two options:
Base package:
df <- unstack(dat1, form = value ~ numbers)
rownames(df) <- unique(dat1$name)
df
sqldf package:
library(sqldf)
sqldf('SELECT name,
MAX(CASE WHEN numbers = 1 THEN value ELSE NULL END) x1,
MAX(CASE WHEN numbers = 2 THEN value ELSE NULL END) x2,
MAX(CASE WHEN numbers = 3 THEN value ELSE NULL END) x3,
MAX(CASE WHEN numbers = 4 THEN value ELSE NULL END) x4
FROM dat1
GROUP BY name')
Using base R aggregate function:
aggregate(value ~ name, dat1, I)
# name value.1 value.2 value.3 value.4
#1 firstName 0.4145 -0.4747 0.0659 -0.5024
#2 secondName -0.8259 0.1669 -0.8962 0.1681
The base reshape function works perfectly fine:
df <- data.frame(
year = c(rep(2000, 12), rep(2001, 12)),
month = rep(1:12, 2),
values = rnorm(24)
)
df_wide <- reshape(df, idvar="year", timevar="month", v.names="values", direction="wide", sep="_")
df_wide
Where
idvar is the column of classes that separates rows
timevar is the column of classes to cast wide
v.names is the column containing numeric values
direction specifies wide or long format
the optional sep argument is the separator used in between timevar class names and v.names in the output data.frame.
If no idvar exists, create one before using the reshape() function:
df$id <- c(rep("year1", 12), rep("year2", 12))
df_wide <- reshape(df, idvar="id", timevar="month", v.names="values", direction="wide", sep="_")
df_wide
Just remember that idvar is required! The timevar and v.names part is easy. The output of this function is more predictable than some of the others, as everything is explicitly defined.
There's very powerful new package from genius data scientists at Win-Vector (folks that made vtreat, seplyr and replyr) called cdata. It implements "coordinated data" principles described in this document and also in this blog post. The idea is that regardless how you organize your data, it should be possible to identify individual data points using a system of "data coordinates". Here's a excerpt from the recent blog post by John Mount:
The whole system is based on two primitives or operators
cdata::moveValuesToRowsD() and cdata::moveValuesToColumnsD(). These
operators have pivot, un-pivot, one-hot encode, transpose, moving
multiple rows and columns, and many other transforms as simple special
cases.
It is easy to write many different operations in terms of the
cdata primitives. These operators can work-in memory or at big data
scale (with databases and Apache Spark; for big data use the
cdata::moveValuesToRowsN() and cdata::moveValuesToColumnsN()
variants). The transforms are controlled by a control table that
itself is a diagram of (or picture of) the transform.
We will first build the control table (see blog post for details) and then perform the move of data from rows to columns.
library(cdata)
# first build the control table
pivotControlTable <- buildPivotControlTableD(table = dat1, # reference to dataset
columnToTakeKeysFrom = 'numbers', # this will become column headers
columnToTakeValuesFrom = 'value', # this contains data
sep="_") # optional for making column names
# perform the move of data to columns
dat_wide <- moveValuesToColumnsD(tallTable = dat1, # reference to dataset
keyColumns = c('name'), # this(these) column(s) should stay untouched
controlTable = pivotControlTable# control table above
)
dat_wide
#> name numbers_1 numbers_2 numbers_3 numbers_4
#> 1 firstName 0.3407997 -0.7033403 -0.3795377 -0.7460474
#> 2 secondName -0.8981073 -0.3347941 -0.5013782 -0.1745357
much easier way!
devtools::install_github("yikeshu0611/onetree") #install onetree package
library(onetree)
widedata=reshape_toWide(data = dat1,id = "name",j = "numbers",value.var.prefix = "value")
widedata
name value1 value2 value3 value4
firstName 0.3407997 -0.7033403 -0.3795377 -0.7460474
secondName -0.8981073 -0.3347941 -0.5013782 -0.1745357
if you want to go back from wide to long, only change Wide to Long, and no changes in objects.
reshape_toLong(data = widedata,id = "name",j = "numbers",value.var.prefix = "value")
name numbers value
firstName 1 0.3407997
secondName 1 -0.8981073
firstName 2 -0.7033403
secondName 2 -0.3347941
firstName 3 -0.3795377
secondName 3 -0.5013782
firstName 4 -0.7460474
secondName 4 -0.1745357
This works even if you have missing pairs and it doesn't require sorting (as.matrix(dat1)[,1:2] can be replaced with cbind(dat1[,1],dat1[,2])):
> set.seed(45);dat1=data.frame(name=rep(c("firstName","secondName"),each=4),numbers=rep(1:4,2),value=rnorm(8))
> u1=unique(dat1[,1]);u2=unique(dat1[,2])
> m=matrix(nrow=length(u1),ncol=length(u2),dimnames=list(u1,u2))
> m[as.matrix(dat1)[,1:2]]=dat1[,3]
> m
1 2 3 4
firstName 0.3407997 -0.7033403 -0.3795377 -0.7460474
secondName -0.8981073 -0.3347941 -0.5013782 -0.1745357
This doesn't work if you have missing pairs and it requires sorting, but it's a bit shorter in case the pairs are already sorted:
> u1=unique(dat1[,1]);u2=unique(dat1[,2])
> dat1=dat1[order(dat1[,1],dat1[,2]),] # not actually needed in this case
> matrix(dat1[,3],length(u1),,T,list(u1,u2))
1 2 3 4
firstName 0.3407997 -0.7033403 -0.3795377 -0.7460474
secondName -0.8981073 -0.3347941 -0.5013782 -0.1745357
Here's a function version of the first approach (add as.data.frame to make it work with tibbles):
l2w=function(x,row=1,col=2,val=3,sort=F){
u1=unique(x[,row])
u2=unique(x[,col])
if(sort){u1=sort(u1);u2=sort(u2)}
out=matrix(nrow=length(u1),ncol=length(u2),dimnames=list(u1,u2))
out[cbind(x[,row],x[,col])]=x[,val]
out
}
Or if you only have the values of the lower triangle, you can do this:
> euro=as.matrix(eurodist)[1:3,1:3]
> lower=data.frame(V1=rownames(euro)[row(euro)[lower.tri(euro)]],V2=colnames(euro)[col(euro)[lower.tri(euro)]],V3=euro[lower.tri(euro)])
> lower
V1 V2 V3
1 Barcelona Athens 3313
2 Brussels Athens 2963
3 Brussels Barcelona 1318
> n=unique(c(lower[,1],lower[,2]))
> full=rbind(lower,setNames(lower[,c(2,1,3)],names(lower)),data.frame(V1=n,V2=n,V3=0))
> full
V1 V2 V3
1 Barcelona Athens 3313
2 Brussels Athens 2963
3 Brussels Barcelona 1318
4 Athens Barcelona 3313
5 Athens Brussels 2963
6 Barcelona Brussels 1318
7 Athens Athens 0
8 Barcelona Barcelona 0
9 Brussels Brussels 0
> l2w(full,sort=T)
Athens Barcelona Brussels
Athens 0 3313 2963
Barcelona 3313 0 1318
Brussels 2963 1318 0
Or here's another approach:
> rc=as.matrix(lower[-3])
> n=sort(unique(c(rc)))
> m=matrix(0,length(n),length(n),,list(n,n))
> m[rc]=lower[,3]
> m[rc[,2:1]]=lower[,3]
> m
Athens Barcelona Brussels
Athens 0 3313 2963
Barcelona 3313 0 1318
Brussels 2963 1318 0
Another simple method in base R is to use xtabs. The result of xtabs is basically just a matrix with a fancy class name, but you can make it look like a regular matrix with class(x)=NULL;attr(x,"call")=NULL;dimnames(x)=unname(dimnames(x)):
> x=xtabs(value~name+numbers,dat1);x
numbers
name 1 2 3 4
firstName 0.3407997 -0.7033403 -0.3795377 -0.7460474
secondName -0.8981073 -0.3347941 -0.5013782 -0.1745357
> str(x)
'xtabs' num [1:2, 1:4] 0.341 -0.898 -0.703 -0.335 -0.38 ...
- attr(*, "dimnames")=List of 2
..$ name : chr [1:2] "firstName" "secondName"
..$ numbers: chr [1:4] "1" "2" "3" "4"
- attr(*, "call")= language xtabs(formula = value ~ name + numbers, data = dat1)
> class(x)
[1] "xtabs" "table"
> class(as.matrix(x)) # `as.matrix` has no effect because `x` is already a matrix
[1] "xtabs" "table"
> class(x)=NULL;class(x)
[1] "matrix" "array"
> attr(x,"call")=NULL;dimnames(x)=unname(dimnames(x))
> x # now it looks like a regular matrix
1 2 3 4
firstName 0.3407997 -0.7033403 -0.3795377 -0.7460474
secondName -0.8981073 -0.3347941 -0.5013782 -0.1745357
> str(x)
num [1:2, 1:4] 0.341 -0.898 -0.703 -0.335 -0.38 ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:2] "firstName" "secondName"
..$ : chr [1:4] "1" "2" "3" "4"
Normally as.data.frame(x) converts the result of xtabs back to long format, but you can avoid it with class(x)=NULL:
> x=xtabs(value~name+numbers,dat1);as.data.frame(x)
name numbers Freq
1 firstName 1 0.3407997
2 secondName 1 -0.8981073
3 firstName 2 -0.7033403
4 secondName 2 -0.3347941
5 firstName 3 -0.3795377
6 secondName 3 -0.5013782
7 firstName 4 -0.7460474
8 secondName 4 -0.1745357
> class(x)=NULL;as.data.frame(x)
1 2 3 4
firstName 0.3407997 -0.7033403 -0.3795377 -0.7460474
secondName -0.8981073 -0.3347941 -0.5013782 -0.1745357
This converts data in wide fromat to long format (unlist converts a dataframe to a vector and c converts a matrix to a vector):
w2l=function(x)data.frame(V1=rownames(x)[row(x)],V2=colnames(x)[col(x)],V3=unname(c(unlist(x))))
Came here via a linked question Reshape three column data frame to matrix ("long" to "wide" format). That question is closed, so I writing an alternative solution here.
I found a alternative solution, perhaps useful for someone looking for converting three columns to a matrix. I am referring to decoupleR (2.3.2) package. Below is copied from their site
Generates a kind of table where the rows come from id_cols, the columns from names_from and the values from values_from.
Usage
pivot_wider_profile(
data,
id_cols,
names_from,
values_from,
values_fill = NA,
to_matrix = FALSE,
to_sparse = FALSE,
...
)
Using only dplyr and map.
library(dplyr)
library(purrr)
set.seed(45)
dat1 <- data.frame(
name = rep(c("firstName", "secondName"), each=4),
numbers = rep(1:4, 2), value = rnorm(8)
)
longer_to_wider <- function(data, name_from, value_from){
group <- colnames(data)[!(colnames(data) %in% c(name_from,value_from))]
data %>% group_by(.data[[group]]) %>%
summarise( name = list(.data[[name_from]]),
value = list(.data[[value_from]])) %>%
{
d <- data.frame(
name = .[[name_from]] %>% unlist() %>% unique()
)
e <- map_dfc(.[[group]],function(x){
y <- data_frame(
x = data %>% filter(.data[[group]] == x) %>% pull(value_from)
)
colnames(y) <- x
y
})
cbind(d,e)
}
}
longer_to_wider(dat1, "name", "value")
# name 1 2 3 4
# 1 firstName 0.3407997 -0.7033403 -0.3795377 -0.7460474
# 2 secondName -0.8981073 -0.3347941 -0.5013782 -0.1745357

Groupby and A)Concate matching strings(and or substring) B)Sum the values

I have df:
row_numbers ID code amount
1 med a 1
2 med a, b 1
3 med b, c 1
4 med c 1
5 med d 10
6 cad a, b 1
7 cad a, b, d 0
8 cad e 2
Pasted the above df:
I wanted to do groupby on column-ID and A)Combine the strings if substring/string matches(on column-code) B)sum the values of column-amount.
Expected results:
Explanation:
column-row_numbers has no role here in df. I just took here to explain the output.
A)grouping on column-ID and looking at column-code, row1 string i.e., a is matching with row2's sub string. row2's substring i.e., b is matching with row3's substring. row3's substring i.e., c is matching with string of row4 and Hence combining row1, row2, row3 and row4. row5 string is not matching with any of string/substring so it is separate group. B) Based on this adding row1, row2, row3 and row4 values. and row5 as separate group.
Thanks in advance for your time and thoughts:).
EDIT - 1
Pasting the real time.
Expected output:
Explanation:
have to do on grouping column-id and concatenating the values of column-code and summing the values of column-units and vol. It is color coded the matching(to be contacted) values of column-code. row1 has link with row5 and row9. row9 has inturn link with row3. Hence combining row1, row5, row9, row3. Simliarly row2 and row7 and so on. row8 has no link with any of the values with-in group-med(column-id) and hence will be as separate row.
Thanks!.
Update: From your latest sample data, this is not a simple data munging. There is no vectorized solution. It relates to graph theory. You need to find connected components within each group of ID and do the calculation on each connected components.
Consider each string as a node of graph. If 2 strings are overlapped, they are connected nodes. For every node, you need to traverse all paths connected to it. Do calculation on all connected nodes through these paths. This traversal can be done by using Depth-first search logic.
However, before processing depth-first search, you need to preprocess strings to set to check overlapping.
Method 1: Recursive
Do the following:
Define a function dfs to recursively run depth-first search
Define a function gfunc to use with groupby apply. This function will traverse elements of each group of ID and return the desired dataframe.
Get rid of any blank spaces in each string and split and convert them
to sets using replace, split and map and assign it to a new column new_code to df
Call groupby on ID and apply using function gfunc. Call droplevel and reset_index to get the desired output
Codes as follows:
import numpy as np
def dfs(node, index, glist, g_checked_rows):
ret_arr = df.loc[index, ['code', 'amount', 'volume']].values
g_checked_rows.add(index)
for j, s in glist:
if j not in g_checked_rows and not node.isdisjoint(s):
t_arr = dfs(s, j, glist, g_checked_rows)
ret_arr[0] += ', ' + t_arr[0]
ret_arr[1:] += t_arr[1:]
return ret_arr
def gfunc(x):
checked_rows = set()
final = []
code_list = list(x.new_code.items())
for i, row in code_list:
if i not in checked_rows:
final.append(dfs(row, i, code_list, checked_rows))
return pd.DataFrame(final, columns=['code','units','vol'])
df['new_code'] = df.code.str.replace(' ','').str.split(',').map(set)
df_final = df.groupby('ID', sort=False).apply(gfunc).droplevel(1).reset_index()
Out[16]:
ID code units vol
0 med CO-96, CO-B15, CO-B15, CO-96, OA-18, OA-18 4 4
1 med CO-16, CO-B20, CO-16 3 3
2 med CO-252, CO-252, CO-45 3 3
3 med OA-258 1 1
4 cad PR-96, PR-96, CO-243 4 4
5 cad PR-87, OA-258, PR-87 3 3
Note: I assume your pandas version is 0.24+. If it is < 0.24, the last step you need to use reset_index and drop instead of droplevel and reset_index as follows
df_final = df.groupby('ID', sort=False).apply(gfunc).reset_index().drop('level_1', 1)
Method 2: Iterative
To make this complete, I implement a version of gfunc using iterative process instead of recursive. Iterative process requires only one function.
However, the function is more complicated. The logic of iterative process as follows
push the first node to deque. Check if deque not empty, pop the top node out.
if a node is not marked checked, process it and mark it as checked
find all its neighbors in the reverse order of list of nodes that
haven't been marked, push them to the deque
Check if deque not empty, pop out a node from the top deque and
process from step 2
Code as follows:
def gfunc_iter(x):
checked_rows = set()
final = []
q = deque()
code_list = list(x.new_code.items())
code_list_rev = code_list[::-1]
for i, row in code_list:
if i not in checked_rows:
q.append((i, row))
ret_arr = np.array(['', 0, 0], dtype='O')
while (q):
n, node = q.pop()
if n in checked_rows:
continue
ret_arr_child = df.loc[n, ['code', 'amount', 'volume']].values
if not ret_arr[0]:
ret_arr = ret_arr_child.copy()
else:
ret_arr[0] += ', ' + ret_arr_child[0]
ret_arr[1:] += ret_arr_child[1:]
checked_rows.add(n)
#push to `q` all neighbors in the reversed list of nodes
for j, s in code_list_rev:
if j not in checked_rows and not node.isdisjoint(s):
q.append((j, s))
final.append(ret_arr)
return pd.DataFrame(final, columns=['code','units','vol'])
df['new_code'] = df.code.str.replace(' ','').str.split(',').map(set)
df_final = df.groupby('ID', sort=False).apply(gfunc_iter).droplevel(1).reset_index()
I believe the three main ideas for executing what you want are:
create an accumulator datastructure ( a DataFrame in this case)
iterate over a pair of rows, in each iteration you have (currentRow, nextRow)
pattern matching of current row in next row and pattern matching in the accumulated rows
It's not totally clear the exactly pattern match you're looking for, so I assumed that if any letter of currentRow code is on the next one, then concatenate them.
using a data.csv (with espace separators) as example:
row_numbers ID code amount
1 med a 1
2 med a,b 1
3 med b,c 1
4 med c 1
5 med d 10
6 cad a,b 1
7 cad a,b,d 0
8 cad e 2
import pandas as pd
from itertools import zip_longest
def generate_pairs(group):
''' generate pairs (currentRow, nextRow) '''
group_curriterrows = group.iterrows()
group_nextiterrows = group.iterrows()
group_nextiterrows.__next__()
zip_list = zip_longest(group_curriterrows, group_nextiterrows)
return zip_list
def generate_lists_to_check(currRow, nextRow, accumulated_rows):
''' generate list if any next letters are in current ones and
another list if any next letters are in the accumulated codes '''
currLetters = str(currRow["code"]).split(",")
nextLetters = str(nextRow["code"]).split(",")
letter_inNext = [letter in nextLetters for letter in currLetters]
unique_acc_codes = [str(v) for v in accumulated_rows["code"].unique()]
letter_inHistory = [any(letter in unq for letter in nextLetters)
for unq in unique_acc_codes]
return letter_inNext, letter_inHistory
def create_newRow(accumulated_rows, nextRow):
nextRow["row_numbers"] = str(nextRow["row_numbers"])
accumulated_rows = accumulated_rows.append(nextRow,ignore_index=True)
return accumulated_rows
def update_existingRow(accumulated_rows, match_idx, Row):
accumulated_rows.loc[match_idx]["code"] += ","+Row["code"]
accumulated_rows.loc[match_idx]["amount"] += Row["amount"]
accumulated_rows.loc[match_idx]["volume"] += Row["volume"]
accumulated_rows.loc[match_idx]["row_numbers"] += ','+str(Row["row_numbers"])
return accumulated_rows
if __name__ == "__main__":
df = pd.read_csv("extended.tsv",sep=" ")
groups = pd.DataFrame(columns=df.columns)
for ID, group in df.groupby(["ID"], sort=False):
accumulated_rows = pd.DataFrame(columns=df.columns)
group_firstRow = group.iloc[0]
accumulated_rows.loc[len(accumulated_rows)] = group_firstRow.values
row_numbers = str(group_firstRow.values[0])
accumulated_rows.set_value(0,'row_numbers',row_numbers)
zip_list = generate_pairs(group)
for (currRow_idx, currRow), Next in zip_list:
if not (Next is None):
(nextRow_idx, nextRow) = Next
letter_inNext, letter_inHistory = \
generate_lists_to_check(currRow, nextRow, accumulated_rows)
if any(letter_inNext) :
accumulated_rows = update_existingRow(accumulated_rows, (len(accumulated_rows)-1), nextRow)
elif any(letter_inHistory):
matches = [ idx for (idx, bool_val) in enumerate(letter_inHistory) if bool_val == True ]
first_match_idx = matches[0]
accumulated_rows = update_existingRow(accumulated_rows, first_match_idx, nextRow)
for match_idx in matches[1:]:
accumulated_rows = update_existingRow(accumulated_rows, first_match_idx, accumulated_rows.loc[match_idx])
accumulated_rows = accumulated_rows.drop(match_idx)
elif not any(letter_inNext):
accumulated_rows = create_newRow(accumulated_rows, nextRow)
groups = groups.append(accumulated_rows)
groups.reset_index(inplace=True,drop=True)
print(groups)
OUTPUT normal rows order REMOVING lines using column volume from current code because first exampe has no column volume:
row_numbers ID code amount
0 1 med a,a,b,b,c,c 4
1 5 med d 10
2 6 cad a,b,a,b,d 1
3 8 cad e 2
OUTPUT new example:
row_numbers ID code amount volume
0 1,5,9,3 med CO-96,CO-B15,CO-B15,CO-96,OA-18,OA-18 4 4
1 2,7 med CO-16,CO-B20,CO-16 3 3
2 4,6 med CO-252,CO-252,CO-45 3 3
3 8 med OA-258 1 1
4 10,13 cad PR-96,PR-96,CO-243 4 4
5 11,12 cad PR-87,OA-258,PR-87 3 3

Pandas: Unable to change value of a cell using while loop

I am trying to use a while loop to read through all the rows of my file and edit the value of a particular cell when a condition is met.
My logic is working just fine when I am reading data from an excel. But same logic is not working when I am reading from a csv file.
Here is my logic to read from Excel file:
df = pd.read_excel('Energy Indicators.xls', 'Energy', index_col=None, na_values=['NA'], skiprows = 15, skipfooter = 38, header = 1, parse_cols ='C:F')
df = df.rename(columns = {'Unnamed: 0' : 'Country', 'Renewable Electricity Production': '% Renewable'})
df = df.drop(0, axis=0)
i = 0
while (i !=len(df)):
if df.iloc[i]['Country'] == "Ukraine18":
print(df.iloc[i]['Country'])
df.iloc[i]['Country'] = 'Ukraine'
print(df.iloc[i]['Country'])
i += 1
df
The result I get is:
Ukraine18
Ukraine
But when I read a CSV file:
df = pd.read_csv('world_bank.csv', skiprows = 4)
df = df.rename(columns = {'Country Name' : 'Country'})
i = 0
while (i !=len(df)):
if df.iloc[i]['Country'] == "Aruba":
print(df.iloc[i]['Country'])
df.iloc[i]['Country'] = "Arb"
print(df.iloc[i]['Country'])
i += 1
df
The result I get is:
Aruba
Aruba
Can someone please help? What am I doing wrong with my CSV file?
#Anna Iliukovich-Strakovskaia, #msr_003, you guys are right! I changed my code to df['ColumnName][i], and it worked with the CSV file. But it is not working with Excel file now.
So, it seems with data read from CSV file, df['ColumnName][i] works correctly,
but with data read from Excel file, df.iloc[i]['ColumnName'] works correctly.
At time point, I have no clue why there should be a difference, because I am not working with the data 'within' the files, rather I am working on data that was read from these files into a 'dataframe'. Once the data is in the dataframe, the source shouldn't have any influence, I think.
Anyway, thank you for your help!!
generally i used to modify as below.
testdf = pd.read_csv("sam.csv")
testdf
ExportIndex AgentName Country
0 1 Prince United Kingdom
1 2 Nit United Kingdom
2 3 Akhil United Kingdom
3 4 Ruiva United Kingdom
4 5 Niraj United Kingdom
5 6 Nitin United States
i = 0
while(i != len(testdf)):
if(testdf['AgentName'][i] == 'Nit'):
testdf['AgentName'][i] = 'Nitesh'
i += 1
testdf
ExportIndex AgentName Country
0 1 Prince United Kingdom
1 2 Nitesh United Kingdom
2 3 Akhil United Kingdom
3 4 Ruiva United Kingdom
4 5 Niraj United Kingdom
5 6 Nitin United States
But i'm not sure what's wrong with your approach.