Pandas / How to insert variable number of lines inside a DataFrame? - pandas

Here is the structure of my dataframe
plan
ADO_ver_x
ADO_incr_x
ADO_ver_y
ADO_incr_y
3ABP3
25.0
4.0
25.0
7.0
I would like to add ADO_incr_y - ADO_incr_x lines, which means in this case the result would be :
plan
ADO_ver_x
ADO_incr_x
ADO_ver_y
ADO_incr_y
3ABP3
25.0
4.0
25.0
5.0
3ABP3
25.0
5.0
25.0
6.0
3ABP3
25.0
6.0
25.0
7.0
Is there a Panda/Pythonic way to do that ?
I was thinking something like :
reps = [ val2-val1 for val2, val1 in zip(df_insert["ADO_incr_y"],df_insert["ADO_incr_x"]) ]
df_insert.loc[np.repeat(df.index_insert.values, reps)]
But I don't get the incremental progression :
4 -> 5, 5->-6, 6 -> 7
How can I get the index inside the list comprehension ?

You can repeat the data, then modify with groupby.cumcount():
repeats = df['ADO_incr_y'].sub(df['ADO_incr_x']).astype(int)
out = df.reindex(df.index.repeat(repeats))
out['ADO_incr_x'] += out.groupby(level=0).cumcount()
out['ADO_incr_y'] = out['ADOE_incr_x'] + 1

Related

Selecting Data Using Conditions Stored in a Variable

Pretend I have this table on a server:
library(dplyr)
library(DBI)
con <- dbConnect(RSQLite::SQLite(), ":memory:")
iris$id = 1:nrow(iris)
dbWriteTable(con, "iris", iris)
I want to select some some random rows from this dataset - suppose I create an R variable that contains the random rows that I want to select:
rows_to_select = sample.int(10, 5, replace = TRUE)
[1] 1 1 8 8 7
I then tried to select these rows from my table - but this "rows_to_select" variable is not being recognized for some reason:
DBI::dbGetQuery(con, "select a.* from (select *, row_number() over (order by id) as rnum from iris)a where a.rnum in (rows_to_select) limit 100;")
Error: no such column: rows_to_select
This code works fine if I manually specify which rows I want (e.g. I want the first row, and the fifth row selected twice):
#works - but does not return the 5th row twice
DBI::dbGetQuery(con, "select a.* from (select *, row_number() over (order by id) as rnum from iris)a where a.rnum in (1,5,5) limit 100;")
Does anyone know how to fix this?
Thank you!
In general, just including rows_to_select in a query is not going to know to reach out of the SQLite environment and "invade" the R environment (completely different!) and look for a variable. (For that matter, why doesn't select a.* ... find dplyr::select?) This is the case both for pragmatic reasons and security (though mostly pragmatic).
You may want to consider parameterized queries vice constructing query strings manually. In addition to security concerns about malicious SQL injection (e.g., XKCD's Exploits of a Mom aka "Little Bobby Tables"), it is also a concern for malformed strings or Unicode-vs-ANSI mistakes, even if it's one data analyst running the query. DBI supports parameterized queries.
Long story short, try this:
set.seed(42)
rows_to_select = sample.int(10, 5, replace = TRUE)
rows_to_select
# [1] 1 5 1 9 10
qmarks <- paste(rep("?", length(rows_to_select)), collapse = ",")
qmarks
# [1] "?,?,?,?,?"
DBI::dbGetQuery(con, paste(
"select a.*
from (select *, row_number() over (order by id) as rnum from iris) a
where a.rnum in (", qmarks, ") limit 100;"),
params = as.list(rows_to_select))
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species id rnum
# 1 5.1 3.5 1.4 0.2 setosa 1 1
# 2 5.0 3.6 1.4 0.2 setosa 5 5
# 3 4.4 2.9 1.4 0.2 setosa 9 9
# 4 4.9 3.1 1.5 0.1 setosa 10 10
In this case it is rather trivial, but if you have a more complicated query where you use question marks ("bindings") at different places in the query, the order must align perfectly with the elements of the list assigned to the params= argument of dbGetQuery.
Alternative: insert a temp table with your candidate values, then left-join against it.
dbWriteTable(con, "mytemp", data.frame(rnum = rows_to_select), temporary = TRUE)
DBI::dbGetQuery(con,
"select i.* from mytemp m left join iris i on i.id=m.rnum")
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species id
# 1 5.1 3.5 1.4 0.2 setosa 1
# 2 5.0 3.6 1.4 0.2 setosa 5
# 3 5.1 3.5 1.4 0.2 setosa 1
# 4 4.4 2.9 1.4 0.2 setosa 9
# 5 4.9 3.1 1.5 0.1 setosa 10
DBI::dbExecute(con, "drop table mytemp")
# [1] 0

Which is the correct tree in XGBClassifier?

I have trained the following XGBClassifier in Pandas:
model = XGBClassifier(
objective='binary:logistic',
base_score=0.5,
booster='gbtree',
colsample_bylevel=1,
colsample_bynode=1,
colsample_bytree=1,
enable_categorical=False,
gamma=2,
gpu_id=-1,
importance_type=None,
interaction_constraints='',
learning_rate=0.1,
max_delta_step=0,
max_depth=3,
min_child_weight=7,
monotone_constraints='(1,1,1,1,1)',
n_estimators=3,
n_jobs=1,
nthread=1,
num_parallel_tree=1,
predictor='auto',
random_state=0,
reg_alpha=0,
reg_lambda=1,
scale_pos_weight=1,
silent=True,
subsample=0.8,
tree_method='exact',
validate_parameters=1,
pred_contribs=True,
verbosity=None)
model.fit(X, Y)
The resulting tree looks like this:
As you can see there are 4 leaves:
Leaf3 -> Log odds = -0.13381
Leaf4 -> Log odds = -0.05526
Leaf5 -> Log odds = -0.04303
Leaf6 -> Log odds = 0.00275
My assumption is that the number that we see in the picture are log odds.
Then I use the apply method to check the predicted leaf for every tree for each sample:
model.fit(X, Y)
x = pd.DataFrame((model.apply(X)))
x.to_csv('x.csv')
print(x)
The printout looks like this:
0 1 2
0 6.0 6.0 6.0
1 3.0 3.0 6.0
2 3.0 4.0 3.0
3 6.0 6.0 6.0
4 5.0 5.0 4.0
.. ... ... ...
457 4.0 4.0 6.0
458 6.0 6.0 6.0
459 5.0 5.0 4.0
460 6.0 6.0 5.0
461 3.0 4.0 5.0
The total number of Trees is 3 (from 0 to 2) because I set the n_estimators=3. Is my understanding correct?
Question: which one of these three trees above corresponds to the tree (plot) displayed in the picture above?
I have dumped the tree:
df = model.get_booster().trees_to_dataframe()
print(df)
Which looks like this:
I have highlighted in orange the leaves and gains that correspond to the plot shown above.
Therefore I assume that Tree 0 is the one chosen by the algo to segment the dataset.
Now, when I merge the dataframe that I have used to train the XGBClassifier with the dataframe that contains the apply results by choosing only the tree 0, I obtain a dataframe which contains the Probabilities that I have predicted with the model.predict_proba(X) method and a column called Leaf that contains leaves 3,4,5,6 as expected. The problem is that the probabilities field only contains TWO values: I was expecting FOUR values (one for each leaf).
Why is that? I expect that there is one and one only probability assigned to each leaf.
How can I figure out what is the segmentation that the algo has chosen to assign a leaf to each record? Where can I find the actual segmentation? And How can I create a column in the train dataframe that contains the correct leaf?

Python - Looping through dataframe using methods other than .iterrows()

Here is the simplified dataset:
Character x0 x1
0 T 0.0 1.0
1 h 1.1 2.1
2 i 2.2 3.2
3 s 3.3 4.3
5 i 5.5 6.5
6 s 6.6 7.6
8 a 8.8 9.8
10 s 11.0 12.0
11 a 12.1 13.1
12 m 13.2 14.2
13 p 14.3 15.3
14 l 15.4 16.4
15 e 16.5 17.5
16 . 17.6 18.6
The simplified dataset is generated by the following code:
ch = ['T']
x0 = [0]
x1 = [1]
string = 'his is a sample.'
for s in string:
ch.append(s)
x0.append(round(x1[-1]+0.1,1))
x1.append(round(x0[-1]+1,1))
df = pd.DataFrame(list(zip(ch, x0, x1)), columns = ['Character', 'x0', 'x1'])
df = df.drop(df.loc[df['Character'] == ' '].index)
x0 and x1 represents the starting and ending position of each Character, respectively. Assume that the distance between any two adjacent characters equals to 0.1. In other words, if the difference between x0 of a character and x1 of the previous character is 0.1, the two characters belongs to the same string. If such difference is larger than 0.1, the character should be the start of a new string, etc. I need to produce a dataframe of strings and their respective x0 and x1, which is done by looping through the dataframe using .iterrows()
string = []
x0 = []
x1 = []
for index, row in df.iterrows():
if index == 0:
string.append(row['Character'])
x0.append(row['x0'])
x1.append(row['x1'])
else:
if round(row['x0']-x1[-1],1) == 0.1:
string[-1] += row['Character']
x1[-1] = row['x1']
else:
string.append(row['Character'])
x0.append(row['x0'])
x1.append(row['x1'])
df_string = pd.DataFrame(list(zip(string, x0, x1)), columns = ['String', 'x0', 'x1'])
Here is the result:
String x0 x1
0 This 0.0 4.3
1 is 5.5 7.6
2 a 8.8 9.8
3 sample. 11.0 18.6
Is there any other faster way to achieve this?
You could use groupby + agg:
# create diff column
same = (df['x0'] - df['x1'].shift().fillna(df.at[0, 'x0'])).abs()
# create grouper column, had to use this because of problems with floating point
grouper = ((same - 0.1) > 0.00001).cumsum()
# group and aggregate accordingly
res = df.groupby(grouper).agg({ 'Character' : ''.join, 'x0' : 'first', 'x1' : 'last' })
print(res)
Output
Character x0 x1
0 This 0.0 4.3
1 is 5.5 7.6
2 a 8.8 9.8
3 sample. 11.0 18.6
The tricky part is this one:
# create grouper column, had to use this because of problems with floating point
grouper = ((same - 0.1) > 0.00001).cumsum()
The idea is to convert the column of diffs (same) into a True or False column, where every time a True appears it means a new group needs to be created. The cumsum will take care of assigning the same id to each group.
As suggested by #ShubhamSharma, you could do:
# create diff column
same = (df['x0'] - df['x1'].shift().fillna(df['x0'])).abs().round(3).gt(.1)
# create grouper column, had to use this because of problems with floating point
grouper = same.cumsum()
The other part remains the same.

Finding significant values from a series

I have a series with index and the count can be 0 to 1000.
I can select all the entries where the value is greater than 3
But after looking at the data, I decide to select all the entries where the value is more than 10 because some values are significantly higher than others!
s[s > 3].dropna()
-PB-[variable][variable] 8.0
-[variable] 15.0
-[variable][variable] 6.0
A-[variable][variable] 5.0
B 5.0
B-[variable][variable] 5.0
Book 4.0
Bus 8.0
Date 5.0
Dear 1609.0
MR 4.0
Man[variable] 4.0
Number[variable] 5.0
PM[variable] 4.0
Pickup 12.0
Pump[variable] 5.0
RJ 9.0
RJ-[variable]-PB-[variable][variable] 6.0
Time[variable] 6.0
[variable] 103.0
[variable][variable] 15.0
I have refined my query to something like this...
s[s > 10].dropna()
-[variable] 15.0
Dear 1609.0
Pickup 12.0
[variable] 103.0
[variable][variable] 15.0
Is there any function in pandas to return the significant entries. I can sort in descending order and select the first 5 or 10, but there is no guarantee that those entries will be very high compared to average. In that case I will prefer to select all entries.
In other words, I have decided the threshold of 10 in this case after looking at the data. Is there any method to select that value programmatically?
Selecting a threshold value with the quntile method might be a better solution, but still not the exact answer.
You can use .head function to select default top 5 row and .sort_values to sort in that dataframe. If you want to select top 10 then pass 10 in head function.
Simply call:
s[s['column_name'] > 10].sort_values(kind='quicksort', by='column_name_to_sort', ascending=False).head(10)

filter file in .csv formar with pandas [duplicate]

This question already has answers here:
How do I select rows from a DataFrame based on column values?
(16 answers)
Closed 4 years ago.
I tried to filter my data with pandas but I have not succeeded, I changed my data to a .csv file and did the following:
import pandas as pd
data = pd.read_csv("test3.csv")
print (type(data))
print(data)
By doing this I get my table:
C1 C2 C3 C4 C5
0 1.0 2.0 3.0 4.0 5.0
1 2.0 3.0 4.0 5.0 6.0
2 3.0 4.0 5.0 6.0 7.0
3 4.0 5.0 6.0 7.0 8.0
class 'pandas.core.frame.DataFrame'
Now I need that for the rows in which the columns meet a condition, python prints that row for example the rows for which all the columns are <4.0, the idea is that I have a condition for each column. I tried this but it does not work:
for item in data:
fil_C1=(data["C1"]) == 4.0
print (fil_C1)
please help me!!!
If you need to retain rows where columns value for C1 is less than 4, try the following:
less_than_4 = data[data['C1'] < 4]
print(less_than_4)
If you have multiple conditions say C1 less than 4 and C5 greater than 5 try this:
mul_conditions = data[(data['C1'] < 4) & (data['C5'] > 5)]
print(mul_conditions)
Let me know how it goes.