Python altair - facet line plot with multiple variables - facet

I have the following kind of DataFrame
Marque Annee Modele PVFP PM
0 A 1 Python 70783.066836 2.067821e+07
1 A 2 Python 75504.270716 1.957717e+07
2 A 3 Python 66383.237169 1.848982e+07
3 A 4 Python 61966.851675 1.755261e+07
4 A 5 Python 54516.367597 1.671907e+07
5 A 1 Sol 66400.686091 2.067821e+07
6 A 2 Sol 74953.770294 1.955218e+07
7 A 3 Sol 66500.916446 1.844078e+07
8 A 4 Sol 62016.941237 1.748098e+07
9 A 5 Sol 54356.008414 1.662684e+07
10 B 1 Python 43152.461787 1.340989e+07
11 B 2 Python 62397.794144 1.494418e+07
12 B 3 Python 1871.135251 2.178552e+06
I tried to build a facet graph but without really succeeding. I'am just able to concat vertically the 2 charts generated. I would be grateful if you have any idea to do it properly in one operation.
My current code :
chart = alt.Chart(euro).mark_line().encode(
x='Annee',
y='PVFP',
color='Modele'
).properties(
width=150,
height=150
).facet(
facet='Marque',
columns=3
)
chart2 = alt.Chart(euro).mark_line().encode(
x='Annee',
y='PM',
color='Modele'
).properties(
width=150,
height=150
).facet(
facet='Marque',
columns=3
)
chart & chart2

One good way to do this is to use a Fold Transform to fold your two columns into one, and then you can use row and column facets to facet by both variables at once. For example:
alt.Chart(euro).transform_fold(
['PVFP', 'PM'], as_=['key', 'value']
).mark_line().encode(
x='Annee:Q',
y='value:Q',
color='Modele:N'
).properties(
width=150,
height=150
).facet(
column='Marque:N',
row='key:N'
)

Related

Filling previous value by field - Pandas apply function filling None

I am trying to fill each row in a new column (Previous time) with a value from previous row of the specific subset (when condition is met). The thing is, that if I interrupt kernel and check values, it is ok. But if it runs to the end, then all rows in new column are filled with None. If previous row doesnt exist, than I will fill it with first value.
Name First round Previous time
Runner 1 2 2
Runner 2 5 5
Runner 3 5 5
Runner 1 6 2
Runner 2 8 5
Runner 3 4 5
Runner 1 2 6
Runner 2 5 8
Runner 3 5 4
What I tried:
df.insert(column = "Previous time", value = 999)
def fce(arg):
runner= arg[0]
stat = arg[1]
if stat == 999:
# I used this to avoid filling all rows in a new column again for the same runner
first = df.loc[df['Name'] == runner,"First round"].iloc[0]
df.loc[df['Name'] == runner,"Previous time"] = df.loc[df['Name'] == runner]["First round"].shift(1, fill_value = first)
df["Previous time"] = df[['Name', "Previous time"]].apply(fce, axis=1)
Condut gruopby shift for each Name and fill the missing values with the original series.
df['Previous time'] = (df.groupby('Name')['First round']
.shift()
.fillna(df['First round'], downcast='infer'))
The problem is that your function fce returns None for every row, so the Series produced by the term df[['Name', "Previous time"]].apply(fce, axis=1) is a Series of None.
That is, instead of overriding the Dataframe with df.loc inside the function, you need to return the value to fill for this position. Unfortunately, this is impossible since then you need to know which indices you already calculated.
A better way to do it would be to use groupby. This is a more natural way, since you want to perform an action on each group. If you use apply after groupby and you to return a series, you, in fact, define a value for each row. Just remember to remove the extra index "Name" that groupby adds.
def fce(g):
first = g["First round"].iloc[0]
return g["First round"].shift(1, fill_value=first)
df["Previous time"] == df.groupby("Name").apply(fce).reset_index("Name", drop=True)
Thank you very much. Please can you answer me one more question? How does it work with group by on multiple columns if I want to return mean of all rounds based on specific runner a sleeping time before race.
Expected output:
Name First round Sleep before race Mean
Runner 1 2 8 4
Runner 2 5 7 6
Runner 3 5 8 5
Runner 1 6 8 4
Runner 2 8 7 6
Runner 3 4 9 4,5
Runner 1 2 9 2
Runner 2 5 7 6
Runner 3 5 9 4,5
This does not work for me.
def last_season(g):
aa = g["First round"].mean()
df["Mean"] = df.groupby(["Name", "Sleep before race"]).apply(g).reset_index(["Name", "Sleep before race"], drop=True)

pandas create Cross-Validation based on specific columns

I have a dataframe of few hundreds rows , that can be grouped to ids as follows:
df = Val1 Val2 Val3 Id
2 2 8 b
1 2 3 a
5 7 8 z
5 1 4 a
0 9 0 c
3 1 3 b
2 7 5 z
7 2 8 c
6 5 5 d
...
5 1 8 a
4 9 0 z
1 8 2 z
I want to use GridSearchCV , but with a custom CV that will assure that all the rows from the same ID will always be on the same set.
So either all the rows if a are in the test set , or all of them are in the train set - and so for all the different IDs.
I want to have 5 folds - so 80% of the ids will go to the train and 20% to the test.
I understand that it can't guarentee that all folds will have the exact same amount of rows - since one ID might have more rows than the other.
What is the best way to do so?
As stated, you can provide cv with an iterator. You can use GroupShuffleSplit(). For example, once you use it to split your dataset, you can put the result within GridSearchCV() for the cv parameter.
As mentioned in the sklearn documentation, there's a parameter called "cv" where you can provide "An iterable yielding (train, test) splits as arrays of indices."
Do check out the documentation in future first.
As mentioned previously, GroupShuffleSplit() splits data based on group lables. However, the test sets aren't necessarily disjoint (i.e. doing multiple splits, an ID may appear in multiple test sets). If you want each ID to appear in exactly one test fold, you could use GroupKFold(). This is also available in Sklearn.model_selection, and directly extends KFold to take into account group lables.

Fill Empty Panda Dataframe Using Loop Method

I am currently working with some telematics data where the trip id is missing. Trip id is unique. 1 trip id contains multiple of rows of data consisting i.e gps coordinate, temp, voltage, rpm, timestamp, engine status (on or off). The data pattern indicate time of engine status on and off, can be cluster as a unique trip id. Though, I have difficulty to translate the above logic in order to generate these tripId.
Tried to use few pandas loop methods but keep failing.
import pandas as pd
inp = [{'Ignition_Status':'ON', 'tripID':''},{'Ignition_Status':'ON','tripID':''},
{'Ignition_Status':'ON', 'tripID':''},{'Ignition_Status':'OFF','tripID':''},
{'Ignition_Status':'ON', 'tripID':''},{'Ignition_Status':'ON','tripID':''},
{'Ignition_Status':'ON', 'tripID':''},{'Ignition_Status':'ON', 'tripID':''},
{'Ignition_Status':'ON', 'tripID':''},{'Ignition_Status':'OFF', 'tripID':''},
{'Ignition_Status':'ON', 'tripID':''},{'Ignition_Status':'OFF', 'tripID':''}]
test = pd.DataFrame(inp)
print (test)
Approach Taken
n=1
for index, row in test.iterrows():
test['tripID']=np.where(test['Ignition_Status']=='ON',n,n)
n=n+1
Expected Result
Use series.eq() to check for OFF and series.shift() with series.cumsum():
test=test.assign(tripID=test.Ignition_Status.eq('OFF')
.shift(fill_value=False).cumsum().add(1))
Ignition_Status tripID
0 ON 1
1 ON 1
2 ON 1
3 OFF 1
4 ON 2
5 ON 2
6 ON 2
7 ON 2
8 ON 2
9 OFF 2
10 ON 3
11 OFF 3

Python Pandas groupby and join

I am fairly new to python pandas and cannot find the answer to my problem in any older posts.
I have a simple dataframe that looks something like that:
dfA ={'stop':[1,2,3,4,5,1610,1611,1612,1613,1614,2915,...]
'seq':[B, B, D, A, C, C, A, B, A, C, A,...] }
Now I want to merge the 'seq' values from each group, where the difference between the next and previous value in 'stop' is equal to 1. When the difference is high like 5 and 1610, that is where the next cluster begins and so on.
What I need is to write all values from each cluster into separate rows:
0 BBDAC #join'stop' cluster 1-5
1 CABAC #join'stop' cluster 1610-1614
2 A.... #join'stop' cluster 2015 - ...
etc...
What I am getting with my current code is like:
True BDACABAC...
False BCA...
for the entire huge dataframe.
I understand the logic behid the whay it merges it, which is meeting the condition (not perfect, loosing cluster edges) I specified, but I am running out of ideas if I can get it joined and split properly into clusters somehow, not all rows of the dataframe.
Please see my code below:
dfB = dfA.groupby((dfA.stop - dfA.stop.shift(1) == 1))['seq'].apply(lambda x: ''.join(x)).reset_index()
Please help.
P.S. I have also tried various combinations with diff() but that didn't help either. I am not sure if groupby is any good for this solution as well. Please advise!
dfC = dfA.groupby((dfA['stop'].diff(periods=1)))['seq'].apply(lambda x: ''.join(x)).reset_index()
This somehow splitted the dataframe into smaller chunks, cluster-like, but I am not understanding the legic behind the way it did it, and I know the result makes no sense and is not what I intended to get.
I think you need create helper Series for grouping:
g = dfA['stop'].diff().ne(1).cumsum()
dfC = dfA.groupby(g)['seq'].apply(''.join).reset_index()
print (dfC)
stop seq
0 1 BBDAC
1 2 CABAC
2 3 A
Details:
First get differences by diff:
print (dfA['stop'].diff())
0 NaN
1 1.0
2 1.0
3 1.0
4 1.0
5 1605.0
6 1.0
7 1.0
8 1.0
9 1.0
10 1301.0
Name: stop, dtype: float64
Compare by ne (!=) for first values of groups:
print (dfA['stop'].diff().ne(1))
0 True
1 False
2 False
3 False
4 False
5 True
6 False
7 False
8 False
9 False
10 True
Name: stop, dtype: bool
Asn last create groups by cumsum:
print (dfA['stop'].diff().ne(1).cumsum())
0 1
1 1
2 1
3 1
4 1
5 2
6 2
7 2
8 2
9 2
10 3
Name: stop, dtype: int32
I just figured it out.
I managed to round the values of 'stop' to a nearest 100 and assigned it as a new column.
Then my previous code is working....
Thank you so much for quick answer though.
dfA['new_val'] = (dfA['stop'] / 100).astype(int) *100

OverflowError: Python int too large to convert to C long- Matplotlib

I have a pretty simple dataframe as seen below. I am trying to manipulate the x-axis(dates) so it starts at 1996-31-12 and ends at 2016-31-12 on increments of 365 days.
Datafame:
Date A B
1996-31-12 10 3
1997-31-03 5 6
1997-31-07 7 5
1997-30-11 3 12
1997-31-12 4 10
1998-31-03 5 8
.
.
.
2016-31-12 3 9
#change date string to datetime variable
df12.Date = pd.to_datetime(df12.Date)
fig, ax = plt.subplots()
ax.plot_date(df12.Date,df12.A)
ax.plot_date(df12.Date,df12.B)
ax.xaxis.set_major_locator(mdates.YearLocator())
ax.xaxis.grid(True, which="major")
ax.yaxis.grid()
ax.xaxis.set_major_formatter(mdates.DateFormatter('%b\n%Y''))
plt.tight_layout()
plt.show()
I am getting an error message when i try and run the above code that I am not sure what it means-OverflowError: Python int too large to convert to C long. ANy one know what this means? If not, is there another way to do what i want to do?