I have this set of data:
[1] 0013-021 0013-022 0013-031 0013-032 0013-033 0013-034
Levels: 0013-021 0013-022 0013-031 0013-032 0013-033 0013-034
[1] 61.42608 64.95802 31.51387 45.11971 43.66110 63.68363
[1] 28.92506 32.58015 11.86372 16.22164 36.23264 40.54977
Factor w/ 6 levels "0013-021","0013-022",..: 1 2 3 4 5 6
num [1:6] 61.4 65 31.5 45.1 43.7 ...
num [1:6] 28.9 32.6 11.9 16.2 36.2 ...
I would like to have a plot as below:
The plot is not from R, but other software. I would like the same plot to be from R. Appreciate any help.
I have tried with below code and it works
df <- data.frame(TestSystems, Availability, Utilization)
p <- ggplot(df.long,aes(TestSystems, value,fill=variable))+ geom_bar(stat="identity",position="dodge")
how do I display those 12 values at the top of each bar?
I have a pandas dataframe like this
id f1 f2 f3 f4
1 8.327 9.905 8.133 0.785
2 3.549 0.452 7.798 5.797
3 0.011 0.238 1.291 7.593
4 0.325 0.792 4.643 4.3
5 7.093 7.312 3.641 9.88
6 2.88 7.834 5.727 6.984
7 5.554 1.649 4.018 0.623
8 2.501 2.941 9.323 0.565
9 1.032 6.961 3.905 8.116
10 9.68 7.922 7.015 7.542
11 8.096 4.344 1.153 5.244
I would like to filter data by other condition dataframe. I want to find out records that satisfy all the following conditions.
variable interval
1 f1 (0,4)
2 f2 [1,3]
3 f3 (5,+np.inf)
4 f4 [0,10]
I know I can achieve this with the following code.
df.query('f1>0 and f1<4 and f2>=1 and f2<=3 and f3>5 and f4>=0 and f4<=10')
# or
The downside is that I need to modify the code if the conditions change. Is there a pythonic way to handle this issue?
You can dynamically construct a query using:
left_op = lambda y: '>=' if y.closed_left else '>'
right_op = lambda y: '<=' if y.closed_right else '<'
construct_query = lambda x, y: f"({x}{left_op(y)}{y.left} and {x}{right_op(y)}{y.right})"
qry = " and ".join(
df2.apply(lambda x: construct_query(x.variable, x.interval),
axis = 1).tolist()
where df2 is your second dataframe with variable and interval columns.
For your example data qry looks like:
'(f1>0 and f1<4) and (f2>=1 and f2<=3) and (f3>=5 and f3<inf) and (f4>=0 and f4<=10)'
Now if you do df.query(qry) it will give:
id f1 f2 f3 f4
7 8 2.501 2.941 9.323 0.565
I have a dataframe with several variables:
Depth Temperature ... Ay Az
Time ...
2017-09-25 21:46:05 23.0 7.70 ... 0.054688 -0.691406
2017-09-25 21:46:10 24.5 6.15 ... 0.148438 -0.742188
2017-09-25 21:46:15 27.5 4.10 ... -0.078125 -0.875000
2017-09-25 21:46:20 29.0 2.55 ... 0.144531 -0.664062
2017-09-25 21:46:25 30.0 2.45 ... 0.343750 -0.886719
[5 rows x 6 columns]
I want to resample every 24H, select 1) the maximum Depth within 24H, 2) the value of temperature that corresponds to that maximum depth 3) the 24H mean for the last two columns, Ay and Az.
So far I have use the code below and it works but I would like to make the last two lines cleaner into one if possible.
tagdata_dailydepthmax = tagdata.resample('24H').apply(lambda tagdata: tagdata.loc[tagdata.Depth.idxmax()])
tagdata_dailydepthmax.Ay = tagdata['Ay'].resample('24H').mean()
tagdata_dailydepthmax.Az = tagdata['Az'].resample('24H').mean()
You can try this. It calculates mean for multiple columns
tagdata_dailydepthmax[['Ay','Az']] = tagdata[['Ay','Az']].resample('24H').mean()
Here is the simplified dataset:
Character x0 x1
0 T 0.0 1.0
1 h 1.1 2.1
2 i 2.2 3.2
3 s 3.3 4.3
5 i 5.5 6.5
6 s 6.6 7.6
8 a 8.8 9.8
10 s 11.0 12.0
11 a 12.1 13.1
12 m 13.2 14.2
13 p 14.3 15.3
14 l 15.4 16.4
15 e 16.5 17.5
16 . 17.6 18.6
The simplified dataset is generated by the following code:
ch = ['T']
x0 = [0]
x1 = [1]
string = 'his is a sample.'
for s in string:
df = pd.DataFrame(list(zip(ch, x0, x1)), columns = ['Character', 'x0', 'x1'])
df = df.drop(df.loc[df['Character'] == ' '].index)
x0 and x1 represents the starting and ending position of each Character, respectively. Assume that the distance between any two adjacent characters equals to 0.1. In other words, if the difference between x0 of a character and x1 of the previous character is 0.1, the two characters belongs to the same string. If such difference is larger than 0.1, the character should be the start of a new string, etc. I need to produce a dataframe of strings and their respective x0 and x1, which is done by looping through the dataframe using .iterrows()
string = []
x0 = []
x1 = []
for index, row in df.iterrows():
if index == 0:
if round(row['x0']-x1[-1],1) == 0.1:
string[-1] += row['Character']
x1[-1] = row['x1']
df_string = pd.DataFrame(list(zip(string, x0, x1)), columns = ['String', 'x0', 'x1'])
Here is the result:
String x0 x1
0 This 0.0 4.3
1 is 5.5 7.6
2 a 8.8 9.8
3 sample. 11.0 18.6
Is there any other faster way to achieve this?
You could use groupby + agg:
# create diff column
same = (df['x0'] - df['x1'].shift().fillna(df.at[0, 'x0'])).abs()
# create grouper column, had to use this because of problems with floating point
grouper = ((same - 0.1) > 0.00001).cumsum()
# group and aggregate accordingly
res = df.groupby(grouper).agg({ 'Character' : ''.join, 'x0' : 'first', 'x1' : 'last' })
Character x0 x1
0 This 0.0 4.3
1 is 5.5 7.6
2 a 8.8 9.8
3 sample. 11.0 18.6
The tricky part is this one:
# create grouper column, had to use this because of problems with floating point
grouper = ((same - 0.1) > 0.00001).cumsum()
The idea is to convert the column of diffs (same) into a True or False column, where every time a True appears it means a new group needs to be created. The cumsum will take care of assigning the same id to each group.
As suggested by #ShubhamSharma, you could do:
# create diff column
same = (df['x0'] - df['x1'].shift().fillna(df['x0'])).abs().round(3).gt(.1)
# create grouper column, had to use this because of problems with floating point
grouper = same.cumsum()
The other part remains the same.
I have a DataFrame that consists of many stacked time series. The index is (poolId, month) where both are integers, the "month" being the number of months since 2000. What's the best way to calculate one-month lagged versions of multiple variables?
Right now, I do something like:
cols_to_shift = ["bal", ...5 more columns...]
df_shift = df[cols_to_shift].groupby(level=0).transform(lambda x: x.shift(-1))
For my data, this took me a full 60 s to run. (I have 48k different pools and a total of 718k rows.)
I'm converting this from R code and the equivalent data.table call:
dt.shift <- dt[, list(bal=myshift(bal), ...), by=list(poolId)]
only takes 9 s to run. (Here "myshift" is something like "function(x) c(x[-1], NA)".)
Is there a way I can get the pandas verison to be back in line speed-wise? I tested this on 0.8.1.
Edit: Here's an example of generating a close-enough data set, so you can get some idea of what I mean:
ids = np.arange(48000)
lens = np.maximum(np.round(15+9.5*np.random.randn(48000)), 1.0).astype(int)
id_vec = np.repeat(ids, lens)
lens_shift = np.concatenate(([0], lens[:-1]))
mon_vec = np.arange(lens.sum()) - np.repeat(np.cumsum(lens_shift), lens)
n = len(mon_vec)
df = pd.DataFrame.from_items([('pool', id_vec), ('month', mon_vec)] + [(c, np.random.rand(n)) for c in 'abcde'])
df = df.set_index(['pool', 'month'])
%time df_shift = df.groupby(level=0).transform(lambda x: x.shift(-1))
That took 64 s when I tried it. This data has every series starting at month 0; really, they should all end at month np.max(lens), with ragged start dates, but good enough.
Edit 2: Here's some comparison R code. This takes 0.8 s. Factor of 80, not good.
ids <- 1:48000
lens <- as.integer(pmax(1, round(rnorm(ids, mean=15, sd=9.5))))
id.vec <- rep(ids, times=lens)
lens.shift <- c(0, lens[-length(lens)])
mon.vec <- (1:sum(lens)) - rep(cumsum(lens.shift), times=lens)
n <- length(id.vec)
dt <- data.table(pool=id.vec, month=mon.vec, a=rnorm(n), b=rnorm(n), c=rnorm(n), d=rnorm(n), e=rnorm(n))
setkey(dt, pool, month)
myshift <- function(x) c(x[-1], NA)
system.time(dt.shift <- dt[, list(month=month, a=myshift(a), b=myshift(b), c=myshift(c), d=myshift(d), e=myshift(e)), by=pool])
I would suggest you reshape the data and do a single shift versus the groupby approach:
result = df.unstack(0).shift(1).stack()
This switches the order of the levels so you'd want to swap and reorder:
result = result.swaplevel(0, 1).sortlevel(0)
You can verify it's been lagged by one period (you want shift(1) instead of shift(-1)):
In [17]: result.ix[1]
a b c d e
1 0.752511 0.600825 0.328796 0.852869 0.306379
2 0.251120 0.871167 0.977606 0.509303 0.809407
3 0.198327 0.587066 0.778885 0.565666 0.172045
4 0.298184 0.853896 0.164485 0.169562 0.923817
5 0.703668 0.852304 0.030534 0.415467 0.663602
6 0.851866 0.629567 0.918303 0.205008 0.970033
7 0.758121 0.066677 0.433014 0.005454 0.338596
8 0.561382 0.968078 0.586736 0.817569 0.842106
9 0.246986 0.829720 0.522371 0.854840 0.887886
10 0.709550 0.591733 0.919168 0.568988 0.849380
11 0.997787 0.084709 0.664845 0.808106 0.872628
12 0.008661 0.449826 0.841896 0.307360 0.092581
13 0.727409 0.791167 0.518371 0.691875 0.095718
14 0.928342 0.247725 0.754204 0.468484 0.663773
15 0.934902 0.692837 0.367644 0.061359 0.381885
16 0.828492 0.026166 0.050765 0.524551 0.296122
17 0.589907 0.775721 0.061765 0.033213 0.793401
18 0.532189 0.678184 0.747391 0.199283 0.349949
In [18]: df.ix[1]
a b c d e
0 0.752511 0.600825 0.328796 0.852869 0.306379
1 0.251120 0.871167 0.977606 0.509303 0.809407
2 0.198327 0.587066 0.778885 0.565666 0.172045
3 0.298184 0.853896 0.164485 0.169562 0.923817
4 0.703668 0.852304 0.030534 0.415467 0.663602
5 0.851866 0.629567 0.918303 0.205008 0.970033
6 0.758121 0.066677 0.433014 0.005454 0.338596
7 0.561382 0.968078 0.586736 0.817569 0.842106
8 0.246986 0.829720 0.522371 0.854840 0.887886
9 0.709550 0.591733 0.919168 0.568988 0.849380
10 0.997787 0.084709 0.664845 0.808106 0.872628
11 0.008661 0.449826 0.841896 0.307360 0.092581
12 0.727409 0.791167 0.518371 0.691875 0.095718
13 0.928342 0.247725 0.754204 0.468484 0.663773
14 0.934902 0.692837 0.367644 0.061359 0.381885
15 0.828492 0.026166 0.050765 0.524551 0.296122
16 0.589907 0.775721 0.061765 0.033213 0.793401
17 0.532189 0.678184 0.747391 0.199283 0.349949
Perf isn't too bad with this method (it might be a touch slower in 0.9.0):
In [19]: %time result = df.unstack(0).shift(1).stack()
CPU times: user 1.46 s, sys: 0.24 s, total: 1.70 s
Wall time: 1.71 s
I've two column data as:
9 17.52
11 29.77
7 62.75
11 36.15
7 30.46
7 52.5
9 65.26
9 90.05
14 101.87
12 86.88
15 74.78
And want that first column be plotted as histogram according to index of y2, and second column be plotted as line according to index of y1. Anyone has ideas?
Maybe I didn't understand your question correctly, but are you possibly looking for something like this:
set style fill solid border -1
set boxwidth 0.4
plot "Data.dat" u 2 w boxes t "boxes", "" u (column(0)):1 t "lines" w l