Devide the time series data into groups using SQL - sql

Now we have the battery usage data, mainly the voltage data.
The voltage decline as the day goes by. After a few days, the battery would be changed and the voltage would go back to the maximum value.
In this dataset, I hope to flag the 3 cycles the battery be used from the maximum voltage to the minimum voltage by some SQL clause if possible
voltage
date
4.2
2022-1-1
4.1
2022-1-10
4.0
2022-1-20
3.8
2022-1-23
3.6
2022-2-3
4.1
2022-2-5
4.0
2022-2-7
3.9
2022-2-25
3.8
2022-3-12
4.2
2022-3-15
4.1
2022-3-20
4.0
2022-3-23
3.5
2022-3-30
After the operation, the dataset should be like this:
voltage
date
type
4.2
2022-1-1
1
4.1
2022-1-10
1
4.0
2022-1-20
1
3.8
2022-1-23
1
3.6
2022-2-3
1
4.1
2022-2-5
2
4.0
2022-2-7
2
3.9
2022-2-25
2
3.8
2022-3-12
2
4.2
2022-3-15
3
4.1
2022-3-20
3
4.0
2022-3-23
3
3.5
2022-3-30
3

Related

Bootstrapping each column in a DF and replacing the column values with the bootstrap samples

I'm looking to bootstrap 1000 times each column in a dataframe and then replace the few values in each column with the 1000 bootstrapped samples, so as to each column now has 1000 rows. Does anyone have any idea how could I code that? So that from having a data frame like the one on top I end up with a data frame like the one in the bottom with all 1000 sampled values. Thank you!
Col 1.
Col 2.
1.
3.
4.
5.
7.
1.
1.
9.
Col 1.
Col 2.
1.
3.
4.
5.
1.
5.
7.
1.
1.
9.
1.
1.
1.
5.
7.
1.
...n=1000
...n=1000
Assuming you want to randomly sample Col1 and Col2 independently, with replacement:
n = 1000
import numpy as np
out = (df[['Col 1.']]
.sample(n=n, replace=True, ignore_index=True)
.assign(**{'Col 2.': np.random.choice(df['Col 2.'], size=n, replace=True)})
)
print(out)
Output:
Col 1. Col 2.
0 7.0 1.0
1 1.0 5.0
2 1.0 9.0
3 1.0 1.0
4 7.0 9.0
.. ... ...
995 1.0 5.0
996 4.0 9.0
997 4.0 9.0
998 1.0 1.0
999 4.0 9.0
[1000 rows x 2 columns]

Average of Moving Averages using multiple partitions

I would like to create an average of individual moving averages per team. The moving average of each player will be their own and not dependent on what team they were on that day (See Example).
I have a good understanding of how to do a moving average of just an individual player, but not how to combine multiple that occur in different rows.
One idea I had was merging every row of each team under one row first. However, that does not seem like the most ideal way. Can I partition over two columns to accomplish this?
Bonus question: is there a potential to weigh the players differently in their individual moving average depending on stat B (Example 2)?
For example:
Team A average = AVG(MA_statA_player1, MA_statA_player2, & MA_statA_player3)
Example 2:
Team A average = AVG(MA_player1stat_b, MA_player2stat_b, & MA_player3*stat_b)
I have data like below:
Team
ID
date
stat A
stat B
1
player1
5-31-2022
2.5
0.1
1
player2
5-31-2022
2.9
0.5
1
player3
5-31-2022
5
0.3
2
player10
5-31-2022
6
0.75
2
player12
5-31-2022
2.5
0.2
3
player10
6-01-2022
2.5
0.12
3
player2
6-01-2022
2.5
0.85
Example Expected Data; Each row is made up of a team with a date and a moving average of the team. The individual moving averages do not need to be there but are to show how the average team is generated.
No Weight_ Average_team = (ma_playerX + ma_playerY)/2
Team
date
ma_playX
ma_playerY
Average_team
1
5-31-202
3.2
2.5
2.85
2
5-31-2022
5.6
2.9
4.25
3
6-01-2022
2.5
5
2.25
AVG(stat_A) OVER (PARTITION BY player ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW)+ 2 AS avg7games

How to read all tables from a SQLite database and store as datasets/variables in R?

I have a large SQLite database with many tables. I have established a connection to this database in RStudio using RSQLite and DBI packages. (I have named this database db)
library(RSQLite)
library(DBI)
At the moment I have to read in all the tables and assign them a name manually. For example:
country <- dbReadTable(db, "country")
date <- dbReadTable(db, "date")
#...and so on
You can see this can be a very time-consuming process if you were to have many tables.
So I was wondering if it is possible to create a new function or using existing functions (e.g. lapply() ?) to complete this more efficiently and essentially speed up this process?
Any suggestions are much appreciated :)
Two mindsets:
All tables/data into one named-list:
alldat <- lapply(setNames(nm = dbListTables(db)), dbReadTable, conn = db)
The benefit of this is that if the tables have similar meaning, then you can use lapply to apply the same function to each. Another benefit is that all data from one database are stored together.
See How do I make a list of data frames? for working on a list-of-frames.
If you want them as actual variables in the global (or enclosing) environment, then take the previous alldat, and
ign <- list2env(alldat, envir = .GlobalEnv)
The return value from list2env is the environment that we passed to list2env, so it's not incredibly useful in this context (though it is useful other times). The only reason I capture it into ign is to reduce the clutter on the console ... which is minor. list2env works primarily in side-effect, so the return value in this case is not critical.
You can use dbListTables() to generate a character vector of all your table names in your SQLite database and use lapply() to import them into R efficiently. I would first check you are able to import all the tables in your database into memory.
Below is an reproducible example of this:
library(RSQLite)
library(DBI)
db <- dbConnect(RSQLite::SQLite(), ":memory:")
dbWriteTable(db, "mtcars", mtcars)
dbWriteTable(db, "iris", iris)
db_tbls <- dbListTables(db)
tbl_list <- lapply(db_tbls, dbReadTable, conn = db)
tbl_list <- setNames(tbl_list, db_tbls)
dbDisconnect(db)
> lapply(tbl_list, head)
$iris
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
$mtcars
mpg cyl disp hp drat wt qsec vs am gear carb
1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1

Resampling a dataframe based on depth column

I have two dataframe which the key is depth. One has > 2k values the other only 100, but the min and the max depth are the same. I would like to upsample the small dataframe (which has only one column) at the same size of the bigger one and repeat the same value of a column between two depths.
I've tried using concatenate and resampling but I'm stuck when I want to find the same depth since the two dataframes depths do not have exactly the same values
I have this:
df_small:
depth Litholog
0 38.076 2.0
1 39.546 2.0
2 41.034 4.0
3 55.133 3.0
4 69.928 2.0
and this:
df_big:
depth
0 21.3360
1 35.2044
2 37.6428
3 41.7576
4 41.9100
5 48.7680
6 53.1876
7 56.0832
8 58.3692
9 62.1792
I would like this:
df_result:
depth Litholog
0 21.3360 2
1 35.2044 2
2 37.6428 2
3 41.7576 4
4 41.9100 4
5 48.7680 4
6 53.1876 4
7 56.0832 3
8 58.3692 3
9 62.1792 2
I tried several approach but without success. Many thanks to all
If change sample data for same max and min value in both is possible use merge_asof:
#change sample data for same min,max by df_big
print (df_small)
depth Litholog
0 21.3360 2.0
1 39.5460 2.0
2 41.0340 4.0
3 55.1330 3.0
4 62.1792 2.0
df = pd.merge_asof(df_big, df_small, on='depth')
print (df)
depth Litholog
0 21.3360 2.0
1 35.2044 2.0
2 37.6428 2.0
3 41.7576 4.0
4 41.9100 4.0
5 48.7680 4.0
6 53.1876 4.0
7 56.0832 3.0
8 58.3692 3.0
9 62.1792 2.0

pandas dataframe transformation partial sums

I have a pandas dataframe
index A
1 3.4
2 4.5
3 5.3
4 2.1
5 4.0
6 5.3
...
95 3.4
96 1.2
97 8.9
98 3.4
99 2.7
100 7.6
from this I would like to create a dataframe B
1-5 sum(1-5)
6-10 sum(6-10)
...
96-100 sum(96-100)
Any ideas how to do this elegantly rather than brute-force?
Cheers, Mike
This will give you a series with the partial sums:
df['bin'] = df.index / 5
bin_sums = df.groupby('bin')['A'].sum()
Then, if you want to rename the index:
bin_sums.index = ['%s - %s' % (5*i, 5*(i+1)) for i in bin_sums.index]