I have two dataframes, the first (df1) only with dates and the second (df2) with dates and values -> see example.
I want to create a third dataframe (df3) with the dates from df1 where the values from df2 are averaged in the period between the dates from df1.
Unfortunatly there is no structured continuity of the dates.
example:
df1 <- data.frame(
Date = c(
"1986-01-15",
"1986-01-20",
"1986-01-24",
"1986-01-28"
)
)
View(df1)
df2 <- data.frame(
Date = c(
"1986-01-16",
"1986-01-17",
"1986-01-20",
"1986-01-21",
"1986-01-22",
"1986-01-23",
"1986-01-24",
"1986-01-27",
"1986-01-28"
),
Value = c(6,8,1,7,3,9,1,4,5)
)
View(df2)
I hope someone can help. Thank youu!!
the result should be:
1986-01-15 NA
1986-01-20 (6+8+1)/3 = 5
1986-01-24 (7+3+9+1)/4 = 5
1986-01-28 (4+5)/2 = 4,5
Related
How can I convert the table below to a table with columns ["ID", "PC1_0.1", "PC1_0.2", "PC1_0.3", ..., "PC10_111.2"] and only 24 rows. Each row may have the same wafer ID (meaning the same wafer is used repeatedly) and data of some wafer is not recorded.
i hope this codes work for you :)
d = {
"ID":["W-01"]*4+["W-02"]*2,
"Time":["t1","t2"]*3,
"PC1":["00","10","20","30","40","50"],
"PC2":["01","11","21","31","41","51"],
}
df = pd.DataFrame(d)
# for grouping on Time-PC1-PC2 and pivot
melt = df.melt(id_vars=["ID","Time"], value_vars=["PC1","PC2"])
melt["no"] = np.arange(0,melt.shape[0])
pivot = melt.pivot(index=["no","ID"], columns=["Time","variable"], values="value")
# We are combining non-nan columns because during the melt operation, nan data will emerge.
con = pd.DataFrame()
for col in range(pivot.columns.size):
part = pivot.iloc[:,[col]].dropna()
part = part.reset_index().drop("no", axis=1).set_index("ID")
con = pd.concat([con, part], axis=1)
I'm trying to parse html tables from page ukwtv.de to Pandas DataFrames
The challange is that in one table there are combined 2 or even 3 tables together
From table
TV program name and SID as df1,
Kanal, Standort, etc. as df2,
Technische Details as df3,
Here what I managed to achieve so far:
table_MN = pd.read_html('https://www.ukwtv.de/cms/deutschland-tv/schleswig-holstein-tv.html', thousands='.', decimal=',')
df1 = table_MN[1]
df1.columns = df1.columns.str.replace(" ", "_")
df1.columns = df1.columns.str.replace("\n", "_")
df1=df1.iloc[:7 , :]
for col in df1.columns:
print(col)
if '.' in col:
df1.drop(col, axis=1, inplace=True)
df1.dropna(subset = ["TV-_und_Radio-Programme_des_Bouquets"],axis=0, inplace=True)
df1.head(15)
df2 = table_MN[1]
df2.columns = df2.iloc[7]
df2 = df2.iloc[8: , :]
df2 = df2.reset_index(drop=True)
df2.head(20)
To issue which I have problem to solve
row 7 is hardcoded how to recodnize blank line to split data i two dataframes?
Technische Details column in df1 need to be convered to separete dataframe where Modulation, Guardintervall, ... are Series names
I have two dataframes. One is the basevales (df) and the other is an offset (df2).
How do I create a third dataframe that is the first dataframe offset by matching values (the ID) in the second dataframe?
This post doesn't seem to do the offset... Update only some values in a dataframe using another dataframe
import pandas as pd
# initialize list of lists
data = [['1092', 10.02], ['18723754', 15.76], ['28635', 147.87]]
df = pd.DataFrame(data, columns = ['ID', 'Price'])
offsets = [['1092', 100.00], ['28635', 1000.00], ['88273', 10.]]
df2 = pd.DataFrame(offsets, columns = ['ID', 'Offset'])
print (df)
print (df2)
>>> print (df)
ID Price
0 1092 10.02
1 18723754 15.76 # no offset to affect it
2 28635 147.87
>>> print (df2)
ID Offset
0 1092 100.00
1 28635 1000.00
2 88273 10.00 # < no match
This is want I want to produce: The price has been offset by matching
ID Price
0 1092 110.02
1 18723754 15.76
2 28635 1147.87
I've also looked at Pandas Merging 101
I don't want to add columns to the dataframe, and I don;t want to just replace column values with values from another dataframe.
What I want is to add (sum) column values from the other dataframe to this dataframe, where the IDs match.
The closest I come is df_add=df.reindex_like(df2) + df2 but the problem is that it sums all columns - even the ID column.
Try this :
df['Price'] = pd.merge(df, df2, on=["ID"], how="left")[['Price','Offset']].sum(axis=1)
So I have a have a table (~2000 rows, call it df1) of when a particular subject received a medication on a particular date, and I have a large excel file (>1 million rows) of weight data for subjects for different dates (call it df2).
AIM: I want to group by subject and find the weight in df2 that was recorded closest to the medication admin time in df1 using sqldf(because tables are too big to load into R). Or alternatively, I can set up a time frame of interest (e.g. +/- 1 week of medication given) and find a row that falls within that timeframe.
Example:
df1 <- data.frame(
PtID = rep(c(1:5), each=2),
Dose = rep(seq(100,200,25),2),
ADMIN_TIME =seq.Date(as.Date("2016/01/01"), by = "month", length.out = 10)
)
df2 <- data.frame(
PtID = rep(c(1:5),each=10),
Weight = rnorm(50, 50, 10),
Wt_time = seq.Date(as.Date("2016/01/01"), as.Date("2016/10/31"), length.out = 50)
)
So I think i want to left_join df1 and df2, group by PtID, and set up some condition that identifies either the closest df2$Weight to the df1$Admin_time or a df2$Weight within an acceptable range around df1$Admin_time using sql formatting.
So I tried creating a range and then querying the following:
library(dplry)
library(lubridate)
df1 <- df1 %>%
mutate(ADMIN_START = ADMIN_TIME - ddays(30),
ADMIN_END = ADMIN_TIME + ddays(30))
#df2.csv is the large spreadsheet saved in my working directory
result <- read.csv.sql("df2.csv", sql = "select Weight from file
left join df1
on file.Wt_time between df1.ADMIN_START and df1.ADMIN_END")
This will run but it never results anything and I have to escape out of it. Any thoughts are appreciated.
Thanks!
Let df1, df2, and df3 are pandas.DataFrame's having the same structure but different numerical values. I want to perform:
res=if df1>1.0: (df2-df3)/(df1-1) else df3
res should have the same structure as df1, df2, and df3 have.
numpy.where() generates result as a flat array.
Edit 1:
res should have the same indices as df1, df2, and df3 have.
For example, I can access df2 as df2["instanceA"]["parameter1"]["paramter2"]. I want to access the new calculated DataFrame/Series res as res["instanceA"]["parameter1"]["paramter2"].
Actually numpy.where should work fine there. Output here is 4x2 (same as df1, df2, df3).
df1 = pd.DataFrame( np.random.randn(4,2), columns=list('xy') )
df2 = pd.DataFrame( np.random.randn(4,2), columns=list('xy') )
df3 = pd.DataFrame( np.random.randn(4,2), columns=list('xy') )
res = df3.copy()
res[:] = np.where( df1 > 1, (df2-df3)/(df1-1), df3 )
x y
0 -0.671787 -0.445276
1 -0.609351 -0.881987
2 0.324390 1.222632
3 -0.138606 0.955993
Note that this should work on both series and dataframes. The [:] is slicing syntax that preserves the index and columns. Without that res will come out as an array rather than series or dataframe.
Alternatively, for a series you could write as #Kadir does in his answer:
res = pd.Series(np.where( df1>1, (df2-df3)/(df1-1), df3 ), index=df1.index)
Or similarly for a dataframe you could write:
res = pd.DataFrame(np.where( df1>1, (df2-df3)/(df1-1), df3 ), index=df1.index,
columns=df1.columns)
Integrating the idea in this question into JohnE's answer, I have come up with this solution:
res = pd.Series(np.where( df1 > 1, (df2-df3)/(df1-1), df3 ), index=df1.index)
A better answer using DataFrames will be appreciated.
Say df is your initial dataframe and res is the new column. Use a combination of setting values and boolean indexing.
Set res to be a copy of df3:
df['res'] = df['df3']
Then adjust values for your condition.
df[df['df1']>1.0]['res'] = (df['df2'] - df['df3'])/(df['df1']-1)