How to access dask dataframe index value in map_paritions? - pandas

I am trying to use dask dataframe map_partition to apply a function which access the value in the dataframe index, rowise and create a new column.
Below is the code I tried.
import dask.dataframe as dd
import pandas as pd
df = pd.DataFrame(index = ["row0" , "row1","row2","row3","row4"])
df
ddf = dd.from_pandas(df, npartitions=2)
res = ddf.map_partitions(lambda df: df.assign(index_copy= str(df.index)),meta={'index_copy': 'U' })
res.compute()
I am expecting df.index to be the value in the row index, not the entire partition index which it seems to refer to. From the doc here, this work well for columns but not the index.

what you want to do is this
df.index = ['row'+str(x) for x in df.index]
and for that first create your pandas dataframe and then run this code after you will have your expected result.
let me know if this works for you.

Related

Concatenate row values in Pandas DataFrame

I have a problem with Pandas' DataFrame Object.
I have read first excel file and I have DataFrame like this:
First DataFrame
And read second excel file like this:
Second DataFrame
I need to concatenate rows and it should like this:
Third DataFrame
I have code like this:
import pandas as pd
import numpy as np
x1 = pd.ExcelFile("x1.xlsx")
df1 = pd.read_excel(x1, "Sheet1")
x2 = pd.ExcelFile("x2.xlsx")
df2 = pd.read_excel(x2, "Sheet1")
result = pd.merge(df1, df2, how="outer")
The second df just follow the first df,how can I get the style with dataframe like the third one?
merge does not concatenate the dfs as you want, use append instead.
ndf = df1.append(df2).sort_values('name')
You can also use concat:
ndf = pd.concat([df1, df2]).sort_values('name')

Set DateTime to index and then sum over a day

i would like to change the index of my dataframe to datetime to sum the colum "Heizung" over a day.
But it dont work.
After i set the new index, i like to use resample to sum over a day.
Here is an extraction from my dataframe.
Nr;DatumZeit;Erdtemp;Heizung
0;25.04.21 12:58:42;21.8;1
1;25.04.21 12:58:54;21.8;1
2;25.04.21 12:59:06;21.9;1
3;25.04.21 12:59:18;21.9;1
4;25.04.21 12:59:29;21.9;1
5;25.04.21 12:59:41;22.0;1
6;25.04.21 12:59:53;22.0;1
7;25.04.21 13:00:05;22.1;1
8;25.04.21 13:00:16;22.1;0
9;25.04.21 13:00:28;22.1;0
10;25.04.21 13:00:40;22.1;0
11;25.04.21 13:00:52;22.2;0
12;25.04.21 13:01:03;22.2;0
13;25.04.21 13:01:15;22.2;1
14;25.04.21 13:01:27;22.2;1
15;25.04.21 13:01:39;22.3;1
16;25.04.21 13:01:50;22.3;1
17;25.04.21 13:02:02;22.4;1
18;25.04.21 13:02:14;22.4;1
19;25.04.21 13:02:26;22.4;0
20;25.04.21 13:02:37;22.4;1
21;25.04.21 13:02:49;22.4;0
22;25.04.21 13:03:01;22.4;0
23;25.04.21 13:03:13;22.5;0
24;25.04.21 13:03:25;22.4;0
This is my code
import pandas as pd
Tab = pd.read_csv('/home/kai/Dokumente/TempData', delimiter=';')
Tab1 = Tab[["DatumZeit","Erdtemp","Heizung"]].copy()
Tab1['DatumZeit'] = pd.to_datetime(Tab1['DatumZeit'])
Tab1.plot(x='DatumZeit', figsize=(20, 5),subplots=True)
#Tab1.index.to_datetime()
#Tab1.index = pd.to_datetime(Tab1.index)
Tab1.set_index('DatumZeit')
Tab.info()
Tab1.resample('D').sum()
print(Tab1.head(10))
This is how we can set index and create Timestamp object and then resample it for 'D' and sum a column over it.
Tab1['DatumZeit'] = pd.to_datetime(Tab1.DatumZeit)
Tab1 = Tab1.set_index('DatumZeit') ## missed here
Tab1.resample('D').Heizung.sum()
If we don't want to set index explicitly then other way to resample is pd.Grouper.
Tab1['DatumZeit'] = pd.to_datetime(Tab1.DatumZeit
Tab1.groupby(pd.Grouper(key='DatumZeit', freq='D')).Heizung.sum()
If we want output to be dataframe, then we can use to_frame method.
Tab1 = Tab1.groupby(pd.Grouper(key='DatumZeit', freq='D')).Heizung.sum().to_frame()
Output
Heizung
DatumZeit
2021-04-25 15
Pivot tables to the rescue:
import pandas as pd
import numpy as np
Tab1.pivot_table(index=["DatumZeit"], values=["Heizung"], aggfunc=np.sum)
If you need to do it with setting the index first, you need to use inplace=True on set_index
Tab1.set_index("DatumZeit", inplace=True)
Just note if you do this way, you can't go back to a pivot table. In the end, it's whatever works best for you.

Dask .loc only the first result (iloc[0])

Sample dask dataframe:
import pandas as pd
import dask
import dask.dataframe as dd
df = pd.DataFrame({'col_1': [1,2,3,4,5,6,7], 'col_2': list('abcdefg')},
index=pd.Index([0,0,1,2,3,4,5]))
df = dd.from_pandas(df, npartitions=2)
Now I would like to only get first (based on the index) result back - like this in pandas:
df.loc[df.col_1 >3].iloc[0]
col_1 col_2
2 4 d
I know there is no positional row indexing in dask using iloc, but I wonder if it would be possible to limit the query to 1 result like in SQL?
Got it - But not sure about the efficiency here:
tmp = df.loc[df.col_1 >3]
tmp.loc[tmp.index == tmp.index.min().compute()].compute()

How can I add values from pandas group to new Dataframe after a function?

I am trying to separate a Dataframe into groups, run each group through a function, and have the return value from the first row of each group placed into a new Dataframe.
When I try the code below, I can print out the information I want, but when I try to add it to the new Dataframe, it only shows the values for the last group.
How can I add the values from each group into the new Dataframe?
Thanks,
Here is what I have so far:
import pandas as pd
import numpy as np
#Build random dataframe
df = pd.DataFrame(np.random.randint(0,40,size=10),
columns=["Random"],
index=pd.date_range("20200101", freq='6h',periods=10))
df["Random2"] = np.random.randint(70,100,size=10)
df["Random3"] = 2
df.index =df.index.map(lambda t: t.strftime('%Y-%m-%d'))
df.index.name = 'Date'
df.reset_index(inplace=True)
#Setup groups by date
df = df.groupby(['Date']).apply(lambda x: x.reset_index())
df.drop(["index","Date"],axis=1,inplace = True)
#Creat new dataframe for newValue
df2 = pd.DataFrame(index=(df.index)).unstack()
#random function for an example
def any_func(df):
df["Value"] = df["Random"] * df["Random2"] / df["Random3"]
return df["Value"]
#loop by unique group name
for date in df.index.get_level_values('Date').unique():
#I can print the data I want
print(any_func(df.loc[date])[0])
#But when I add it to a new dataframe, it only shows the value from the last group
df2["newValue"] = any_func(df.loc[date])[0]
df2
Unrelated, but try modifying your any_func to take advantage of vectorized functions is possible.
Now if I understand you correctly:
new_value = df['Random'] * df['Random2'] / df['Random3']
df2['New Value'] = new_value.loc[:, 0]
This line of code gave me the desired outcome. I just needed to set the index using the "date" variable when I created the column, not when I created the Dataframe.
df2.loc[date, "newValue"] = any_func(df.loc[date])[0]

how to extract the unique values and its count of a column and store in data frame with index key

I am new to pandas.I have a simple question:
how to extract the unique values and its count of a column and store in data frame with index key
I have tried to:
df = df1['Genre'].value_counts()
and I am getting a series but I don't know how to convert it to data frame object.
Pandas series has a .to_frame() function. Try it:
df = df1['Genre'].value_counts().to_frame()
And if you wanna "switch" the rows to columns:
df = df1['Genre'].value_counts().to_frame().T
Update: Full example if you want them as columns:
import pandas as pd
import numpy as np
np.random.seed(400) # To reproduce random variables
df1 = pd.DataFrame({
'Genre': np.random.choice(['Comedy','Drama','Thriller'], size=10)
})
df = df1['Genre'].value_counts().to_frame().T
print(df)
Returns:
Thriller Comedy Drama
Genre 5 3 2
try
df = pd.DataFrame(df1['Genre'].value_counts())