Replace pyspark column based on other columns - pandas

In my "data" dataframe, I have 2 columns, 'time_stamp' and 'hour'. I want to insert 'hour' column values where 'time_stamp' values is missing. I do not want to create a new column, instead fill missing values in 'time_stamp'
What I'm trying to do is replace this pandas code to pyspark code:
data['time_stamp'] = data.apply(lambda x: x['hour'] if pd.isna(x['time_stamp']) else x['time_stamp'], axis=1)

Something like this should work
from pyspark.sql import functions as f
df = (df.withColumn('time_stamp',
f.expr('case when time_stamp is null then hour else timestamp'))) #added ) which you mistyped
Alternatively, if you don't like sql:
df = df.withColumn('time_stamp', f.when(f.col('time_stamp').isNull(),f.col('hour'))).otherwise(f.col('timestamp')) # Please correct the Brackets

Related

Set DateTime to index and then sum over a day

i would like to change the index of my dataframe to datetime to sum the colum "Heizung" over a day.
But it dont work.
After i set the new index, i like to use resample to sum over a day.
Here is an extraction from my dataframe.
Nr;DatumZeit;Erdtemp;Heizung
0;25.04.21 12:58:42;21.8;1
1;25.04.21 12:58:54;21.8;1
2;25.04.21 12:59:06;21.9;1
3;25.04.21 12:59:18;21.9;1
4;25.04.21 12:59:29;21.9;1
5;25.04.21 12:59:41;22.0;1
6;25.04.21 12:59:53;22.0;1
7;25.04.21 13:00:05;22.1;1
8;25.04.21 13:00:16;22.1;0
9;25.04.21 13:00:28;22.1;0
10;25.04.21 13:00:40;22.1;0
11;25.04.21 13:00:52;22.2;0
12;25.04.21 13:01:03;22.2;0
13;25.04.21 13:01:15;22.2;1
14;25.04.21 13:01:27;22.2;1
15;25.04.21 13:01:39;22.3;1
16;25.04.21 13:01:50;22.3;1
17;25.04.21 13:02:02;22.4;1
18;25.04.21 13:02:14;22.4;1
19;25.04.21 13:02:26;22.4;0
20;25.04.21 13:02:37;22.4;1
21;25.04.21 13:02:49;22.4;0
22;25.04.21 13:03:01;22.4;0
23;25.04.21 13:03:13;22.5;0
24;25.04.21 13:03:25;22.4;0
This is my code
import pandas as pd
Tab = pd.read_csv('/home/kai/Dokumente/TempData', delimiter=';')
Tab1 = Tab[["DatumZeit","Erdtemp","Heizung"]].copy()
Tab1['DatumZeit'] = pd.to_datetime(Tab1['DatumZeit'])
Tab1.plot(x='DatumZeit', figsize=(20, 5),subplots=True)
#Tab1.index.to_datetime()
#Tab1.index = pd.to_datetime(Tab1.index)
Tab1.set_index('DatumZeit')
Tab.info()
Tab1.resample('D').sum()
print(Tab1.head(10))
This is how we can set index and create Timestamp object and then resample it for 'D' and sum a column over it.
Tab1['DatumZeit'] = pd.to_datetime(Tab1.DatumZeit)
Tab1 = Tab1.set_index('DatumZeit') ## missed here
Tab1.resample('D').Heizung.sum()
If we don't want to set index explicitly then other way to resample is pd.Grouper.
Tab1['DatumZeit'] = pd.to_datetime(Tab1.DatumZeit
Tab1.groupby(pd.Grouper(key='DatumZeit', freq='D')).Heizung.sum()
If we want output to be dataframe, then we can use to_frame method.
Tab1 = Tab1.groupby(pd.Grouper(key='DatumZeit', freq='D')).Heizung.sum().to_frame()
Output
Heizung
DatumZeit
2021-04-25 15
Pivot tables to the rescue:
import pandas as pd
import numpy as np
Tab1.pivot_table(index=["DatumZeit"], values=["Heizung"], aggfunc=np.sum)
If you need to do it with setting the index first, you need to use inplace=True on set_index
Tab1.set_index("DatumZeit", inplace=True)
Just note if you do this way, you can't go back to a pivot table. In the end, it's whatever works best for you.

drop record based on multile columns value using pyspark

I have a pyspark dataframe like below :
I wanted to keep only one record if two column uniq_id and date_time have same value.
Expected Output :
I wanted to achieve this using pyspark.
Thank you
You can group by uniq_id and date_time and use first()
from pyspark.sql import functions as F
df.groupBy("uniq_id", "date_time").agg(F.first("col_1"), F.first("col_2"), F.first("col_3")).show()
I can't get how you compare int column and timestamp one(though it can be done with casting timestamp to int) but such a filtering can be made via
from pyspark.sql import functions as F
# assume you already have your DataFrame
df = df.filter(F.col('first_column_name') == F.col('second_column_name'))
or just
df = df.filter('first_column_name = second_column_name')

How to filter in rows where any column is null in pyspark dataframe

It has to be somewhere on stackoverflow already but I'm only finding ways to filter the rows of a pyspark dataframe where 1 specific column is null, not where any column is null.
import pandas as pd
import pyspark.sql.functions as f
my_dict = {"column1":list(range(100)),"column2":["a","b","c",None]*25,"column3":["a","b","c","d",None]*20}
my_pandas_df = pd.DataFrame(my_dict)
sparkDf = spark.createDataFrame(my_pandas_df)
sparkDf.show(5)
I'm trying to include any row with null values on any column of my dataframe, basically the opposite of this:
sparkDf.na.drop()
For including rows having any columns with null:
sparkDf.filter(F.greatest(*[F.col(i).isNull() for i in sparkDf.columns])).show(5)
For excluding the same:
sparkDf.na.drop(how='any').show(5)

Adding additional rows of missing combination in pandas dataframe

I have a column-D which has value of other column names [Col A, COl B , COL C] , i want to add additional rows of missing combination. My dataframe looks like below:
Original Data
import pandas as pd
data={'colA':[0,0,0],'ColB':[0,0,0] ,'ColC':[0,0,0],'ColD':['ColA','ColA','ColB'],'Target':[1,1,1]}
df=pd.DataFrame(data)
print(df)
I need resulting df as:
data={'colA':[0,0,0,0,0,0,0,0,0],'ColB':[0,0,0,0,0,0,0,0,0] ,'ColC':[0,0,0,0,0,0,0,0,0],'ColD':['ColA','ColB','ColC','ColA','ColB','ColC','ColB','ColA','ColC'],'Target':[1,0,0,1,0,0,1,0,0]}
df=pd.DataFrame(data)
print(df)
Resulting Data needed
Given contents of ColA,B,C are irrelevant and you just want to repeat values in ColD and Target it just becomes a dict comprehension right. Nothing to do with pandas
data={'colA':[0,0,0],'ColB':[0,0,0] ,'ColC':[0,0,0],'ColD':['ColA','ColA','ColB'],'Target':[1,1,1]}
df=pd.DataFrame(data)
pd.DataFrame({k:v*3
if k not in ["Target","ColD"]
else [1,0,0]*3
if k=="Target" else ["ColA","ColB", "ColC"]*3
for k,v in data.items()})

How can I add values from pandas group to new Dataframe after a function?

I am trying to separate a Dataframe into groups, run each group through a function, and have the return value from the first row of each group placed into a new Dataframe.
When I try the code below, I can print out the information I want, but when I try to add it to the new Dataframe, it only shows the values for the last group.
How can I add the values from each group into the new Dataframe?
Thanks,
Here is what I have so far:
import pandas as pd
import numpy as np
#Build random dataframe
df = pd.DataFrame(np.random.randint(0,40,size=10),
columns=["Random"],
index=pd.date_range("20200101", freq='6h',periods=10))
df["Random2"] = np.random.randint(70,100,size=10)
df["Random3"] = 2
df.index =df.index.map(lambda t: t.strftime('%Y-%m-%d'))
df.index.name = 'Date'
df.reset_index(inplace=True)
#Setup groups by date
df = df.groupby(['Date']).apply(lambda x: x.reset_index())
df.drop(["index","Date"],axis=1,inplace = True)
#Creat new dataframe for newValue
df2 = pd.DataFrame(index=(df.index)).unstack()
#random function for an example
def any_func(df):
df["Value"] = df["Random"] * df["Random2"] / df["Random3"]
return df["Value"]
#loop by unique group name
for date in df.index.get_level_values('Date').unique():
#I can print the data I want
print(any_func(df.loc[date])[0])
#But when I add it to a new dataframe, it only shows the value from the last group
df2["newValue"] = any_func(df.loc[date])[0]
df2
Unrelated, but try modifying your any_func to take advantage of vectorized functions is possible.
Now if I understand you correctly:
new_value = df['Random'] * df['Random2'] / df['Random3']
df2['New Value'] = new_value.loc[:, 0]
This line of code gave me the desired outcome. I just needed to set the index using the "date" variable when I created the column, not when I created the Dataframe.
df2.loc[date, "newValue"] = any_func(df.loc[date])[0]