How to add return value from function into dataframe Column? - pandas

I want to add multiple return values from function into one column, for example, here I am splitting the combined hours and minutes and adding into different variables "hour" and "minute" each. Now I want to add these values in one data frame column.
def issuetimesplit(x):
hour = str(x)[:-2]
minutes = str(x)[-2:]
time = print(f"{hour} : {minutes}")
return time
for i,v in timeData.items():
issuetimesplit(v)

Let me sum the comments up.
You can't use print to define a string variable.
In the function, your new string can be returned immediately. It means time variable is not needed. However, it is not a mistake to define it.
You can use apply() function, that could be a solution for your problem.
The code is following.
# import pandas
import pandas as pd
# define function
def make_new_time_feature(row):
hour = row['old_time_feature'][:-2]
minutes = row['old_time_feature'][-2:]
return f'{hour} : {minutes}'
# call apply
your_df['new_time_feature'] = your_df.apply(lambda row: make_new_time_feature(row), axis=1)

Related

Interpolate values based in date in pandas

I have the following datasets
import pandas as pd
import numpy as np
df = pd.read_excel("https://github.com/norhther/datasets/raw/main/ncp1b.xlsx",
sheet_name="Sheet1")
df2 = pd.read_excel("https://github.com/norhther/datasets/raw/main/ncp1b.xlsx",
sheet_name="Sheet2")
df2.dropna(inplace = True)
For each group of values on the first df X-Axis Value, Y-Axis Value, where the first one is the date and the second one is a value, I would like to create rows with the same date. For instance, df.iloc[0,0] the timestamp is Timestamp('2020-08-25 23:14:12'). However, in the following columns of the same row maybe there is other dates with different Y-Axis Value associated. The first one in that specific row being X-Axis Value NCVE-064 HPNDE with a timestap 2020-08-25 23:04:12 and a Y-Axis Value associated of value 0.952.
What I want to accomplish is to interpolate those values for a time interval, maybe 10 minutes, and then merge those results to have the same date for each row.
For the df2 is moreless the same, interpolate the values in a time interval and add them to the original dataframe. Is there any way to do this?
The trick is to realize that datetimes can be represented as seconds elapsed with respect to some time.
Without further context part the hardest things is to decide at what times you wants to have the interpolated values.
import pandas as pd
import numpy as np
from scipy.interpolate import interp1d
df = pd.read_excel(
"https://github.com/norhther/datasets/raw/main/ncp1b.xlsx",
sheet_name="Sheet1",
)
x_columns = [col for col in df.columns if col.startswith("X-Axis")]
# What time do we want to align the columsn to?
# You can use anything else here or define equally spaced time points
# or something else.
target_times = df[x_columns].min(axis=1)
def interpolate_column(target_times, x_times, y_values):
ref_time = x_times.min()
# For interpolation we need to represent the values as floats. One options is to
# compute the delta in seconds between a reference time and the "current" time.
deltas = (x_times - ref_time).dt.total_seconds()
# repeat for our target times
target_times_seconds = (target_times - ref_time).dt.total_seconds()
return interp1d(deltas, y_values, bounds_error=False,fill_value="extrapolate" )(target_times_seconds)
output_df = pd.DataFrame()
output_df["Times"] = target_times
output_df["Y-Axis Value NCVE-063 VPNDE"] = interpolate_column(
target_times,
df["X-Axis Value NCVE-063 VPNDE"],
df["Y-Axis Value NCVE-063 VPNDE"],
)
# repeat for the other columns, better in a loop

python - if-else in a for loop processing one column

I am interested to loop through column to convert into processed series.
Below is an example of two row, four columns data frame:
import pandas as pd
from rapidfuzz import process as process_rapid
from rapidfuzz import utils as rapid_utils
data = [['r/o ac. nephritis. /. nephrotic syndrome', ' ac. nephritis. /. nephrotic syndrome',1,'ac nephritis nephrotic syndrome'], [ 'sternocleidomastoid contracture','sternocleidomastoid contracture',0,"NA"]]
# Create the pandas DataFrame
df_diagnosis = pd.DataFrame(data, columns = ['diagnosis_name', 'diagnosis_name_edited','is_spell_corrected','spell_corrected_value'])
I want to use spell_corrected_value column if is_spell_corrected column is more than 1. Else, use diagnosis_name_edited
At the moment, I have following code to directly use diagnosis_name_edited column. How do I make into if-else/lambda check for is_spell_corrected column?
unmapped_diag_series = (rapid_utils.default_process(d) for d in df_diagnosis['diagnosis_name_edited'].astype(str)) # characters (generator)
unmapped_processed_diagnosis = pd.Series(unmapped_diag_series) #
Thank you.
If I get you right, try out this fast solution using numpy.where:
df_diagnosis['new_column'] = np.where(df_diagnosis['is_spell_corrected'] > 1, df_diagnosis['spell_corrected_value'], df_diagnosis['diagnosis_name_edited'])

How to using define round function like pandas round that executing one line code

Goal
Only one line to execute.
I refer round function from this post. But I want using like df.round(2) which changes the affected columns but keep the sequence of data but not required selecting float or int type.
df.applymap(myfunction) will get TypeError: must be real number, not str, which means I have to select type first.
Try
I refer round source code but I could not and understand how to change my function.
Firstly get the columns where values are float:
cols=df.select_dtypes('float').columns
Finally:
df[cols]=df[cols].agg(round,ndigits=2)
If you want to make changes in the function then add if/else condition:
from numpy import ceil, floor
def float_round(num, places=2, direction=ceil):
if isinstance(num,float):
return direction(num * (10 ** places)) / float(10 ** places)
else:
return num
out=df.applymap(float_round)
With the error message you mention, it's likely the column is already a string, and needs to be converted to some numeric type.
Let's now assume that the column is numeric, there are a few ways you could implement custom rounding functions that don't require reimplementing the .round() method of a dataframe object.
With the requirements you laid above, we want a way to round a data frame that:
fits on one line
doesn't require selecting numeric type
There are two ways we could do this that are functionally equivalent. One is to treat the dataframe as an argument to a function that is safe for numpy arrays.
Another is to use the apply method (explanation here) which applies a function to a row or a column.
import pandas as pd
import numpy as np
from numpy import ceil
# generate a 100x10 dataframe with a null value
data = np.random.random(1000) * 10
data = data.reshape(100,10)
data[0, 0] = np.nan
df = pd.DataFrame(data)
# changing data type of the second column
df[1] = df[1].astype(int)
# verify dtypes are different
print(df.dtypes)
# taken from other stack post
def float_round(num, places=2, direction=ceil):
return direction(num * (10 ** places)) / float(10 ** places)
# method 1 - use the dataframe as an argument
result1 = float_round(df)
print(result1.head())
# method 2 - apply
result2 = df.apply(float_round)
print(result2)
Because apply is applied row or column-wise, you can specify logic in your round function to ignore non-numeric columns. For instance:
# taken from other stack post
def float_round(num, places=2, direction=ceil):
# check type of a specific column
if num.dtype == 'O':
return num
return direction(num * (10 ** places)) / float(10 ** places)
# this will work, method 1 will fail
result2 = df.apply(float_round)
print(result2)

Empty cells when using an apply function

So I am trying to calculate a value from one column or another based based on which one has data available into a new column. This is the code I have right now. It doesn't seem to notice when there is no data present and always goes to the "else" statement. My dataframe is an imported excel file. Thanks for any advice!
def create_sulfide_col(row):
if row["Sulphate-S(HCL Leachable)_%S"] is None:
val = row["Total-S_%S"] - row["Sulphate-S(HCL Leachable)_%S"]
else:
val = ["Total-S_%S"]- df["Sulphate-S_%S"]
return val
df["Sulphide-S(calc)-C_%S"] = df.apply(lambda row: create_sulfide_col(row), axis='columns')
This is can be done by using numpy.where
Import numpy as np
df['newcol'] = np.where(df["Sulphate-S(HCL Leachable)_%S"].isna(),df["Total-S_%S"]- df["Sulphate-S(HCL Leachable)_%S"],df["Total-S_%S"]- df["Sulphate-S_%S"])

Generate programmed data & create data frame out of it (generated data in to single column)

Initial context was, I was using "for loop" and generating some random data (using some logic shown below) and then writing that data to a key ('server_avg_response_time') in dictionary('data_dict'). Finally., that's a list('data_rows') of dictionaries and writing the whole to CSV.
Code snippet for generating random data:
server_avg_response_time_alert = "low"
for i in range(0,no_of_rows):
if (random.randint(0,10) != 7 and server_avg_response_time_alert != "high"):
data_dict['server_avg_response_time'] = random.randint(1,800000)
else:
if(server_avg_response_time_alert == "low"):
print "***ALERT***"
server_avg_response_time_alert = "high"
data_dict['server_avg_response_time'] = random.randint(600000,800000)
server_avg_response_time_period = random.randint(1,1000)
if(server_avg_response_time_period > 980):
print "***ALERT OFF***"
server_avg_response_time_alert = "low"
data_rows.insert(i,data_dict.copy())
This is taking lot of time (to generate some 300 000 rows of data) and hence I was asked to look for Pandas (to generate data fastly). Now, I am trying to use the same logic to pandas dataframe.
Question: If I put above code in a function, can't I use that function to mint data in to column of dataframe? What is the best way to program this data in to a column of dataframe? I believe I don't need a dictionary (key) too if putting data directly to dataframe after generating it randomly. But don't know the syntax to do it.
try wrapping your logic (everything after the for loop) in a function, then pass that to an empty pandas df with one column called 'avg_resp_time' (with 30000 rows) like this using the apply method:
def randomLogic(value):
random_value = 0 # logic goes here
return random_value
df = pd.DataFrame(np.zeros(300000), columns=['server_avg_response_time'])
df['server_avg_response_time'] = df.server_avg_response_time.apply(randomLogic)