Defining a function in Pandas - pandas

I am new to Pandas and I am taking this course online. I know there is a way to define a function to make this code cleaner but I'm not sure how to go about it.
noshow = len((df[
(df['Gender'] == 'M') \
& (df['No_show'] == 'Yes') \
& (df['Persons_age'] == 'Child')
]))
noshow
There is multiple Genders and multiple No_show answers and Multiple Person's age and I don't want to have write out the code for each one of those.
I've gotten the code for a single function but not for the mutiple iterations.
def print_noshow_percentage(column_name, value, percentage_text):
total = (df[column_name] == value).sum()
noshow = len((df[(df[column_name] == value) & (df['No_show'] == 'Yes')]))
print(int((noshow / total) * 100), percentage_text)
I hope this makes sense. Thanks for any help!

Welcome to Stack Exchange. You are not too clear about your desired output, but I think what you are trying to do is to get a summary of every possible combination of age, gender, and no_show in your df. To accomplish this you can use the built in groupby method of pandas documentation here.
As mentioned by #ALollz, the following code will get you everything you need to know about your counts in terms of percentages.
counts = df.groupby(['Gender', 'Persons_age'])['No_show'].value_counts(normalize=True)
Now you need to decide what to do with it. You can either iterate through the dataframe printing each line, or you can find specific combinations or you can print out the whole thing.
In general, it is better to look for a built in method than to try to build a function outside of pandas. There are a lot of different ways to do things and checking the documentation is a good place to start.

Related

Can anyone tell me what's wrong with my code (I am a newbie in programming, pls do cooperate )

I am trying to write a code which calculates the HCF of two numbers but I am either getting a error or an empty list as my answer
I was expecting the HCF, My idea was to get the factors of the 2 given numbers and then find the common amongst them then take the max out of that
For future reference, do not attach screenshots. Instead, copy your code and put it into a code block because stack overflow supports code blocks. To start a code block, write three tildes like ``` and to end it write three more tildes to close. If you add a language name like python, or javascript after the first three tildes, syntax highlighting will be enabled. I would also create a more descriptive title that more accurately describes the problem at hand. It would look like so:
Title: How to print from 1-99 in python?
for i in range(1,100):
print(i)
To answer your question, it seems that your HCF list is empty, and the python max function expects the argument to the function to not to be empty (the 'arg' is the HCF list). From inspection of your code, this is because the two if conditions that need to be satisfied before anything is added to HCF is never satisfied.
So it could be that hcf2[x] is not in hcf and hcf[x] is not in hcf[x] 2.
What I would do is extract the logic for the finding of the factors of each number to a function, then use built in python functions to find the common elements between the lists. Like so:
num1 = int(input("Num 1:")) # inputs
num2 = int(input("Num 2:")) # inputs
numberOneFactors = []
numberTwoFactors = []
commonFactors = []
# defining a function that finds the factors and returns it as a list
def findFactors(number):
temp = []
for i in range(1, number+1):
if number%i==0:
temp.append(i)
return temp
numberOneFactors = findFactors(num1) # populating factors 1 list
numberTwoFactors = findFactors(num2) # populating factors 2 list
# to find common factors we can use the inbuilt python set functions.
commonFactors = list(set(numberOneFactors).intersection(numberTwoFactors))
# the intersection method finds the common elements in a set.

Pyspark: Filter DF based on columns, then run every subset DF through a function

I am new to Pyspark and am a bit confused on how to think of the problem.
I have a large dataframe and I would like to filter down every subset of that dataframe based on two columns and run it through the same algorithm.
Here is an example of how I run it (extremely inefficiently) now:
for letter in ['a', 'b', 'c']:
for number in [1, 2, 3]
filtered_DF_1, filtered_DF_2 = filter_func(DF_1, DF_2, letter, number)
process_function(filtered_DF_1, filtered_DF_2)
Basic filter function:
def filter_func(DF_1, DF_2, letter, number):
DF_1 = DF_1.filter(
(F.col("Letter") == letter) &
(F.col('Number') == number)
)
DF_2 = DF_2.filter(
(F.col("Letter") == letter) &
(F.col('Number') == number)
)
return DF_1, DF_2
Since this is Pyspark, I would like to parallelize it, since each iteration of the function can run independently.
Do I need to do some sort of mapping to get all my data subsets?
And then do I need to do anything to the process_function to make it available to all nodes as well to run and return an answer?
What is the best way to do this?
​
EDIT:
The process_function takes the filtered dataset and runs it through about 7 different functions that are already written in 300 lines of pyspark --> the end goal is to return a list of timestamps that are overbooked based on a bunch of complicated logic.
I think my plan is to build a dictionary of letter --> [number], then explode that list to get every permutation and create a dataset from that. Then map through that, and hopefully am able to create a udf for my process_function.
I don't think you need to worry a lot about parallelizing or the execution plan because the spark catalyst does it in the background for you. Also better to avoid UDF, you can do it mostly with inbulit function.
Are you doing a transformation function or an aggregate function inside you process_func?
Please provide any test data and suitable example of expected output. That would help in better answering..

I cannot understand why "in" doesn't work correctly

sp01 is dataframe which contains S&P 500 index. And I have a dataframe,interest, which contains daily interest rate. The two data started from same date, but their size were not same. It's error.
I want to get exact same date, so tried to check every date using "in" function. But "in" function doesn't work well. This is code :
print(sp01.Date[0], type(sp01.Date[0]) )
->1976-06-01, str
print(interest.DATE[0], type(interest.DATE[0]) )
->1976-06-01, str
print(sp01.Date[0] in interest.DATE)
->False
I can never understand why the result becomes False.
Of course, first date of sp01 and interest is totally same,
I checked it too by typing code. So, True should be come out, but False came out. I got mad!!!please Help me.
I solved it! the problem is that "in" function does not work for pandas series data. Those two data are pandas series, so I have to change one of them to list

Processing pandas data in declarative style

I have a pandas dataframe of vehicle co-ordinates (from multiple vehicles on multiple days). For each vehicle and for each day, I do two things: either apply an algorithm to it, or filter it out of the dataset completely if it doesn't satisfy certain criteria.
To acheive this I use df.groupby('vehicle_id', 'day') and then .apply(algorithm) or .filter(condition) where algorithm and condition are functions which take in a dataframe.
I would like the full processing of my dataset (which involves multiple .apply and .filter steps) to be written out in a declaritive style, as opposed to imperatively looping through the groups, with the goal of the whole thing to look something like:
df.group_by('vehicle_id', 'day').apply(algorithm1).filter(condition1).apply(algorithm2).filter(condition2)
Of course, the above code is incorrect since .apply() and .filter() return new dataframes, and this is exactly my problem. They return all the data back in a single dataframe, and I find that I have apply .groupby('vehicle_id', 'day') continuously.
Is there a nice way that I can write this out without having to group by the same columns over and over?
Since apply uses a for loop anyway (meaning there are no sophisticated optimizations in the background), I suggest using an actual for loop:
arr = []
for key, dfg in df.groupby(['vehicle_id', 'day']):
dfg = dfg.do_stuff1() # Perform all needed operations
dfg = do_stuff2(dfg) #
arr.append(dfg)
result = pd.concat(arr)
An alternative is to create a function which runs all of the applies and filters sequentially on a given dataframe, and then map a single groupby/apply to it:
def all_operations(dfg):
# Do stuff
return result_df
result = df.group_by(['vehicle_id', 'day']).apply(all_operations)
In both options you will have to deal with cases in which an empty dataframe is returned from the filters, if such cases exist.

Reading .txt into Julia DataFrame as Date Type

Is there way to read date ("2000-01") variables from text files into a Julia DataFrame directly, as a date? There's no documentation on this from what I have seen.
df = readtable("path/dates.txt", eltypes = [Date, Date])
This doesn't work, even though it seems like it should. My usual process is to read the dates in as strings and then loop over each row to create a new date variable. This has become a bottleneck in some of my processes now, do to the size of the DataFrames.
My usual flow is to do something like this:
full_df[:real_date] = Date(full_df[:temp_dte_string], "m/d/y")
Thank you
I don't think there's currently any way to do the loading in a single step like your first suggested code. However you can speed up the second method somewhat by making a DateFormat object and calling Date with that instead of with a string.
(This is mentioned briefly here.)
dfmt = Dates.DateFormat(“m/d/y”)
full_df[:real_date] = Date(full_df[:temp_dte_string], dfmt)
(For some reason I thought Date was not vectorized and have been doing this inside a for loop in all my code. Whoops.)
By delete a variable do you mean delete a column or a row? If you mean the former, then there's a few other ways to do this including things like
function vectorin(a, b) #IMHO this should be in base
bset = Set(b)
[i in bset for i in a]
end
df = DataFrame(A1="", A2="", A3="", D="", E="", F="") #Some long list of columns
badCols = [:D, :F] #Some long list of columns you want to remove
df = df[names(df)[!vectorin(names(df), badCols)]]
Sometimes I read in csv files with a lot of extra columns, then just do something like
df = readtable("data.csv")
df = df[[:Only, :the, :cols, :I, :want]]