Numpy, Pandas Exercise - pandas

"[it just needs to be done using numpy and pandas.]"
Your task:
You are asked to write a function that applies ”slack time remaining” (STR) sequencing rule to a given collection of jobs. Although this rule has not been covered in class, application is very similar to critical ratio. You need to calculate STR value for all jobs and schedule the one with the lowest STR. Continue this until all jobs are scheduled. The STR values are calculated as follows:
STR = [Time Until Due Date] − [Processing Time]
If you have more than 1 job with the lowest STR, break ties with Earliest Due Date
(EDD) rule. If due dates are also the same, schedule the one arrived earlier (that means the one in the upper rows of the table.)
Your function will accept a single parameter as pandas dataframe:
Function Parameter:
df jobs: A pandas dataframe whose indexes are the names of the jobs. Jobs
are assumed to be arrived in the same day in the same order given in the dataframe. There
will be two data columns in the dataframe:
ˆ ”Processing Time”: Processing time required for the job
ˆ ”Due Date”: Time between arrival of the job and the due date of the job.
Output: Your function should return a list containing the correct sequence according to STR rule.
Example inputs and expected outputs:
Example Input Data:
Job Processing Time Due Date
A 2 7
B 8 16
C 4 4
D 10 17
E 5 15
F 12 18
Expected Output: [’C’, ’A’, ’F’, ’D’, ’B’, ’E’]

Assuming your input is a DataFrame - your function would be:
def str_list(df):
df = df.set_index('Job')
return (df['Due Date'] - df['Processing Time']).sort_values().index.tolist()

Related

Pandas efficiently concat DataFrames returned from apply function

I have a pandas.Series of business dates called s_dates. I want to pass each of these dates (together with some other hyper-parameters) to a function called func_sql_to_df which formats an SQL-query and then returns a pandas.DataFrame. Finally, all of the DataFrames should be concatenated (appended) into a single pandas.DataFrame called df_summary where the business date is the identifier.
From here I need to do two things:
export df_summary to an Excel sheet or csv-file.
group df_summary by the dates and then apply another function called func_analysis to each column.
My attempt is something like this
df_summary = pd.concat(list(
s_dates.apply(func_sql_to_df, args=hyper_param)
))
df_summary.groupby('dates').apply(func_analysis)
# Export data
...
However, the first statement where df_summary is defined takes quite long. There are a total of 250 dates where the first couple of iterations takes approximately 3 seconds but it increases to over 3 minutes after about 100 iterations (and continues to do so). All of the SQL-queries take more or less the same time to execute individually and the resulting dataframes all have the same number of observations.
I want to increase the performance of this setup, but I am already not using any loops (only apply-functions) and the SQL-query has already been optimized a lot. Any suggestions?
Update: If I am not mistaken then my attempt is actually the suggested solution as stated in the accepted answer to this post.
Update2: My SQL-query looks something like this. I do not know if all the dates can be passed at ones as the conditions specified in the WHERE-statement must hold for each passed value in dates.
select /*+ parallel(auto) */
MY_DATE as EOD_DATE -- These are all the elements in 'DATES' passed
, Var2
, Var3
, ColA
, ColB
, ...
, ColN
from Database1
where
Var2 in (select Var2 from Datebase2 where update_time < MY_DATE) -- Cond1
and Var3 in (select Var3 from DataBase3 where EOD_DATE = MY_DATE) -- Cond2
and cond3
and cond4
...
Running the query for any date in dates on its own seems to take around 2-8 seconds. However, as mentioned some of the iterations in the apply-function takes more than 3 minutes.
Turns out the trying to use pandas.concat(...) with a pandas.DataFrame.apply(...) as the argument as in my setting above is really slow. I just tried to compare the results by using a for-loop which gives ~x10 times faster performance.
# ~x10 times faster
dfs = []
for d in dates:
dfs.append(func_sql_to_df(d, hyper_param))
df_summary = pd.concat(dfs) # It is very important that the concat is outside the for-loop
This can even be run in parallel to get much better results
# ~x10 * (n_jobs) times faster
from joblib import Parallel, delayed
df_summary = pd.concat(
Parallel(n_jobs=-1)(delayed(func_sql_to_df)(d, hyper_param) for d in dates)
)

PySpark Grouping and Aggregating based on A Different Column?

I'm working on a problem where I have a dataset in the following format (replaced real data for example purposes):
session
activity
timestamp
1
enter_store
2022-03-01 23:25:11
1
pay_at_cashier
2022-03-01 23:31:10
1
exit_store
2022-03-01 23:55:01
2
enter_store
2022-03-02 07:15:00
2
pay_at_cashier
2022-03-02 07:24:00
2
exit_store
2022-03-02 07:35:55
3
enter_store
2022-03-05 11:07:01
3
exit_store
2022-03-05 11:22:51
I would like to be able to compute counting statistics for these events based on the pattern observed within each session. For example, based on the table above, the count of each pattern observed would be as follows:
{
'enter_store -> pay_at_cashier -> exit_store': 2,
'enter_store -> exit_store': 1
}
I'm trying to do this in PySpark, but I'm having some trouble figuring out the most efficient way to do this kind of pattern matching where some steps are missing. The real problem involves a much larger dataset of ~15M+ events like this.
I've tried logic in the form of filtering the entire DF for unique sessions where 'enter_store' is observed, and then filtering that DF for unique sessions where 'pay_at_cashier' is observed. That works fine, the only issue is I'm having trouble thinking of ways where I can count the sessions like 3 where there is only a starting step and final step, but no middle step.
Obviously one way to do this brute-force would be to iterate over each session and assign it a pattern and increment a counter, but I'm looking for more efficient and scalable ways to do this.
Would appreciate any suggestions or insights.
For Spark 2.4+, you could do
df = (df
.withColumn("flow", F.expr("sort_array(collect_list(struct(timestamp, activity)) over (partition by session))"))
.withColumn("flow", F.expr("concat_ws(' -> ', transform(flow, v -> v.activity))"))
.groupBy("flow").agg(F.countDistinct("session").alias("total_session"))
)
df.show(truncate=False)
# +-------------------------------------------+-------------+
# |flow |total_session|
# +-------------------------------------------+-------------+
# |enter_store -> pay_at_cashier -> exit_store|2 |
# |enter_store -> exit_store |1 |
# +-------------------------------------------+-------------+
The first block was collecting list of timestamp and its activity for each session in an ordered array (be sure timestamp is timestamp format) based on its timestamp value. After that, use only the activity values from the array using transform function (and combine them to create a string using concat_ws if needed) and group them by the activity order to get the distinct sessions.

Pandas: Date difference loop between columns with similiar names (ACD and ECD)

I'm working in Jupyter and have a large number of columns, many of them dates. I want to create a loop that will return a new column with the date difference between two similarly-named columns.
For example:
df['Site Visit ACD']
df['Site Visit ECD']
df['Sold ACD (Loc A)']
df['Sold ECD (Loc A)']
The new column will have a column df['Site Visit Cycle Time'] = date difference between ACD and ECD. Generally, it will always be the column that contains "ACD" minus the column that contains "ECD". How can I write this?
Any help appreciated!
The following code will do the following:
Find columns that are similar (over 90 ratio fuzz using fuzzywuzzy package)
Perform the date comparison (or time)
Avoid the same computation to be performed on both sides
get the name 'Site Visit' if the column is called more or less like that
get the name 'difference between 'column 1' and 'column 2' if it is called differently
I hope it helps.
import pandas as pd
from fuzzywuzzy import fuzz
name = pd.read_excel('Book1.xlsx', sheet_name='name')
unique = []
for i in name.columns:
for j in name.columns:
if i != j and fuzz.ratio(i, j) > 90 and i+j not in unique:
if 'Site Visit' in i:
name['Site Visit'] = name[i] - name[j]
else:
name['difference between '+i+' and '+j] = name[i] - name[j]
unique.append(j+i)
unique.append(i+j)
print(name)
Generally, it will always be the column that contains "ACD" minus the column that contains "ECD".
This answer assumes the column titles are not noisy, i.e. they only differ in "ACD" / "ECD" and are exactly the same apart from that (upper/lower case included). Also assuming that there always is a matching column. This code doesn't check if it overwrites the column it writes the date difference to.
This approach works in linear time, as we iterate over the set of columns once and directly access the matching column by name.
test.csv
Site Visit ECD,Site Visit ACD,Sold ECD (Loc A),Sold ACD (Loc A)
2018-06-01,2018-06-04,2018-07-05,2018-07-06
2017-02-22,2017-03-02,2017-02-27,2017-03-02
Code
import pandas as pd
df = pd.read_csv("test.csv", delimiter=",")
for col_name_acd in df.columns:
# Skip columns that don't have "ACD" in their name
if "ACD" not in col_name_acd: continue
col_name_ecd = col_name_acd.replace("ACD", "ECD")
# we assume there is always a matching "ECD" column
assert col_name_ecd in df.columns
col_name_diff = col_name_acd.replace("ACD", "Cycle Time")
df[col_name_diff] = df[col_name_acd].astype('datetime64[ns]') - df[col_name_ecd].astype('datetime64[ns]')
print(df.head())
Output
Site Visit ECD Site Visit ACD Sold ECD (Loc A) Sold ACD (Loc A) \
0 2018-06-01 2018-06-04 2018-07-05 2018-07-06
1 2017-02-22 2017-03-02 2017-02-27 2017-03-02
Site Visit Cycle Time Sold Cycle Time (Loc A)
0 3 days 1 days
1 8 days 3 days

Dendrograms with SciPy

I have a dataset that I shaped according to my needs, the dataframe is as follows:
Index A B C D ..... Z
Date/Time 1 0 0 0,35 ... 1
Date/Time 0,75 1 1 1 1
The total number of rows is 8878
What I try to do is create a time-series dendrogram (Example: Whole A column will be compared to whole B column in whole time).
I am expecting an output like this:
(source: rsc.org)
I tried to construct the linkage matrix with Z = hierarchy.linkage(X, 'ward')
However, when I print the dendrogram, it just shows an empty picture.
There is no problem if a compare every time point with each other and plot, but in that way, the dendrogram becomes way too complicated to observe even in truncated form.
Is there a way to handle the data as a whole time series and compare within columns in SciPy?

Mapping column values to a combination of another csv file's information

I have a dataset that indicates date & time in 5-digit format: ddd + hm
ddd part starts from 2009 Jan 1. Since the data was collected only from then to 2-years period, its [min, max] would be [1, 365 x 2 = 730].
Data is observed in 30-min interval, making 24 hrs per day period to lengthen to 48 at max. So [min, max] for hm at [1, 48].
Following is the excerpt of daycode.csv file that contains ddd part of the daycode, matching date & hm part of the daycode, matching time.
And I think I agreed to not showing the dataset which is from ISSDA. So..I will just describe that the daycode in the File1.txt file reads like '63317'.
This link gave me a glimpse of how to approach this problem, and I was in the middle of putting up this code together..which of course won't work at this point.
consume = pd.read_csv("data/File1.txt", sep= ' ', encoding = "utf-8", names =['meter', 'daycode', 'val'])
df1= pd.read_csv("data/daycode.csv", encoding = "cp1252", names =['code', 'print'])
test = consume[consume['meter']==1048]
test['daycode'] = test['daycode'].map(df1.set_index('code')['print'])
plt.plot(test['daycode'], test['val'], '.')
plt.title('test of meter 1048')
plt.xlabel('daycode')
plt.ylabel('energy consumption [kWh]')
plt.show()
Not all units(thousands) have been observed at full length but 730 x 48 is a large combination to lay out on excel by hand. Tbh, not an elegant solution but I tried by dragging - it doesn't quite get it.
If I could read the first 3 digits of the column values and match with another file's column, 2 last digits with another column, then combine.. is there a way?
For the last 2 lines you can just do something like this
df['first_3_digits'] = df['col1'].map(lambda x: str(x)[:3])
df['last_2_digits'] = df['col1'].map(lambda x: str(x)[-2:])
for joining 2 dataframes
df3 = df.merge(df2,left_on=['first_3_digits','last_2_digits'],right_on=['col1_df2','col2_df2'],how='left')