awswrangler.s3.read_parquet very slow for multi-layer partitioning - pandas

I have transformed a csv file in partitioned parquets on 2 layers (day, ID) with the following function:
awswrangler.s3.to_parquet(
df=df,
path=s3_path,
dataset=True,
use_threads=True,
partition_cols=partition_key,
boto3_session=session_s3
)
When I read the parquets I use the following:
my_filter = lambda x: True if x['day'] == mydate and x['ID'] == myID else False
df = awswrangler.s3.read_parquet(path=s3_path, dataset=True, partition_filter=my_filter)
The reading procedure takes 30 seconds and it's ok.
However, when I partition the files into 3 layers (day, hour, ID) the same function (i.e. filtering only by date and ID) takes over 3 minutes.
It seems that adding a layer slowed significantly down the reading procedure. I guess it's something related to parallel vs sequential reading.
Has somebody had the same problem and can suggest a workaround?

Related

Pandas rolling window on an offset between 4 and 2 weeks in the past

I have a datafile with quality scores from different suppliers over a time range of 3 years. The end goal is to use machine learning to predict the quality label (good or bad) of a shipment based on supplier information.
I want to use the mean historic quality data over a specific period of time as an input feature in this model by using pandas rolling window. the problem with this method is that pandas only allows you to create a window from t=0-x until t=0 for you rolling window as presented below:
df['average_score t-2w'] = df['score'].rolling(window='14d',closed='left').mean()
And this is were the problem comes in. For my feature I want to use quality data from a period of 2 weeks, but these 2 weeks are not the 2 weeks before the corresponding shipment, but of 2 weeks, starting from t=-4weeks , and ending on t=-2weeks.
You would imagine that this could be solved by using the same string of code but changing the window as presented below:
df['average_score t-2w'] = df['score'].rolling(window='28d' - '14d',closed='left').mean()
This, or any other type of denotation of this specific window does not seem to work.
It seems like pandas does not offer a solution to this problem, so we made a work around it with the following solution:
def time_shift_week(df):
def _avg_score_interval_func(series):
current_time = series.index[-1]
result = series[(series.index > ( current_time- pd.Timedelta(value=4, unit='w')))
& (series.index < (current_time - pd.Timedelta(value=2, unit='w')))]
return result.mean() if len(result)>0 else 0.0
temp_df = df.groupby(by=["supplier", "timestamp"], as_index=False).aggregate({"score": np.mean}).set_index('timestamp')
temp_df["w-42"] = (
temp_df
.groupby(["supplier"])
.ag_score
.apply(lambda x:
x
.rolling(window='30D', closed='both')
.apply(_avg_score_interval_func)
))
return temp_df.reset_index()
This results in a new df in which we find the average score score per supplier per timestamp, which we can subsequently merge with the original data frame to obtain the new feature.
Doing it this way seems really cumbersome and overly complicated for the task I am trying to perform. Eventhough we have found a workaround, I am wondering if there is an easier method of doing this.
Is anyone aware of a less complicated way of performing this rolling window feature extraction?
While pandas does not have the custom date offset you need, calculating the mean is pretty simple: it's just sum divided by count. You can subtract the 14-day rolling window from the 28-day rolling window:
# Some sample data. All scores are sequential for easy verification
idx = pd.MultiIndex.from_product(
[list("ABC"), pd.date_range("2020-01-01", "2022-12-31")],
names=["supplier", "timestamp"],
)
df = pd.DataFrame({"score": np.arange(len(idx))}, index=idx).reset_index()
# Now we gonna do rolling avg on score with the custom window.
# closed=left mean the current row will be excluded from the window.
score = df.set_index("timestamp").groupby("supplier")["score"]
r28 = score.rolling("28d", closed="left")
r14 = score.rolling("14d", closed="left")
avg_score = (r28.sum() - r14.sum()) / (r28.count() - r14.count())

Pandas efficiently concat DataFrames returned from apply function

I have a pandas.Series of business dates called s_dates. I want to pass each of these dates (together with some other hyper-parameters) to a function called func_sql_to_df which formats an SQL-query and then returns a pandas.DataFrame. Finally, all of the DataFrames should be concatenated (appended) into a single pandas.DataFrame called df_summary where the business date is the identifier.
From here I need to do two things:
export df_summary to an Excel sheet or csv-file.
group df_summary by the dates and then apply another function called func_analysis to each column.
My attempt is something like this
df_summary = pd.concat(list(
s_dates.apply(func_sql_to_df, args=hyper_param)
))
df_summary.groupby('dates').apply(func_analysis)
# Export data
...
However, the first statement where df_summary is defined takes quite long. There are a total of 250 dates where the first couple of iterations takes approximately 3 seconds but it increases to over 3 minutes after about 100 iterations (and continues to do so). All of the SQL-queries take more or less the same time to execute individually and the resulting dataframes all have the same number of observations.
I want to increase the performance of this setup, but I am already not using any loops (only apply-functions) and the SQL-query has already been optimized a lot. Any suggestions?
Update: If I am not mistaken then my attempt is actually the suggested solution as stated in the accepted answer to this post.
Update2: My SQL-query looks something like this. I do not know if all the dates can be passed at ones as the conditions specified in the WHERE-statement must hold for each passed value in dates.
select /*+ parallel(auto) */
MY_DATE as EOD_DATE -- These are all the elements in 'DATES' passed
, Var2
, Var3
, ColA
, ColB
, ...
, ColN
from Database1
where
Var2 in (select Var2 from Datebase2 where update_time < MY_DATE) -- Cond1
and Var3 in (select Var3 from DataBase3 where EOD_DATE = MY_DATE) -- Cond2
and cond3
and cond4
...
Running the query for any date in dates on its own seems to take around 2-8 seconds. However, as mentioned some of the iterations in the apply-function takes more than 3 minutes.
Turns out the trying to use pandas.concat(...) with a pandas.DataFrame.apply(...) as the argument as in my setting above is really slow. I just tried to compare the results by using a for-loop which gives ~x10 times faster performance.
# ~x10 times faster
dfs = []
for d in dates:
dfs.append(func_sql_to_df(d, hyper_param))
df_summary = pd.concat(dfs) # It is very important that the concat is outside the for-loop
This can even be run in parallel to get much better results
# ~x10 * (n_jobs) times faster
from joblib import Parallel, delayed
df_summary = pd.concat(
Parallel(n_jobs=-1)(delayed(func_sql_to_df)(d, hyper_param) for d in dates)
)

Writing pyspark dataframe to s3 location using partition

I have a dataframe of size 3.7 million relatively small with a date column(01-01-2018 to till date) and a partner column along with other unique ids.I want to write the data frame to s3 location by partitioning it by date first and then partner(5 partners for instance P1,P2,P3,P4 and P5). Below is my schema and code
df schema is
id1: long
id2: long
id3: long
partner: string
dt: date
df = df1.select('dt','partner').distinct().groupBy('partner').agg(F.collect_set('dt').alias('dt'))
dummy_list = []
for i in df.collect():
dummy_list.append(i.partner)
for src in dummy_list:
for dt1 in i.dt:
df.filter(F.col('dt') == dt1).filter(F.col('partner') == src).write.mode("overwrite").parquet("s3://test/parquet/dt={}/partner={}".format(datetime.strftime(dt1,'%Y-%m-%d'),src))
the above code runs successfully but it is taking more than 4-5 hours(i cancelled it midway) to write the dataframe in the s3 location. Any ways I can reduce the time significantly? Can anyone help me validate the code or correct the code if necessary in order to achieve this faster. I am new to this, appreciate any help.
Sample Data
id1|id2|id3|partner|dt
100|200|300|p1 |01-01-2018
101|200|30 |p2 |01-01-2020
102|202|311|p3 |01-01-2019
103|201|320|p4 |01-02-2019
104|210|305|p5 |01-03-2018
.
.
.

Apache Hbase - Fetching large row is extremely slow

I'm running an Apache Hbase Cluster on AWS EMR. I have a table that is a single column family, 75,000 columns and 50,000 rows. I'm trying to get all the column values for a single row, and when the row is not sparse, and has 75,000 values, the return time is extremely slow - it takes almost 2.5 seconds for me to fetch the data from the DB. I'm querying the table from a Lambda function running Happybase.
import happybase
start = time.time()
col = 'mycol'
table = connection.table('mytable')
row = table.row(col)
end = time.time() - start
print("Time taken to fetch column from database:")
print(end)
What can I do to make this faster? This seems incredibly slow - the return payload is 75,000 value pairs, and is only ~2MB. It should be much faster than 2 seconds. I'm looking for millisecond return time.
I have a BLOCKCACHE size of 8194kb, a BLOOMFILTER of type ROW, and SNAPPY compression enabled on this table.

Apply function with pandas dataframe - POS tagger computation time

I'm very confused on the apply function for pandas. I have a big dataframe where one column is a column of strings. I'm then using a function to count part-of-speech occurrences. I'm just not sure the way of setting up my apply statement or my function.
def noun_count(row):
x = tagger(df['string'][row].split())
# array flattening and filtering out all but nouns, then summing them
return num
So basically I have a function similar to the above where I use a POS tagger on a column that outputs a single number (number of nouns). I may possibly rewrite it to output multiple numbers for different parts of speech, but I can't wrap my head around apply.
I'm pretty sure I don't really have either part arranged correctly. For instance, I can run noun_count[row] and get the correct value for any index but I can't figure out how to make it work with apply how I have it set up. Basically I don't know how to pass the row value to the function within the apply statement.
df['num_nouns'] = df.apply(noun_count(??),1)
Sorry this question is all over the place. So what can I do to get a simple result like
string num_nouns
0 'cat' 1
1 'two cats' 1
EDIT:
So I've managed to get something working by using list comprehension (someone posted an answer, but they've deleted it).
df['string'].apply(lambda row: noun_count(row),1)
which required an adjustment to my function:
def tagger_nouns(x):
list_of_lists = st.tag(x.split())
flat = [y for z in list_of_lists for y in z]
Parts_of_speech = [row[1] for row in flattened]
c = Counter(Parts_of_speech)
nouns = c['NN']+c['NNS']+c['NNP']+c['NNPS']
return nouns
I'm using the Stanford tagger, but I have a big problem with computation time, and I'm using the left 3 words model. I'm noticing that it's calling the .jar file again and again (java keeps opening and closing in the task manager) and maybe that's unavoidable, but it's really taking far too long to run. Any way I can speed it up?
I don't know what 'tagger' is but here's a simple example with a word count that ought to work more or less the same way:
f = lambda x: len(x.split())
df['num_words'] = df['string'].apply(f)
string num_words
0 'cat' 1
1 'two cats' 2