Fetch data from several similarly named tables in DAE Databricks? - dataframe

I'm using PySpark in DAE Databricks to get HES data.
At the moment I do this:
df_test = sqlContext.sql("select * from db_name_2122")
ICD10_codes = ['X85','X87']
df_test = df_test.filter( (df_test.field1 == "something") &
(df_test.field.rlike('|'.join(ICD10_codes) )
df_test_2 = sqlContext.sql("select * from db_name_2021")
ICD10_codes = ['X85','X87']
df_test2 = df_test2.filter( (df_test2.field1 == "something") &
(df_test2.field.rlike('|'.join(ICD10_codes) )
I have to do this for financial years 1112, 1213, 1314, ..., 2122. This is a lot of copy-pasting of similar code and I know this is bad - both from experience of finding c+p errors and also reading stuff.
What I want to do:
Be able to select data where the same conditions are met in the same fields in 11 different financial year tables within a DB and pull it all into one table at the end.
Rather than what I'm doing now which is 11 different but similar copy and paste chunks of code, which are then appended together.

First, before the "appending" you can put all the dfs into the same list:
dfs = []
for i in range(11, 22):
df = sqlContext.sql(f"select * from db_name_{i}{i+1}")
ICD10_codes = ['X85','X87']
df = df.filter((df.field1 == "something") &
(df.field.rlike('|'.join(ICD10_codes))))
dfs.append(df)
And then do the "append". I don't know what do you mean by "append". This is how you could do a unionByName:
df_final = dfs[0]
for df in dfs[1:]:
df_final = df_final.unionByName(df)
Everything can be added into one loop:
ICD10_codes = ['X85','X87']
rng = range(11, 22)
for i in rng:
df = sqlContext.sql(f"select * from db_name_{i}{i+1}")
df = df.filter((df.field1 == "something") &
(df.field.rlike('|'.join(ICD10_codes))))
df_final = df if i == rng[0] else df_final.unionByName(df)
If your table names are more complex, you can put them directly into the list:
tables = ['db_name_1112_text1', 'db_name_1213_text56']
ICD10_codes = ['X85','X87']
for x in tables:
df = sqlContext.sql(f"select * from {x}")
df = df.filter((df.field1 == "something") &
(df.field.rlike('|'.join(ICD10_codes))))
df_final = df if x == tables[0] else df_final.unionByName(df)

Related

Capping multiples columns

I found an interesting snippet (vrana95) that caps multiple columns, however this function works on the main "df" as well instead to work only on "final_df". Someone knows why?
def cap_data(df):
for col in df.columns:
print("capping the ",col)
if (((df[col].dtype)=='float64') | ((df[col].dtype)=='int64')):
percentiles = df[col].quantile([0.01,0.99]).values
df[col][df[col] <= percentiles[0]] = percentiles[0]
df[col][df[col] >= percentiles[1]] = percentiles[1]
else:
df[col]=df[col]
return df
final_df=cap_data(df)
As I wanted to cap only a few columns I changed the for loop of the original snippet. It works, but I would to know why this function is working with both dataframes.
cols = ['score_3', 'score_6', 'credit_limit', 'last_amount_borrowed', 'reported_income', 'income']
def cap_data(df):
for col in cols:
print("capping the column:",col)
if (((df[col].dtype)=='float64') | ((df[col].dtype)=='int64')):
percentiles = df[col].quantile([0.01,0.99]).values
df[col][df[col] <= percentiles[0]] = percentiles[0]
df[col][df[col] >= percentiles[1]] = percentiles[1]
else:
df[col]=df[col]
return df
final_df=cap_data(df)

Finding top 3 dominant topics for LDA topic model

I am creating a datatable via this LDA modeling tutorial, (https://www.machinelearningplus.com/nlp/topic-modeling-visualization-how-to-present-results-lda-models/) and instead of just finding the single most dominant topic, I want to expand to find the top 3 most dominant topics, along with each of their percent contributions and topic keywords.
To do that, is it best to create 2 additional functions to create 3 separate dataframes, and append each of the results? Or is there a simpler way to modify the format_topics_sentence function to pull the top 3 topics from the enumerated bag of words corpus?
def format_topics_sentences(ldamodel=None, corpus=corpus, texts=data):
# Init output
sent_topics_df = pd.DataFrame()
# Get main topic in each document
for i, row_list in enumerate(ldamodel[corpus]):
row = row_list[0] if ldamodel.per_word_topics else row_list
# print(row)
row = sorted(row, key=lambda x: (x[1]), reverse=True)
# Get the Dominant topic, Perc Contribution and Keywords for each document
for j, (topic_num, prop_topic) in enumerate(row):
if j == 0: # => dominant topic
wp = ldamodel.show_topic(topic_num)
topic_keywords = ", ".join([word for word, prop in wp])
sent_topics_df = sent_topics_df.append(pd.Series([int(topic_num), round(prop_topic,4), topic_keywords]), ignore_index=True)
else:
break
sent_topics_df.columns = ['Dominant_Topic', 'Perc_Contribution', 'Topic_Keywords']
# Add original text to the end of the output
contents = pd.Series(texts)
sent_topics_df = pd.concat([sent_topics_df, contents], axis=1)
return(sent_topics_df)
df_topic_sents_keywords = format_topics_sentences(ldamodel=lda_model, corpus=corpus, texts=data_ready)
# Format
df_dominant_topic = df_topic_sents_keywords.reset_index()
df_dominant_topic.columns = ['Document_No', 'Dominant_Topic', 'Topic_Perc_Contrib', 'Keywords', 'Text']
df_dominant_topic.head(10)
table ouput
I had a similar requirement in a recent project, hopefully this helps you out, you will need to add topic keywords to below code:
topics_df1 = pd.DataFrame()
topics_df2 = pd.DataFrame()
topics_df3 = pd.DataFrame()
for i, row_list in enumerate(lda_model[corpus]):
row = row_list[0] if lda_model.per_word_topics else row_list
row = sorted(row, key=lambda x: (x[1]), reverse=True)
for j, (topic_num, prop_topic) in enumerate(row):
if len(row) >= 3:
if j ==0:
topics_df1 = topics_df1.append(pd.Series([int(topic_num), prop_topic]), ignore_index=True)
elif j ==1:
topics_df2 = topics_df2.append(pd.Series([int(topic_num), prop_topic]), ignore_index=True)
elif j ==2:
topics_df3 = topics_df3.append(pd.Series([int(topic_num), prop_topic]), ignore_index=True)
else:
break
elif len(row) == 2:
if j ==0:
topics_df1 = topics_df1.append(pd.Series([int(topic_num), prop_topic]), ignore_index=True)
elif j ==1:
topics_df2 = topics_df2.append(pd.Series([int(topic_num), prop_topic]), ignore_index=True)
topics_df3 = topics_df3.append(pd.Series(['-', '-']), ignore_index=True)
elif len(row) == 1:
topics_df1 = topics_df1.append(pd.Series([int(topic_num), prop_topic]), ignore_index=True)
topics_df2 = topics_df2.append(pd.Series(['-', '-']), ignore_index=True)
topics_df3 = topics_df3.append(pd.Series(['-', '-']), ignore_index=True)
topics_df1.rename(columns={0:'1st Topic', 1:'1st Topic Contribution'}, inplace=True)
topics_df2.rename(columns={0:'2nd Topic', 1:'2nd Topic Contribution'}, inplace=True)
topics_df3.rename(columns={0:'3rd Topic', 1:'3rd Topic Contribution'}, inplace=True)
topics_comb = pd.concat([topics_df1, topics_df2, topics_df3], axis=1, sort=False)
#Join topics dataframe to original data
new_df = pd.concat([data_ready, topics_comb], axis=1, sort=False)

df.groupby('columns').apply(''.join()), join all the cells to a string

df.groupby('columns').apply(''.join()), join all the cells to a string.
This is for a junior dataprocessor. In the past, I've tried many ways.
import pandas as pd
data = {'key':['a','b','c','a','b','c','a'], 'profit':
[12,3,4,5,6,7,9],'income':['j','d','d','g','d','t','d']}
df = pd.DataFrame(data)
df = df.set_index(‘key’)
#df2 is expected result
data2 = {'a':['12j5g9d'],'b':['3d6d'],'c':['4d7t']}
df2 = pd.DataFrame(data2)
df2 = df2.set_index(‘key’)
Here's a simple solution, where we first translate the integers to strings and then concatenate profit and income, then finally we concatenate all strings under the same key:
data = {'key':['a','b','c','a','b','c','a'], 'profit':
[12,3,4,5,6,7,9],'income':['j','d','d','g','d','t','d']}
df = pd.DataFrame(data)
df['profit_income'] = df['profit'].apply(str) + df['income']
res = df.groupby('key')['profit_income'].agg(''.join)
print(res)
output:
key
a 12j5g9d
b 3d6d
c 4d7t
Name: profit_income, dtype: object
This question can be solved couple different ways:
First add an extra column by concatenating the profit and income columns.
import pandas as pd
data = {'key':['a','b','c','a','b','c','a'], 'profit':
[12,3,4,5,6,7,9],'income':['j','d','d','g','d','t','d']}
df = pd.DataFrame(data)
df = df.set_index('key')
df['profinc']=df['profit'].astype(str)+df['income']
1) Using sum
df2=df.groupby('key').profinc.sum()
2) Using apply and join
df2=df.groupby('key').profinc.apply(''.join)
Results from both of the above would be the same:
key
a 12j5g9d
b 3d6d
c 4d7t

Pandas DataFrame expand existing dataset to finer timestamp

I am trying to make this piece of code faster, it is failing on conversion of ~120K rows to ~1.7m.
Essentially, I am trying to convert each date stamped entry into 14, representing each DOW from PayPeriodEndingDate to T-14
Does anyone have a better suggestion other than iteruples to do this loop?
Thanks!!
df_Final = pd.DataFrame()
for row in merge4.itertuples():
listX = []
listX.append(row)
df = pd.DataFrame(listX*14)
df = df.reset_index().drop('Index',axis=1)
df['Hours'] = df['Hours']/14
df['AmountPaid'] = df['AmountPaid']/14
df['PayPeriodEnding'] = np.arange(df.loc[:,'PayPeriodEnding'][0] - np.timedelta64(14,'D'), df.loc[:,'PayPeriodEnding'][0], dtype='datetime64[D]')
frames = [df_Final,df]
df_Final = pd.concat(frames,axis=0)
df_Final

Why am I returned an object when using std() in Pandas?

The print for average of the spreads come out grouped and calculated right. Why do I get this returned as the result for the std_deviation column instead of the standard deviation of the spread grouped by ticker?:
pandas.core.groupby.SeriesGroupBy object at 0x000000000484A588
df = pd.read_csv('C:\\Users\\William\\Desktop\\tickdata.csv',
dtype={'ticker': str, 'bidPrice': np.float64, 'askPrice': np.float64, 'afterHours': str},
usecols=['ticker', 'bidPrice', 'askPrice', 'afterHours'],
nrows=3000000
)
df = df[df.afterHours == "False"]
df = df[df.bidPrice != 0]
df = df[df.askPrice != 0]
df['spread'] = (df.askPrice - df.bidPrice)
df['std_deviation'] = df['spread'].std(ddof=0)
df = df.groupby(['ticker'])
print(df['std_deviation'])
print(df['spread'].mean())
UPDATE: no longer being returned an object but now trying to figure out how to have the standard deviation displayed by ticker
df['spread'] = (df.askPrice - df.bidPrice)
df2 = df.groupby(['ticker'])
print(df2['spread'].mean())
df = df.set_index('ticker')
print(df['spread'].std(ddof=0))
UPDATE2: got the dataset I needed using
df = df[df.afterHours == "False"]
df = df[df.bidPrice != 0]
df = df[df.askPrice != 0]
df['spread'] = (df.askPrice - df.bidPrice)
print(df.groupby(['ticker'])['spread'].mean())
print(df.groupby(['ticker'])['spread'].std(ddof=0))
This line:
df = df.groupby(['ticker'])
assigns df to a DataFrameGroupBy object, and
df['std_deviation']
is a SeriesGroupBy object (of the column).
It's a good idea not to "shadow" / re-assign one variable to a completely different datatype. Try to use a different variable name for the groupby!