How to properly iterate over a for loop using Dask? - pandas

When I run a loop like this (see below) using dask and pandas, only the last field in the list gets evaluated. Presumably this is because of "lazy-evaluation"
import pandas as pd
import dask.dataframe as ddf
df_dask = ddf.from_pandas(df, npartitions=16)
for field in fields:
df_dask["column__{field}".format(field=field)] = df_dask["column"].apply(lambda _: [__ for __ in _ if (__ == field)], meta=list)
If I add .compute() to the last line:
df_dask["column__{field}".format(field=field)] = df_dask["column"].apply(lambda _: [__ for __ in _ if (__ == field)], meta=list).compute()
it then works correctly, but is this the most efficient way of doing this operation? Is there a way for Dask to add all the items from the fields list at once, and then run them in one-shot via compute()?
edit ---------------
Please see screenshot below for a worked example

You will want to call .compute() at the end of your computation to trigger work. Warning: .compute assumes that your result will fit in memory
Also, watch out, lambdas late-bind in Python, so the field value may end up being the same for all of your columns.

Here's one way to do it, where string check is just a sample function that returns True/False. The issue was the late binding of lambda functions.
from functools import partial
def string_check(string, search):
return search in string
search_terms = ['foo', 'bar']
for s in search_terms:
string_check_partial = partial(string_check, search=s)
df[s] = df['YOUR_STRING_COL'].apply(string_check_partial)

Related

Appending data into gsheet using google colab without giving a range

I have a simple example to test on how to append data to a tab in gsheet using colab. For example here is the first code snippet to update the data first time;
from google.colab import auth
auth.authenticate_user()
import gspread
from google.auth import default
creds, _ = default()
gc = gspread.authorize(creds)
df = pd.DataFrame({'a': ['apple','airplane','alligator'], 'b': ['banana', 'ball', 'butterfly'], 'c': ['cantaloupe', 'crane', 'cat']})
df2 = [df.columns.to_list()] + df.values.tolist()
wb = gc.open('test_wbr_feb13')
wsresults2 = wb.worksheet('Sheet2')
wsresults2.update(None,df2)
This works for me as show in the screenshot;
First screenshot
Since it is my work account, I am not able to give a link of the gsheet, apologies for that. Next I need to check if we can append the data to existing data. To this end, I use following code;
from gspread_dataframe import get_as_dataframe, set_with_dataframe
wb = gc.open('test_wbr_feb13')
wsresults2 = wb.worksheet('Sheet2')
set_with_dataframe(wsresults2, df)
Please note that we don't know the rows from which we need to insert the data, it can be variable depending on the data size. But the output is still the same, plz see screenshot. Can I please get help on how to append data into gsheet using this approach? thanks
Second screenshot

loading input dataset for every iteration in for loop in palantir

I have a #transform_pandas code which loads the input file for computing.
Inside the compute function I have a for loop which has to read the complete input data and filter accordingly for every iteration.
#transform_pandas(
Output("/FCA_Foundry/dataset1"),
source_df=Input(sample),
)
I have the below code where I'm trying to read source_df dataset for every iteration in for loop and filter the dataset specifically to the year and family and do the computation.
def compute(source_df):
for entire_row in vhcl_df.itertuples():
modyr = entire_row[1]
fam = str(entire_row[2])
/* source_df should be read again here.
source_df = source_df.loc[source_df['i_yr']==modyr]
source_df = source_df.loc[source_df['fam']==fam]
...
Is there a way to achieve this. Thank you for your support.
As already suggested by #nicornk in the comments, you should create a new .copy() item of your source_df right after you declare the transform.
The two filtering steps (that can ben also merged in one, if you don't need to work just on the "modyr filtered" source_df.
Please note that modyr, fam are actual colnames of vhcl_df, it is actually sufficient to
#transform_pandas(
Output("/FCA_Foundry/dataset1"),
source_df=Input(sample),
vhcl_df=Input(path)
)
def compute(source_df, vhcl_df):
for modyr, fam in vhcl_df.items():
temp_df = source_df.copy()
temp_df = source_df.loc[source_df['i_yr']==modyr]
temp_df = source_df.loc[source_df['fam']==str(fam)]
which, in a more concise and clean way is actually writable as
def compute(source_df, vhcl_df):
for modyr, fam in vhcl_df.items():
temp_df = source_df.copy()
filtered_temp_df = temp_df[(temp_df.i_yr==modyr) & (temp_df.fam==str(fam))]
PS: Remember that if source_df is big, you should proceed with PySpark (see foundry docs)
Note that transform_pandas should only be used on datasets that can fit into memory. If you have larger datasets that you wish to filter down first before converting to Pandas, you should write your transformation using the transform_df() decorator and the pyspark.sql.SparkSession.createDataFrame() method.

how to use for loop to split dataframe (list in list) and assign new name

This is the data I am working with:
df = [[1,2,3], [2,3,4], [5,6,7]]
I want to make sure I get a new name for each df[1] and df[2] etc so that I can use it later in my codes.
I tried using the following code:
for i in range(len(df)):
df_"{}".format(i) = df[i]
Obviously I am getting error for trying to declare dataframe like that.
How best can I achieve this? of splitting list in list into seperate dataframes using for loop?
Note: I am newbie to python hence if I missed anything obvious please help me point that out.
Use:
dfps = [[1,2,3], [2,3,4], [5,6,7]]
dic_of_dfs = {}
for i, row in enumerate(dfps):
dic_of_dfs.update({f"df_{i}":pd.DataFrame(row)})
dic_of_dfs["df_0"]
output:

from xbbg import blp works for equity but does not work for bonds

from xbbg import blp works for equity but does not work for bonds.
I use this pip library: https://pypi.org/project/xbbg/
I do the following imports.
import blpapi
from xbbg import blp
I then run the following test for an equity:
# this works
eqData = blp.bdh(
tickers='SPX Index', flds=['high', 'low', 'last_price'],
start_date='2018-10-10', end_date='2018-10-20',
)
print(eqData)
This works and produces the expected dataframe.
I do exactly the same for a corporate bond:
# this returns empty
bondData = blp.bdh(
tickers='XS1152338072 Corp', flds=['px_bid', 'px_ask'],
start_date='2019-10-10', end_date='2018-10-20',
)
print(bondData)
This fails (produces an empty dataFrame), even though the data exists.
Here is the result (an empty DataFrame):
getting bond data...
Empty DataFrame
Columns: []
Index: []
Also note that i can get the BDP function to work for bonds.
why can i not get the BDH function to work ?
it appears as though the start date (year) was after the end date.
ie. change 2019 to 2018.
start_date='2019-10-10', end_date='2018-10-20'
corrects to
start_date='2012-10-10', end_date='2018-10-20',
This produces the expected dataFrame.

How to import Pandas data frames in a loop [duplicate]

So what I'm trying to do is the following:
I have 300+ CSVs in a certain folder. What I want to do is open each CSV and take only the first row of each.
What I wanted to do was the following:
import os
list_of_csvs = os.listdir() # puts all the names of the csv files into a list.
The above generates a list for me like ['file1.csv','file2.csv','file3.csv'].
This is great and all, but where I get stuck is the next step. I'll demonstrate this using pseudo-code:
import pandas as pd
for index,file in enumerate(list_of_csvs):
df{index} = pd.read_csv(file)
Basically, I want my for loop to iterate over my list_of_csvs object, and read the first item to df1, 2nd to df2, etc. But upon trying to do this I just realized - I have no idea how to change the variable being assigned when doing the assigning via an iteration!!!
That's what prompts my question. I managed to find another way to get my original job done no problemo, but this issue of doing variable assignment over an interation is something I haven't been able to find clear answers on!
If i understand your requirement correctly, we can do this quite simply, lets use Pathlib instead of os which was added in python 3.4+
from pathlib import Path
csvs = Path.cwd().glob('*.csv') # creates a generator expression.
#change Path(your_path) with Path.cwd() if script is in dif location
dfs = {} # lets hold the csv's in this dictionary
for file in csvs:
dfs[file.stem] = pd.read_csv(file,nrows=3) # change nrows [number of rows] to your spec.
#or with a dict comprhension
dfs = {file.stem : pd.read_csv(file) for file in Path('location\of\your\files').glob('*.csv')}
this will return a dictionary of dataframes with the key being the csv file name .stem adds this without the extension name.
much like
{
'csv_1' : dataframe,
'csv_2' : dataframe
}
if you want to concat these then do
df = pd.concat(dfs)
the index will be the csv file name.