When I try to copy a pandas panel object using the instructions provided in the online documentation, I do not get the expected bahavior.
Maybe this will illustrate the problem:
import pandas as pd
# make first panel with some bogus numbers
dates = pd.date_range('20130101',periods=6)
df1 = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))
df2 = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('EFGH'))
pnl = {}
pnl['alpha'] = df1
pnl['beta'] = df2
# copy pnl into pnl2
# according to online docs the default is 'deep=True'
# but it chokes when I try to specify deep=True
pnl2 = pnl.copy()
# now delete column C from pnl2['alpha']
del pnl2['alpha']['C']
#Now when I try to find column C in the original panel (pnl) it's gone!
I figure there must be a slick solution to this, but I couldn't find it in the online docs, nor in Wes McKinney's book (my only book on pandas...).
Any tips/advise much appreciated!
You didn't make a Panel, just a dict of DataFrames. Add this line to convert it to a Panel object, and it should work as you expect.
pnl = pd.Panel(pnl)
Related
I have a simple example to test on how to append data to a tab in gsheet using colab. For example here is the first code snippet to update the data first time;
from google.colab import auth
auth.authenticate_user()
import gspread
from google.auth import default
creds, _ = default()
gc = gspread.authorize(creds)
df = pd.DataFrame({'a': ['apple','airplane','alligator'], 'b': ['banana', 'ball', 'butterfly'], 'c': ['cantaloupe', 'crane', 'cat']})
df2 = [df.columns.to_list()] + df.values.tolist()
wb = gc.open('test_wbr_feb13')
wsresults2 = wb.worksheet('Sheet2')
wsresults2.update(None,df2)
This works for me as show in the screenshot;
First screenshot
Since it is my work account, I am not able to give a link of the gsheet, apologies for that. Next I need to check if we can append the data to existing data. To this end, I use following code;
from gspread_dataframe import get_as_dataframe, set_with_dataframe
wb = gc.open('test_wbr_feb13')
wsresults2 = wb.worksheet('Sheet2')
set_with_dataframe(wsresults2, df)
Please note that we don't know the rows from which we need to insert the data, it can be variable depending on the data size. But the output is still the same, plz see screenshot. Can I please get help on how to append data into gsheet using this approach? thanks
Second screenshot
This is the data I am working with:
df = [[1,2,3], [2,3,4], [5,6,7]]
I want to make sure I get a new name for each df[1] and df[2] etc so that I can use it later in my codes.
I tried using the following code:
for i in range(len(df)):
df_"{}".format(i) = df[i]
Obviously I am getting error for trying to declare dataframe like that.
How best can I achieve this? of splitting list in list into seperate dataframes using for loop?
Note: I am newbie to python hence if I missed anything obvious please help me point that out.
Use:
dfps = [[1,2,3], [2,3,4], [5,6,7]]
dic_of_dfs = {}
for i, row in enumerate(dfps):
dic_of_dfs.update({f"df_{i}":pd.DataFrame(row)})
dic_of_dfs["df_0"]
output:
I have information in different dataframes related to several securities. I would like to be able to run a loop where I take the columns I need from different dataframes and I create one dataframe per security with the columns I need and then I store the dataframes as elements of a dictionary, something along the following lines:
securities={}
for tick in tickers:
df_'tick' = pd.Dataframe(data,columns['a','b','c'])
securities[tick] = df_tick
Thanks for your help!!
Try:
securities = { tick : pd.Dataframe(data,columns['a','b','c'])
for tick in tickers:
}
So what I'm trying to do is the following:
I have 300+ CSVs in a certain folder. What I want to do is open each CSV and take only the first row of each.
What I wanted to do was the following:
import os
list_of_csvs = os.listdir() # puts all the names of the csv files into a list.
The above generates a list for me like ['file1.csv','file2.csv','file3.csv'].
This is great and all, but where I get stuck is the next step. I'll demonstrate this using pseudo-code:
import pandas as pd
for index,file in enumerate(list_of_csvs):
df{index} = pd.read_csv(file)
Basically, I want my for loop to iterate over my list_of_csvs object, and read the first item to df1, 2nd to df2, etc. But upon trying to do this I just realized - I have no idea how to change the variable being assigned when doing the assigning via an iteration!!!
That's what prompts my question. I managed to find another way to get my original job done no problemo, but this issue of doing variable assignment over an interation is something I haven't been able to find clear answers on!
If i understand your requirement correctly, we can do this quite simply, lets use Pathlib instead of os which was added in python 3.4+
from pathlib import Path
csvs = Path.cwd().glob('*.csv') # creates a generator expression.
#change Path(your_path) with Path.cwd() if script is in dif location
dfs = {} # lets hold the csv's in this dictionary
for file in csvs:
dfs[file.stem] = pd.read_csv(file,nrows=3) # change nrows [number of rows] to your spec.
#or with a dict comprhension
dfs = {file.stem : pd.read_csv(file) for file in Path('location\of\your\files').glob('*.csv')}
this will return a dictionary of dataframes with the key being the csv file name .stem adds this without the extension name.
much like
{
'csv_1' : dataframe,
'csv_2' : dataframe
}
if you want to concat these then do
df = pd.concat(dfs)
the index will be the csv file name.
When I run a loop like this (see below) using dask and pandas, only the last field in the list gets evaluated. Presumably this is because of "lazy-evaluation"
import pandas as pd
import dask.dataframe as ddf
df_dask = ddf.from_pandas(df, npartitions=16)
for field in fields:
df_dask["column__{field}".format(field=field)] = df_dask["column"].apply(lambda _: [__ for __ in _ if (__ == field)], meta=list)
If I add .compute() to the last line:
df_dask["column__{field}".format(field=field)] = df_dask["column"].apply(lambda _: [__ for __ in _ if (__ == field)], meta=list).compute()
it then works correctly, but is this the most efficient way of doing this operation? Is there a way for Dask to add all the items from the fields list at once, and then run them in one-shot via compute()?
edit ---------------
Please see screenshot below for a worked example
You will want to call .compute() at the end of your computation to trigger work. Warning: .compute assumes that your result will fit in memory
Also, watch out, lambdas late-bind in Python, so the field value may end up being the same for all of your columns.
Here's one way to do it, where string check is just a sample function that returns True/False. The issue was the late binding of lambda functions.
from functools import partial
def string_check(string, search):
return search in string
search_terms = ['foo', 'bar']
for s in search_terms:
string_check_partial = partial(string_check, search=s)
df[s] = df['YOUR_STRING_COL'].apply(string_check_partial)