exog = I shouldn't name this - syntax-error

forecast = model.get_forecast(50, exog = data1[[***]].iloc[length-60:-10])
Can I specify a different name for what I put in *** above? For example, like below.
eelement = 'open', 'high', 'low', 'volume'
forecast = model.get_forecast(50, exog = data1[[eelement]].iloc[length-60:-10])
But I get an error.
if missing == len(indexer):
1297 axis_name = self.obj._get_axis_name(axis)
-> 1298 raise KeyError(f"None of [{key}] are in the [{axis_name}]")
1299
1300 # We (temporarily) allow for some missing keys with .loc, except in
KeyError: "None of [Index([('open', 'high', 'low', 'volume')], dtype='object')] are in the
[columns]"

You need to pass a list to pandas, try:
eelement = ['open', 'high', 'low', 'volume']
forecast = model.get_forecast(50, exog = data1[eelement].iloc[length-60:-10])

Related

Generate combinations with specified order with itertools.combinations

I used itertools.combinations to generate combinations for a dataframe's index. I'd like the combinations in specified order --> (High - Mid - Low)
Example
from itertools import combinations
d = {'levels':['High', 'High', 'Mid', 'Low', 'Low', 'Low', 'Mid'], 'converted':[True, True, True, False, False, True, False]}
df = pd.DataFrame(data=d)
df_ = pd.crosstab(df['levels'], df['converted'])
df_
converted False True
levels
High 0 2
Low 2 1
Mid 1 1
list(combinations(df_.index, 2)) returns [('High', 'Low'), ('High', 'Mid'), ('Low', 'Mid')]
I'd like the third group to be ('Mid', 'Low'), how can I achieve this ?
Use DataFrame.reindex first, but first and second values in list are swapped:
order = ['High','Mid','Low']
a = list(combinations(df_.reindex(order).index, 2))
print (a)
[('High', 'Mid'), ('High', 'Low'), ('Mid', 'Low')]

GroupBy Function Not Applying

I am trying to groupby for the following specializations but I am not getting the expected result (or any for that matter). The data stays ungrouped even after this step. Any idea what's wrong in my code?
cols_specials = ['Enterprise ID','Specialization','Specialization Branches','Specialization Type']
specials = pd.read_csv(agg_specials, engine='python')
specials = specials.merge(roster, left_on='Enterprise ID', right_on='Enterprise ID', how='left')
specials = specials[cols_specials]
specials = specials.groupby(['Enterprise ID'])['Specialization'].transform(lambda x: '; '.join(str(x)))
specials.to_csv(end_report_specials, index=False, encoding='utf-8-sig')
Please try using agg:
import pandas as pd
df = pd.DataFrame(
[
['john', 'eng', 'build'],
['john', 'math', 'build'],
['kevin', 'math', 'asp'],
['nick', 'sci', 'spi']
],
columns = ['id', 'spec', 'type']
)
df.groupby(['id'])[['spec']].agg(lambda x: ';'.join(x))
resiults in:
if you need to preserve starting number of lines, use transform. transform returns one column:
df['spec_grouped'] = df.groupby(['id'])[['spec']].transform(lambda x: ';'.join(x))
df
results in:

Pandas groupby multi conditions and date difference calculation

I am stuck understanding the method to use. I have the following dataframe:
df = {'CODE': ['BBLGLC70M','BBLGLC70M','ZZTNRD77', 'ZZTNRD77', 'AACCBD', 'AACCBD', 'BCCDN', 'BCCDN', 'BCCDN'],
'DATE': ['16/05/2019','25/09/2019', '16/03/2020', '27/02/2020', '16/07/2020', '21/07/2020', '13/02/2020', '23/07/2020', '27/02/2020'],
'TYPE': ['PRI', 'PRI', 'PRI', 'PRI', 'PUB', 'PUB', 'PUB', 'PRI', 'PUB'],
'DESC' : ['KO', 'OK', 'KO', 'KO', 'KO', 'OK', 'KO', 'OK', 'OK']
}
df = pd.DataFrame(df)
df['DATE'] = pd.to_datetime(df['DATE'], format = '%d/%m/%Y')
df
I need to:
groupby the same 'CODE',
check if the 'DESC' is not the same
check if the 'TYPE' is the same
calculate the month difference between dates that satisfy the previous 2 commands
The expected output is the below:
The following code uses .drop_duplicates() and .duplicated() to keep or throw out rows from your dataframe that have duplicate values.
How would you calculate a month's difference? A month can be 28, 30 or 31 days. You could divide the end result by 30 and get an indication of the number of months difference. So I kept it in days for now.
import pandas as pd
df = {'CODE': ['BBLGLC70M','BBLGLC70M','ZZTNRD77', 'ZZTNRD77', 'AACCBD', 'AACCBD', 'BCCDN', 'BCCDN', 'BCCDN'],
'DATE': ['16/05/2019','25/09/2019', '16/03/2020', '27/02/2020', '16/07/2020', '21/07/2020', '13/02/2020', '23/07/2020', '27/02/2020'],
'TYPE': ['PRI', 'PRI', 'PRI', 'PRI', 'PUB', 'PUB', 'PUB', 'PRI', 'PUB'],
'DESC' : ['KO', 'OK', 'KO', 'KO', 'KO', 'OK', 'KO', 'OK', 'OK']
}
df = pd.DataFrame(df)
df['DATE'] = pd.to_datetime(df['DATE'], format = '%d/%m/%Y')
# only keep rows that have the same code and type
df = df[df.duplicated(subset=['CODE', 'TYPE'], keep=False)]
# throw out rows that have the same code and desc
df = df.drop_duplicates(subset=['CODE', 'DESC'], keep=False)
# find previous date
df = df.sort_values(by=['CODE', 'DATE'])
df['previous_date'] = df.groupby('CODE')['DATE'].transform('shift')
# drop rows that don't have a previous date
df = df.dropna()
# calculate the difference between current date and previous date
df['difference_in_dates'] = (df['DATE'] - df['previous_date'])
This results in the following df:
CODE DATE TYPE DESC previous_date difference_in_dates
AACCBD 2020-07-21 PUB OK 2020-07-16 5 days
BBLGLC70M 2019-09-25 PRI OK 2019-05-16 132 days
BCCDN 2020-02-27 PUB OK 2020-02-13 14 days

Merging dataframes by file name

I have multiple files with the following naming convention.
ENCSR000EQO_0_0.txt
ENCSR000DIA_0_0.txt
ENCSR000DIA_1_1.txt
ENCSR000DIA_2_1.txt
ENCSR000DIM_0_0.txt
ENCSR000DIM_1_1.txt
ENCSR000AIB_0_0.txt
ENCSR000AIB_1_1.txt
ENCSR000AIB_2_1.txt
ENCSR000AIB_3_1.txt
I want to merge them as dataframes using pandas according to the file name, so I would have 4 resulting dataframes. And then for each of these 4, I want to groupby the gene(GeneName) column. Since the same gene will appear multiple times.
They all have the same columns in the same order. I can merge all 10 together at once, but I couldn't figure it out how to merge by name.
path = '/renamed/'
print os.listdir(path)
df_merge = None
for fname in os.listdir(path):
if fname.endswith('.txt'):
df = pd.read_csv(path + fname, sep='\t', header=0)
df.columns = ['ID ', 'Chr', 'Start', 'End', 'Strand', 'Peak Score', 'Focus Ratio/Region Size',
'Ann', 'DetAnn', 'Distance', 'PromoterID', 'EID',
'Unigene', 'Refseq', 'Ensembl', 'GeneName', 'GeneAlias',
'GeneDescription', 'GeneType']
df = df.groupby('GeneName').agg(np.mean)
print df
Thank you for any input.
I would do something more like this, where you can use glob to get the filenames, check each one, and then group the concatenated results.
import glob
path = 'renamed'
df_merge = None
for fid in ('EQO', 'DIA', 'DIM', 'AIB'):
df_ = pd.DataFrame()
for fname in glob.glob(os.path.join(path, '*.txt')):
if fid in fname:
df = pd.read_csv(fname, sep='\t', header=0)
df.columns = ['ID ', 'Chr', 'Start', 'End', 'Strand', 'Peak Score', 'Focus Ratio/Region Size',
'Ann', 'DetAnn', 'Distance', 'PromoterID', 'EID',
'Unigene', 'Refseq', 'Ensembl', 'GeneName', 'GeneAlias',
'GeneDescription', 'GeneType']
df_ = pd.concat((df_, df))
df_ = df_.groupby('GeneName').agg(np.mean)
print df_
Edit: expanding answer to be more automated.
Based on your filenames you might be able to id them as follows:
import numpy as np
files = glob.glob(os.path.join(path, '*.txt'))
fids = np.unique([file.split('_')[0] for file in files])
Putting it all together the updated code would be this:
import glob
import numpy as np
path = 'renamed'
files = glob.glob(os.path.join(path, '*.txt'))
fids = np.unique([file.split('_')[0] for file in files])
df_merge = None
for fid in fids:
df_ = pd.DataFrame()
for fname in files:
if fid in fname:
df = pd.read_csv(fname, sep='\t', header=0)
df.columns = ['ID ', 'Chr', 'Start', 'End', 'Strand', 'Peak Score', 'Focus Ratio/Region Size',
'Ann', 'DetAnn', 'Distance', 'PromoterID', 'EID',
'Unigene', 'Refseq', 'Ensembl', 'GeneName', 'GeneAlias',
'GeneDescription', 'GeneType']
df_ = pd.concat((df_, df))
df_ = df_.groupby('GeneName').agg(np.mean)
print df_
try adding the file name as column, append all df's to a list and concat them, then group:
df_merge = []
for fname in os.listdir(path):
if fname.endswith('.txt'):
df = pd.read_csv(path + fname, sep='\t', header=0)
df.columns = ['ID ', 'Chr', 'Start', 'End', 'Strand', 'Peak Score', 'Focus Ratio/Region Size',
'Ann', 'DetAnn', 'Distance', 'PromoterID', 'EID',
'Unigene', 'Refseq', 'Ensembl', 'GeneName', 'GeneAlias',
'GeneDescription', 'GeneType']
df['fname'] = [fname.split('_')[0] for x in df.index] #just to multiple by length
df_merge.append(df)
df_all = pd.concat(df_merge)
for fn in set(df_all['fname'].values):
print df_all[df_all['fname']==fn].groupby('GeneName').agg(np.mean)

Where is DataFrame() after moving from panda.io.data to pandas_datareader?

installed python/pandas in a new PC,
Successfully installed pandas-datareader-0.2.1 requests-file-1.4.1
But the old code is not working after replacing pandas.io with pandas_datareader.
import pandas_datareader.data as web
import datetime
start = datetime.datetime(2010, 1, 1)
end = datetime.datetime(2013, 1, 27)
f = web.DataReader("F", 'yahoo', start, end)
columns = ['Open', 'High', 'Low', 'Close', 'DateIdx']
diDian = web.DataFrame(columns=columns)
Get this,
File "delme1.py", line 9, in
diDian = web.DataFrame(columns=columns)
AttributeError: 'module' object has no attribute 'DataFrame'
How to fix this please ?
Ok this works
import pandas_datareader.data as web
import datetime
start = datetime.datetime(2010, 1, 1)
end = datetime.datetime(2013, 1, 27)
f = web.DataReader("F", 'yahoo', start, end)
f['DateIdx'] = 0
columns = ['Open', 'High', 'Low', 'Close', 'DateIdx']
diDian = f[columns]