exog = I shouldn't name this

exog = I shouldn't name this - syntax-error

forecast = model.get_forecast(50, exog = data1[[***]].iloc[length-60:-10])
Can I specify a different name for what I put in *** above? For example, like below.
eelement = 'open', 'high', 'low', 'volume'
forecast = model.get_forecast(50, exog = data1[[eelement]].iloc[length-60:-10])
But I get an error.
if missing == len(indexer):
1297 axis_name = self.obj._get_axis_name(axis)
-> 1298 raise KeyError(f"None of [{key}] are in the [{axis_name}]")
1299
1300 # We (temporarily) allow for some missing keys with .loc, except in
KeyError: "None of [Index([('open', 'high', 'low', 'volume')], dtype='object')] are in the
[columns]"

You need to pass a list to pandas, try:
eelement = ['open', 'high', 'low', 'volume']
forecast = model.get_forecast(50, exog = data1[eelement].iloc[length-60:-10])

Related

Generate combinations with specified order with itertools.combinations

I used itertools.combinations to generate combinations for a dataframe's index. I'd like the combinations in specified order --> (High - Mid - Low)
Example
from itertools import combinations
d = {'levels':['High', 'High', 'Mid', 'Low', 'Low', 'Low', 'Mid'], 'converted':[True, True, True, False, False, True, False]}
df = pd.DataFrame(data=d)
df_ = pd.crosstab(df['levels'], df['converted'])
df_
converted False True
levels
High 0 2
Low 2 1
Mid 1 1
list(combinations(df_.index, 2)) returns [('High', 'Low'), ('High', 'Mid'), ('Low', 'Mid')]
I'd like the third group to be ('Mid', 'Low'), how can I achieve this ?

Use DataFrame.reindex first, but first and second values in list are swapped:
order = ['High','Mid','Low']
a = list(combinations(df_.reindex(order).index, 2))
print (a)
[('High', 'Mid'), ('High', 'Low'), ('Mid', 'Low')]

GroupBy Function Not Applying

I am trying to groupby for the following specializations but I am not getting the expected result (or any for that matter). The data stays ungrouped even after this step. Any idea what's wrong in my code?
cols_specials = ['ï»¿Enterprise ID','Specialization','Specialization Branches','Specialization Type']
specials = pd.read_csv(agg_specials, engine='python')
specials = specials.merge(roster, left_on='ï»¿Enterprise ID', right_on='Enterprise ID', how='left')
specials = specials[cols_specials]
specials = specials.groupby(['ï»¿Enterprise ID'])['Specialization'].transform(lambda x: '; '.join(str(x)))
specials.to_csv(end_report_specials, index=False, encoding='utf-8-sig')

Please try using agg:
import pandas as pd
df = pd.DataFrame(
[
['john', 'eng', 'build'],
['john', 'math', 'build'],
['kevin', 'math', 'asp'],
['nick', 'sci', 'spi']
],
columns = ['id', 'spec', 'type']
)
df.groupby(['id'])[['spec']].agg(lambda x: ';'.join(x))
resiults in:
if you need to preserve starting number of lines, use transform. transform returns one column:
df['spec_grouped'] = df.groupby(['id'])[['spec']].transform(lambda x: ';'.join(x))
df
results in:

Pandas groupby multi conditions and date difference calculation

I am stuck understanding the method to use. I have the following dataframe:
df = {'CODE': ['BBLGLC70M','BBLGLC70M','ZZTNRD77', 'ZZTNRD77', 'AACCBD', 'AACCBD', 'BCCDN', 'BCCDN', 'BCCDN'],
'DATE': ['16/05/2019','25/09/2019', '16/03/2020', '27/02/2020', '16/07/2020', '21/07/2020', '13/02/2020', '23/07/2020', '27/02/2020'],
'TYPE': ['PRI', 'PRI', 'PRI', 'PRI', 'PUB', 'PUB', 'PUB', 'PRI', 'PUB'],
'DESC' : ['KO', 'OK', 'KO', 'KO', 'KO', 'OK', 'KO', 'OK', 'OK']
}
df = pd.DataFrame(df)
df['DATE'] = pd.to_datetime(df['DATE'], format = '%d/%m/%Y')
df
I need to:
groupby the same 'CODE',
check if the 'DESC' is not the same
check if the 'TYPE' is the same
calculate the month difference between dates that satisfy the previous 2 commands
The expected output is the below:

The following code uses .drop_duplicates() and .duplicated() to keep or throw out rows from your dataframe that have duplicate values.
How would you calculate a month's difference? A month can be 28, 30 or 31 days. You could divide the end result by 30 and get an indication of the number of months difference. So I kept it in days for now.
import pandas as pd
df = {'CODE': ['BBLGLC70M','BBLGLC70M','ZZTNRD77', 'ZZTNRD77', 'AACCBD', 'AACCBD', 'BCCDN', 'BCCDN', 'BCCDN'],
'DATE': ['16/05/2019','25/09/2019', '16/03/2020', '27/02/2020', '16/07/2020', '21/07/2020', '13/02/2020', '23/07/2020', '27/02/2020'],
'TYPE': ['PRI', 'PRI', 'PRI', 'PRI', 'PUB', 'PUB', 'PUB', 'PRI', 'PUB'],
'DESC' : ['KO', 'OK', 'KO', 'KO', 'KO', 'OK', 'KO', 'OK', 'OK']
}
df = pd.DataFrame(df)
df['DATE'] = pd.to_datetime(df['DATE'], format = '%d/%m/%Y')
# only keep rows that have the same code and type
df = df[df.duplicated(subset=['CODE', 'TYPE'], keep=False)]
# throw out rows that have the same code and desc
df = df.drop_duplicates(subset=['CODE', 'DESC'], keep=False)
# find previous date
df = df.sort_values(by=['CODE', 'DATE'])
df['previous_date'] = df.groupby('CODE')['DATE'].transform('shift')
# drop rows that don't have a previous date
df = df.dropna()
# calculate the difference between current date and previous date
df['difference_in_dates'] = (df['DATE'] - df['previous_date'])
This results in the following df:
CODE DATE TYPE DESC previous_date difference_in_dates
AACCBD 2020-07-21 PUB OK 2020-07-16 5 days
BBLGLC70M 2019-09-25 PRI OK 2019-05-16 132 days
BCCDN 2020-02-27 PUB OK 2020-02-13 14 days

Merging dataframes by file name

I have multiple files with the following naming convention.
ENCSR000EQO_0_0.txt
ENCSR000DIA_0_0.txt
ENCSR000DIA_1_1.txt
ENCSR000DIA_2_1.txt
ENCSR000DIM_0_0.txt
ENCSR000DIM_1_1.txt
ENCSR000AIB_0_0.txt
ENCSR000AIB_1_1.txt
ENCSR000AIB_2_1.txt
ENCSR000AIB_3_1.txt
I want to merge them as dataframes using pandas according to the file name, so I would have 4 resulting dataframes. And then for each of these 4, I want to groupby the gene(GeneName) column. Since the same gene will appear multiple times.
They all have the same columns in the same order. I can merge all 10 together at once, but I couldn't figure it out how to merge by name.
path = '/renamed/'
print os.listdir(path)
df_merge = None
for fname in os.listdir(path):
if fname.endswith('.txt'):
df = pd.read_csv(path + fname, sep='\t', header=0)
df.columns = ['ID ', 'Chr', 'Start', 'End', 'Strand', 'Peak Score', 'Focus Ratio/Region Size',
'Ann', 'DetAnn', 'Distance', 'PromoterID', 'EID',
'Unigene', 'Refseq', 'Ensembl', 'GeneName', 'GeneAlias',
'GeneDescription', 'GeneType']
df = df.groupby('GeneName').agg(np.mean)
print df
Thank you for any input.

I would do something more like this, where you can use glob to get the filenames, check each one, and then group the concatenated results.
import glob
path = 'renamed'
df_merge = None
for fid in ('EQO', 'DIA', 'DIM', 'AIB'):
df_ = pd.DataFrame()
for fname in glob.glob(os.path.join(path, '*.txt')):
if fid in fname:
df = pd.read_csv(fname, sep='\t', header=0)
df.columns = ['ID ', 'Chr', 'Start', 'End', 'Strand', 'Peak Score', 'Focus Ratio/Region Size',
'Ann', 'DetAnn', 'Distance', 'PromoterID', 'EID',
'Unigene', 'Refseq', 'Ensembl', 'GeneName', 'GeneAlias',
'GeneDescription', 'GeneType']
df_ = pd.concat((df_, df))
df_ = df_.groupby('GeneName').agg(np.mean)
print df_
Edit: expanding answer to be more automated.
Based on your filenames you might be able to id them as follows:
import numpy as np
files = glob.glob(os.path.join(path, '*.txt'))
fids = np.unique([file.split('_')[0] for file in files])
Putting it all together the updated code would be this:
import glob
import numpy as np
path = 'renamed'
files = glob.glob(os.path.join(path, '*.txt'))
fids = np.unique([file.split('_')[0] for file in files])
df_merge = None
for fid in fids:
df_ = pd.DataFrame()
for fname in files:
if fid in fname:
df = pd.read_csv(fname, sep='\t', header=0)
df.columns = ['ID ', 'Chr', 'Start', 'End', 'Strand', 'Peak Score', 'Focus Ratio/Region Size',
'Ann', 'DetAnn', 'Distance', 'PromoterID', 'EID',
'Unigene', 'Refseq', 'Ensembl', 'GeneName', 'GeneAlias',
'GeneDescription', 'GeneType']
df_ = pd.concat((df_, df))
df_ = df_.groupby('GeneName').agg(np.mean)
print df_

try adding the file name as column, append all df's to a list and concat them, then group:
df_merge = []
for fname in os.listdir(path):
if fname.endswith('.txt'):
df = pd.read_csv(path + fname, sep='\t', header=0)
df.columns = ['ID ', 'Chr', 'Start', 'End', 'Strand', 'Peak Score', 'Focus Ratio/Region Size',
'Ann', 'DetAnn', 'Distance', 'PromoterID', 'EID',
'Unigene', 'Refseq', 'Ensembl', 'GeneName', 'GeneAlias',
'GeneDescription', 'GeneType']
df['fname'] = [fname.split('_')[0] for x in df.index] #just to multiple by length
df_merge.append(df)
df_all = pd.concat(df_merge)
for fn in set(df_all['fname'].values):
print df_all[df_all['fname']==fn].groupby('GeneName').agg(np.mean)

Where is DataFrame() after moving from panda.io.data to pandas_datareader?

installed python/pandas in a new PC,
Successfully installed pandas-datareader-0.2.1 requests-file-1.4.1
But the old code is not working after replacing pandas.io with pandas_datareader.
import pandas_datareader.data as web
import datetime
start = datetime.datetime(2010, 1, 1)
end = datetime.datetime(2013, 1, 27)
f = web.DataReader("F", 'yahoo', start, end)
columns = ['Open', 'High', 'Low', 'Close', 'DateIdx']
diDian = web.DataFrame(columns=columns)
Get this,
File "delme1.py", line 9, in
diDian = web.DataFrame(columns=columns)
AttributeError: 'module' object has no attribute 'DataFrame'
How to fix this please ?

Ok this works
import pandas_datareader.data as web
import datetime
start = datetime.datetime(2010, 1, 1)
end = datetime.datetime(2013, 1, 27)
f = web.DataReader("F", 'yahoo', start, end)
f['DateIdx'] = 0
columns = ['Open', 'High', 'Low', 'Close', 'DateIdx']
diDian = f[columns]

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

exog = I shouldn't name this - syntax-error

You need to pass a list to pandas, try: eelement = ['open', 'high', 'low', 'volume'] forecast = model.get_forecast(50, exog = data1[eelement].iloc[length-60:-10])

Related

Generate combinations with specified order with itertools.combinations

GroupBy Function Not Applying

Pandas groupby multi conditions and date difference calculation

Merging dataframes by file name

Where is DataFrame() after moving from panda.io.data to pandas_datareader?

Categories

Resources