Python Pandas: Read in External Dataset Into Dataframe Only If Conditions Are Met Using Function Call - pandas

Let's say I have an Excel file called "test.xlsx" on my local machine. I can read this data set in using traditional code.
df_test = pd.read_excel('test.xlsx')
However, I want to conditionally read that data set in if a condition is met ... if another condition is met I want to read in a different dataset.
Below is the code I tried using a function:
def conditional_run(x):
if x == 'fleet':
eval('''df_test = pd.read_excel('test.xlsx')''')
elif x != 'fleet':
eval('''df_test2 = pd.read_excel('test_2.xlsx')''')
conditional_run('fleet')
Below is the error I get:
File "<string>", line 1
df_test = pd.read_excel('0Day Work Items Raw Data.xlsx')
^
SyntaxError: invalid syntax

There probably isn't a reason to use eval in this case. It might be sufficient to conditionally read the file based on its name. For example:
def conditional_run(x):
if x == 'fleet':
file = "test.xlsx"
elif x != 'fleet':
file = "test_2.xlsx"
df_test = pd.read_excel(file)
return df_test
conditional_run('fleet')

Related

Does Snakemake have states during a workflow

Does Snakemake support states in the pipelines. Meaning the current run can be changed according to the last e.g.10 runs?
For example: Data is being processed and if the current value is greater than X and in the last 10 values there were at least 5 others with a value greater than X, then i want the workflow to branch differently otherwise it should continue normally.
You could potentially use some slightly hacky workaround with checkpoints and multiple snakefiles to achieve conditional execution of rules.
For example, a first snakemake file that includes a checkpoint, i.e. a rule that waits for the execution of previous rules and only gets evaluated then. Here you could check your conditions from the current pipeline and previous results. For the example code I'm just using a random number to determine what the checkpoint does.
rule all:
input:
"random_number.txt",
"next_step.txt"
rule random_number:
output: "random_number.txt"
run:
import numpy as np
r = np.random.choice([0, 1])
with open(output[0], 'w') as fh:
fh.write(f"{r}")
checkpoint next_rule:
output: "next_step.txt"
run:
# read random number
with open("random_number.txt", 'r') as rn:
num = int(rn.read())
print(num)
if num == 0:
with open(output[0], 'w') as fh:
fh.write("case_a")
elif num == 1:
with open(output[0],'w') as fh:
fh.write("case_b")
else:
exit(1)
Then you could have a second snakefile with a conditional rule all, i.e. a list of output files that depends on the result of the first pipeline.
with open("next_step.txt", 'r') as fh:
case = fh.read()
outputs = []
if case == "case_a":
outputs = ["output_a_0.txt", "output_a_1.txt"]
if case == "case_b":
outputs = ["output_b_0.txt", "output_b_1.txt"]
rule all:
input:
outputs

python DataFrame data export to stata using to_stata() raise ValueError

I use to_stata() to exporting my DataFrame
AppliedTariff.to_stata('Applied%s.dta' % name, write_index = False)
raise ValueError('Writing general object arrays is not supported')
I do not know how to continue.
See the answer via this link.
Find out which columns are of the object type:
list(df.select_dtypes(include=['object']).columns)
Convert them to something else: df['col'] = df['col'].astype(str)

Pandas: Location of a row with error

I am pretty new to Pandas and trying to find out where my code breaks. Say, I am doing a type conversion:
df['x']=df['x'].astype('int')
...and I get an error "ValueError: invalid literal for long() with base 10: '1.0692e+06'
In general, if I have 1000 entries in the dataframe, how can I find out what entry causes a break. Is there anything in ipdb to output the current location (i.e. where the code broke)? Basically, I am trying to pinpoint what value cannot be converted to Int.
The error you are seeing might be due to the value(s) in the x column being strings:
In [15]: df = pd.DataFrame({'x':['1.0692e+06']})
In [16]: df['x'].astype('int')
ValueError: invalid literal for long() with base 10: '1.0692e+06'
Ideally, the problem can be avoided by making sure the values stored in the
DataFrame are already ints not strings when the DataFrame is built.
How to do that depends of course on how you are building the DataFrame.
After the fact, the DataFrame could be fixed using applymap:
import ast
df = df.applymap(ast.literal_eval).astype('int')
but calling ast.literal_eval on each value in the DataFrame could be slow, which is why fixing the problem from the beginning is the best alternative.
Usually you could drop to a debugger when an exception is raised to inspect the problematic value of row.
However, in this case the exception is happening inside the call to astype, which is a thin wrapper around C-compiled code. The C-compiled code is doing the looping through the values in df['x'], so the Python debugger is not helpful here -- it won't allow you to introspect on what value the exception is being raised from within the C-compiled code.
There are many important parts of Pandas and NumPy written in C, C++, Cython or Fortran, and the Python debugger will not take you inside those non-Python pieces of code where the fast loops are handled.
So instead I would revert to a low-brow solution: iterate through the values in a Python loop and use try...except to catch the first error:
df = pd.DataFrame({'x':['1.0692e+06']})
for i, item in enumerate(df['x']):
try:
int(item)
except ValueError:
print('ERROR at index {}: {!r}'.format(i, item))
yields
ERROR at index 0: '1.0692e+06'
I hit the same problem, and as I have a big input file (3 million rows), enumerating all rows will take a long time. Therefore I wrote a binary-search to locate the offending row.
import pandas as pd
import sys
def binarySearch(df, l, r, func):
while l <= r:
mid = l + (r - l) // 2;
result = func(df, mid, mid+1)
if result:
# Check if we hit exception at mid
return mid, result
result = func(df, l, mid)
if result is None:
# If no exception at left, ignore left half
l = mid + 1
else:
r = mid - 1
# If we reach here, then the element was not present
return -1
def check(df, start, end):
result = None
try:
# In my case, I want to find out which row cause this failure
df.iloc[start:end].uid.astype(int)
except Exception as e:
result = str(e)
return result
df = pd.read_csv(sys.argv[1])
index, result = binarySearch(df, 0, len(df), check)
print("index: {}".format(index))
print(result)
To report all rows which fails to map due to any exception:
df.apply(my_function) # throws various exceptions at unknown rows
# print Exceptions, index, and row content
for i, row in enumerate(df):
try:
my_function(row)
except Exception as e:
print('Error at index {}: {!r}'.format(i, row))
print(e)

numpy returning record in a where query for an already deleted index in pandas df

I am trying to run this command:
ipums = ipums.drop(np.where(ipums['wkswork1'] == 0)[0])
but I am getting an error:
raise ValueError('labels %s not contained in axis' % labels[mask])
I check the ipums dataset for a value returned in the array:
ipums[207]
and I get:
File "index.pyx", line 128, in pandas.index.IndexEngine.get_loc (pandas/index.c:3542)
File "index.pyx", line 138, in pandas.index.IndexEngine.get_loc (pandas/index.c:3322)
KeyError: 207
Which I assume it means it was deleted in an earlier record. (And it was because of a similar earlier command that addressed a different field)
Am I missing something here?
The usual way you would do this in pandas is to use a boolean mask:
ipums = ipums[ipums['wkswork1'] != 0]
You can also use a ~ to negate the mask.
There error is raised because when you use numpy's where it returns the integer locations of the rows, rather than the labels, this means you can't use drop (as this uses labels).

Python - Change variable value on exit for next session

I want to change the value of a variable on exit so that on the next run, it remains what was last set. This is a short version of my current code:
def example():
x = 1
while True:
x = x + 1
print x
On 'KeyboardInterrupt', I want the last value set in the while loop to be a global variable. On running the code next time, that value should be the 'x' in line 2. Is it possible?
This is a bit hacky, but hopefully it gives you an idea that you can better implement in your current situation (pickle/cPickle is what you should use if you want to persist more robust data structures - this is just a simple case):
import sys
def example():
x = 1
# Wrap in a try/except loop to catch the interrupt
try:
while True:
x = x + 1
print x
except KeyboardInterrupt:
# On interrupt, write to a simple file and exit
with open('myvar', 'w') as f:
f.write(str(x))
sys.exit(0)
# Not sure of your implementation (probably not this :) ), but
# prompt to run the function
resp = raw_input('Run example (y/n)? ')
if resp.lower() == 'y':
example()
else:
# If the function isn't to be run, read the variable
# Note that this will fail if you haven't already written
# it, so you will have to make adjustments if necessary
with open('myvar', 'r') as f:
myvar = f.read()
print int(myvar)
You could save any variables that you want to persist to a text file then read them back in to the script the next time it runs.
Here is a link for reading and writing to text files.
http://docs.python.org/2/tutorial/inputoutput.html#reading-and-writing-files
Hope it helps!