Create a dictionary using "if... else" in Julia?

Create a dictionary using "if... else" in Julia? - dataframe

I'm working with a sample CSV file that lists nursing home residents' DOBs and DODs. I used those fields to calculate their age at death, and now I'm trying to create a dictionary that "bins" their age at death into groups. I'd like the bins to be 1-25, 26-50, 51-75, and 76-100.
Is there a concise way to make a Dict(subject_id, age, age_bin) using "if... else" syntax?
For example: (John, 76, "76-100"), (Moira, 58, "51-75").
So far I have:
#import modules
using CSV
using DataFrames
using Dates
# Open, read, write desired files
input_file = open("../data/FILE.csv", "r")
output_file = open("FILE_output.txt", "w")
# Use to later skip header line
file_flag = 0
for line in readlines(input_file)
if file_flag==0
global file_flag = 1
continue
end
# Define what each field in FILE corresponds to
line_array = split(line, ",")
subject_id = line_array[2]
gender = line_array[3]
date_of_birth = line_array[4]
date_of_death = line_array[5]
# Get yyyy-mm-dd only (first ten characters) from fields 4 and 5:
date_birth = date_of_birth[1:10]
date_death = date_of_death[1:10]
# Create DateFormat; use to calculate age
date_format = DateFormat("y-m-d")
age_days = Date(date_death, date_format) - Date(date_birth, date_format)
age_years = round(Dates.value(age_days)/365.25, digits=0)
# Use "if else" statement to determine values
keys = age_years
function values()
if age_years <= 25
return "0-25"
elseif age_years <= 50
return "26-50"
elseif age_years <= 75
return "51-75"
else age_years < 100
return "76-100"
end
end
values()
# Create desired dictionary
age_death_dict = Dict(zip(keys, values()))
end
Edit: or is there a better way to approach this using DataFrames?

To answer your question, " is there a concise way using if/else" -- probably not, given that you have 5 cases (age ranges) you have to account for. Suppose you have names and ages in two separate lists (which I assume you generate from your example code, although I can't see the input CSVs):
julia> name = ["John", "Mary", "Robert", "Cindy", "Beatrice"];
julia> ages = [24, 73, 75, 69, 90];
julia> function bin_age_ifelse(a)
if a<1
return "Invalid age"
elseif 1<=a<=25
return "1-25"
elseif 25<a<=50
return "26-50"
elseif 50<a<=75
return "51-75"
else
return "76-100"
end
end
bin_age_ifelse (generic function with 1 method)
julia> binned_ifelse = Dict([n=>[a, bin_age_ifelse(a)] for (n,a) in zip(name, ages)])
Dict{String, Vector{Any}} with 5 entries:
"John" => [24, "1-25"]
"Mary" => [73, "51-75"]
"Beatrice" => [90, "76-100"]
"Robert" => [75, "51-75"]
"Cindy" => [69, "51-75"]
Here's an option for the binning function to avoid if/else syntax, although there are probably yet more elegant ways to do it:
julia> function bin_age(a)
bins = [1:25, 26:50, 51:75, 76:100]
for b in bins
if a in b
return "$(b[1])-$(b[end])"
end
end
end
bin_age (generic function with 1 method)
julia> bin_age(84)
"76-100"
I've taken some liberties with the format of the answer, using the name as the key, since your original question describes a dict format that doesn't really make sense in Julia. If you'd like to have the keys be the age ranges, you could construct the dictionary above and then invert it as described here (with some modification since the values above have two entries).
If you don't care about name, age, or age range being a key, then I would suggest using DataFrames.jl:
julia> using DataFrames
julia> d = DataFrame(name=name, age=ages, age_range=[bin_age(a) for a in ages])
5×3 DataFrame
Row │ name age age_range
│ String Int64 String
─────┼────────────────────────────
1 │ John 24 1-25
2 │ Mary 73 51-75
3 │ Robert 75 51-75
4 │ Cindy 69 51-75
5 │ Beatrice 90 76-100

Related

Capping multiples columns

I found an interesting snippet (vrana95) that caps multiple columns, however this function works on the main "df" as well instead to work only on "final_df". Someone knows why?
def cap_data(df):
for col in df.columns:
print("capping the ",col)
if (((df[col].dtype)=='float64') | ((df[col].dtype)=='int64')):
percentiles = df[col].quantile([0.01,0.99]).values
df[col][df[col] <= percentiles[0]] = percentiles[0]
df[col][df[col] >= percentiles[1]] = percentiles[1]
else:
df[col]=df[col]
return df
final_df=cap_data(df)
As I wanted to cap only a few columns I changed the for loop of the original snippet. It works, but I would to know why this function is working with both dataframes.
cols = ['score_3', 'score_6', 'credit_limit', 'last_amount_borrowed', 'reported_income', 'income']
def cap_data(df):
for col in cols:
print("capping the column:",col)
if (((df[col].dtype)=='float64') | ((df[col].dtype)=='int64')):
percentiles = df[col].quantile([0.01,0.99]).values
df[col][df[col] <= percentiles[0]] = percentiles[0]
df[col][df[col] >= percentiles[1]] = percentiles[1]
else:
df[col]=df[col]
return df
final_df=cap_data(df)

How to convert a string inside a function to a variable name that holds a Pandas datframe outside the function? [duplicate]

I know that some other languages, such as PHP, support a concept of "variable variable names" - that is, the contents of a string can be used as part of a variable name.
I heard that this is a bad idea in general, but I think it would solve some problems I have in my Python code.
Is it possible to do something like this in Python? What can go wrong?
If you are just trying to look up an existing variable by its name, see How can I select a variable by (string) name?. However, first consider whether you can reorganize the code to avoid that need, following the advice in this question.

You can use dictionaries to accomplish this. Dictionaries are stores of keys and values.
>>> dct = {'x': 1, 'y': 2, 'z': 3}
>>> dct
{'y': 2, 'x': 1, 'z': 3}
>>> dct["y"]
2
You can use variable key names to achieve the effect of variable variables without the security risk.
>>> x = "spam"
>>> z = {x: "eggs"}
>>> z["spam"]
'eggs'
For cases where you're thinking of doing something like
var1 = 'foo'
var2 = 'bar'
var3 = 'baz'
...
a list may be more appropriate than a dict. A list represents an ordered sequence of objects, with integer indices:
lst = ['foo', 'bar', 'baz']
print(lst[1]) # prints bar, because indices start at 0
lst.append('potatoes') # lst is now ['foo', 'bar', 'baz', 'potatoes']
For ordered sequences, lists are more convenient than dicts with integer keys, because lists support iteration in index order, slicing, append, and other operations that would require awkward key management with a dict.

Use the built-in getattr function to get an attribute on an object by name. Modify the name as needed.
obj.spam = 'eggs'
name = 'spam'
getattr(obj, name) # returns 'eggs'

It's not a good idea. If you are accessing a global variable you can use globals().
>>> a = 10
>>> globals()['a']
10
If you want to access a variable in the local scope you can use locals(), but you cannot assign values to the returned dict.
A better solution is to use getattr or store your variables in a dictionary and then access them by name.

New coders sometimes write code like this:
my_calculator.button_0 = tkinter.Button(root, text=0)
my_calculator.button_1 = tkinter.Button(root, text=1)
my_calculator.button_2 = tkinter.Button(root, text=2)
...
The coder is then left with a pile of named variables, with a coding effort of O(m * n), where m is the number of named variables and n is the number of times that group of variables needs to be accessed (including creation). The more astute beginner observes that the only difference in each of those lines is a number that changes based on a rule, and decides to use a loop. However, they get stuck on how to dynamically create those variable names, and may try something like this:
for i in range(10):
my_calculator.('button_%d' % i) = tkinter.Button(root, text=i)
They soon find that this does not work.
If the program requires arbitrary variable "names," a dictionary is the best choice, as explained in other answers. However, if you're simply trying to create many variables and you don't mind referring to them with a sequence of integers, you're probably looking for a list. This is particularly true if your data are homogeneous, such as daily temperature readings, weekly quiz scores, or a grid of graphical widgets.
This can be assembled as follows:
my_calculator.buttons = []
for i in range(10):
my_calculator.buttons.append(tkinter.Button(root, text=i))
This list can also be created in one line with a comprehension:
my_calculator.buttons = [tkinter.Button(root, text=i) for i in range(10)]
The result in either case is a populated list, with the first element accessed with my_calculator.buttons[0], the next with my_calculator.buttons[1], and so on. The "base" variable name becomes the name of the list and the varying identifier is used to access it.
Finally, don't forget other data structures, such as the set - this is similar to a dictionary, except that each "name" doesn't have a value attached to it. If you simply need a "bag" of objects, this can be a great choice. Instead of something like this:
keyword_1 = 'apple'
keyword_2 = 'banana'
if query == keyword_1 or query == keyword_2:
print('Match.')
You will have this:
keywords = {'apple', 'banana'}
if query in keywords:
print('Match.')
Use a list for a sequence of similar objects, a set for an arbitrarily-ordered bag of objects, or a dict for a bag of names with associated values.

Whenever you want to use variable variables, it's probably better to use a dictionary. So instead of writing
$foo = "bar"
$$foo = "baz"
you write
mydict = {}
foo = "bar"
mydict[foo] = "baz"
This way you won't accidentally overwrite previously existing variables (which is the security aspect) and you can have different "namespaces".

Use globals() (disclaimer: this is a bad practice, but is the most straightforward answer to your question, please use other data structure as in the accepted answer).
You can actually assign variables to global scope dynamically, for instance, if you want 10 variables that can be accessed on a global scope i_1, i_2 ... i_10:
for i in range(10):
globals()['i_{}'.format(i)] = 'a'
This will assign 'a' to all of these 10 variables, of course you can change the value dynamically as well. All of these variables can be accessed now like other globally declared variable:
>>> i_5
'a'

Instead of a dictionary you can also use namedtuple from the collections module, which makes access easier.
For example:
# using dictionary
variables = {}
variables["first"] = 34
variables["second"] = 45
print(variables["first"], variables["second"])
# using namedtuple
Variables = namedtuple('Variables', ['first', 'second'])
v = Variables(34, 45)
print(v.first, v.second)

The SimpleNamespace class could be used to create new attributes with setattr, or subclass SimpleNamespace and create your own function to add new attribute names (variables).
from types import SimpleNamespace
variables = {"b":"B","c":"C"}
a = SimpleNamespace(**variables)
setattr(a,"g","G")
a.g = "G+"
something = a.a

If you don't want to use any object, you can still use setattr() inside your current module:
import sys
current_module = module = sys.modules[__name__] # i.e the "file" where your code is written
setattr(current_module, 'variable_name', 15) # 15 is the value you assign to the var
print(variable_name) # >>> 15, created from a string

You have to use globals() built in method to achieve that behaviour:
def var_of_var(k, v):
globals()[k] = v
print variable_name # NameError: name 'variable_name' is not defined
some_name = 'variable_name'
globals()[some_name] = 123
print(variable_name) # 123
some_name = 'variable_name2'
var_of_var(some_name, 456)
print(variable_name2) # 456

Variable variables in Python
"""
<?php
$a = 'hello';
$e = 'wow'
?>
<?php
$$a = 'world';
?>
<?php
echo "$a ${$a}\n";
echo "$a ${$a[1]}\n";
?>
<?php
echo "$a $hello";
?>
"""
a = 'hello' #<?php $a = 'hello'; ?>
e = 'wow' #<?php $e = 'wow'; ?>
vars()[a] = 'world' #<?php $$a = 'world'; ?>
print(a, vars()[a]) #<?php echo "$a ${$a}\n"; ?>
print(a, vars()[vars()['a'][1]]) #<?php echo "$a ${$a[1]}\n"; ?>
print(a, hello) #<?php echo "$a $hello"; ?>
Output:
hello world
hello wow
hello world
Using globals(), locals(), or vars() will produce the same results
#<?php $a = 'hello'; ?>
#<?php $e = 'wow'; ?>
#<?php $$a = 'world'; ?>
#<?php echo "$a ${$a}\n"; ?>
#<?php echo "$a ${$a[1]}\n"; ?>
#<?php echo "$a $hello"; ?>
print('locals():\n')
a = 'hello'
e = 'wow'
locals()[a] = 'world'
print(a, locals()[a])
print(a, locals()[locals()['a'][1]])
print(a, hello)
print('\n\nglobals():\n')
a = 'hello'
e = 'wow'
globals()[a] = 'world'
print(a, globals()[a])
print(a, globals()[globals()['a'][1]])
print(a, hello)
Output:
locals():
hello world
hello wow
hello world
globals():
hello world
hello wow
hello world
Bonus (creating variables from strings)
# Python 2.7.16 (default, Jul 13 2019, 16:01:51)
# [GCC 8.3.0] on linux2
Creating variables and unpacking tuple:
g = globals()
listB = []
for i in range(10):
g["num%s" % i] = i ** 10
listB.append("num{0}".format(i))
def printNum():
print "Printing num0 to num9:"
for i in range(10):
print "num%s = " % i,
print g["num%s" % i]
printNum()
listA = []
for i in range(10):
listA.append(i)
listA = tuple(listA)
print listA, '"Tuple to unpack"'
listB = str(str(listB).strip("[]").replace("'", "") + " = listA")
print listB
exec listB
printNum()
Output:
Printing num0 to num9:
num0 = 0
num1 = 1
num2 = 1024
num3 = 59049
num4 = 1048576
num5 = 9765625
num6 = 60466176
num7 = 282475249
num8 = 1073741824
num9 = 3486784401
(0, 1, 2, 3, 4, 5, 6, 7, 8, 9) "Tuple to unpack"
num0, num1, num2, num3, num4, num5, num6, num7, num8, num9 = listA
Printing num0 to num9:
num0 = 0
num1 = 1
num2 = 2
num3 = 3
num4 = 4
num5 = 5
num6 = 6
num7 = 7
num8 = 8
num9 = 9

I'm answering the question How to get the value of a variable given its name in a string?
which is closed as a duplicate with a link to this question. (Editor's note: It is now closed as a duplicate of How can I select a variable by (string) name?)
If the variables in question are part of an object (part of a class for example) then some useful functions to achieve exactly that are hasattr, getattr, and setattr.
So for example you can have:
class Variables(object):
def __init__(self):
self.foo = "initial_variable"
def create_new_var(self, name, value):
setattr(self, name, value)
def get_var(self, name):
if hasattr(self, name):
return getattr(self, name)
else:
raise "Class does not have a variable named: " + name
Then you can do:
>>> v = Variables()
>>> v.get_var("foo")
'initial_variable'
>>> v.create_new_var(v.foo, "is actually not initial")
>>> v.initial_variable
'is actually not initial'

I have tried both in python 3.7.3, you can use either globals() or vars()
>>> food #Error
>>> milkshake #Error
>>> food="bread"
>>> drink="milkshake"
>>> globals()[food] = "strawberry flavor"
>>> vars()[drink] = "chocolate flavor"
>>> bread
'strawberry flavor'
>>> milkshake
'chocolate flavor'
>>> globals()[drink]
'chocolate flavor'
>>> vars()[food]
'strawberry flavor'
Reference:
https://www.daniweb.com/programming/software-development/threads/111526/setting-a-string-as-a-variable-name#post548936

The consensus is to use a dictionary for this - see the other answers. This is a good idea for most cases, however, there are many aspects arising from this:
you'll yourself be responsible for this dictionary, including garbage collection (of in-dict variables) etc.
there's either no locality or globality for variable variables, it depends on the globality of the dictionary
if you want to rename a variable name, you'll have to do it manually
however, you are much more flexible, e.g.
you can decide to overwrite existing variables or ...
... choose to implement const variables
to raise an exception on overwriting for different types
etc.
That said, I've implemented a variable variables manager-class which provides some of the above ideas. It works for python 2 and 3.
You'd use the class like this:
from variableVariablesManager import VariableVariablesManager
myVars = VariableVariablesManager()
myVars['test'] = 25
print(myVars['test'])
# define a const variable
myVars.defineConstVariable('myconst', 13)
try:
myVars['myconst'] = 14 # <- this raises an error, since 'myconst' must not be changed
print("not allowed")
except AttributeError as e:
pass
# rename a variable
myVars.renameVariable('myconst', 'myconstOther')
# preserve locality
def testLocalVar():
myVars = VariableVariablesManager()
myVars['test'] = 13
print("inside function myVars['test']:", myVars['test'])
testLocalVar()
print("outside function myVars['test']:", myVars['test'])
# define a global variable
myVars.defineGlobalVariable('globalVar', 12)
def testGlobalVar():
myVars = VariableVariablesManager()
print("inside function myVars['globalVar']:", myVars['globalVar'])
myVars['globalVar'] = 13
print("inside function myVars['globalVar'] (having been changed):", myVars['globalVar'])
testGlobalVar()
print("outside function myVars['globalVar']:", myVars['globalVar'])
If you wish to allow overwriting of variables with the same type only:
myVars = VariableVariablesManager(enforceSameTypeOnOverride = True)
myVars['test'] = 25
myVars['test'] = "Cat" # <- raises Exception (different type on overwriting)

Any set of variables can also be wrapped up in a class.
"Variable" variables may be added to the class instance during runtime by directly accessing the built-in dictionary through __dict__ attribute.
The following code defines Variables class, which adds variables (in this case attributes) to its instance during the construction. Variable names are taken from a specified list (which, for example, could have been generated by program code):
# some list of variable names
L = ['a', 'b', 'c']
class Variables:
def __init__(self, L):
for item in L:
self.__dict__[item] = 100
v = Variables(L)
print(v.a, v.b, v.c)
#will produce 100 100 100

It should be extremely risky...
but you can use exec():
a = 'b=5'
exec(a)
c = b*2
print (c)
Result:
10

The setattr() method sets the value of the specified attribute of the specified object.
Syntax goes like this –
setattr(object, name, value)
Example –
setattr(self,id,123)
which is equivalent to self.id = 123
As you might have observed, setattr() expects an object to be passed along with the value to generate/modify a new attribute.
We can use setattr() with a workaround to be able to use within modules. Here’ how –
import sys
x = "pikachu"
value = 46
thismodule = sys.modules[__name__]
setattr(thismodule, x, value)
print(pikachu)

New column with word at nth position of string from other column pandas

import numpy as np
import pandas as pd
d = {'ABSTRACT_ID': [14145090,1900667, 8157202,6784974],
'TEXT': [
"velvet antlers vas are commonly used in tradit",
"we have taken a basic biologic RPA to elucidat4",
"ceftobiprole bpr is an investigational cephalo",
"lipoperoxidationderived aldehydes for example",],
'LOCATION': [1, 4, 2, 1]}
df = pd.DataFrame(data=d)
df
def word_at_pos(x,y):
pos=x
string= y
count = 0
res = ""
for word in string:
if word == ' ':
count = count + 1
if count == pos:
break
res = ""
else :
res = res + word
print(res)
word_at_pos(df.iloc[0,2],df.iloc[0,1])
For this df I want to create a new column WORD that contains the word from TEXT at the position indicated by LOCATION. e.g. first line would be "velvet".
I can do this for a single line as an isolated function world_at_pos(x,y), but can't work out how to apply this to whole column. I have done new columns with Lambda functions before, but can't work out how to fit this function to lambda.

Looping over TEXT and LOCATION could be the best idea because splitting creates a jagged array, so filtering using numpy advanced indexing won't be possible.
df["WORDS"] = [txt.split()[loc] for txt, loc in zip(df["TEXT"], df["LOCATION"]-1)]
print(df)
ABSTRACT_ID ... WORDS
0 14145090 ... velvet
1 1900667 ... a
2 8157202 ... bpr
3 6784974 ... lipoperoxidationderived
[4 rows x 4 columns]

Remove the first or last char so the values from a column should start with numbers

I'm new to Pandas and I'd like to ask your advice.
Let's take this dataframe:
df_test = pd.DataFrame({'Dimensions': ['22.67x23.5', '22x24.6', '45x56', 'x23x56.22','46x23x','34x45'],
'Other': [59, 29, 73, 56,48,22]})
I want to detect the lines that starts with "x" (line 4) or ends with "x" (line 5) and then remove them so my dataframe should look like this
Dimensions Other
22.67x23.5 59
22x24.6 29
45x56 73
23x56.22 56
46x23 48
34x45 22
I wanted to create a function and apply it to a column
def remove_x(x):
if (x.str.match('^[a-zA-Z]') == True):
x = x[1:]
return x
if (x.str.match('.*[a-zA-Z]$') == True):
x = x[:-1]
return x
If I apply this function to the column
df_test['Dimensions'] = df_test['Dimensions'].apply(remove_x)
I got an error 'str' object has no attribute 'str'
I delete 'str' from the function and re-run all but no success.
What should I do?
Thank you for any suggestions or if there is another way to do it I'm interested in.

Just use str.strip:
df_test['Dimensions'] = df_test['Dimensions'].str.strip('x')
For general patterns, you can try str.replace:
df_test['Dimensions'].str.replace('(^x)|(x$)','')
Output:
Dimensions Other
0 22.67x23.5 59
1 22x24.6 29
2 45x56 73
3 23x56.22 56
4 46x23 48
5 34x45 22

#QuangHoang's answer is better (for simplicity and efficiency), but here's what went wrong in your approach. In your apply function, you are making calls to accessing the str methods of a Series or DataFrame. But when you call df_test['Dimensions'].apply(remove_x), the values passed to remove_x are the elements of df_test['Dimensions'], aka the str values themselves. So you should construct the function as if x is an incoming str.
Here's how you could implement that (avoiding any regex):
def remove_x(x):
if x[0] == 'x':
return x[1:]
elif x[-1] == 'x':
return x[:-1]
else:
return x
More idiomatically:
def remove_x(x):
x.strip('x')
Or even:
df_test['Dimensions'] = df_test['Dimensions'].apply(lambda x : x.strip('x'))
All that said, better to not use apply and follow the built-ins shown by Quang.

Adding multiple dictionaries into a single Dataframe pandas

I have a set of python dictionaries that I have obtained by means of a for loop. I am trying to have these added to Pandas Dataframe.
Output for a variable called output
{'name':'Kevin','age':21}
{'name':'Steve','age':31}
{'name':'Mark','age':11}
I am trying to append each of these dictionary into a single Dataframe. I tried to perform the below but it just added the first row.
df = pd.DataFrame(output)
Could anyone advice as to where am going wrong and have all the dictionaries added to the Dataframe.
Update on the loop statement
The below code helps to read xml and convert it to a dataframe. Right now I see I am able to loop in through multiple xml files and created dictionaries for each xml file. I am trying to see how could I add each of these dictionaries to a single Dataframe:
def f(elem, result):
result[elem.tag] = elem.text
cs = elem.getchildren()
for c in cs:
result = f(c, result)
return result
result = {}
for file in allFiles:
tree = ET.parse(file)
root = tree.getroot()
result = f(root, result)
print(result)

You can append each dictionary to list and last call DataFrame constructor:
out = []
for file in allFiles:
tree = ET.parse(file)
root = tree.getroot()
result = f(root, result)
out.append(result)
df = pd.DataFrame(out)

We can add these dicts to a list:
ds = []
for ...: # your loop
ds += [d] # where d is one of the dicts
When we have the list of dicts, we can simply use pd.DataFrame on that list:
ds = [
{'name':'Kevin','age':21},
{'name':'Steve','age':31},
{'name':'Mark','age':11}
]
pd.DataFrame(ds)
Output:
name age
0 Kevin 21
1 Steve 31
2 Mark 11
Update:
And it's not a problem if different dicts have different keys, e.g.:
ds = [
{'name':'Kevin','age':21},
{'name':'Steve','age':31,'location': 'NY'},
{'name':'Mark','age':11,'favorite_food': 'pizza'}
]
pd.DataFrame(ds)
Output:
age favorite_food location name
0 21 NaN NaN Kevin
1 31 NaN NY Steve
2 11 pizza NaN Mark
Update 2:
Building up on our previous discussion in Python - Converting xml to csv using Python pandas we can do:
results = []
for file in glob.glob('*.xml'):
tree = ET.parse(file)
root = tree.getroot()
result = f(root, {})
result['filename'] = file # added filename to our results
results += [result]
pd.DataFrame(results)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Create a dictionary using "if... else" in Julia? - dataframe

Related

Capping multiples columns

How to convert a string inside a function to a variable name that holds a Pandas datframe outside the function? [duplicate]

New column with word at nth position of string from other column pandas

Remove the first or last char so the values from a column should start with numbers

Adding multiple dictionaries into a single Dataframe pandas

Categories

Resources