Select column of dataframe with name matching a string in Julia? - dataframe

I have a large DataFrame that I import from a spreadsheet. I have the names of several columns that I care about in an array of strings. How do I select a column of the DataFrame who's name matches the contents of a string? I would have though that something like this would work
using DataFrames
df = DataFrame(A = 1:4, B = ["M", "F", "F", "M"], C = 2:5)
colsICareAbout = [":B" ":C"]
df[:A] #This works
df[colsICareAbout[1]] #This doesn't work
Is there a way to do this?

Strings are different than symbols, but they're easy to convert.
colsICareAbout = ["B","C"]
df[symbol(colsICareAbout[1])]
Mind you it might be better to make the entries in colsICareAbout symbols to begin with, but I don't know where your data is coming from.

Related

replacing df.append with pd.concat when building a new dataframe from file read

...
header = pd.DataFrame()
for x in {0,7,8,9,10,11,12,13,14,15,18,19,21,23}:
header = header.append({'col1':data1[x].split(':')[0],
'col2':data1[x].split(':')[1][:-1],
'col3':data2[x].split(':')[1][:-1],
'col4':data2[x]==data1[x],
'col5':'---'},
ignore_index=True)`
...
I have some Jupyter Notebook code which reads in 2 text files to data1 and data2 and using a list I am picking out specific matching lines in both files to a dataframe for easy display and comparison in the notebook
Since df.append is now being bumped for pd.concat what's the tidiest way to do this
is it basically to replace the inner loop code with
...
header = pd.concat(header, {all the column code from above })
...
addtional input to comment below
Yes, sorry for example the next block of code does this:
for x in {4,2 5}:
header = header.append({'col1':SOMENEWROWNAME'',
'col2':data1[x].split(':')[1][:-1],
'col3':data2[x].split(':')[1][:-1],
'col4':data2[x]==data1[x],
'col5':float(data2[x].split(':'},[1]([-1]) -float(data1[x].split(':'},[1]([-1])
ignore_index=True)`
repeated 5 times with different data indices in the loop, and then a different SOMENEWROWNAME
I inherited this notebook and I see now that this way of doing it was because they only wanted to do a numerical float difference on the columns where numbers come
but there are several such blocks, with different lines in the data and where that first parameter SOMENEWROWNAME is the different text fields from the respective lines in the data.
so I was primarily just trying to fix these append to concat warnings, but of course if the code can be better written then all good!
Use list comprehension and DataFrame constructor:
data = [{'col1':data1[x].split(':')[0],
'col2':data1[x].split(':')[1][:-1],
'col3':data2[x].split(':')[1][:-1],
'col4':data2[x]==data1[x],
'col5':'---'} for x in {0,7,8,9,10,11,12,13,14,15,18,19,21,23}]
df = pd.DataFrame(data)
EDIT:
out = []
#sample
for x in {1,7,30}:
out.append({'col1':SOMENEWROWNAME'',
'col2':data1[x].split(':')[1][:-1],
'col3':data2[x].split(':')[1][:-1],
'col4':data2[x]==data1[x],
'col5':float(data2[x].split(':'},[1]([-1]) -float(data1[x].split(':'},[1]([-1]))))))
df1 = pd.DataFrame(out)
out1 = []
#sample
for x in {1,7,30}:
out1.append({another dict})))
df2 = pd.DataFrame(out1)
df = pd.concat([df1, df2])
Or:
final = []
for x in {4,2,5}:
final.append({'col1':SOMENEWROWNAME'',
'col2':data1[x].split(':')[1][:-1],
'col3':data2[x].split(':')[1][:-1],
'col4':data2[x]==data1[x],
'col5':float(data2[x].split(':'},[1]([-1]) -float(data1[x].split(':'},[1]([-1]))))))
for x in {4,2, 5}:
final.append({another dict})))
df = pd.DataFrame(final)

Pandas splitting a column with new line separator

I am extracting tables from pdf using Camelot. Two of the columns are getting merged together with a newline separator. Is there a way to separate them into two columns?
Suppose the column looks like this.
A\nB
1\n2
2\n3
3\n4
Desired output:
|A|B|
|-|-|
|1|2|
|2|3|
|3|4|
I have tried df['A\nB'].str.split('\n', 2, expand=True) and that splits it into two columns however I want the new column names to be A and B and not 0 and 1. Also I need to pass a generalized column label instead of actual column name since I need to implement this for several docs which may have different column names. I can determine such column name in my dataframe using
colNew = df.columns[df.columns.str.contains(pat = '\n')]
However when I pass colNew in split function, it throws an attribute error
df[colNew].str.split('\n', 2, expand=True)
AttributeError: DataFrame object has no attribute 'str'
You can take advantage of the Pandas split function.
import pandas as pd
# recreate your pandas series above.
df = pd.DataFrame({'A\nB':['1\n2','2\n3','3\n4']})
# first: Turn the col into str.
# second. split the col based on seperator \n
# third: make sure expand as True since you want the after split col become two new col
test = df['A\nB'].astype('str').str.split('\n',expand=True)
# some rename
test.columns = ['A','B']
I hope this is helpful.
I reproduced the error from my side... I guess the issue is that "df[colNew]" is still a dataframe as it contains the indexes.
But .str.split() only works on Series. So taking as example your code, I would convert the dataframe to series using iloc[:,0].
Then another line to split the column headers:
df2=df[colNew].iloc[:,0].str.split('\n', 2, expand=True)
df2.columns = 'A\nB'.split('\n')

Multiple column selection on a Julia DataFrame

Imagine I have the following DataFrame :
10 rows x 26 columns named A to Z
What I would like to do is to make a multiple subset of the columns by their name (not the index). For instance, assume that I want columns A to D and P to Z in a new DataFrame named df2.
I tried something like this but it doesn't seem to work :
df2=df[:,[:A,:D ; :P,:Z]]
syntax: unexpected semicolon in array expression
top-level scope at Slicing.jl:1
Any idea of the way to do it ?
Thanks for any help
df2 = select(df, Between(:A,:D), Between(:P,:Z))
or
df2 = df[:, All(Between(:A,:D), Between(:P,:Z))]
if you are sure your columns are only from :A to :Z you can also write:
df2 = select(df, Not(Between(:E, :O)))
or
df2 = df[:, Not(Between(:E, :O))]
Finally, you can easily find an index of the column using columnindex function, e.g.:
columnindex(df, :A)
and later use column numbers - if this is something what you would prefer.
In Julia you can also build Ranges with Chars and hence when your columns are named just by single letters yet another option is:
df[:, Symbol.(vcat('A':'D', 'P':'Z'))]

Looping through a dictionary of dataframes and counting a column

I am wondering if anyone can help. I have a number of dataframes stored in a dictionary. I simply want to access each of these dataframes and count the values in a column in the column I have 10 letters. In the first dataframe there are 5bs and 5 as. For example the output from the count I would expect to be is a = 5 and b =5. However for each dataframe this count would be different hence I would like to store the output of these counts either into another dictionary or a separate variable.
The dictionary is called Dict and the column name in all the dataframes is called letters. I have tried to do this by accessing the keys in the dictionary but can not get it to work. A section of what I have tried is shown below.
import pandas as pd
for key in Dict:
Count=pd.value_counts(key['letters'])
Count here would ideally change with each new count output to store into a new variable
A simplified example (the actual dataframe sizes are max 5000,63) of the one of the 14 dataframes in the dictionary would be
`d = {'col1': [1, 2,3,4,5,6,7,8,9,10], 'letters': ['a','a','a','b','b','a','b','a','b','b']}
df = pd.DataFrame(data=d)`
The other dataframes are names df2,df3,df4 etc
I hope that makes sense. Any help would be much appreciated.
Thanks
If you want to access both key and values when iterating over a dictionary, you should use the items function.
You could use another dictionary to store the results:
letter_counts = {}
for key, value in Dict.items():
letter_counts[key] = value["letters"].value_counts()
You could also use dictionary comprehension to do this in 1 line:
letter_counts = {key: value["letters"].value_counts() for key, value in Dict.items()}
The easiest thing is probably dictionary comprehension:
d = {'col1': [1, 2,3,4,5,6,7,8,9,10], 'letters': ['a','a','a','b','b','a','b','a','b','b']}
d2 = {'col1': [1, 2,3,4,5,6,7,8,9,10,11], 'letters': ['a','a','a','b','b','a','b','a','b','b','a']}
df = pd.DataFrame(data=d)
df2 = pd.DataFrame(d2)
df_dict = {'d': df, 'd2': df2}
new_dict = {k: v['letters'].count() for k,v in df_dict.items()}
# out
{'d': 10, 'd2': 11}

Is there a faster way through list comprehension to iterate through two dataframes?

I have two dataframes, one contains screen names/display names and another contains individuals, and I am trying to create a third dataframe that contains all the data from each dataframe in a new row for each time a last name appears in the screen name/display name. Functionally this will create a list of possible matching names. My current code, which works perfectly but very slowly, looks like this:
# Original Social Media Screen Names
# cols = 'userid','screen_name','real_name'
usernames = pd.read_csv('social_media_accounts.csv')
# List Of Individuals To Match To Accounts
# cols = 'first_name','last_name'
individuals = pd.read_csv('individuals_list.csv')
userid, screen_name, real_name, last_name, first_name = [],[],[],[],[]
for index1, row1 in individuals.iterrows():
for index2, row2 in usernames.iterrows():
if (row2['Screen_Name'].lower().find(row1['Last_Name'].lower()) != -1) | (row2['Real_Name'].lower().find(row1['Last_Name'].lower()) != -1):
userid.append(row2['UserID'])
screen_name.append(row2['Screen_Name'])
real_name.append(row2['Real_Name'])
last_name.append(row1['Last_Name'])
first_name.append(row1['First_Name'])
cols = ['UserID', 'Screen_Name', 'Real_Name', 'Last_Name', 'First_Name']
index = range(0, len(userid))
match_list = pd.DataFrame(index=index, columns=cols)
match_list = match_list.fillna('')
match_list['UserID'] = userid
match_list['Screen_Name'] = screen_name
match_list['Real_Name'] = real_name
match_list['Last_Name'] = last_name
match_list['First_Name'] = first_name
Because I need the whole row from each column, the list comprehension methods I have tried do not seem to work.
The thing you want is to iterate through a dataframe faster. Doing that with a list comprehension is, taking data out of a pandas dataframe, handling it using operations in python, then putting it back in a pandas dataframe. The fastest way (currently, with small data) would be to handle it using pandas iteration methods.
The next thing you want to do is work with 2 dataframes. There is a tool in pandas called join.
result = pd.merge(usernames, individuals, on=['Screen_Name', 'Last_Name'])
After the merge you can do your filtering.
Here is the documentation: http://pandas.pydata.org/pandas-docs/stable/merging.html