Split Get from Python dictionary - pandas

I have a dictionary that I can use the get method to extract values from but I need to subset these values. For example
dict_of_measures = {k: v for k, v in measures.groupby('Measure')}
And I am using get
BCS=dict_of_measures.get('BCS')
I have several values and wanted to know if I could use a for loop to extract from the dictionary and subset into multiple dataframes per measure using the get method? Is this possible?
for measure name in dict_of_measures:
get measure name()

you can use dict comprehension-
result= []
keys_to_extract = ['key1','key2']
new_dict = {k: bigdict[k] for k in keys_to_extract}
result.append(new_dict) # add dictionary to list. This can then be converted into pandas dataframe

Related

Selecting two sets of columns from a dataFrame with all rows

I have a dataFrame with 28 columns (features) and 600 rows (instances). I want to select all rows, but only columns from 0-12 and 16-27. Meaning that I don't want to select columns 12-15.
I wrote the following code, but it doesn't work and throws a syntax error at : in 0:12 and 16:. Can someone help me understand why?
X = df.iloc[:,[0:12,16:]]
I know there are other ways for selecting these rows, but I am curious to learn why this one does not work, and how I should write it to work (if there is a way).
For now, I have written it is as:
X = df.iloc[:,0:12]
X = X + df.iloc[:,16:]
Which seems to return an incorrect result, because I have already treated the NaN values of df, but when I use this code, X includes lots of NaNs!
Thanks for your feedback in advance.
You can use np.r_ to concatenate the slices:
x = df.iloc[:, np.r_[0:12,16:]]
iloc has these allowed inputs (from the docs):
An integer, e.g. 5.
A list or array of integers, e.g. [4, 3, 0].
A slice object with ints, e.g. 1:7.
A boolean array.
A callable function with one argument (the calling Series or DataFrame) and that returns valid output for indexing (one of the above). This is useful in method chains, when you don’t have a reference to the calling object, but would like to base your selection on some value.
What you're passing to iloc in X = df.iloc[:,[0:12,16:]] is not a list of integers or a slice of ints, but a list of slice objects. You need to convert those slices to a list of integers, and the best way to do that is using the numpy.r_ function.
X = df.iloc[:, np.r_[0:13, 16:28]]

Got TypeError: string indices must be integers with .apply [duplicate]

I have a dataframe, one column is a URL, the other is a name. I'm simply trying to add a third column that takes the URL, and creates an HTML link.
The column newsSource has the Link name, and url has the URL. For each row in the dataframe, I want to create a column that has:
[newsSource name]
Trying the below throws the error
File "C:\Users\AwesomeMan\Documents\Python\MISC\News Alerts\simple_news.py", line 254, in
df['sourceURL'] = df['url'].apply(lambda x: '{1}'.format(x, x[0]['newsSource']))
TypeError: string indices must be integers
df['sourceURL'] = df['url'].apply(lambda x: '{1}'.format(x, x['source']))
But I've used x[colName] before? The below line works fine, it simply creates a column of the source's name:
df['newsSource'] = df['source'].apply(lambda x: x['name'])
Why suddenly ("suddenly" to me) is it saying I can't access the indices?
pd.Series.apply has access only to a single series, i.e. the series on which you are calling the method. In other words, the function you supply, irrespective of whether it is named or an anonymous lambda, will only have access to df['source'].
To access multiple series by row, you need pd.DataFrame.apply along axis=1:
def return_link(x):
return '{1}'.format(x['url'], x['source'])
df['sourceURL'] = df.apply(return_link, axis=1)
Note there is an overhead associated with passing an entire series in this way; pd.DataFrame.apply is just a thinly veiled, inefficient loop.
You may find a list comprehension more efficient:
df['sourceURL'] = ['{1}'.format(i, j) \
for i, j in zip(df['url'], df['source'])]
Here's a working demo:
df = pd.DataFrame([['BBC', 'http://www.bbc.o.uk']],
columns=['source', 'url'])
def return_link(x):
return '{1}'.format(x['url'], x['source'])
df['sourceURL'] = df.apply(return_link, axis=1)
print(df)
source url sourceURL
0 BBC http://www.bbc.o.uk BBC
With zip and string old school string format
df['sourceURL'] = ['%s.' % (x,y) for x , y in zip (df['url'], df['source'])]
This is f-string
[f'{y}' for x , y in zip ((df['url'], df['source'])]

Use same category labeling criteria on two different dataframes

I have a dataFrame that contains a categorical feature which i have encoded in the following way:
df['categorical_feature'] = df['categorical_feature'].astype('category')
df['labels'] = df['categorical_feature'].cat.codes
If I apply the same code as above on another dataFrame with same category field the mapping is shuffled, but i need it to be consistent with the first dataFrame.
Is there a way to successfully apply the same mapping category:label to another dataFrame that has the same categorical values?
I think you are looking for pd.Series.map(), which maps values from category to label using a dictionary that has category: label mappings.
Create mapping dictionary: You can do this using a dictionary comprehension in combination with zip, but there also other ways of doing this:
col = 'categorical_features'
mapping_dict = {k: v for k, v in zip(df[col], df[col].cat.codes}
Now you can map that category: label mapping:
df['labels'] = df['categorical'].map(mapping_dict)

Get True/False boolean list of row in pandas dataframe out of condition

I am working with several Pandas DataFrames and I need the following filtering:
Suppose I get a list like
L=['EP6','EP3','EP2']
I need to get the following vector of a row:
for row concept 1 True where columns index is in L, False where not.
I am trying:
# D being the DataFrame
L=['EP6', 'EP3','EP2']
[True for ind in D.columns if ind in L ]
But only get [True,True,True]
I need the complete list like:
desire_result=[0,0,0,0,1,0,0,1,1,0]
Note: be aware that the 1 in the desire result do not have anything to do with the 1 the Dataframe is populate with.
Thanks
We have isin in pandas
D.columns.isin(L)
You here made a filter where you yield True, given ind in L, and otherwise, you do not yield an element.
You here want to perform a mapping. You can still use list comprehension, but you should put the condition in the yield part:
[ind in L for ind in D.columns]
or if you want integers:
[int(ind in L) for ind in D.columns]

Adding Columns in loop pandas

I have a 2 dataframes each with 2 columns (named the same in both df's) and I want to add them together to make a third column.
df1['C']=df1[['A','B']].sum(axis=1)
df1['D']=df1[['E','G']].sum(axis=1)
df2['C']=df2[['A','B']].sum(axis=1)
df2['D']=df2[['E','G']].sum(axis=1)
However in reality its more complicated than this. So can I put these in a dictionary and loop?
I'm still figuring out how to structure dictionarys for this type of problem, so any advice would be great.
Here's what I'm trying to do:
all_dfs=[df1,df2]
for df in all_dfs:
dict={Out=['C'], in=['A','B]
Out2=['D'], in2=['E','G]
}
for i in dict:
df[i]=df[['i[1....
I'm a bit lost in how to build this last bit
First change dictionary name because dict is python code word, then change it by key with output column and value by list of input columns and last loop by items() method:
d= {'C':['A','B'],'D': ['E','G']}
for k, v in d.items():
#checking key and value of dict
print (k)
print (v)
df[k]=df[v].sum(axis=1)
EDIT:
Here is simplier working with dictionary of DataFrames, use sum and last create anoter dictionary of DataFrames:
all_dfs= {'first': df1, 'second':df2}
out = {}
for name, df in all_dfs.items():
d= {'C':['A','B'],'D': ['E','G']}
for k, v in d.items():
df[k]=df[v].sum(axis=1)
#fill empty dict by name
out[name] = df
print (out)
print (out['first'])
print (out['second'])