I have a list l=['x','y']
I want to make 2 blank dataframes called x and y by a loop from the list l.
So something like this
for v in l:
v=pd.DataFrame()
You could try something like,
l = []
length = 2
for i in range(length):
l.append(pd.DataFrame())
Or if you really want to modify the initial list with strings,
l = ['x', 'y']
for i in range(len(l)):
l[i] = pd.DataFrame()
Related
...
header = pd.DataFrame()
for x in {0,7,8,9,10,11,12,13,14,15,18,19,21,23}:
header = header.append({'col1':data1[x].split(':')[0],
'col2':data1[x].split(':')[1][:-1],
'col3':data2[x].split(':')[1][:-1],
'col4':data2[x]==data1[x],
'col5':'---'},
ignore_index=True)`
...
I have some Jupyter Notebook code which reads in 2 text files to data1 and data2 and using a list I am picking out specific matching lines in both files to a dataframe for easy display and comparison in the notebook
Since df.append is now being bumped for pd.concat what's the tidiest way to do this
is it basically to replace the inner loop code with
...
header = pd.concat(header, {all the column code from above })
...
addtional input to comment below
Yes, sorry for example the next block of code does this:
for x in {4,2 5}:
header = header.append({'col1':SOMENEWROWNAME'',
'col2':data1[x].split(':')[1][:-1],
'col3':data2[x].split(':')[1][:-1],
'col4':data2[x]==data1[x],
'col5':float(data2[x].split(':'},[1]([-1]) -float(data1[x].split(':'},[1]([-1])
ignore_index=True)`
repeated 5 times with different data indices in the loop, and then a different SOMENEWROWNAME
I inherited this notebook and I see now that this way of doing it was because they only wanted to do a numerical float difference on the columns where numbers come
but there are several such blocks, with different lines in the data and where that first parameter SOMENEWROWNAME is the different text fields from the respective lines in the data.
so I was primarily just trying to fix these append to concat warnings, but of course if the code can be better written then all good!
Use list comprehension and DataFrame constructor:
data = [{'col1':data1[x].split(':')[0],
'col2':data1[x].split(':')[1][:-1],
'col3':data2[x].split(':')[1][:-1],
'col4':data2[x]==data1[x],
'col5':'---'} for x in {0,7,8,9,10,11,12,13,14,15,18,19,21,23}]
df = pd.DataFrame(data)
EDIT:
out = []
#sample
for x in {1,7,30}:
out.append({'col1':SOMENEWROWNAME'',
'col2':data1[x].split(':')[1][:-1],
'col3':data2[x].split(':')[1][:-1],
'col4':data2[x]==data1[x],
'col5':float(data2[x].split(':'},[1]([-1]) -float(data1[x].split(':'},[1]([-1]))))))
df1 = pd.DataFrame(out)
out1 = []
#sample
for x in {1,7,30}:
out1.append({another dict})))
df2 = pd.DataFrame(out1)
df = pd.concat([df1, df2])
Or:
final = []
for x in {4,2,5}:
final.append({'col1':SOMENEWROWNAME'',
'col2':data1[x].split(':')[1][:-1],
'col3':data2[x].split(':')[1][:-1],
'col4':data2[x]==data1[x],
'col5':float(data2[x].split(':'},[1]([-1]) -float(data1[x].split(':'},[1]([-1]))))))
for x in {4,2, 5}:
final.append({another dict})))
df = pd.DataFrame(final)
I have written this code to show a list of column names in a dataframe if they contains 'a','b' ,'c' or 'd'.
I then want to say trim the first 3 character of the column name for these columns.
However its showing an error. Is there something wrong with the code?
ind_cols= [x for x in df if df.columns[df.columns.str.contains('|'.join(['a','b','c','d']))]]
df[ind_cols].columns=df[ind_cols].columns.str[3:]
Use list comprehension with if-else:
L = df.columns[df.columns.str.contains('|'.join(['a','b','c','d']))]
df.columns = [x[3:] if x in L else x for x in df.columns]
Another solution with numpy.where by boolean mask:
m = df.columns.str.contains('|'.join(['a','b','c','d']))
df.columns = np.where(m, df.columns.str[3:], df.columns)
I tried following code to select columns from a dataframe. My dataframe has about 50 values. At the end, I want to create the sum of selected columns, create a new column with these sum values and then delete the selected columns.
I started with
columns_selected = ['A','B','C','D','E']
df = df[df.column.isin(columns_selected)]
but it said AttributeError: 'DataFrame' object has no attribute 'column'
Regarding the sum: As I don't want to write for the sum
df['sum_1'] = df['A']+df['B']+df['C']+df['D']+df['E']
I also thought that something like
df['sum_1'] = df[columns_selected].sum(axis=1)
would be more convenient.
You want df[columns_selected] to sub-select the df by a list of columns
you can then do df['sum_1'] = df[columns_selected].sum(axis=1)
To filter the df to just the cols of interest pass a list of the columns, df = df[columns_selected] note that it's a common error to just a list of strings: df = df['a','b','c'] which will raise a KeyError.
Note that you had a typo in your original attempt:
df = df.loc[:,df.columns.isin(columns_selected)]
The above would've worked, firstly you needed columns not column, secondly you can use the boolean mask as a mask against the columns by passing to loc or ix as the column selection arg:
In [49]:
df = pd.DataFrame(np.random.randn(5,5), columns=list('abcde'))
df
Out[49]:
a b c d e
0 -0.778207 0.480142 0.537778 -1.889803 -0.851594
1 2.095032 1.121238 1.076626 -0.476918 -0.282883
2 0.974032 0.595543 -0.628023 0.491030 0.171819
3 0.983545 -0.870126 1.100803 0.139678 0.919193
4 -1.854717 -2.151808 1.124028 0.581945 -0.412732
In [50]:
cols = ['a','b','c']
df.ix[:, df.columns.isin(cols)]
Out[50]:
a b c
0 -0.778207 0.480142 0.537778
1 2.095032 1.121238 1.076626
2 0.974032 0.595543 -0.628023
3 0.983545 -0.870126 1.100803
4 -1.854717 -2.151808 1.124028
I have a problem with assigning a series like object to a slice of a Pandas dataframe.
Maybe I'm not using the Datafarme the way it is intended to, so some enlightment will be greatly appreciated.
I've already read the following articles:
pandas: slice a MultiIndex by range of secondary index
Returning a view versus a copy
As far as I understand the way I'm evoking the slice with one .loc call does ensure I'm getting not a copy of the data. Obviously also the original dataframe gets altered, but instead of the expected data I get NaN values.
See the appended code snipet.
Do I have to iterate over the desired section of the dataframe for each single value I want to change and use the .set_value(row_idx,col_idx,val) method?
kind regards and thanks in advance
Markus
In [1]: import pandas as pd
In [2]: mindex = pd.MultiIndex.from_product([['one','two'],['first','second']])
In [3]: dfmi = pd.DataFrame([list('abcd'),list('efgh'),list('ijkl'),list('mnop')],
...: index = mindex, columns=(['X','Y','Z','Q']))
In [4]: print(dfmi)
X Y Z Q
one first a b c d
second e f g h
two first i j k l
second m n o p
In [5]: dfmi.loc[('two',slice('first','second')),'X']
Out[5]:
two first i
second m
Name: X, dtype: object
In [6]: substitute = pd.Series(data=["ab","cd"], index= mindex.levels[1])
...: print(substitute)
first ab
second cd
dtype: object
In [7]: dfmi.loc[('two',slice('first','second')),'X'] = substitute
In [8]: print(dfmi)
X Y Z Q
one first a b c d
second e f g h
two first NaN j k l
second NaN n o p
What's happening is that substitute has an index, which determine the location of the values, and dfmi.loc[('two',slice('first','second')),'X'] is also specifying such location.
During the assignment pandas is trying to align both index and since they do not match (they would if substitute was also a multi-index), the result of the alignment are all NA's, which get inserted.
A solution could be to get rid of the index of substitute since the location of where you want to insert the values is already specified in the loc:
dfmi.loc[('two',slice('first','second')),'X'] = substitute.values
or even simpler, insert the values directly:
dfmi.loc[('two',slice('first','second')),'X'] = ["ab","cd"]
Can you try this:
dfmi.loc['two']['X']=substitute
I have a bunch of files where I want to open, read the first line, parse it into several expected pieces of information, and then put the filenames and those data as rows in a dataframe. My question concerns the recommended syntax to build the dataframe in a pandanic/pythonic way (the file-opening and parsing I already have figured out).
For a dumbed-down example, the following seems to be the recommended thing to do when you want to create one new column:
df = pd.DataFrame(files, columns=['filename'])
df['first_letter'] = df.apply(lambda x: x['filename'][:1], axis=1)
but I can't, say, do this:
df['first_letter'], df['second_letter'] = df.apply(lambda x: (x['filename'][:1], x['filename'][1:2]), axis=1)
as the apply function creates only one column with tuples in it.
Keep in mind that, in place of the lambda function I will place a function that will open the file and read and parse the first line.
You can put the two values in a Series, and then it will be returned as a dataframe from the apply (where each series is a row in that dataframe). With a dummy example:
In [29]: df = pd.DataFrame(['Aa', 'Bb', 'Cc'], columns=['filenames'])
In [30]: df
Out[30]:
filenames
0 Aa
1 Bb
2 Cc
In [31]: df['filenames'].apply(lambda x : pd.Series([x[0], x[1]]))
Out[31]:
0 1
0 A a
1 B b
2 C c
This you can then assign to two new columns:
In [33]: df[['first', 'second']] = df['filenames'].apply(lambda x : pd.Series([x[0], x[1]]))
In [34]: df
Out[34]:
filenames first second
0 Aa A a
1 Bb B b
2 Cc C c