Importing an excel column of lists with pandas - pandas

I have an excel file with a column that contains lists (see image). What is the correct way to use pandas.read_excel() in order to import the column?
I ultimately need to be able create a dataframe with a column for each value of the list.
This is what I thought I would need to do but it is not correct.
# read in the file
df = pd.read_excel('fruit.xlsx')
df
# this isn't working...
# create new dataframe with column for each value of the "Fruit" column in df
fruits = df['Fruit'].apply(pd.Series)
fruits
The following does work though when I create the initial dataframe from a dictionary rather than reading in an excel file.
What am I doing wrong with read_excel()?
How do I specify that the excel column is lists?
# dictionary with the same data as the excel file
raw_data = {'ID': [1, 2, 3, 4],
'Fruit': [['Apple', 'Banana', 'Pear'],
['Pineapple'],
'',
['Apple', 'Orange']]}
# dataframe from the dictionary
df = pd.DataFrame(raw_data, columns=['ID', 'Fruit'])
df
# new dataframe with a column for each value of the lists
fruits = df['Fruit'].apply(pd.Series)
fruits
Thanks.

Related

Convert multiple downloaded time series share to pandas dataframe

i downloaded the information about multiple shares using nsepy library for the last 10 days, but could not save it in the pandas dataframe.
Below code to download the multiples share data:
import datetime
from datetime import date
from nsepy import get_history
import pandas as pd
symbol=['SBIN','GAIL','NATIONALUM' ]
data={}
for s in symbol:
data[s]=get_history(s,start=date(2022, 11, 29),end=date(2022, 12, 12))
Below code using to convert the data to pd datafarme, but i am getting error
new = pd.DataFrame(data, index=[0])
new
error message:
ValueError: Shape of passed values is (14, 3), indices imply (1, 3)
Documentation of get_history sais:
Returns:
pandas.DataFrame : A pandas dataframe object
Thus, data is a dict with the symbol as keys and the pd.DataFrames as values. Then you are trying to insert a DataFrame inside of another DataFrame, that does not work. If you want to create a new MultiIndex Dataframe from the 3 existing DataFrames, you can do something like this:
result = {}
for df, symbol in zip(data.values(), data.keys()):
data = df.to_dict()
for key, value in data.items():
result[(symbol, key)] = value
df_multi = pd.DataFrame(result)
df_multi.columns
Result (just showing two columns per Symbol to clarifying the Multiindex structure)
MultiIndex([( 'SBIN', 'Symbol'),
( 'SBIN', 'Series'),
( 'GAIL', 'Symbol'),
( 'GAIL', 'Series'),
('NATIONALUM', 'Symbol'),
('NATIONALUM', 'Series')
Edit
So if you just want a single index DF, like in your attached file with the symbols in a column, you can simply to this:
new_df = pd.DataFrame()
for symbol in data:
# sequentally concat the DataFrames from your dict of DataFrames
new_df = pd.concat([data[symbol], new_df],axis=0)
new_df
Then the output looks like in your file.

Add new columns to excel file from multiple datasets with Pandas in Google Colab

I'm trying to add some columns to a excel file after some data but I'm not having good results just overwriting what I have. Let me give you some context: I'm reading a csv, for each column I'm using a for to value_counts and then create a frame from this value_counts here the code for just one column:
import pandas as pd
data= pd.read_csv('responses.csv')
datatoexcel = data['Music'].value_counts().to_frame()
datatoexcel.to_excel('savedataframetocolumns.xlsx') #Name of the file
This works like this ...
And with that code for only one column I have the format that I actually need for excel.
But the problem is when I try to do it with for to all the columns and then "Append" to excel the following dataframes using this formula:
for columnName in df:
datasetstoexcel = df.value_counts(columnName).to_frame()
print(datasetstoexcel)
# Here is my problem with the following line the .to_excel
x.to_excel('quickgraph.xlsx') #I tried more code lines but I'll leave this one as base
The result that I want to reach is this one:
I'm really close to finish this code, some help here please!
How about this?
Sample data
df = pd.DataFrame({
"col1": [1,2,3,4],
"col2": [5,6,7,8],
"col3": [9, 9, 11, 12],
"col4": [13, 14, 15, 16],
})
Find value counts and add to a list
li = []
for i in range(0, len(df)):
value_counts = df.iloc[:, i].value_counts().to_frame().reset_index()
li.append(value_counts)
concat all the dataframes inside li and write to excel
pd.concat(li, axis=1).to_excel("result.xlsx")
Sample output:

Get names of dummy variables created by get_dummies

I have a dataframe with a very large number of columns of different types. I want to encode the categorical variables in my dataframe using get_dummies(). The question is: is there a way to get the column headers of the encoded categorical columns created by get_dummies()?
The hard way to do this would be to extract a list of all categorical variables in the dataframe, then append the different text labels associated to each categorical variable to the corresponding column headers. I wonder if there is an easier way to achieve the same end.
I think the way that should work with all the different uses of get_dummies would be:
#example data
import pandas as pd
df = pd.DataFrame({'P': ['p', 'q', 'p'], 'Q': ['q', 'p', 'r'],
'R': [2, 3, 4]})
dummies = pd.get_dummies(df)
#get column names that were not in the original dataframe
new_cols = dummies.columns[~dummies.columns.isin(df.columns)]
new_cols gives:
Index(['P_p', 'P_q', 'Q_p', 'Q_q', 'Q_r'], dtype='object')
I think the first column is the only column preserved when using get_dummies, so you could also just take the column names after the first column:
dummies.columns[1:]
which on this test data gives the same result:
Index(['P_p', 'P_q', 'Q_p', 'Q_q', 'Q_r'], dtype='object')

Looping through a dictionary of dataframes and counting a column

I am wondering if anyone can help. I have a number of dataframes stored in a dictionary. I simply want to access each of these dataframes and count the values in a column in the column I have 10 letters. In the first dataframe there are 5bs and 5 as. For example the output from the count I would expect to be is a = 5 and b =5. However for each dataframe this count would be different hence I would like to store the output of these counts either into another dictionary or a separate variable.
The dictionary is called Dict and the column name in all the dataframes is called letters. I have tried to do this by accessing the keys in the dictionary but can not get it to work. A section of what I have tried is shown below.
import pandas as pd
for key in Dict:
Count=pd.value_counts(key['letters'])
Count here would ideally change with each new count output to store into a new variable
A simplified example (the actual dataframe sizes are max 5000,63) of the one of the 14 dataframes in the dictionary would be
`d = {'col1': [1, 2,3,4,5,6,7,8,9,10], 'letters': ['a','a','a','b','b','a','b','a','b','b']}
df = pd.DataFrame(data=d)`
The other dataframes are names df2,df3,df4 etc
I hope that makes sense. Any help would be much appreciated.
Thanks
If you want to access both key and values when iterating over a dictionary, you should use the items function.
You could use another dictionary to store the results:
letter_counts = {}
for key, value in Dict.items():
letter_counts[key] = value["letters"].value_counts()
You could also use dictionary comprehension to do this in 1 line:
letter_counts = {key: value["letters"].value_counts() for key, value in Dict.items()}
The easiest thing is probably dictionary comprehension:
d = {'col1': [1, 2,3,4,5,6,7,8,9,10], 'letters': ['a','a','a','b','b','a','b','a','b','b']}
d2 = {'col1': [1, 2,3,4,5,6,7,8,9,10,11], 'letters': ['a','a','a','b','b','a','b','a','b','b','a']}
df = pd.DataFrame(data=d)
df2 = pd.DataFrame(d2)
df_dict = {'d': df, 'd2': df2}
new_dict = {k: v['letters'].count() for k,v in df_dict.items()}
# out
{'d': 10, 'd2': 11}

iterating over a dictionary of empty pandas dataframes to append them with data from existing dataframe based on list of column names

I'm a biologist and very new to Python (I use v3.5) and pandas. I have a pandas dataframe (df), from which I need to make several dataframes (df1... dfn) that can be placed in a dictionary (dictA), which currently has the correct number (n) of empty dataframes. I also have a dictionary (dictB) of n (individual) lists of column names that were extracted from df. The keys in 2 dictionaries match. I'm trying to append the empty dfs within dictA with parts of df based on the column names within the lists in dictB.
import pandas as pd
listA=['A', 'B', 'C',...]
dictA={i:pd.DataFrame() for i in listA}
lets say I have something like this:
dictA={'A': df1, 'B': df2}
dictB={'A': ['A1', A2', 'A3'],
'B': ['B1', B2']}
df=pd.DataFrame({'A1': [0,2,4,5],
'A2': [2,5,6,7],
'A3': [5,6,7,8],
'B1': [2,5,6,7],
'B2': [1,3,5,6]})
listA=['A', 'B']
what I'm trying to get is for df1 and df2 to get appended with portions of df like this, so that the output for df1 is like this:
A1 A2 A3
0 0 2 5
1 2 4 6
2 4 6 7
3 5 7 8
df2 would have columns B1 and B2.
I tried the following loop and some alterations, but it doesn't yield populated dfs:
for key, values in dictA.items():
values.append(df[dictB[key]])
Thanks and sorry if this was already addressed elsewhere but I couldn't find it.
You could create the dataframes you want like this instead :
df = #Your original dataframe containing all the columns
df_A = df.iloc[:][[col for col in df if 'A' in col]]
df_B = df.iloc[:][[col for col in df if 'B' in col]]