pandas Dataframe from itertools.product output not able to be created - pandas

I want to create a Dataframe of all possible combinations:
Primary_function=['Office','Hotel','Hospital(General Medical & Surgical)','Other - Education']
City=['Miami','Houston','Phoenix','Atlanta','Las Vegas','San Francisco','Baltimore','Chicago','Boulder','Minneapolis']
Gross_Floor_Area=[50,100,200]
Years_Built=[1950,1985,2021]
Floors_above_grade=[2,6,15]
Heat=['Electricity - Grid Purchase','Natural Gas','District Steam']
WWR=[30,50,70]
Buildings=[Primary_function,City,Gross_Floor_Area,Years_Built,Floors_above_grade,Heat,WWR]
a=list((itertools.product(*Buildings)))
df=pd.DataFrame(a,columns=Buildings)
The error that I am getting is :
ValueError: Length of columns passed for MultiIndex columns is different

Pass a list with strings of the columns, i.e.
columns = ["Primary Function", "City", "Gross Floor Area", "Year Built", "Floors Above Grade", "Heat", "WWR"]
df = pd.DataFrame(a, columns = columns)
As Mr. T suggests, if you do this frequently you will be better off using a dict.

Related

Pandas splitting a column with new line separator

I am extracting tables from pdf using Camelot. Two of the columns are getting merged together with a newline separator. Is there a way to separate them into two columns?
Suppose the column looks like this.
A\nB
1\n2
2\n3
3\n4
Desired output:
|A|B|
|-|-|
|1|2|
|2|3|
|3|4|
I have tried df['A\nB'].str.split('\n', 2, expand=True) and that splits it into two columns however I want the new column names to be A and B and not 0 and 1. Also I need to pass a generalized column label instead of actual column name since I need to implement this for several docs which may have different column names. I can determine such column name in my dataframe using
colNew = df.columns[df.columns.str.contains(pat = '\n')]
However when I pass colNew in split function, it throws an attribute error
df[colNew].str.split('\n', 2, expand=True)
AttributeError: DataFrame object has no attribute 'str'
You can take advantage of the Pandas split function.
import pandas as pd
# recreate your pandas series above.
df = pd.DataFrame({'A\nB':['1\n2','2\n3','3\n4']})
# first: Turn the col into str.
# second. split the col based on seperator \n
# third: make sure expand as True since you want the after split col become two new col
test = df['A\nB'].astype('str').str.split('\n',expand=True)
# some rename
test.columns = ['A','B']
I hope this is helpful.
I reproduced the error from my side... I guess the issue is that "df[colNew]" is still a dataframe as it contains the indexes.
But .str.split() only works on Series. So taking as example your code, I would convert the dataframe to series using iloc[:,0].
Then another line to split the column headers:
df2=df[colNew].iloc[:,0].str.split('\n', 2, expand=True)
df2.columns = 'A\nB'.split('\n')

How to build a loop for converting entires of categorical columns to numerical values in Pandas?

I have a Pandas data frame with several columns, with some columns comprising categorical entries. I am 'manually' converting these entries to numerical values. For example,
df['gender'] = pd.Series(df['gender'].factorize()[0])
df['race'] = pd.Series(df['race'].factorize()[0])
df['city'] = pd.Series(df['city'].factorize()[0])
df['state'] = pd.Series(df['state'].factorize()[0])
If the number of columns is huge, this method is obviously inefficient. Is there a way to do this by constructing a loop over all columns (only those columns with categorical entries)?
Use DataFrame.apply by columns in variable cols:
cols = df.select_dtypes(['category']).columns
df[cols] = df[cols].apply(lambda x: x.factorize()[0])
EDIT:
Your solution should be simplify:
for column in df.select_dtypes(['category']):
df[column] = df[column].factorize()[0]
I tried the following, which seems to work fine:
for column in df.select_dtypes(['category']):
df[column] = pd.Series(df[column].factorize()[0])
where 'category' could be 'bool', 'object', etc.

Python and Pandas: apply per multiple columns

I am new to python and I was successful in using apply in a dataframe to create a new column inside a dataframe.
X['Geohash']=X[['Lat','Long']].apply (lambda column: geohash.encode(column[0],column[1],precision=8), axis=1)
this is calling the geohash function with the latitudes and longitudes per row and per column.
Now I have two new data frames one for Latitude and one for Longitude.
Each dataframe has twenty columns and I want that the
.apply (lambda column: geohash.encode(column[0],column[1],precision=8), axis=1)
is called twenty times.
-First time the first
dataframe-Latitude column with the first dataframe-Longitude column then
-the second time, the second dataframe-Latitude column with the second dataframe-Longitude column.
How I can do this iteration per column and at each iteration call the
.apply (lambda column: geohash.encode(column[0],column[1],precision=8), axis=1)
What I want to get is a new dataframe with twenty columns with each column being the result of the geohash function.
Ideas will be appreciated.
You can do this by creating an "empty" dataframe with 20 columns, and then using df.columns[i] to loop through your other dataframes - something like this:
output = pd.DataFrame({i:[] for i in range(20)})
This creates an empty dataframe with all the columns you wanted (numbered).
Now, let's say longitude and latitude dataframes are called 'lon' and 'lat'. We need to join them into one dataframe Then:
lonlat = lat.join(lon)
for i in range(len(output.columns)):
output[output.columns[i]] = lonlat.apply(lambda column: geohash.encode(column[lat.columns[i]],
column[lon.columns[i]],
precision=8), axis=1)

Merging DataFrames on a specific column together

I tried to perform my self-created function on a for loop.
Some remarks in advance:
ma_strategy is my function and requires three inputs
ticker_list is a list with strings result is a pandas Dataframe with 7 columns and I can call the column 'return_cum' with result['return_cum']. - The rows of this column are containing floating point numbers.
My intention is the following:
The for loop should iterate over the items in my ticker_list and should save the 'return_cum' columns in a DataFrame. Then the different 'return_cum' columns should be stored together so that at the end I get a DataFrame with all the 'return_cum' columns of my ticker list.
How can I achieve that goal?
My approach is:
for i in ticker_list:
result = ma_strategy(i, 20, 5)
x = result['return_cum'].to_frame()
But at this stage I need some help.
If i inderstood you correctly this should work:
result_df =pd.DataFrame()
for i in ticker_list:
result= ma_strategy(i, 20,5)
resault_df[i + '_return_cum'] = result['return_cum']

Merge two unequal length data frames by factor matching

I'm new to R and I have been searching all over to look for a solution to merge two data frames and match them by factor. Some of the data does have white space. Here is a simple example in what I am trying to do:
df1 = data.frame(id=c(1,2,3,4,5), item=c("apple", " ", "coffee", "orange", "bread"))
df2 = data.frame(item=c("orange", "carrot", "peas", "coffee", "cheese", "apple", "bacon"),count=c(2,5,13,4,11,9,3))
When I use the merge() function to combine df1 into df1 by matching by 'item' name, I end up with a "item" column of NAs.
ndf = merge(df1, df2, by="item")
How do I resolve this issue? Am I getting this because I have white space in my data? Any help would be great. Thanks,