Pandas, turn all the data frame to unique categorical values - pandas

I am relatively new to Pandas and to python and I am trying to find out how to turn all content(all fields are strings) of a Pandas Dataframe to categorical one.
All the values from rows and columns have to be treated as a big unique data set before turning them to categorical numbers.
So far I was able to write the following piece of code
for col_name in X.columns:
if(X[col_name].dtype == 'object'):
X[col_name]= X[col_name].astype('category')
X[col_name] = X[col_name].cat.codes
that works on a data frame X of multiple columns. It takes the strings and turns them to unique numbers.
What I am not sure for the code above is that my for loop only works per column and I am not sure if the codes assigned are unique per column or per whole data frame (the latter is the desired action).
Can you please provide advice on how I can turn my code to provide unique numbers considering all the values of the data frame?
I would like to thank you in advance for your help.
Regards
Alex

Use DataFrame.stack with Series.unstack for set MultiIndex Series to unique values:
cols = df.select_dtypes('object').columns
df[cols] = df[cols].stack().astype('category').cat.codes.unstack()

Related

Alternative to reset_index().apply() to create new column based off index values

I have a df with a multiindex with 2 levels. One of these levels, age, is used to generate another column, Numeric Age.
Currently, my idea is to reset_index, use apply with age_func which reads row["age"], and then re-set the index, something like...
df = df.reset_index("age")
df["Numeric Age"] = df.apply(age_func, axis=1)
df = df.set_index("age") # ValueError: cannot reindex from a duplicate axis
This strikes me as a bad idea. I'm having a hard time resetting the indices correctly, and I think this is probably a slow way to go.
What is the correct way to make a new column based on the values of one of your indices? Or, if this is the correct way to go, is there a way to re-set the indices such that the df is the exact same as when I started, with the new column added?
We can set a new column using .loc, and modify the rows we need using masks. To use the correct col values, we also use a mask.
First step is to make a mask for the rows to target.
mask_foo = df.index.get_level_values("age") == "foo"
Later we will use .apply(axis=1), so write a function to handle the rows you will have from mask_foo.
def calc_foo_numeric_age(row):
# The logic here isn't important, the point is we have access to the row
return row["some_other_column"].split(" ")[0]
And now the .loc magic
df[mask_foo, "Numeric Age"] = df[mask_foo].apply(calc_foo_numeric_age, axis=1)
Repeat process for other target indices.
If your situation allows you to reset_index().apply(axis=1), I recommend that over this. I am doing this because I have other reasons for not wanting to reset_index().

can i compress a pandas dataframe into one row?

I have a pandas dataframe that I've extracted from a json object using pd.json_normalize.
It has 4 rows and over 60 columns, and with the exception of the 'ts' column there are no columns where there is more than one value.
Is it possible to merge the four rows togather to give one row which can then be written to a .csv file? I have searched the documentation and found no information on this.
To give context, the data is a one time record from a weather station, I will have records at 5 minute intervals and need to put all the records into a database for further use.
I've managed to get the desired result, it's a little convoluted, and i would expect that there is a much more succint way to do it, but I basically manipulated the dataframe, replaced all nan's with zero, replaced some strings with ints and added the columns together as shown in the code below:
with open(fname,'r') as d:
ws=json.loads(next(d))
df=pd.json_normalize(ws['sensors'], record_path='data')
df3=pd.concat([df.iloc[0],df.iloc[1], df.iloc[2],
df.iloc[3]],axis=1)
df3.rename(columns={0 :'a', 1:'b', 2 :'c' ,3 :'d'}, inplace=True)
df3=df3.fillna(0)
df3.loc['ts',['b','c','d']]=0
df3.loc[['ip_v4_gateway','ip_v4_netmask','ip_v4_address'],'c']=int(0)
df3['comb']=df3['a']+df3['b']+df3['c']+df3['d']
df3.drop(columns=['a','b','c','d'], inplace=True)
df3=df3.T
As has been said by quite a few people, the documentation on this is very patchy, so I hope this may help someone else who is struggling with this problem! (and yes, i know that one line isn't indented properly, get over it!)

Slice dataframe according to unique values into many smaller dataframes

I have a large dataframe (14,000 rows). The columns include 'title', 'x' and 'y' as well as other random data.
For a particular title, I've written a code which basically performs an analysis using the x and y values for a subset of this data (but the specifics are unimportant for this).
For this title (which is something like "Part number Y1-17") there are about 80 rows.
At the moment I have only worked out how to get my code to work on 1 subset of titles (i.e. one set of rows with the same title) at a time. For this I've been making a smaller dataframe out of my big one using:
df = pd.read_excel(r"mydata.xlsx")
a = df.loc[df['title'].str.contains('Y1-17')]
But given there are about 180 of these smaller datasets I need to do this analysis on, I don't want to have to do it manually.
My question is, is there a way to make all of the smaller dataframes automatically, by slicing the data by the unique 'title' value? All the help I've found, it seems like you need to specify the 'title' to make a subset. I want to subset all of it and I don't want to have to list all the title names to do it.
I've searched quite a lot and haven't found anything, however I am a beginner so it's very possible I've missed some really basic way of doing this.
I'm not sure if its important information but the modules I'm working with pandas, and numpy
Thanks for any help!
You can use Pandas groupby
For example:
df_dict = {key: title for key, title in df.copy().groupby('title', sort=False)}
Which creates a dictionary of DataFrames each containing all the columns and only the rows pertaining to each unique value of title.

How to efficently flatten JSON structure returned in elasticsearch_dsl queries?

I'm using elasticsearch_dsl to make make queries for and searches of an elasticsearch DB.
One of the fields I'm querying is an address, which as a structure like so:
address.first_line
address.second_line
address.city
adress.code
The returned documents hold this in JSON structures, such that the address is held in a dict with a field for each sub-field of address.
I would like to put this into a (pandas) dataframe, such that there is one column per sub-field of the address.
Directly putting address into the dataframe gives me a column of address dicts, and iterating the rows to manually unpack (json.normalize()) each address dict takes a long time (4 days, ~200,000 rows).
From the docs I can't figure out how to get elasticsearch_dsl to return flattened results. Is there a faster way of doing this?
Searching for a way to solve this problem, I've come across my own answer and found it lacking, so will update with a better way
Specifically: pd.json_normalize(df['json_column'])
In context: pd.concat([df, pd.json_normalize(df['json_column'])], axis=1)
Then drop the original column if required.
Original answer from last year that does the same thing much more slowly
df.column_of_dicts.apply(pd.Series) returns a DataFrame with those dicts flattened.
pd.concat(df,new_df) gets the new columns onto the old dataframe.
Then delete the original column_of_dicts.
pd.concat([df, df.address.apply(pd.Series)], axis=1) is the actual code I used.

pandas dataframe for matrix of values

I have 3 things:
A time series 1D array of certain length.
A matrix of stellar flux values of equal column length as the time series (as each star in the field was observed according to the time array) but ~3000 rows deep as there are ~3000 observed stars in this field.
An array of ~3000 star ID's to go with the ~3000 time-series flux recordings mentioned above.
I'm trying to turn all of this into a pandas.DataFrame for extracting timeseries features using the module 'tsfresh'. Link here.
Does anyone have an idea on how to do this? It should read somewhat like a table with a row of ID's as headers, a column of time values and ~3000 columns of flux values for the stars.
I've seen examples of it being done on the page I've linked i.e. multiple 'value' columns (in this case they would be flux values). But no indication of how to construct them.
This data frame will then be used for machine learning if that makes any difference.
Many thanks for any help that can be offered!