using regex in pivot_longer to unpivot multiple sets of columns with common grouping variable - pandas

Follow-up from my last question:
pyjanitor pivot_longer multiple sets of columns with common grouping variable and id column
In my last question, the dataset I gave was oversimplified for the problem I was having. I have changed the column names to represent the ones in my dataset, as I couldn't figure out how to fix them myself using regex in pivot_longer. In the model dataset I gave, columns were written with the following pattern: number_word, but in my dataset the columns are in any order and never separated by underscores (e.g., wordnumber).
Note that the number needs to be the same grouping variable for each column set. So there should be a rating, estimate, and type for each number.
The dataset
df = pd.DataFrame({
'id': [1, 1, 1],
'ratingfirst': [1, 2, 3],
'ratingsecond': [2.8, 2.9, 2.2],
'ratingthird': [3.4, 3.8, 2.9],
'firstestimate': [1.2, 2.4, 2.8],
'secondestimate': [2.4, 3, 2.4],
'thirdestimate':[3.4, 3.8, 2.9],
'firsttype': ['red', 'green', 'blue'],
'secondtype': ['red', 'green', 'yellow'],
'thirdtype': ['red', 'red', 'blue'],
})
Desired output
The header of my desired output is the following:
id
category
rating
estimate
type
1
first
1.0
1.2
'red'

I think the easiest way would be to align the columns you have with what was used in the previous question, something like:
def fix_col_header(s, d):
for word, word_replace in d.items():
s = s.replace(word, word_replace)
if s.startswith("_"):
s = s[len(word_replace):] + s[:len(word_replace)]
return s
d = {"first":"_first", "second":"_second", "third": "_third"}
df.columns = [fix_col_header(col, d) for col in df.columns]
This will give the columns:
id, rating_first, rating_second, rating_third, estimate_first, estimate_second, estimate_third, type_first, type_second, type_third
Now you can apply the solution from the previous question (note that category and value are switched). For completeness I have added it here:
import janitor
(df
.pivot_longer(
column_names="*_*",
names_to = (".value", "category"),
names_sep="_")
)

Related

i need to return a value from a dataframe cell as a variable not a series

i have the following issue:
when i use .loc funtion it returns a series not a single value with no index.
As i need to do some math operation with the selected cells. the function that i am using is:
import pandas as pd
data = [[82,1], [30, 2], [3.7, 3]]
df = pd.DataFrame(data, columns = ['Ah-Step', 'State'])
df['Ah-Step'].loc[df['State']==2]+ df['Ah-Step'].loc[df['State']==3]
.values[0] will do what OP wants.
Assuming one wants to obtain the value 30, the following will do the work
df.loc[df['State'] == 2, 'Ah-Step'].values[0]
print(df)
[Out]: 30.0
So, in OP's specific case, the operation 30+3.7 could be done as follows
df.loc[df['State'] == 2, 'Ah-Step'].values[0] + df['Ah-Step'].loc[df['State']==3].values[0]
[Out]: 33.7

Unify dataframe columns like in Prolog to remove duplicates

I'm learning numpy and pandas, so please forgive me if I'm not using the right terminology or I ask something silly.
I've spent quite some time searching online for an answer to my question, with no luck...
I have a few .CSV files containing data in the same domain.
Each file has a column which works as a primary key for each row.
The name of the primary key column might be different in each file.
Each file might have or miss primary keys from other files.
It's safe to assume that each file contains most of the primary keys, so some np.NAN will have to be introduced.
I'm trying to load and combine together all the data in a single dataset, possibly removing "duplicated" columns.
This is how I do the first part (please let me know if there are better ways of doing it):
import pandas as pd
FILENAME_KEY_TABLE = {
'file_001.csv': 'id',
'file_002.csv': 'ident',
'file_003.csv': 'cid',
'file_004.csv': 'id',
}
dataset, previous = pd.DataFrame(), None
for filename, key in listing.items():
resource = pd.read_csv(filename)
dataset = pd.merge(dataset, resource, left_on=[previous], right_on=[key], how='outer')
previous = key
The resulting dataset has duplicated columns for sure.
The code above keeps copies of all the primary keys, for instance.
Moreover, some columns have duplicates (with the same name or a different name) in the other files.
I'm using the following code to identify duplicated columns in the dataset (if there is a better way of doing this, please let me know). The output is a dictionary whose keys are the names of the duplicated column and the values sets of names of the original columns they duplicate.
def get_duplicates(df: pd.DataFrame) -> Dict[str, Set[str]]:
result = {}
for x in range(df.shape[1]):
col = df.iloc[:, x]
original = df.columns.values[x]
for y in range(x + 1, df.shape[1]):
other = df.iloc[:, y]
if col.equals(other):
duplicate = df.columns.values[y]
result.setdefault(duplicate, set()).add(original)
return result
Unfortunately, the code above compares columns as a whole and it doesn't consider any np.NAN introduced by the merging as a potential match as it should. I tried naively to replace:
if col.equals(other):
...
with:
if col.equals(other) or col.isnull() or other.isnull():
...
clearly with no success. How should I change the line above to work as intended?
For instance, column 'A' and 'B' in the following example actually unify:
df = pd.DataFrame({
'A': ['x', 'x', np.NAN, np.NAN],
'B': ['x', np.NAN, 'x', np.NAN],
}, index=[0, 1, 2, 3])
and the unified column would be:
ps.DataGrame({
'Unified': ['x', 'x', 'x', np.NAN],
}, index=[0, 1, 2, 3])
Ultimately, I'd like to implement something like the unification in Prolog.
Two columns unify if each corresponding value matches or any of them is np.NAN (np.NAN works like a variable).
If two column unify, I'd like to keep the column with the more descriptive title (the longer title), merge its content with the values from the other column (replace any np.NAN with the corresponding value from the other columns), and eventually remove the other column. Is there a function or package that does this already?
Last but not least, I've noticed that some columns in my data are duplicates, but they are not identified as such because they contain floats with different level of precision (i.e.: 3 or 6 decimal digits). Is there any way to tackle this or similar cases? Again, is there any package or function that might help?
Thanks in advance!
I modified my code as follows, however it is slow and doesn't find the same duplicates that I was finding with a former pure-python implementation.
import pandas as pd
import swifter
def get_duplicates(df: pd.DataFrame) -> List[Tuple[str, str]]:
result = []
for i, original in enumerate(df.columns, start=1):
for duplicate in df.columns[i:]:
if df.swifter.apply(lambda x: unifying(x[original], x[duplicate]), axis=1).all():
result.append((original, duplicate))
return result
def consolidate(df: pd.DataFrame, original: str, duplicate: str) -> None:
if df.swifter.apply(lambda x: unifying(x[original], x[duplicate]), axis=1).all():
if len(duplicate) > len(original):
original, duplicate = duplicate, original
df[original] = df.swifter.apply(lambda x: unification(x[original], x[duplicate]), axis=1)
df.drop(duplicate, axis=1, inplace=True)
def unifying(src: Any, tgt: Any) -> bool:
return src == tgt or src is np.NAN or tgt is np.NAN
def unification(src: Any, tgt: Any) -> Any:
return src if src == tgt or tgt is np.NAN else tgt
df = pd.DataFrame({
'A': ['x', 'x', np.NAN, np.NAN],
'B': ['x', np.NAN, 'x', np.NAN],
}, index=[0, 1, 2, 3])
for original, duplicate in get_duplicates(df):
consolidate(df, original, duplicate)
Maybe it's because the comparison in the unif* methods keep into account the type of the data? (ie: '1.0' != 1.0)
After further research, I believe what I was after is (matrix) correlation.
def find_correlations(df: pd.DataFrame, threshold: float = 0.95):
corr = df.corr().abs()
upper = corr.where(np.triu(np.ones(corr.shape), k=1).astype(bool))
result = []
for column in upper.columns:
key, value = max(upper[column].items(), key=lambda x: x[1])
if value > threshold:
result.append((column, key, value))
return result

pandas sum particular rows conditional on values in different rows

I have a data frame like this.
The rows are grouped into 5 rows at a time. The first row of the first group tells me whether to include the whole following 4 rows according to Field A. For example, the yellow is included, the blue is not, simply because the first row of each tells me.
I want to sum fieldB if the section has FieldA true in the first row. In this example, I want to sum over the yellow section, because the first row for that section has TRUE in fieldA.
I can think of two apoproaches to do this, but don't know how to code it:
Update the remainder of Field A first with TRUE if the first of the 5 rows is true. But I don't know how to do this.
Have a filter that is based on on the row itself, but the row of the header. Again, I don't know how to do this.
These is solution based on your option 1 suggestion:
# Import pandas
import pandas as pd
import numpy as np
# Sample df
d = {'FieldA': [True, '', '', '', '', False, '', '', '', ''],'FieldB': [1, 2, 1, 4, 6, 5, 7, 9, 0, 1], 'FieldC': [0.3, 0.2, 0.3, 0.2, 0.2, 0.3, 0.2, 0.3, 0.2, 0.2]}
df = pd.DataFrame(data=d)
# Create temporaty column to find index distance from last True/False
t_mod = []
for i in list(df.index.values):
t_mod.append(i%5)
df['t_mod_c'] = np.array(t_mod)
# Add missinf True/False values to FieldA based in column t_mod_c
test = []
for i in df.index.values:
test.append(df['FieldA'].loc[i-df['t_mod_c'].loc[i]])
df.drop(['t_mod_c'], axis=1, inplace = True)
df['FieldA'] = np.array(test)
df
# Sum FieldB based on FieldA value
df[df['FieldA'] == True]['FieldB'].sum()
Hope it helps!
If you have ony questions let me know.
Good luck!

How to group by multiple columns on a pandas series

The pandas.Series groupby method makes it possible to group by another series, for example:
data = {'gender': ['Male', 'Male', 'Female', 'Male'], 'age': [20, 21, 20, 20]}
df = pd.DataFrame(data)
grade = pd.Series([5, 6, 7, 4])
grade.groupby(df['age']).mean()
However, this approach does not work for a groupby using two columns:
grade.groupby(df[['age','gender']])
ValueError: Grouper for class pandas.core.frame.DataFrame not 1-dimensional.
In the example, it is easy to add the column to the dataframe and get the desired result as follows:
df['grade'] = grade
y = df.groupby(['gender','age']).mean()
y.to_dict()
{'grade': {('Female', 20): 7.0, ('Male', 20): 4.5, ('Male', 21): 6.0}}
But that can get quite ugly in real life situations. Is there any way to do this groupby on multiple columns directly on the series?
Since I don't know of any direct way to solve the problem, I've made a function that creates a temporary table and performs the groupby on it.
def pd_groupby(series,group_obj):
df = pd.DataFrame(group_obj).copy()
groupby_columns = list(df.columns)
df[series.name] = series
return df.groupby(groupby_columns)[series.name]
Here, group_obj can be a pandas Series or a Pandas DataFrame. Starting from the sample code, the desired result can be achieved by:
y = pd_groupby(grade,df[['gender','age']]).mean()
y.to_dict()
{('Female', 20): 7.0, ('Male', 20): 4.5, ('Male', 21): 6.0}

Seaborn groupby pandas Series

I want to visualize my data into box plots that are grouped by another variable shown here in my terrible drawing:
So what I do is to use a pandas series variable to tell pandas that I have grouped variables so this is what I do:
import pandas as pd
import seaborn as sns
#example data for reproduciblity
a = pd.DataFrame(
[
[2, 1],
[4, 2],
[5, 1],
[10, 2],
[9, 2],
[3, 1]
])
#converting second column to Series
a.ix[:,1] = pd.Series(a.ix[:,1])
#Plotting by seaborn
sns.boxplot(a, groupby=a.ix[:,1])
And this is what I get:
However, what I would have expected to get was to have two boxplots each describing only the first column, grouped by their corresponding column in the second column (the column converted to Series), while the above plot shows each column separately which is not what I want.
A column in a Dataframe is already a Series, so your conversion is not necessary. Furthermore, if you only want to use the first column for both boxplots, you should only pass that to Seaborn.
So:
#example data for reproduciblity
df = pd.DataFrame(
[
[2, 1],
[4, 2],
[5, 1],
[10, 2],
[9, 2],
[3, 1]
], columns=['a', 'b'])
#Plotting by seaborn
sns.boxplot(df.a, groupby=df.b)
I changed your example a little bit, giving columns a label makes it a bit more clear in my opinion.
edit:
If you want to plot all columns separately you (i think) basically want all combinations of the values in your groupby column and any other column. So if you Dataframe looks like this:
a b grouper
0 2 5 1
1 4 9 2
2 5 3 1
3 10 6 2
4 9 7 2
5 3 11 1
And you want boxplots for columns a and b while grouped by the column grouper. You should flatten the columns and change the groupby column to contain values like a1, a2, b1 etc.
Here is a crude way which i think should work, given the Dataframe shown above:
dfpiv = df.pivot(index=df.index, columns='grouper')
cols_flat = [dfpiv.columns.levels[0][i] + str(dfpiv.columns.levels[1][j]) for i, j in zip(dfpiv.columns.labels[0], dfpiv.columns.labels[1])]
dfpiv.columns = cols_flat
dfpiv = dfpiv.stack(0)
sns.boxplot(dfpiv, groupby=dfpiv.index.get_level_values(1))
Perhaps there are more fancy ways of restructuring the Dataframe. Especially the flattening of the hierarchy after pivoting is hard to read, i dont like it.
This is a new answer for an old question because in seaborn and pandas are some changes through version updates. Because of this changes the answer of Rutger is not working anymore.
The most important changes are from seaborn==v0.5.x to seaborn==v0.6.0. I quote the log:
Changes to boxplot() and violinplot() will probably be the most disruptive. Both functions maintain backwards-compatibility in terms of the kind of data they can accept, but the syntax has changed to be more similar to other seaborn functions. These functions are now invoked with x and/or y parameters that are either vectors of data or names of variables in a long-form DataFrame passed to the new data parameter.
Let's now go through the examples:
# preamble
import pandas as pd # version 1.1.4
import seaborn as sns # version 0.11.0
sns.set_theme()
Example 1: Simple Boxplot
df = pd.DataFrame([[2, 1] ,[4, 2],[5, 1],
[10, 2],[9, 2],[3, 1]
], columns=['a', 'b'])
#Plotting by seaborn with x and y as parameter
sns.boxplot(x='b', y='a', data=df)
Example 2: Boxplot with grouper
df = pd.DataFrame([[2, 5, 1], [4, 9, 2],[5, 3, 1],
[10, 6, 2],[9, 7, 2],[3, 11, 1]
], columns=['a', 'b', 'grouper'])
# usinge pandas melt
df_long = pd.melt(df, "grouper", var_name='a', value_name='b')
# join two columns together
df_long['a'] = df_long['a'].astype(str) + df_long['grouper'].astype(str)
sns.boxplot(x='a', y='b', data=df_long)
Example 3: rearanging the DataFrame to pass is directly to seaborn
def df_rename_by_group(data:pd.DataFrame, col:str)->pd.DataFrame:
'''This function takes a DataFrame, groups by one column and returns
a new DataFrame where the old columnnames are extended by the group item.
'''
grouper = df.groupby(col)
max_length_of_group = max([len(values) for item, values in grouper.indices.items()])
_df = pd.DataFrame(index=range(max_length_of_group))
for i in grouper.groups.keys():
helper = grouper.get_group(i).drop(col, axis=1).add_suffix(str(i))
helper.reset_index(drop=True, inplace=True)
_df = _df.join(helper)
return _df
df = pd.DataFrame([[2, 5, 1], [4, 9, 2],[5, 3, 1],
[10, 6, 2],[9, 7, 2],[3, 11, 1]
], columns=['a', 'b', 'grouper'])
df_new = df_rename_by_group(data=df, col='grouper')
sns.boxplot(data=df_new)
I really hope this answer helps to avoid some confusion.
sns.boxplot() doesnot take groupby.
Probably you are gonna see
TypeError: boxplot() got an unexpected keyword argument 'groupby'.
The best idea to group data and use in boxplot passing the data as groupby dataframe value.
import seaborn as sns
grouDataFrame = nameDataFrame(['A'])['B'].agg(sum).reset_index()
sns.boxplot(y='B', x='A', data=grouDataFrame)
Here B column data contains numeric value and grouped is done on the basis of A. All the grouped value with their respective column are added and boxplot diagram is plotted. Hope this helps.