Is there a way to separate a column containing multiple data sets? - pandas

I'm new to this and the dataframe I'm currently working with has four columns containing data in just object datatype. The last column contains multiple data points...
i.e. the first row, last column contains:
[{"year":"1901","a":"A","b":"B"}] #printed in this format
Is there a way so I can create a new column containing just the year? i.e. isolate this data
Thanks in advance

With pandas, you can add a new column the same way you add a value to a dictionary.
So this should work for you.
df['year'] = [i[0]['year'] for i in df['last_column']]

You can use the df.apply() to get the dictionary value and assign it to a new column.
import pandas as pd
df = pd.DataFrame({'col1':['Jack','Jill','Moon','Wall','Hill'],
'col2':[100,200,300,400,500],
'col3':[{"year":"1901","a":"A","b":"B"},
{"year":"1902","c":"C","d":"D"},
{"year":"1903","e":"E","f":"F"},
{"year":"1904","g":"G","h":"H"},
{"year":"1905","i":"I","j":"J"}] })
print (df)
df['year'] = df['col3'].apply(lambda x: x['year'])
print (df)
Output for the above code:
Original DataFrame:
col1 col2 col3
0 Jack 100 {'year': '1901', 'a': 'A', 'b': 'B'}
1 Jill 200 {'year': '1902', 'c': 'C', 'd': 'D'}
2 Moon 300 {'year': '1903', 'e': 'E', 'f': 'F'}
3 Wall 400 {'year': '1904', 'g': 'G', 'h': 'H'}
4 Hill 500 {'year': '1905', 'i': 'I', 'j': 'J'}
Updated DataFrame:
col1 col2 col3 year
0 Jack 100 {'year': '1901', 'a': 'A', 'b': 'B'} 1901
1 Jill 200 {'year': '1902', 'c': 'C', 'd': 'D'} 1902
2 Moon 300 {'year': '1903', 'e': 'E', 'f': 'F'} 1903
3 Wall 400 {'year': '1904', 'g': 'G', 'h': 'H'} 1904
4 Hill 500 {'year': '1905', 'i': 'I', 'j': 'J'} 1905

Related

Python: plotting a combination of observations taken in the same day

I have the following dataset:
id test date
1 A 2000-01-01
1 B 2000-01-01
1 C 2000-01-08
2 A 2000-01-01
2 A 2000-01-01
2 B 2000-01-08
3 A 2000-01-01
3 C 2000-01-01
3 B 2000-01-08
4 A 2000-01-01
4 B 2000-01-01
4 C 2000-01-01
5 A 2000-01-01
5 B 2000-01-01
5 C 2000-01-01
I would love to create a matrix figure with the count of how many individuals got a test taken on the same day.
For example:
Since we can see that 1 time (for one individual, id=1) test A and B were taken on the day; also for one individual (id = 3) test A and B were taken on the same day; and for two individuals (id=4 and 5) the three tests were taken on the same day.
So far I am doing the following:
df_tests = df.groupby(['id', 'date']).value_counts().reset_index(name='count')
df_tests_unique = df_tests[df_tests_re.duplicated(subset=['id','date'], keep=False)]
df_tests_unique = df_tests_unique[["id", "date", "test"]]
So the only thing left is to count the number of times the different tests ocur within the same date
Thanks for the fun exercise :) Given below is a possible solution. I created a numpy array and plotted it using seaborn. Note that it's quite hardcoded for the case where there is only A, B, C but I'm sure you will be able to generalize that. Also, the default color scheme of seaborn brings opposite colors than what you intended but that's easily fixable as well. Hope I helped!
This is the resulting plot from the script:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.DataFrame({
'id': [1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5],
'test': ['A', 'B', 'C', 'A', 'A', 'B', 'A', 'C', 'B', 'A', 'B', 'C', 'A', 'B', 'C'],
'date': ['2000-01-01', '2000-01-01', '2000-01-08', '2000-01-01', '2000-01-01', '2000-01-08', '2000-01-01', '2000-01-01', '2000-01-08', '2000-01-01', '2000-01-01', '2000-01-01', '2000-01-01', '2000-01-01', '2000-01-01']
})
df_tests = df.groupby(['id', 'date']).value_counts().reset_index(name='count')
df_test_with_patterns = (df_tests[df_tests.duplicated(subset=['id', 'date'], keep=False)]
.groupby(['id', 'date'])
.agg({'test': 'sum'})
.reset_index().groupby('test').count().reset_index()
.assign(pattern=lambda df: df.test.apply(lambda tst: [1 if x in tst else 0 for x in ['A', 'B', 'C']]))
)
pattern_mat = np.vstack(df_test_with_patterns.pattern.values.tolist())
ax = sns.heatmap(pattern_mat, xticklabels=['A', 'B', 'C'], yticklabels=df_test_with_patterns.id.values)
ax.set(xlabel='Test Type', ylabel='# of individuals that took in a single day')
plt.show()
print
Building on Erap answer, this works too, maybe slightly faster:
out = pd.get_dummies(df.set_index(['date', 'id'], drop=True).sort_index()).groupby(level=[0,1]).sum()
and then iterate through the different dates to get the different charts
for i in out.index.levels[0]:
d = out.loc[i]
plt.figure()
plt.title(f'test for date {i}')
sns.heatmap(d.gt(0))

fastest way to get value frequency stored in dictionary format in groupby pandas

In order to calculate frequency of each value by id, we can do something using value_counts and groupby.
>>> df = pd.DataFrame({"id":[1,1,1,2,2,2], "col":['a','a','b','a','b','b']})
>>> df
id col
0 1 a
1 1 a
2 1 b
3 2 a
4 2 b
5 2 b
>>> df.groupby('id')['col'].value_counts()
id col
1 a 2
b 1
2 b 2
a 1
But I would like to get results stored in dictionary format, not Series. So how am I able to achieve that and also the speed is fast if we have a large dataset?
The ideal format is:
id
1 {'a': 2, 'b': 1}
2 {'a': 1, 'b': 2}
You can unstack the groupby result to get a dict-of-dicts:
df.groupby('id')['col'].value_counts().unstack().to_dict(orient='index')
# {1: {'a': 2, 'b': 1}, 2: {'a': 1, 'b': 2}}
If you want a Series of dicts, use agg instead of to_dict:
df.groupby('id')['col'].value_counts().unstack().agg(pd.Series.to_dict)
col
a {1: 2, 2: 1}
b {1: 1, 2: 2}
dtype: object
I don't recommend storing data in this format, objects are generally more troublesome to work with.
If unstacking generates NaNs, try an alternative with GroupBy.agg:
df.groupby('id')['col'].agg(lambda x: x.value_counts().to_dict())
id
1 {'a': 2, 'b': 1}
2 {'b': 2, 'a': 1}
Name: col, dtype: object
We can do pd.crosstab
pd.Series(pd.crosstab(df.id,df.col).to_dict('i'))
1 {'a': 2, 'b': 1}
2 {'a': 1, 'b': 2}
dtype: object

How to convert list of dictionaries to dataframe using pandas in python [duplicate]

This question already has answers here:
Convert list of dictionaries to a pandas DataFrame
(7 answers)
Closed 2 years ago.
I have a list of dictionaries:
data = [{'name': 'peter', 'id': 92, 'value': 6500},{'name': 'peter', 'id': 93, 'value': 6000},[{'name': 'jack', 'id': 93, 'value': 9500}]
and I want it to be converted to a dataframe:
peter id jack
6500 92 0/NaN
6000 93 9500
How to do that in python.
I have tried this but it is not working
df1 = pd.DataFrame(data, columns =['name', 'value'])
print (df1)
What about just doing this?
pd.DataFrame(data)
It gives you:
name id value
0 peter 92 6500
1 jack 93 6000

Best way to extract rows with none-matching column values for two python pandas dataframes

I would like to know the best approach for the following pandas dataframe comparison task:
Two dataframes df_a and df_b with both having columns = ['W','X','Y','Z']:
import pandas as pd
df_a = pd.DataFrame([
['a', 2, 2, 3],
['b', 5, 3, 5],
['b', 7, 6, 44],
['c', 3, 12, 19],
['c', 7, 13, 45],
['c', 3, 13, 45],
['d', 5, 11, 90],
['d', 9, 33, 44]
], columns=['W','X','Y','Z'])
df_b = pd.DataFrame([
['a', 2, 2, 3],
['a', 4, 3, 15],
['b', 5, 12, 24],
['b', 7, 6, 44],
['c', 3, 12, 19],
['d', 3, 23, 45],
['d', 6, 11, 91],
['d', 9, 33, 44]
], columns=['W','X','Y','Z'])
Extract those rows from df_a that do not have a match in columns ['W','X'] in df_b
Extract those rows from df_b that do not have a match in columns ['W','X'] in df_a
Since I am kind of newbie to pandas (and could not find any other source that gives information on the mentioned task) help-out is very much appreciated.
Thanx in advance.
The basic way is using left outer merge with indicator=True and select left_only using query
cols = ['W', 'X']
df_a_only = (df_a.merge(df_b[cols], on=cols, indicator=True, how='left')
.query('_merge=="left_only"')[df_a.columns])
Out[87]:
W X Y Z
4 c 7 13 45
6 d 5 11 90
df_b_only = (df_b.merge(df_a[cols], on=cols, indicator=True, how='left')
.query('_merge=="left_only"')[df_b.columns])
Out[89]:
W X Y Z
1 a 4 3 15
6 d 3 23 45
7 d 6 11 91
Note: if your dataframe is huge, itt is better doing one full outer merge than using 2 left outer merge as above and choosing left_only and right_only accordingly. However, in doing full outer merge, you need to do post process of NaN, converting float back to integer and rename columns.

drop a single tuple from a multi tuple column

I have the following dataframe:
<bound method DataFrame.info of <class 'pandas.core.frame.DataFrame'>
MultiIndex: 369416 entries, (datetime.datetime(2008, 1, 2, 16, 0), 'ABC') to (datetime.datetime(2010, 12, 31, 16, 0), 'XYZ')
Data columns:
b_val 369416 non-null values
dtypes: float64(1)>
From this, I want a dataframe that has dates as the indexes and 'ABC' to 'XYZ' as column names with the values as the values under the column 'b_val'. I tried to do:
new_data = new_data.unstack()
But this gives me:
<bound method DataFrame.info of <class 'pandas.core.frame.DataFrame'>
Index: 757 entries, 2008-01-02 16:00:00 to 2010-12-31 16:00:00
Columns: 488 entries, ('b_val', 'ABC') to ('b_val', 'XYZ')
dtypes: float64(488)>
Is there a way to transform this another way or is there a way to drop 'b_val' from each of the column names?
I think unstack is the correct way to do what you've done.
You could drop the first level from the the column names (a MultiIndex) using droplevel:
df.columns = df.columns.droplevel(0)
Here's an example:
df = pd.DataFrame([[1, 'a', 22], [1, 'b', 27], [2, 'a', 35], [2, 'b', 56]], columns=['date', 'name', 'value']).set_index(['date','name'])
df1 = df.unstack()
In [3]: df1
Out[3]:
value
name a b
date
1 22 27
2 35 56
In [4]: df1.columns = df1.columns.droplevel(0)
In [5]: df1
Out[5]:
name a b
date
1 22 27
2 35 56
However, a cleaner option is just to unstack the column (the series):
In [6]: df.value.unstack()
Out[6]:
name a b
date
1 22 27
2 35 56