Selecting specific rows by date in a multi-dimensional df Python - pandas

image of df
I would like to select a specific date e.g 2020-07-07 and get the Adj Cls and ExMA for each of the symbols. I'm new in Python and I tried using df.loc['xy'], (xy being a specific date on the datetime) and keep getting a KeyError. Any insight is greatly appreciated.
Info on the df MultiIndex: 30 entries, (SNAP, 2020-07-06 00:00:00) to (YUM, 2020-07-10 00:00:00)
Data columns (total 2 columns):
dtypes: float64(2)

You can use pandas.DataFrame.xs for this.
import pandas as pd
import numpy as np
df = pd.DataFrame(
np.arange(8).reshape(4, 2), index=[[0, 0, 1, 1], [2, 3, 2, 3]], columns=list("ab")
)
print(df)
# a b
# 0 2 0 1
# 3 2 3
# 1 2 4 5
# 3 6 7
print(df.xs(3, level=1).filter(["a"]))
# a
# 0 2
# 1 6

Related

Count how many non-zero entries at each month in a dataframe column

I have a dataframe, df, with datetimeindex and a single column, like this:
I need to count how many non-zero entries i have at each month. For example, according to those images, in January i would have 2 entries, in February 1 entry and in March 2 entries. I have more months in the dataframe, but i guess that explains the problem.
I tried using pandas groupby:
df.groupby(df.index.month).count()
But that just gives me total days at each month and i don't saw any other parameter in count() that i could use here.
Any ideas?
Try index.to_period()
For example:
In [1]: import pandas as pd
import numpy as np
x_df = pd.DataFrame(
{
'values': np.random.randint(low=0, high=2, size=(120,))
} ,
index = pd.date_range("2022-01-01", periods=120, freq="D")
)
In [2]: x_df
Out[2]:
values
2022-01-01 0
2022-01-02 0
2022-01-03 1
2022-01-04 0
2022-01-05 0
...
2022-04-26 1
2022-04-27 0
2022-04-28 0
2022-04-29 1
2022-04-30 1
[120 rows x 1 columns]
In [3]: x_df[x_df['values'] != 0].groupby(lambda x: x.to_period("M")).count()
Out[3]:
values
2022-01 17
2022-02 15
2022-03 16
2022-04 17
can you try this:
#drop nans
import numpy as np
dfx['col1']=dfx['col1'].replace(0,np.nan)
dfx=dfx.dropna()
dfx=dfx.resample('1M').count()

How to make difference between two pandas dataframes?

I have two pandas dataframes:
import pandas as pd
df_1 = pd.DataFrame({'ID': [1, 2, 4, 7, 30],
'Instrument': ['temp', 'temp_sensor', 'temp_sensor',
'sensor', 'sensor'],
'Value': [1000, 0, 1000, 0, 1000]})
print(df_1)
ID Instrument Value
1 temp 1000
2 temp_sensor 0
4 temp_sensor 1000
7 sensor 0
30 sensor 1000
df_2 = pd.DataFrame({'ID': [1, 30],
'Instrument': ['temp', 'sensor'],
'Value': [1000, 1000]})
print(df_2)
ID Instrument Value
1 temp 1000
30 sensor 1000
I need to exclude from df_1 the lines that also exist in df_2. So I made the code:
combined = df_1.append(df_2)
combined[~combined.index.duplicated(keep=False)]
The (wrong) output is:
ID Instrument Value
4 temp_sensor 1000
7 sensor 0
30 sensor 1000
I would like the output to be:
ID Instrument Value
2 temp_sensor 0
4 temp_sensor 1000
7 sensor 0
I relied on what was explained in: How to remove a pandas dataframe from another dataframe
Use DataFrame.merge by all columns names with left join and parameter indicator=True and filter rows with left_only values:
s = df_1.merge(df_2, on=list(df_1.columns), how='left', indicator=True)['_merge']
df = df_1.loc[s == 'left_only']
print(df)
ID Instrument Value
1 2 temp_sensor 0
2 4 temp_sensor 1000
3 7 sensor 0

Seaborn Violin Plot from Pandas Dataframe, each column its own separate violin plot

I have Pandas Dataframe with structure:
A B
0 1 1
1 2 1
2 3 4
3 3 7
4 6 8
How do I generate a Seaborn Violin plot with each column as its own separate violin plot for side-by-side comparison?
seaborn (at least, version 0.8.1; not sure if this is new) supports what you want without messing around with your dataframe at all:
import pandas as pd
import seaborn as sns
df = pd.DataFrame({'A': [1, 2, 3, 3, 6], 'B': [1, 1, 4, 7, 8]})
sns.violinplot(data=df)
(Note that you do need to set data=df; if you just pass in df as the first argument (equivalent to setting x=df in the function call), it seems like it concatenates the columns together and then makes a violin plot of all of the data)
You can first reshape by melt for groups from columns and then seaborn.violinplot:
#old version of pandas
#df = pd.melt(df, var_name='groups', value_name='vals')
df = df.melt(var_name='groups', value_name='vals')
print (df)
groups vals
0 A 1
1 A 2
2 A 3
3 A 3
4 A 6
5 B 1
6 B 1
7 B 4
8 B 7
9 B 8
ax = sns.violinplot(x="groups", y="vals", data=df)

Check whether a column in a dataframe is an integer or not, and perform operation

Check whether a column in a dataframe is an integer or not, and if it is an integer, it must be multiplied by 10
import numpy as np
import pandas as pd
df = pd.dataframe(....)
#function to check and multiply if a column is integer
def xtimes(x):
for col in x:
if type(x[col]) == np.int64:
return x[col]*10
else:
return x[col]
#using apply to apply that function on df
df.apply(xtimes).head(10)
I am getting an error like ('GP', 'occurred at index school')
You could use select_dtypes to get numeric columns and then multiply.
In [1284]: df[df.select_dtypes(include=['int', 'int64', np.number]).columns] *= 10
You could have your specific check list for include=[... np.int64, ..., etc]
You can use the dtypes attribute and loc.
df.loc[:, df.dtypes <= np.integer] *= 10
Explanation
pd.DataFrame.dtypes returns a pd.Series of numpy dtype objects. We can use the comparison operators to determine subdtype status. See this document for the numpy.dtype hierarchy.
Demo
Consider the dataframe df
df = pd.DataFrame([
[1, 2, 3, 4, 5, 6],
[1, 2, 3, 4, 5, 6]
]).astype(pd.Series([np.int32, np.int16, np.int64, float, object, str]))
df
0 1 2 3 4 5
0 1 2 3 4.0 5 6
1 1 2 3 4.0 5 6
The dtypes are
df.dtypes
0 int32
1 int16
2 int64
3 float64
4 object
5 object
dtype: object
We'd like to change columns 0, 1, and 2
Conveniently
df.dtypes <= np.integer
0 True
1 True
2 True
3 False
4 False
5 False
dtype: bool
And that is what enables us to use this within a loc assignment.
df.loc[:, df.dtypes <= np.integer] *= 10
df
0 1 2 3 4 5
0 10 20 30 4.0 5 6
1 10 20 30 4.0 5 6

pyspark's flatMap in pandas

Is there an operation in pandas that does the same as flatMap in pyspark?
flatMap example:
>>> rdd = sc.parallelize([2, 3, 4])
>>> sorted(rdd.flatMap(lambda x: range(1, x)).collect())
[1, 1, 1, 2, 2, 3]
So far I can think of apply followed by itertools.chain, but I am wondering if there is a one-step solution.
There's a hack. I often do something like
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'x': [[1, 2], [3, 4, 5]]})
In [3]: df['x'].apply(pd.Series).unstack().reset_index(drop=True)
Out[3]:
0 1
1 3
2 2
3 4
4 NaN
5 5
dtype: float64
The introduction of NaN is because the intermediate object creates a MultiIndex, but for a lot of things you can just drop that:
In [4]: df['x'].apply(pd.Series).unstack().reset_index(drop=True).dropna()
Out[4]:
0 1
1 3
2 2
3 4
5 5
dtype: float64
This trick uses all pandas code, so I would expect it to be reasonably efficient, though it might not like things like very different sized lists.
there are three steps to solve this question.
import pandas as pd
df = pd.DataFrame({'x': [[1, 2], [3, 4, 5]]})
df_new = df['x'].apply(pd.Series).unstack().reset_index().dropna()
df_new[['level_1',0]]`
Since July 2019, Pandas offer pd.Series.explode to unnest frames. Here's a possible implementation of pd.Series.flatmap based on explode and map. Why?
flatmap operations should be a subset of map, not apply. check this thread for map/applymap/apply details Difference between map, applymap and apply methods in Pandas
import pandas as pd
from typing import Callable
def flatmap(
self,
func:Callable[[pd.Series],pd.Series],
ignore_index:bool=False):
return self.map(func).explode(ignore_index)
pd.Series.flatmap = flatmap
# example
df = pd.DataFrame([(x,y) for x,y in zip(range(1,6),range(6,16))], columns=['A','B'])
print(df.head(5))
# A B
# 0 1 6
# 1 2 7
# 2 3 8
# 3 4 9
# 4 5 10
print(df.A.flatmap(range,False))
# 0 NaN
# 1 0
# 2 0
# 2 1
# 3 0
# 3 1
# 3 2
# 4 0
# 4 1
# 4 2
# 4 3
# Name: A, dtype: object
print(df.A.flatmap(range,True))
# 0 0
# 1 0
# 2 1
# 3 0
# 4 1
# 5 2
# 6 0
# 7 1
# 8 2
# 9 3
# 10 0
# 11 1
# 12 2
# 13 3
# 14 4
# Name: A, dtype: object
As you can see, the main issue is the indexing. You could ignore it and just reset, but then you're better of using NumPy or std lists, as indexing is one of the key pandas' points. If you do not care about indexing at all, you could reuse the idea of the solution above, change pd.Series.map to pd.DataFrame.applymap and pd.Series.explode to pd.DataFrame.explode and forcing ignore_index=True.
I suspect that the answer is "no, not efficiently."
Pandas isn't built for nested data like this. I suspect that the case you're considering in Pandas looks a bit like the following:
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'x': [[1, 2], [3, 4, 5]]})
In [3]: df
Out[3]:
x
0 [1, 2]
1 [3, 4, 5]
And that you want something like the following
x
0 1
0 2
1 3
1 4
1 5
It is far more typical to normalize your data in Python before you send it to Pandas. If Pandas did do this then it would probably only be able to operate at slow Python speeds rather than fast C speeds.
Generally one does a bit of munging of data before one uses tabular computation.