Fastest way to locate rows of a dataframe from two lists and concatenate them? - pandas

Apologies if this has already been asked, I haven't found anything specific enough although this does seem like a general question. Anyways, I have two lists of values which correspond to values in a dataframe, and I need to pull those rows which contain those values and make them into another dataframe. The code I have works, but it seems quite slow (14 seconds per 250 items). Is there a smart way to speed it up?
row_list = []
for i, x in enumerate(datetime_list):
row_list.append(df.loc[(df["datetimes"] == x) & (df.loc["b"] == b_list[i])])
data = pd.concat(row_list)
Edit: Sorry for the vagueness #anky, here's an example dataframe
import pandas as pd
from datetime import datetime
df = pd.DataFrame({'datetimes' : [datetime(2020, 6, 14, 2), datetime(2020, 6, 14, 3), datetime(2020, 6, 14, 4)],
'b' : [0, 1, 2],
'c' : [500, 600, 700]})

IIUC, try this
dfi = df.set_index(['datetime', 'b'])
data = dfi.loc[list(enumerate(datetime_list)), :].reset_index()
Without test data in question it is hard to verify if this correct.

Related

Capturing the Timestamp values from resampled DataFrame

Using the .resample() method yields a DataFrame with a DatetimeIndex and a frequency.
Does anyone have an idea on how to iterate through the values of that DatetimeIndex ?
df = pd.DataFrame(
data=np.random.randint(0, 10, 100),
index=pd.date_range('20220101', periods=100),
columns=['a'],
)
df.resample('M').mean()
If you iterate, you get individual entries taking the Timestamp(‘2022-11-XX…’, freq=‘M’) form but I did not manage to get the date only.
g.resample('M').mean().index[0]
Timestamp('2022-01-31 00:00:00', freq='M')
I am aiming at feeding all the dates in a list for instance.
Thanks for your help !
You an convert each entry in the index into a Datetime object using .date and to a list using .tolist() as below
>>> df.resample('M').mean().index.date.tolist()
[datetime.date(2022, 1, 31), datetime.date(2022, 2, 28), datetime.date(2022, 3, 31), datetime.date(2022, 4, 30)]
You can also truncate the timestamp as follows (reference solution)
>>> df.resample('M').mean().index.values.astype('<M8[D]')
array(['2022-01-31', '2022-02-28', '2022-03-31', '2022-04-30'],
dtype='datetime64[D]')
This solution seems to work fine both for dates and periods:
I = [k.strftime('%Y-%m') for k in g.resample('M').groups]

Add new columns to excel file from multiple datasets with Pandas in Google Colab

I'm trying to add some columns to a excel file after some data but I'm not having good results just overwriting what I have. Let me give you some context: I'm reading a csv, for each column I'm using a for to value_counts and then create a frame from this value_counts here the code for just one column:
import pandas as pd
data= pd.read_csv('responses.csv')
datatoexcel = data['Music'].value_counts().to_frame()
datatoexcel.to_excel('savedataframetocolumns.xlsx') #Name of the file
This works like this ...
And with that code for only one column I have the format that I actually need for excel.
But the problem is when I try to do it with for to all the columns and then "Append" to excel the following dataframes using this formula:
for columnName in df:
datasetstoexcel = df.value_counts(columnName).to_frame()
print(datasetstoexcel)
# Here is my problem with the following line the .to_excel
x.to_excel('quickgraph.xlsx') #I tried more code lines but I'll leave this one as base
The result that I want to reach is this one:
I'm really close to finish this code, some help here please!
How about this?
Sample data
df = pd.DataFrame({
"col1": [1,2,3,4],
"col2": [5,6,7,8],
"col3": [9, 9, 11, 12],
"col4": [13, 14, 15, 16],
})
Find value counts and add to a list
li = []
for i in range(0, len(df)):
value_counts = df.iloc[:, i].value_counts().to_frame().reset_index()
li.append(value_counts)
concat all the dataframes inside li and write to excel
pd.concat(li, axis=1).to_excel("result.xlsx")
Sample output:

Removing selected features from dataset

I am following this program: https://scikit-learn.org/dev/auto_examples/inspection/plot_permutation_importance_multicollinear.html
since I have a problem with highly correlated features in my model (different from that one shown in the example). In this step
selected_features = [v[0] for v in cluster_id_to_feature_ids.values()]
I can get information on the features that I will need to remove from my classifier. They are given as numbers ([0, 3, 5, 6, 8, 9, 10, 17]). How can I get names of these features?
Ok, there are two different elements to this problem I think.
First, you need to get a list of the column names. In the example code you linked, it looks like the list of feature names is stored like this:
data.feature_names
Once you have the feature names, you'd need a way to loop through them and grab only the ones you want. Something like this should work:
columns = ['a', 'b', 'c', 'd']
keep_index = [0, 3]
new_columns = [columns[i] for i in keep_index]
new_columns
['a', 'b']

Assert an integer is in list on pandas series

I have a DataFrame with two pandas Series as follow:
value accepted_values
0 1 [1, 2, 3, 4]
1 2 [5, 6, 7, 8]
I would like to efficiently check if the value is in accepted_values using pandas methods.
I already know I can do something like the following, but I'm interested in a faster approach if there is one (took around 27 seconds on 1 million rows DataFrame)
import pandas as pd
df = pd.DataFrame({"value":[1, 2], "accepted_values": [[1,2,3,4], [5, 6, 7, 8]]})
def check_first_in_second(values: pd.Series):
return values[0] in values[1]
are_in_accepted_values = df[["value", "accepted_values"]].apply(
check_first_in_second, axis=1
)
if not are_in_accepted_values.all():
raise AssertionError("Not all value in accepted_values")
I think if create DataFrame with list column you can compare by DataFrame.eq and test if match at least one value per row by DataFrame.any:
df1 = pd.DataFrame(df["accepted_values"].tolist(), index=df.index)
are_in_accepted_values = df1.eq(df["value"]).any(axis=1).all()
Another idea:
are_in_accepted_values = all(v in a for v, a in df[["value", "accepted_values"]].to_numpy())
I found a little optimisation to your second idea. Using a bit more numpy than pandas makes it faster (more than 3x, tested with time.perf_counter()).
values = df["value"].values
accepted_values = df["accepted_values"].values
are_in_accepted_values = all(s in e for s, e in np.column_stack([values, accepted_values]))

Splitting Dataframe with hierarchical index [duplicate]

This question already has answers here:
Splitting dataframe into multiple dataframes
(13 answers)
Closed 3 years ago.
I have a large dataframe with hierarchical indexing (a simplistic/ format example provided in the code below). I would like to setup a loop/automated way of splitting the dataframe into subsets per unique index value, i.e. dfa, dfb, dfc etc. in the coded example below and store in a list.
I have tried the following but unfortunately to no success. Any help appreciated!
data = pd.Series(np.random.randn(9), index=[['a', 'a', 'a', 'b',
'b', 'c', 'c', 'd', 'd'], [1, 2, 3, 1, 2, 1, 2, 2, 3]])
split = []
for value in data.index.unique():
split.append(data[data.index == value])
I am not exactly sure if this is what you are looking for but have you checked groupby pandas function? The crucial part is that you can apply it across MultiIndex specifying which level of indexing (or what subset of levels) to group by. e.g.
split = {}
for value, split_group in data.groupby(level=0):
split[value] = split_group
print(split)
as #jezrael points out a simpler way to do it is:
dict(tuple(df.groupby(level=0)))