How to access results from extractall on a dataframe - pandas

I have a dataframe df in which the column df.Type has dimension information about physical objects. The numbers appear inside a text string which I have successfully extracted using this code:
dftemp=df.Type.str.extractall("([-+]?\d*\.\d+|\d+)").astype(float)
But now, the problem is that results appear as:
0
Unit match
5 0 0.02
1 0.03
6 0 0.02
1 0.02
7 0 0.02
...
How can I multiply these successive numbers (e.g. 0.02 * 0.03 = 0.006) and insert the result into the original dataframe df as a new column, say df.Area for each value of df.Type?
Thanks for your ideas!

I think you can do it with unstack and then prod along axis=1 like
print (dftemp.unstack().prod(axis=1))
then if I'm not mistaken, Unit is the name of the index in df, so I would say that
df['Area'] = dftemp.unstack().prod(axis=1)
should create the column you look for.
With an example:
df = pd.DataFrame( {'Type':['bla 0.03 dddd 0.02 jjk','bli 0.02 kjhg 0.02 wait']},
index=pd.Index([5,6],name = 'Unit'))
df['Area'] = (df.Type.str.extractall("([-+]?\d*\.\d+|\d+)").astype(float)
.unstack().prod(axis=1))
print (df)
Type Area
Unit
5 bla 0.03 dddd 0.02 jjk 0.0006
6 bli 0.02 kjhg 0.02 wait 0.0004

Related

return list by dataframe linear interpolation

I have a dataframe that has, let's say 5 entries.
moment
stress
strain
0
0.12
13
0.11
1
0.23
14
0.12
2
0.56
15
0.56
I would like to get a 1D float list in the order of [moment, stress, strain], based on the linear interpolation of strain = 0.45
I have read a couple of threads talking about the interpolate() method from pandas. But it is used when you have Nan entry and you fill in the number.
How do I accomplish a similar task with my case?
Thank you
One method is to add new row to your dataframe with NaN values and sort it:
df = df.append(
{"moment": np.nan, "stress": np.nan, "strain": 0.45}, ignore_index=True
)
df = df.sort_values(by="strain").set_index("strain")
df = df.interpolate(method="index")
print(df)
Prints:
moment stress
strain
0.11 0.1200 13.00
0.12 0.2300 14.00
0.45 0.4775 14.75
0.56 0.5600 15.00
To get the values back:
df = df.reset_index()
print(
df.loc[df.strain == 0.45, ["moment", "stress", "strain"]]
.to_numpy()
.tolist()[0]
)
Prints:
[0.47750000000000004, 14.75, 0.45]

how to regularize numeric/non-numeric entries in pandas dataframes

I want to control for non-numeric entries in my pandas dataframe.
Say I have the following:
>>> df
col_1 col_2 col_3
0 0.01 NaN 0.1
1 NaN 0.9 0.2
2 0.01 NaN 0.3
3 0.01 0.9 0.4
I can take the row-means as follows, while python properly skips over the NaN values:
>>> df.mean(axis=1)
0 0.055000
1 0.550000
2 0.155000
3 0.436667
dtype: float64
Great!. but now suppose one of the values from my imported table is a string
>>> df.iloc[0,1]="str1"
>>> df
col_1 col_2 col_3
0 0.01 str1 0.1
1 NaN 0.9 0.2
2 0.01 NaN 0.3
3 0.01 0.9 0.4
>>> df.mean(axis=1)
0 0.055
1 0.200
2 0.155
3 0.205
dtype: float64
DANGER: the output looks plausible, but is wrong, because once I changed the value in position [0,1] to a string, the values in position [1,1] and [3,1] changed from being the number 0.9 to become the string "0.9", and all the strings are omitted from averaging (I guess each column has to be of the same type? there's probably a reason, but boy this is dangerously subtle.)
What I want to do now is force all the entries of the dataframe back into numeric type. Anything that can be sensibly coerced into a number should become that number, and anything that cannot should become nan (regardless of what string or type it might have been).
Pandas series have a function pandas.to_numeric where you can set errors='coerce', but unfortunately the analogous function for df's (DataFrame.astype()) doesn't allow this option.
Is there a function for "make every element of the dataFrame that looks like a number numeric, and make everything else nan"?
I think you can use to_numeric on a subset of columns with apply. This answer might help.
You can apply, which by default will perform on the columns:
df.apply(pd.to_numeric, errors='coerce').mean(1)

Merging 2 or more data frames and transposing the result

I have several DFs derived from a Panda binning process using the below code;
df2 = df.resample(rule=timedelta(milliseconds=250))[('diffA')].mean().dropna()
df3 = df.resample(rule=timedelta(milliseconds=250))[('diffB')].mean().dropna()
.. etc
Every DF will have column containing 'time' in Datetime format( example:2019-11-22 13:18:00.000 ) and second column containing a number (i.e. 0.06 ). Different DFs will have different 'time' bins. I am trying to concatenate all DFs into one , where certain elements of the resulting DF may contain 'NaN'.
The Datetime format of the DFs give an error when using;
method 1) df4=pd.merge(df2,df3,left_on='time',right_on='time')
method 2) pd.pivot_table(df2, values = 'diffA', index=['time'], columns = 'time').reset_index()
When DFs have been combined , I also want to transpose the resulting DF, where:
Rows: are 'DiffA','DiffB'..etc
Columns: are time bins accordingly.
Have tried the transpose() method with individual DFs, just to try, but getting an error as my time /index is in 'Datetime' format..
Once that is in place, I am looking for a method to extract rows from the resulting transposed DF as individual data series.
Please advise how I can achieve the above with some guidance, appreciate any feedback ! thank you so much for your help.
Data frames ( 2 - for example )
time DiffA
2019-11-25 08:18:01.250 0.06
2019-11-25 08:18:01.500 0.05
2019-11-25 08:18:01.750 0.04
2019-11-25 08:18:02.000 0
2019-11-25 08:18:02.250 0.22
2019-11-25 08:18:02.500 0.06
time DiffB
2019-11-26 08:18:01.250 0.2
2019-11-27 08:18:01.500 0.05
2019-11-25 08:18:01.000 0.6
2019-11-25 08:18:02.000 0.01
2019-11-25 08:18:02.250 0.8
2019-11-25 08:18:02.500 0.5
resulting merged DF should be as follows ( text only);
time ( first row )
2019-11-25 08:18:01.000,
2019-11-25 08:18:01.250,
2019-11-25 08:18:01.500,
2019-11-25 08:18:01.750,
2019-11-25 08:18:02.000,
2019-11-25 08:18:02.250,
2019-11-25 08:18:02.500,
2019-11-26 08:18:01.250,
2019-11-27 08:18:01.500
(second row)
diffA nan 0.06 0.05 0.04 0 0.22 0.06 nan nan
(third row)
diffB 0.6 nan nan nan 0.01 0.8 0.5 0.2 0.05
Solution
The core logic: You need to use outer-join on the column 'time' to merge each of the sampled-dataframes together to achieve your objective. Finally resetting the index to the column time completes the solution.
I will use the dummy data I created below to create a reproducible solution.
Note: I have used df as the final dataframe and df0 as the original dataframe. My df0 is your df.
df = pd.DataFrame()
for i, column_name in zip(range(5), column_names):
if i==0:
df = df0.sample(n=10, random_state=i).rename(columns={'data': f'df{column_name}'})
else:
df_other = df0.sample(n=10, random_state=i).rename(columns={'data': f'df{column_name}'})
df = pd.merge(df, df_other, on='time', how='outer')
print(df.set_index('time').T)
Output:
Dummy Data
import pandas as pd
# dummy data:
df0 = pd.DataFrame()
df0['time'] = pd.date_range(start='2020-02-01', periods=15, freq='D')
df0['data'] = np.random.randint(0, high=9, size=15)
print(df0)
Output:
time data
0 2020-02-01 6
1 2020-02-02 1
2 2020-02-03 7
3 2020-02-04 0
4 2020-02-05 8
5 2020-02-06 8
6 2020-02-07 1
7 2020-02-08 6
8 2020-02-09 2
9 2020-02-10 6
10 2020-02-11 8
11 2020-02-12 3
12 2020-02-13 0
13 2020-02-14 1
14 2020-02-15 0

pandas time-weighted average groupby in panel data

Hi I have a panel data set looks like
stock date time spread1 weight spread2
VOD 01-01 9:05 0.01 0.03 ...
VOD 01-01 9.12 0.03 0.05 ...
VOD 01-01 10.04 0.02 0.30 ...
VOD 01-02 11.04 0.02 0.05
... ... ... .... ...
BAT 01-01 0.05 0.04 0.03
BAT 01-01 0.07 0.05 0.03
BAT 01-01 0.10 0.06 0.04
I want to calculate the weighted average of spread1 for each stock in each day. I can break the solution into several steps. i.e. I can apply groupby and agg function to get the sum of spread1*weight for each stock in each day in dataframe1, and then calculate the sum of weight for each stock in each day in dataframe2. After that merge two data sets and get weighted average for spread1.
My question is is there any simple way to calculate weighted average of spread1 here ? I also have spread2, spread3 and spread4. So I want to write as fewer code as possible. Thanks
IIUC, you need to transform the result back to the original, but using .transform with output that depends on two columns is tricky. We write our own function, where we pass the series of spread s and the original DataFrame df so we can also use the weights:
import numpy as np
def weighted_avg(s, df):
return np.average(s, weights=df.loc[df.index.isin(s.index), 'weight'])
df['spread1_avg'] = df.groupby(['stock', 'date']).spread1.transform(weighted_avg, df)
Output:
stock date time spread1 weight spread1_avg
0 VOD 01-01 9:05 0.01 0.03 0.020526
1 VOD 01-01 9.12 0.03 0.05 0.020526
2 VOD 01-01 10.04 0.02 0.30 0.020526
3 VOD 01-02 11.04 0.02 0.05 0.020000
4 BAT 01-01 0.05 0.04 0.03 0.051000
5 BAT 01-01 0.07 0.05 0.03 0.051000
6 BAT 01-01 0.10 0.06 0.04 0.051000
If needed for multiple columns:
gp = df.groupby(['stock', 'date'])
for col in [f'spread{i}' for i in range(1,5)]:
df[f'{col}_avg'] = gp[col].transform(weighted_avg, df)
Alternatively, if you don't need to transform back and one want value per stock-date:
def my_avg2(gp):
avg = np.average(gp.filter(like='spread'), weights=gp.weight, axis=0)
return pd.Series(avg, index=[col for col in gp.columns if col.startswith('spread')])
### Create some dummy data
df['spread2'] = df.spread1+1
df['spread3'] = df.spread1+12.1
df['spread4'] = df.spread1+1.13
df.groupby(['stock', 'date'])[['weight'] + [f'spread{i}' for i in range(1,5)]].apply(my_avg2)
# spread1 spread2 spread3 spread4
#stock date
#BAT 01-01 0.051000 1.051000 12.151000 1.181000
#VOD 01-01 0.020526 1.020526 12.120526 1.150526
# 01-02 0.020000 1.020000 12.120000 1.150000

Select every nth row as a Pandas DataFrame without reading the entire file

I am reading a large file that contains ~9.5 million rows x 16 cols.
I am interested in retrieving a representative sample, and since the data is organized by time, I want to do this by selecting every 500th element.
I am able to load the data, and then select every 500th row.
My question: Can I immediately read every 500th element (using.pd.read_csv() or some other method), without having to read first and then filter my data?
Question 2: How would you approach this problem if the date column was not ordered? At the moment, I am assuming it's ordered by date, but all data is prone to errors.
Here is a snippet of what the data looks like (first five rows) The first 4 rows are out of order, bu the remaining dataset looks ordered (by time):
VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count trip_distance RatecodeID store_and_fwd_flag PULocationID DOLocationID payment_type fare_amount extra mta_tax tip_amount tolls_amount improvement_surcharge total_amount
0 1 2017-01-09 11:13:28 2017-01-09 11:25:45 1 3.30 1 N 263 161 1 12.5 0.0 0.5 2.00 0.00 0.3 15.30
1 1 2017-01-09 11:32:27 2017-01-09 11:36:01 1 0.90 1 N 186 234 1 5.0 0.0 0.5 1.45 0.00 0.3 7.25
2 1 2017-01-09 11:38:20 2017-01-09 11:42:05 1 1.10 1 N 164 161 1 5.5 0.0 0.5 1.00 0.00 0.3 7.30
3 1 2017-01-09 11:52:13 2017-01-09 11:57:36 1 1.10 1 N 236 75 1 6.0 0.0 0.5 1.70 0.00 0.3 8.50
4 2 2017-01-01 00:00:00 2017-01-01 00:00:00 1 0.02 2 N 249 234 2 52.0 0.0 0.5 0.00 0.00 0.3 52.80
Can I immediately read every 500th element (using.pd.read_csv() or some other method), without having to read first and then filter my data?
Something you could do is to use the skiprows parameter in read_csv, which accepts a list-like argument to discard the rows of interest (and thus, also select). So you could create a np.arange with a length equal to the amount of rows to read, and remove every 500th element from it using np.delete, so this way we'll only be reading every 500th row:
n_rows = 9.5e6
skip = np.arange(n_rows)
skip = np.delete(skip, np.arange(0, n_rows, 500))
df = pd.read_csv('my_file.csv', skiprows = skip)
Can I immediately read every 500th element (using.pd.read_csv() or some other method), without having to read first and then filter my data?
First get the length of the file by a custom function, remove each 500 row by numpy.setdiff1d and pass to the skiprows parameter in read_csv:
#https://stackoverflow.com/q/845058
def file_len(fname):
with open(fname) as f:
for i, l in enumerate(f):
pass
return i + 1
len_of_file = file_len('test.csv')
print (len_of_file)
skipped = np.setdiff1d(np.arange(len_of_file), np.arange(0,len_of_file,500))
print (skipped)
df = pd.read_csv('test.csv', skiprows=skipped)
How would you approach this problem if the date column was not ordered? At the moment, I am assuming it's ordered by date, but all data is prone to errors.
The idea is read only the datetime column by parameter usecols, and then sort and select each 500 index value, get the difference and pass again to parameter skiprows:
def file_len(fname):
with open(fname) as f:
for i, l in enumerate(f):
pass
return i + 1
len_of_file = file_len('test.csv')
df1 = pd.read_csv('test.csv',
usecols=['tpep_pickup_datetime'],
parse_dates=['tpep_pickup_datetime'])
sorted_idx = (df1['tpep_pickup_datetime'].sort_values()
.iloc[np.arange(0,len_of_file,500)].index)
skipped = np.setdiff1d(np.arange(len_of_file), sorted_idx)
print (skipped)
df = pd.read_csv('test.csv', skiprows=skipped).sort_values(by=['tpep_pickup_datetime'])
use a lambda with skiprows:
pd.read_csv(path, skiprows=lambda i: i % N)
to skip every N rows.
source: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
You can use csv module return a iterator and use itertools.cycle to select every nth row.
import csv
from itertools import cycle
source_file='D:/a.txt'
cycle_size=500
chooser = (x == 0 for x in cycle(range(cycle_size)))
with open(source_file) as f1:
rdr = csv.reader(f1)
data = [row for pick, row in zip(chooser, rdr) if pick]