Select last row from each column of multi-index Pandas DataFrame based on time, when columns are unequal length - pandas

I have the following Pandas multi-index DataFrame with the top level index being a group ID and the second level index being when, in ISO 8601 time format (shown here without the time):
value weight
when
5e33c4bb4265514aab106a1a 2011-05-12 1.34 0.79
2011-05-07 1.22 0.83
2011-05-03 2.94 0.25
2011-04-28 1.78 0.89
2011-04-22 1.35 0.92
... ... ...
5e33c514392b77d517961f06 2009-01-31 30.75 0.12
2009-01-24 30.50 0.21
2009-01-23 29.50 0.96
2009-01-10 28.50 0.98
2008-12-08 28.50 0.65
when is currently defined as an index but this is not a requirement.
Assertions
when may be non-unique.
Columns may be of unequal length across groups
Within groups when, value and weight will always be of equal length (for each when there will always be a value and a weight
Question
Using the parameter index_time, how do you retrieve:
The most recent past value and weight from each group relative to index_time along with the difference (in seconds) between index_time and when.
index_time may be a time in the past such that only entries where when <= index_time are selected.
The result should be indexed in some way so that the group id of each result can be deduced
Example
From the above, if the index_time was 2011-05-10 then the result should be:
value weight age
5e33c4bb4265514aab106a1a 1.22 0.83 259200
5e33c514392b77d517961f06 30.75 0.12 72576000

Where original DataFrame given in the question is df:
import pandas as pd
df.sort_index(inplace=True)
result = df.loc[pd.IndexSlice[:, :when], :].groupby('id').tail(1)
result['age'] = when - result.index.get_level_values(level=1)

Related

Issues in creating hierarchical columns in pandas

Just approached the hierarchical columns in pandas. The original dataframe (df) has 27 columns and looks like the following (Ticker is the index):
Report Date Shares Gross Profit ...
Ticker
AAPL 2010-07-31 347000000.0 543000000.0 ...
AAPL 2010-10-31 344000000.0 548000000.0 ...
AAPL 2011-01-31 347000000.0 556000000.0 ...
AAPL 2011-04-30 347000000.0 580000000.0 ...
AAPL 2011-07-31 348000000.0 591000000.0 ...
I'd like to modify the column structure so that the first level is Report Date, the second level is the columns Shares and Gross Profit. I tried to create a new dataframe with this structure for just one ticker (AAPL), this is the code I used:
col = pd.MultiIndex.from_product([df['Report Date'], df[['Shares', 'Gross Profit']]])
df1 = pd.DataFrame(df.loc['AAPL'], columns=col)
It seems working apparently, but there are just NaN:
Report Date 2010-07-31 2010-10-31 \
Shares Gross Profit Shares Gross Profit
Ticker
AAPL NaN NaN NaN NaN
AAPL NaN NaN NaN NaN
AAPL NaN NaN NaN NaN
AAPL NaN NaN NaN NaN
Moreover, the shape exploded to (78, 112668). Can anybody spot the error? I guess it's in MultiIndex.from_product but cannot understand where.
Solution
This question can be solved by df.melt() if we aim at producing the transposed version of the desired output first. You can easily set the double-leveled MultiIndex before df.transpose().
df_want = df.melt(id_vars="Report Date", value_vars=["Shares", "Gross Profit"])\
.sort_values(["Report Date", "variable"])\
.set_index(["Report Date", "variable"])\
.transpose()
Result
print(df_want)
Report Date 2010-07-31 ... 2011-07-31
variable Gross Profit Shares ... Gross Profit Shares
value 543000000.0 347000000.0 ... 591000000.0 348000000.0
[1 rows x 10 columns]
N.B. The problem of the original attempt: IMO a better data wrangling strategy is to let the desired indexes/columns be produced naturally within the data processing pipeline or set via standard Pandas APIs, especially when the names or indexes/columns are already present in the source dataframe.
Edit: "Produce the desired index/columns naturally" means to NOT compute them outside the df.f1(...).f2(...).f3(...)... pipeline and assign the externally generated index/columns to the output DataFrame. Such approach can produce less error-prone and more maintainable code in general.
In other words, manually generating indexes or column names is not likely a Pandastic way to go, except maybe for pre-allocation of empty dataframes.
Generalization to Multiple Tickers
I think there is a great chance that dealing with multiple tickers at once would be a realistic use case. So I also provide such a generalized version just in case. This solution is of course also applicable for single-ticker dataframes.
Data
Report Date Shares Gross Profit
Ticker
AAPL 2010-07-31 347000000.0 543000000.0
AAPL 2010-10-31 344000000.0 548000000.0
AAPL 2011-01-31 347000000.0 556000000.0
AAPL 2011-04-30 347000000.0 580000000.0
AAPL 2011-07-31 348000000.0 591000000.0
GOOG 2011-07-31 448000004.0 691000000.0
GOOG 2010-07-31 447000004.0 643000000.0
GOOG 2010-10-31 444000004.0 648000000.0
GOOG 2011-01-31 447000004.0 656000000.0
GOOG 2011-04-30 447000004.0 680000000.0
Code
df_want = df.reset_index()\
.melt(id_vars=["Ticker", "Report Date"], value_vars=["Shares", "Gross Profit"])\
.sort_values(["Ticker", "Report Date", "variable"])\
.pivot(index="Ticker", columns=["Report Date", "variable"], values="value")
Result
print(df_want)
Report Date 2010-07-31 ... 2011-07-31
variable Gross Profit Shares ... Gross Profit Shares
Ticker ...
AAPL 543000000.0 347000000.0 ... 591000000.0 348000000.0
GOOG 643000000.0 447000004.0 ... 691000000.0 448000004.0
[2 rows x 10 columns]
I am using pandas v1.1.3 and python 3.7 on a 64-bit debian 10 laptop.

Merging 2 or more data frames and transposing the result

I have several DFs derived from a Panda binning process using the below code;
df2 = df.resample(rule=timedelta(milliseconds=250))[('diffA')].mean().dropna()
df3 = df.resample(rule=timedelta(milliseconds=250))[('diffB')].mean().dropna()
.. etc
Every DF will have column containing 'time' in Datetime format( example:2019-11-22 13:18:00.000 ) and second column containing a number (i.e. 0.06 ). Different DFs will have different 'time' bins. I am trying to concatenate all DFs into one , where certain elements of the resulting DF may contain 'NaN'.
The Datetime format of the DFs give an error when using;
method 1) df4=pd.merge(df2,df3,left_on='time',right_on='time')
method 2) pd.pivot_table(df2, values = 'diffA', index=['time'], columns = 'time').reset_index()
When DFs have been combined , I also want to transpose the resulting DF, where:
Rows: are 'DiffA','DiffB'..etc
Columns: are time bins accordingly.
Have tried the transpose() method with individual DFs, just to try, but getting an error as my time /index is in 'Datetime' format..
Once that is in place, I am looking for a method to extract rows from the resulting transposed DF as individual data series.
Please advise how I can achieve the above with some guidance, appreciate any feedback ! thank you so much for your help.
Data frames ( 2 - for example )
time DiffA
2019-11-25 08:18:01.250 0.06
2019-11-25 08:18:01.500 0.05
2019-11-25 08:18:01.750 0.04
2019-11-25 08:18:02.000 0
2019-11-25 08:18:02.250 0.22
2019-11-25 08:18:02.500 0.06
time DiffB
2019-11-26 08:18:01.250 0.2
2019-11-27 08:18:01.500 0.05
2019-11-25 08:18:01.000 0.6
2019-11-25 08:18:02.000 0.01
2019-11-25 08:18:02.250 0.8
2019-11-25 08:18:02.500 0.5
resulting merged DF should be as follows ( text only);
time ( first row )
2019-11-25 08:18:01.000,
2019-11-25 08:18:01.250,
2019-11-25 08:18:01.500,
2019-11-25 08:18:01.750,
2019-11-25 08:18:02.000,
2019-11-25 08:18:02.250,
2019-11-25 08:18:02.500,
2019-11-26 08:18:01.250,
2019-11-27 08:18:01.500
(second row)
diffA nan 0.06 0.05 0.04 0 0.22 0.06 nan nan
(third row)
diffB 0.6 nan nan nan 0.01 0.8 0.5 0.2 0.05
Solution
The core logic: You need to use outer-join on the column 'time' to merge each of the sampled-dataframes together to achieve your objective. Finally resetting the index to the column time completes the solution.
I will use the dummy data I created below to create a reproducible solution.
Note: I have used df as the final dataframe and df0 as the original dataframe. My df0 is your df.
df = pd.DataFrame()
for i, column_name in zip(range(5), column_names):
if i==0:
df = df0.sample(n=10, random_state=i).rename(columns={'data': f'df{column_name}'})
else:
df_other = df0.sample(n=10, random_state=i).rename(columns={'data': f'df{column_name}'})
df = pd.merge(df, df_other, on='time', how='outer')
print(df.set_index('time').T)
Output:
Dummy Data
import pandas as pd
# dummy data:
df0 = pd.DataFrame()
df0['time'] = pd.date_range(start='2020-02-01', periods=15, freq='D')
df0['data'] = np.random.randint(0, high=9, size=15)
print(df0)
Output:
time data
0 2020-02-01 6
1 2020-02-02 1
2 2020-02-03 7
3 2020-02-04 0
4 2020-02-05 8
5 2020-02-06 8
6 2020-02-07 1
7 2020-02-08 6
8 2020-02-09 2
9 2020-02-10 6
10 2020-02-11 8
11 2020-02-12 3
12 2020-02-13 0
13 2020-02-14 1
14 2020-02-15 0

How to manipulate data in arrays using pandas

Have data in dataframe and need to compare current value of one column and prior of value of another column. Current time is row 5 in this dataframe and here's the desired output:
target data is streamed and captured into a DataFrame, then that array is multiplied by a constant to generate another column, however unable to generate the third column comp, which should compare current value of prod with prior value of the comp from comp.
df['temp'] = self.temp
df['prod'] = df['temp'].multiply(other=const1)
Another user had suggested using this logic but it is generates errors because the routine's array doesn't match the size of the DataFrame:
for i in range(2, len(df['temp'])):
df['comp'].append(max(df['prod'][i], df['comp'][i - 1]))
Let's try this, I think this will capture your intended logic:
df = pd.DataFrame({'col0':[1,2,3,4,5]
,'col1':[5,4.9,5.5,3.5,6.3]
,'col2':[2.5,2.45,2.75,1.75,3.15]
})
df['col3'] = df['col2'].shift(-1).cummax().shift()
print(df)
Output:
col0 col1 col2 col3
0 1 5.0 2.50 NaN
1 2 4.9 2.45 2.45
2 3 5.5 2.75 2.75
3 4 3.5 1.75 2.75
4 5 6.3 3.15 3.15

Select every nth row as a Pandas DataFrame without reading the entire file

I am reading a large file that contains ~9.5 million rows x 16 cols.
I am interested in retrieving a representative sample, and since the data is organized by time, I want to do this by selecting every 500th element.
I am able to load the data, and then select every 500th row.
My question: Can I immediately read every 500th element (using.pd.read_csv() or some other method), without having to read first and then filter my data?
Question 2: How would you approach this problem if the date column was not ordered? At the moment, I am assuming it's ordered by date, but all data is prone to errors.
Here is a snippet of what the data looks like (first five rows) The first 4 rows are out of order, bu the remaining dataset looks ordered (by time):
VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count trip_distance RatecodeID store_and_fwd_flag PULocationID DOLocationID payment_type fare_amount extra mta_tax tip_amount tolls_amount improvement_surcharge total_amount
0 1 2017-01-09 11:13:28 2017-01-09 11:25:45 1 3.30 1 N 263 161 1 12.5 0.0 0.5 2.00 0.00 0.3 15.30
1 1 2017-01-09 11:32:27 2017-01-09 11:36:01 1 0.90 1 N 186 234 1 5.0 0.0 0.5 1.45 0.00 0.3 7.25
2 1 2017-01-09 11:38:20 2017-01-09 11:42:05 1 1.10 1 N 164 161 1 5.5 0.0 0.5 1.00 0.00 0.3 7.30
3 1 2017-01-09 11:52:13 2017-01-09 11:57:36 1 1.10 1 N 236 75 1 6.0 0.0 0.5 1.70 0.00 0.3 8.50
4 2 2017-01-01 00:00:00 2017-01-01 00:00:00 1 0.02 2 N 249 234 2 52.0 0.0 0.5 0.00 0.00 0.3 52.80
Can I immediately read every 500th element (using.pd.read_csv() or some other method), without having to read first and then filter my data?
Something you could do is to use the skiprows parameter in read_csv, which accepts a list-like argument to discard the rows of interest (and thus, also select). So you could create a np.arange with a length equal to the amount of rows to read, and remove every 500th element from it using np.delete, so this way we'll only be reading every 500th row:
n_rows = 9.5e6
skip = np.arange(n_rows)
skip = np.delete(skip, np.arange(0, n_rows, 500))
df = pd.read_csv('my_file.csv', skiprows = skip)
Can I immediately read every 500th element (using.pd.read_csv() or some other method), without having to read first and then filter my data?
First get the length of the file by a custom function, remove each 500 row by numpy.setdiff1d and pass to the skiprows parameter in read_csv:
#https://stackoverflow.com/q/845058
def file_len(fname):
with open(fname) as f:
for i, l in enumerate(f):
pass
return i + 1
len_of_file = file_len('test.csv')
print (len_of_file)
skipped = np.setdiff1d(np.arange(len_of_file), np.arange(0,len_of_file,500))
print (skipped)
df = pd.read_csv('test.csv', skiprows=skipped)
How would you approach this problem if the date column was not ordered? At the moment, I am assuming it's ordered by date, but all data is prone to errors.
The idea is read only the datetime column by parameter usecols, and then sort and select each 500 index value, get the difference and pass again to parameter skiprows:
def file_len(fname):
with open(fname) as f:
for i, l in enumerate(f):
pass
return i + 1
len_of_file = file_len('test.csv')
df1 = pd.read_csv('test.csv',
usecols=['tpep_pickup_datetime'],
parse_dates=['tpep_pickup_datetime'])
sorted_idx = (df1['tpep_pickup_datetime'].sort_values()
.iloc[np.arange(0,len_of_file,500)].index)
skipped = np.setdiff1d(np.arange(len_of_file), sorted_idx)
print (skipped)
df = pd.read_csv('test.csv', skiprows=skipped).sort_values(by=['tpep_pickup_datetime'])
use a lambda with skiprows:
pd.read_csv(path, skiprows=lambda i: i % N)
to skip every N rows.
source: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
You can use csv module return a iterator and use itertools.cycle to select every nth row.
import csv
from itertools import cycle
source_file='D:/a.txt'
cycle_size=500
chooser = (x == 0 for x in cycle(range(cycle_size)))
with open(source_file) as f1:
rdr = csv.reader(f1)
data = [row for pick, row in zip(chooser, rdr) if pick]

SQL linear interpolation based on lookup table

I need to build linear interpolation into an SQL query, using a joined table containing lookup values (more like lookup thresholds, in fact). As I am relatively new to SQL scripting, I have searched for an example code to point me in the right direction, but most of the SQL scripts I came across were for interpolating between dates and timestamps and I couldn't relate these to my situation.
Basically, I have a main data table with many rows of decimal values in a single column, for example:
Main_Value
0.33
0.12
0.56
0.42
0.1
Now, I need to yield interpolated data points for each of the rows above, based on a joined lookup table with 6 rows, containing non-linear threshold values and the associated linear normalized values:
Threshold_Level Normalized_Value
0 0
0.15 20
0.45 40
0.60 60
0.85 80
1 100
So for example, if the value in the Main_Value column is 0.45, the query will lookup its position in (or between) the nearest Threshold_Level, and interpolate this based on the adjacent value in the Normalized_Value column (which would yield a value of 40 in this example).
I really would be grateful for any insight into building a SQL query around this, especially as it has been hard to track down any SQL examples of linear interpolation using a joined table.
It has been pointed out that I could use some sort of rounding, so I have included a more detailed table below. I would like the SQL query to lookup each Main_Value (from the first table above) where it falls between the Threshold_Min and Threshold_Max values in the table below, and return the 'Normalized_%' value:
Threshold_Min Threshold_Max Normalized_%
0.00 0.15 0
0.15 0.18 5
0.18 0.22 10
0.22 0.25 15
0.25 0.28 20
0.28 0.32 25
0.32 0.35 30
0.35 0.38 35
0.38 0.42 40
0.42 0.45 45
0.45 0.60 50
0.60 0.63 55
0.63 0.66 60
0.66 0.68 65
0.68 0.71 70
0.71 0.74 75
0.74 0.77 80
0.77 0.79 85
0.79 0.82 90
0.82 0.85 95
0.85 1.00 100
For example, if the value from the Main_Value table is 0.52, it falls between Threshold_Min 0.45 and Threshold_Max 0.60, so the Normalized_% returned is 50%. The problem is that the Threshold_Min and Max values are not linear. Could anyone point me in the direction of how to script this?
Assuming you want the Main_Value and the nearest (low and not high) or equal Normalized_Value, you can do it like this:
select t1.Main_Value, max(t2.Normalized_Value) as Normalized_Value
from #t1 t1
inner join #t2 t2 on t1.Main_Value >= t2.Threshold_Level
group by t1.Main_Value
Replace #t1 and #t2 by the correct tablenames.