How to correctly format a pd-multiindex for sktime? - pandas

I have a pd.multiindex which looks like this:
However, when I use the run check_raise(df_train, mtype="pd-multiindex)"
I get the following error:
File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/sktime/datatypes/_check.py:252, in check_raise(obj, mtype, scitype, var_name)
250 return True
251 else:
--> 252 raise TypeError(msg)
TypeError: input.loc[i] must be Series of mtype pd.DataFrame, not at i=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25]
I believe this means I am meant to convert each row into a pandas series, but I am unsure if this is correct?
Any help would be appreciated.

I have similar issue, try to check if your index have duplicate keys, in your case:
df_train.reset_index(['sbj', 'system_time_stamp'])[['sbj', 'system_time_stamp']].duplicated(keep=False)
Remove duplicated index works for me.

Related

np.array for variable matrix

import numpy as np
data = np.array([[10, 20, 30, 40, 50, 60, 70, 80, 90],
[2, 7, 8, 9, 10, 11],
[3, 12, 13, 14, 15, 16],
[4, 3, 4, 5, 6, 7, 10, 12]],dtype=object)
target = data[:,0]
It has this error.
IndexError Traceback (most recent call last)
Input In \[82\], in \<cell line: 9\>()
data = np.array(\[\[10, 20, 30, 40, 50, 60, 70, 80, 90\],
\[2, 7, 8, 9, 10, 11\],
\[3, 12, 13, 14, 15, 16\],
\[4, 3, 4, 5, 6, 7, 10,12\]\],dtype=object)
# Define the target data ----\> 9 target = data\[:,0\]
IndexError: too many indices for array: array is 1-dimensional, but 2 were indexed
May I know how to fix it, please? I mean do not change the elements in the data. Many thanks. I made the matrix in the same size and the error message was gone. But I have the data with variable size.
You have a array of objects, so you can't use indexing on axis=1 as there is none (data.shape -> (4,)).
Use a list comprehension:
out = np.array([a[0] for a in data])
Output: array([10, 2, 3, 4])

Error message More than one column has the same display name

I keep getting the error message
More than one column has the same display name
but I cannot find the route cause. Any help is greatly appreciated!
SELECT
gl_ap_details.ledger_name,
gl_ap_details.company_code,
gl_ap_details.location_code,
gl_ap_details.cost_center,
gl_ap_details.account_number,
gl_ap_details.account_name,
gl_ap_details.product_code,
gl_ap_details.channel_code,
gl_ap_details.journal_name,
gl_ap_details.line_description,
gl_ap_details.gl_posted_date,
gl_ap_details.currency,
gl_ap_details.je_source,
gl_ap_details.je_category,
gl_ap_details.effective_date,
gl_ap_details.created_by,
gl_ap_details.invoice_num,
gl_ap_details.invoice_id,
gl_ap_details.invoice_date,
gl_ap_details.vendor_name,
gl_ap_details.vendor_number,
gl_ap_details.invoice_image,
gl_ap_details.po_number,
gl_ap_details.po_requestor,
gl_ap_details.period_name,
gl_ap_details.amount,
gl_ap_details.gl_posted_date,
gl_ap_details.project_code
FROM
wbr_global.gl_ap_details
WHERE
wbr_global.gl_ap_details.ledger_name = 'Amazon.com, Inc.'
AND cost_center IN ('1172')
AND period_name = 'JUL-21'
AND wbr_global.gl_ap_details.account_number = '60820'
GROUP BY
1, 2, 3, 4, 5, 6, 7, 8, 9, 10,
11, 12, 13, 14, 15, 16, 17, 18, 19, 20,
21, 22, 23, 24, 25, 26, 27, 28;

This prime generating function using generateSequence in Kotlin is not easy to understand. :(

val primes = generateSequence(2 to generateSequence(3) {it + 2}) {
val currSeq = it.second.iterator()
val nextPrime = currSeq.next()
nextPrime to currSeq.asSequence().filter { it % nextPrime != 0}
}.map {it.first}
println(primes.take(10).toList()) // prints [2, 3, 5, 7, 11, 13, 17, 19, 23, 29]
I tried to understand this function about how it works, but not easy to me.
Could someone explain how it works? Thanks.
It generates an infinite sequence of primes using the "Sieve of Eratosthenes" (see here: https://en.wikipedia.org/wiki/Sieve_of_Eratosthenes).
This implementation uses a sequence of pairs to do this. The first element of every pair is the current prime, and the second element is a sequence of integers larger than that prime which is not divisible by any previous prime.
It starts with the pair 2 to [3, 5, 7, 9, 11, 13, 15, 17, ...], which is given by 2 to generateSequence(3) { it + 2 }.
Using this pair, we create the next pair of the sequence by taking the first element of the sequence (which is now 3), and then removing all numbers divisible by 3 from the sequence (removing 9, 15, 21 and so on). This gives us this pair: 3 to [5, 7, 11, 13, 17, ...]. Repeating this pattern will give us all primes.
After creating a sequence of pairs like this, we are finally doing .map { it.first } to pick only the actual primes, and not the inner sequences.
The sequence of pairs will evolve like this:
2 to [3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, ...]
3 to [5, 7, 11, 13, 17, 19, 23, 25, 29, ...]
5 to [7, 11, 13, 17, 19, 23, 29, ...]
7 to [11, 13, 17, 19, 23, 29, ...]
11 to [13, 17, 19, 23, 29, ...]
13 to [17, 19, 23, 29, ...]
// and so on

Inserting new fields(columns) to mongoDB with pandas

I have an existing data in MongoDB where Primary Key is set on 'date' with a few fields in it.
And I want to insert a new pandas dataframe with new fields(columns) to the existing data in MongoDB, joining on the 'date' field which exists on the both dataframe.
For example, lets say the this is dataframe A I have in my MongoDB ( I set the index with 'date' field when calling the data from MongoDB)
And this is the new dataframe B I want to insert to MongoDB
And this is the final dataframe C with new fields( 'std_50_3000window', 'std_50_300window', 'std_50_500window' added on 'date' index), which I want it to have on my MongoDB.
Is there any way to do this?? (Maybe with insert_many method?)
The method you need is update_one() with upsert=True in a loop; you can't use insert_many() for two reasons; firstly your not always inserting; sometime you are updating; secondly update_many() (and insert_many()) only work on a single filter; in your case each filter is different as each update relates to a different time.
This is generic solution that will combine dataframes (df_a, df_b in this case - you can have as many as you like) in the manner that you need. It uses iterrows to get each row of the dataframe, filters on the date, and sets the values to those in the dataframe. the $set operator will override values if they are there already and set them if not set. upsert=True will perform an insert if there's no match on the date.
for df in [df_a, df_b]:
for _, row in df.iterrows():
db.mycollection.update_one({'date': row.get('date')}, {'$set': row.to_dict()}, upsert=True)
Full worked example:
from pymongo import MongoClient
from pprint import pprint
import datetime
import pandas as pd
# Sample data setup
db = MongoClient()['mydatabase']
data_a = [[datetime.datetime(2017, 5, 19, 21, 20), 96, 8, 98],
[datetime.datetime(2017, 5, 19, 21, 21), 95, 8, 97],
[datetime.datetime(2017, 5, 19, 21, 22), 95, 8, 97]]
df_a = pd.DataFrame(data_a, columns=['date', 'std_500_1000window', 'std_50_100window', 'std_50_2000window'])
data_b = [[datetime.datetime(2017, 5, 19, 21, 20), 98, 9, 10],
[datetime.datetime(2017, 5, 19, 21, 21), 98, 9, 10],
[datetime.datetime(2017, 5, 19, 21, 22), 98, 9, 10]]
df_b = pd.DataFrame(data_b, columns=['date', 'std_50_3000window', 'std_50_300window', 'std_50_500window'])
# Perform the upserts
for df in [df_a, df_b]:
for _, row in df.iterrows():
db.mycollection.update_one({'date': row.get('date')}, {'$set': row.to_dict()}, upsert=True)
# Print the results
for record in db.mycollection.find():
pprint(record)
Result:
{'_id': ObjectId('5f0ae909df5531ac655ce528'),
'date': datetime.datetime(2017, 5, 19, 21, 20),
'std_500_1000window': 96,
'std_50_100window': 8,
'std_50_2000window': 98,
'std_50_3000window': 98,
'std_50_300window': 9,
'std_50_500window': 10}
{'_id': ObjectId('5f0ae909df5531ac655ce52a'),
'date': datetime.datetime(2017, 5, 19, 21, 21),
'std_500_1000window': 95,
'std_50_100window': 8,
'std_50_2000window': 97,
'std_50_3000window': 98,
'std_50_300window': 9,
'std_50_500window': 10}
{'_id': ObjectId('5f0ae909df5531ac655ce52c'),
'date': datetime.datetime(2017, 5, 19, 21, 22),
'std_500_1000window': 95,
'std_50_100window': 8,
'std_50_2000window': 97,
'std_50_3000window': 98,
'std_50_300window': 9,
'std_50_500window': 10}

Appending numpy arrays using numpy.insert

I have a numpy array (inputs) of shape (30,1). I want to insert 31st value (eg. x = 2). Trying to use the np.insert function but it is giving me out of bounds error.
np.insert(inputs,b+1,x)
IndexError: index 31 is out of bounds for axis 0 with size 30
Short answer: you need to insert it at index b, not b+1.
The index you pass to np.insert(..) [numpy-doc], is the one where the element should be added. If you insert it at index 30, then it will be positioned last. Note that indexes are zero-based. So if you have an array with 30 elements, then the last index is 29. If you thus insert this at index 30, we get:
>>> a
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29])
>>> np.insert(a,30,42)
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 42])