regex text parser - pandas

I have the dataframe like
ID Series
1102 [('taxi instructions', 13, 30, 'NP'), ('consistent basis', 31, 47, 'NP'), ('the atc taxi clearance', 89, 111, 'NP')]
1500 [('forgot data pages info', 0, 22, 'NP')]
649 [('hud', 0, 3, 'NP'), ('correctly fotr approach', 12, 35, 'NP')]
I am trying to parse the text in column named Series to different columns named Series1 Series2 etc upto the highest number of texts parsed.
df_parsed = df['Series'].str[1:-1].str.split(', ', expand = True)
something like this:
ID Series Series1 Series2 Series3
1102 [('taxi instructions', 13, 30, 'NP'), ('consistent basis', 31, 47, 'NP'), ('the atc taxi clearance', 89, 111, 'NP')] taxi instructions consistent basis the atc taxi clearance
1500 [('forgot data pages info', 0, 22, 'NP')] forgot data pages info
649 [('hud', 0, 3, 'NP'), ('correctly fotr approach', 12, 35, 'NP')] hud correctly fotr approach

The format of your final result is not easy to understand, but maybe you can follow the concept to create your new columns:
def process(ls):
return ' '.join([x[0] for x in ls])
df['Series_new'] = df['Series'].apply(lambda x: process(x))
And if you want to create N new columns (N = max_len(Series_list)), I think you can calculate N first. Then, follow the concept above and fill in NaN properly to create N new columns.

Related

Radial Interpolation of a DataFrame

I have a dataframe (120,238), with 12 values spread across it. I am trying to use radial interpolation to fill up the remaining empty points. For that I created an list with the coordinates of the points, and another list with the values of each of these points.
for i in range(238):
col.append('')
df_map = pd.DataFrame(columns = col, index = range(120))
x_rbf = [8, 227, 19, 116, 11, 223, 5, 231, 116, 116, 13, 222] #x represents the columns
y_rbf = [59, 59, 102, 111, 17, 17, 9, 9, 62, 17, 7, 7] #y represents the rows
z_rbf = [16.2,15.99,16.2,16.3,15.7,15,14.2,14.2,16.4,16.4,13,11]
y = x_rbf, y_rbf
f = scipy.interpolate.RBFInterpolator(y,z_rbf)
However, when I run this code, I get the following error'
ValueError: Expected the first axis of `d` to have length 2.
Does anyone know how to go around this?
After countless tries, I figured out the issue with utilizing the RBF Interpolator. The x and y coordinates have to be flattened (using np.ravel()), and then stacked into one array
for i in range(238):
col.append('')
df_map = pd.DataFrame(columns = col, index = range(120))
x_rbf = [8, 227, 19, 116, 11, 223, 5, 231, 116, 116, 13, 222] #x represents the columns
y_rbf = [59, 59, 102, 111, 17, 17, 9, 9, 62, 17, 7, 7] #y represents the rows
z_rbf = [16.2,15.99,16.2,16.3,15.7,15,14.2,14.2,16.4,16.4,13,11]
sp = np.stack([y_rbf.ravel(),x_rbf.ravel()],-1)
f = scipy.interpolate.RBFInterpolator(sp,z_rbf.ravel(), kernel = 'linear')
Should work this way

How to keep some specific values in a row along with NaN values?

n_month = [12, 24, 36, 48, 60, 72, 84, 96, 108, 120, 132]
df1 = df.loc[df["nth month"].isin(n_month)]
Along with the values given in n_month, I also want to include NaN values in the nth month column? How to include NaN also? Please suggest
First idea is chain Series.isna by | for bitwise Or:
df = pd.DataFrame({ 'nth month':[12,3,108,7,np.nan,0]})
n_month = [12, 24, 36, 48, 60, 72, 84, 96, 108, 120, 132]
df1 = df.loc[df["nth month"].isin(n_month) | df["nth month"].isna()]
Or add missing values to your list:
df1 = df.loc[df["nth month"].isin(n_month + [np.nan])]
print (df1)
0 12.0
2 108.0
4 NaN

Inserting new fields(columns) to mongoDB with pandas

I have an existing data in MongoDB where Primary Key is set on 'date' with a few fields in it.
And I want to insert a new pandas dataframe with new fields(columns) to the existing data in MongoDB, joining on the 'date' field which exists on the both dataframe.
For example, lets say the this is dataframe A I have in my MongoDB ( I set the index with 'date' field when calling the data from MongoDB)
And this is the new dataframe B I want to insert to MongoDB
And this is the final dataframe C with new fields( 'std_50_3000window', 'std_50_300window', 'std_50_500window' added on 'date' index), which I want it to have on my MongoDB.
Is there any way to do this?? (Maybe with insert_many method?)
The method you need is update_one() with upsert=True in a loop; you can't use insert_many() for two reasons; firstly your not always inserting; sometime you are updating; secondly update_many() (and insert_many()) only work on a single filter; in your case each filter is different as each update relates to a different time.
This is generic solution that will combine dataframes (df_a, df_b in this case - you can have as many as you like) in the manner that you need. It uses iterrows to get each row of the dataframe, filters on the date, and sets the values to those in the dataframe. the $set operator will override values if they are there already and set them if not set. upsert=True will perform an insert if there's no match on the date.
for df in [df_a, df_b]:
for _, row in df.iterrows():
db.mycollection.update_one({'date': row.get('date')}, {'$set': row.to_dict()}, upsert=True)
Full worked example:
from pymongo import MongoClient
from pprint import pprint
import datetime
import pandas as pd
# Sample data setup
db = MongoClient()['mydatabase']
data_a = [[datetime.datetime(2017, 5, 19, 21, 20), 96, 8, 98],
[datetime.datetime(2017, 5, 19, 21, 21), 95, 8, 97],
[datetime.datetime(2017, 5, 19, 21, 22), 95, 8, 97]]
df_a = pd.DataFrame(data_a, columns=['date', 'std_500_1000window', 'std_50_100window', 'std_50_2000window'])
data_b = [[datetime.datetime(2017, 5, 19, 21, 20), 98, 9, 10],
[datetime.datetime(2017, 5, 19, 21, 21), 98, 9, 10],
[datetime.datetime(2017, 5, 19, 21, 22), 98, 9, 10]]
df_b = pd.DataFrame(data_b, columns=['date', 'std_50_3000window', 'std_50_300window', 'std_50_500window'])
# Perform the upserts
for df in [df_a, df_b]:
for _, row in df.iterrows():
db.mycollection.update_one({'date': row.get('date')}, {'$set': row.to_dict()}, upsert=True)
# Print the results
for record in db.mycollection.find():
pprint(record)
Result:
{'_id': ObjectId('5f0ae909df5531ac655ce528'),
'date': datetime.datetime(2017, 5, 19, 21, 20),
'std_500_1000window': 96,
'std_50_100window': 8,
'std_50_2000window': 98,
'std_50_3000window': 98,
'std_50_300window': 9,
'std_50_500window': 10}
{'_id': ObjectId('5f0ae909df5531ac655ce52a'),
'date': datetime.datetime(2017, 5, 19, 21, 21),
'std_500_1000window': 95,
'std_50_100window': 8,
'std_50_2000window': 97,
'std_50_3000window': 98,
'std_50_300window': 9,
'std_50_500window': 10}
{'_id': ObjectId('5f0ae909df5531ac655ce52c'),
'date': datetime.datetime(2017, 5, 19, 21, 22),
'std_500_1000window': 95,
'std_50_100window': 8,
'std_50_2000window': 97,
'std_50_3000window': 98,
'std_50_300window': 9,
'std_50_500window': 10}

Why extreme large value to 0 frequency fft (numpy.fft.fft method)

I have a signal ts which has rougly mean 40 and applied fft on that with code
ts = array([25, 40, 30, 40, 29, 48, 36, 32, 34, 38, 15, 33, 40, 32, 41, 25, 37,49, 41, 35, 23, 22, 36, 44, 28, 36, 32, 37, 39, 51])
index = fftshift(fftfreq(len(ts)))
ft_ts =fftshift(fft(ts))
output
ft_ts = array([ -76.00000000 +8.34887715e-14j, -57.72501110 +1.17054586e+01j,
7.69492662 +9.79582336e+00j, -29.11145618 -7.22493645e+00j,
14.92140414 +4.58471353e+01j, -26.00000000 -4.67653718e+01j,
-39.61803399 -2.83601821e+01j, -11.34044003 +8.66215368e+00j,
23.68703939 +1.57391882e+01j, -64.88854382 -2.44499549e+01j,
50.00000000 -3.98371686e+01j, 4.09382150 -6.27663403e+00j,
-37.38196601 -3.06708342e+01j, 35.97162964 +1.31929223e+01j,
18.69662985 -2.20453671e+00j, 1048.00000000 +0.00000000e+00j,
18.69662985 +2.20453671e+00j, 35.97162964 -1.31929223e+01j,
-37.38196601 +3.06708342e+01j, 4.09382150 +6.27663403e+00j,
50.00000000 +3.98371686e+01j, -64.88854382 +2.44499549e+01j,
23.68703939 -1.57391882e+01j, -11.34044003 -8.66215368e+00j,
-39.61803399 +2.83601821e+01j, -26.00000000 +4.67653718e+01j,
14.92140414 -4.58471353e+01j, -29.11145618 +7.22493645e+00j,
7.69492662 -9.79582336e+00j, -57.72501110 -1.17054586e+01j])
at 0 frequency ft_ts has value of 1048. Shouldn't that be the mean of my original signal ts which is 40 ? What happened here ?
Many thanks
The FFT is not normalized, so the first term should be the sum, not the mean.
For example, see the definition here
and you can see, that when k=0, the exponential term is 1, and you'll just get the sum of x_n.
This is why the first item in fft(np.ones(10)) is 10, not 1. 1 is the mean (since it's an array of ones), and 10 is the sum.

Chart Axes in VB.NET

My requirement is to graph (scatter graph) data from 2 arrays. I can now connect the data from the array and use it on the chart. My question is, how do I set the graph's X- and Y- axes to show consistency in their intervals?
For example, I have points from X = {1, 3, 4, 6, 8, 9} and Y = {7, 10, 11, 15, 18, 19}. What I would like to see is that these points are graphed in a scatter manner, but, the intervals for x-axis should be (intervals of) 2 up to 10 (such that it will show 0, 2, 4, 6, 8, 10 on x-axis) and intervals of 5 for the y-axis (such that it will show 5, 10, 15, 20 on y-axis). What code/property should I use/manipulate?
ADDED PART:
I currently have this data:
x_column = {12, 24, 1, 7, 29, 28, 25, 24, 15, 19}
y_column = {3, 5, 8, 3, 3, 3, 3, 3, 19, 15}
each y_column element is a pair of each respective x_column element
Now, I want MyChart to display a scatter graph of the x_column and y_column data in such a way that the x-axis will show 5, 10, 15, 20, 25, 30 and the y-axis will show 2, 4, 6, 8, 10, 12, 14, 16, 18, 20.
My current code is:
' add points
MyChart.Series("Scatter Plot").Points.DataBindXY(x_Column, y_Column)
The code above only adds points.
Try:
Chart1.ChartAreas("Default").AxisX.Interval = 2
Chart1.ChartAreas("Default").AxisY.Interval = 5