Convert monthly data to zoo type - zoo

I want to convert my monthly data to a zoo type. This is what my imported data looks like for 394 obs of 13 variables.
How can I properly do this with the zoo function? Please note I don't have any irregular time series in there.

Related

moving from tabular to graph representation of a given data

Suppose that I have the following data t:
activity
teacher
group
students
duration
subject
One
A
a
3
45
Math
One
B
b
2
45
Math
two
A
c
7
60
P.E
One
D
a
3
45
Math
two
C
c
7
60
P.E
I want to construct a graph data instead of this tabular data. I am actually interested in predicting the teacher by applying some kind of Graph ML. is there a way to transform the tabular data into graphical data ? maybe using networkX.
I tried the following code
G = nx.from_pandas_edgelist(df, "subject", "teacher", edge_attr=True, create_using=nx.Graph())
nx.draw_networkx(G)
plt.show()
the output of this looks like a graph, but I don't understand how it works or how can I get the new data or what is the best way to identify the node and the edge.
thank you in advance for any help.

Encode all data in one column and assign the same code if data has a same value

I have a dataframe which has appr. 100 columns and 20000 rows. Now I want to encode one categorical column so that it will have numerical code. After checking its value counts, the result shows something like this:
df['name'].value_counts()
aaa 650
baa 350
cad 50
dae 10
ef3 1
....
The total unique values are about 3300. So I might have a code range from 1 to 3300. I will
normalize the numerical code before train it. As I have already many columns in the dataset, I prefer not using one hot encoding method. So how can I do it? Thank you!
You can enumerate each group using ngroup(). It would look something like:
df.assign(num_code=lambda x: x.groupby(['name']).ngroup())
I don't know what kind of information the column contains, however I am not sure it makes sense to assign an incremental numerical code to a column that seems to be categorical for training models.

Multiple Object Tracking (MOT) benchmark data-set format for ground truth tracking

I am trying to evaluate the performance of my object detection+tracking on the standard dataset used in the industry in the 2DMOT Challenge 2015. I have downloaded the dataset but I am unable to understand the data fields in the labelled ground truth data.
I have understood the first six columns of the dataset but unable to do so for the rest four columns. Following is the sample data from the directory <\2DMOT2015\train\ETH-Bahnhof\gt>:
frame no. object_id bb_left bb_top bb_width bb_height (?) (?) (?) (?)
1 1 212 204 20 57 0 -3.1784 16.34 0.45739
1 2 223 181 36 104 1 -1.407 9.0212 0.68774
Please let me know if you are aware of this?
The last three fields represent the 3D real-world coordinates of the objects. A similar data structure can be found in videos of ETH-Bahnhof, ETH-Sunnyday, PETS09-S2L1 and TUD-Stadtmitte in 2DMOT2015. For ground-truth, score=1. But sometimes it varies b/w 0-1, then it acts as a flag value and zeroes mean that the line is not to be considered for evaluation. So the data fields are in the format:
frame no. , object_id , bb_left , bb_top , bb_width , bb_height , score, X, Y, Z

Querying a DataFrame values from a specific year

I have a pandas dataframe I have created from weather data that shows the high and low temperatures by day from 2005-2015. I want to be able to query my dataframe such that it only shows the values with the year 2015. Is there any way to do this without first changing the datetime values to only show year (i.e. not making strtime(%y) only first)?
DataFrame Creation:
df=pd.read_csv('data/C2A2_data/BinnedCsvs_d400/fb441e62df2d58994928907a91895ec62c2c42e6cd075c2700843b89.csv')
df['Date']=pd.to_datetime(df.Date)
df['Date'] = df['Date'].dt.strftime('%m-%d-%y')
Attempt to Query:
daily_df=df[df['Date']==datetime.date(year=2015)]
Error: asks for a month and year to be specified.
Data:
An NOAA dataset has been stored in the file data/C2A2_data/BinnedCsvs_d400/fb441e62df2d58994928907a91895ec62c2c42e6cd075c2700843b89.csv. The data for this assignment comes from a subset of The National Centers for Environmental Information (NCEI) Daily Global Historical Climatology Network (GHCN-Daily). The GHCN-Daily is comprised of daily climate records from thousands of land surface stations across the globe.
Each row in the assignment datafile corresponds to a single observation.
The following variables are provided to you:
id : station identification code
date : date in YYYY-MM-DD format (e.g. 2012-01-24 = January 24, 2012)
element : indicator of element type
TMAX : Maximum temperature (tenths of degrees C)
TMIN : Minimum temperature (tenths of degrees C)
value : data value for element (tenths of degrees C)
Image of DataFrame:
I resolved this by adding a row with just the year and then querying that way but there has to be a better way to do this?
df['Date']=pd.to_datetime(df['Date']).dt.strftime('%d-%m-%y')
df['Year']=pd.to_datetime(df['Date']).dt.strftime('%y')
daily_df = df[df['Year']=='15']
return daily_df

Dendrograms with SciPy

I have a dataset that I shaped according to my needs, the dataframe is as follows:
Index A B C D ..... Z
Date/Time 1 0 0 0,35 ... 1
Date/Time 0,75 1 1 1 1
The total number of rows is 8878
What I try to do is create a time-series dendrogram (Example: Whole A column will be compared to whole B column in whole time).
I am expecting an output like this:
(source: rsc.org)
I tried to construct the linkage matrix with Z = hierarchy.linkage(X, 'ward')
However, when I print the dendrogram, it just shows an empty picture.
There is no problem if a compare every time point with each other and plot, but in that way, the dendrogram becomes way too complicated to observe even in truncated form.
Is there a way to handle the data as a whole time series and compare within columns in SciPy?