How merge several time series data sets to a data frame and then using cross validation on it? - pandas

Hi I have 50 time series datasets in 50 .CSV files and each of them have 2 column: data and label. exactly like below. each of these .CSV file contain more than 800,000 row of signal records and none of them is equal to each other
How i can merge these .CSV file in to a data frame in order to select training and testing data with Cross validation?
because i working with RNN the sequence of data is important.
DATA OF CASE 1:
Data label
23 A
88 A
56 B
87 B
56 c
87 D
17 C
44 B
12 A
----------------
DATA OF CASE 2:
Data label
13 A
98 B
56 B
77 C
49 D
89 c
19 B
23 B
32 A

Related

SQL: Subtracting certain rows with restrictions from a data table into a new table

I Have a data table in postgresql which has these columns and some rows like this:
st
epochnum
satnum
l1
l2
c1
p1
p2
1
1
1
10
11
12
13
14
1
1
2
15
16
17
18
19
1
2
1
20
21
22
23
24
1
2
2
25
26
27
28
29
20
1
1
30
41
52
63
74
20
1
2
75
76
87
88
null
20
2
1
...
I want to get some pairs of rows that have the same value for epochnum and satnum but have different value in "st". By the way, I have a list that specifies which "st" pairs should be subtracted. Its just another table that looks like this:
st1
st2
1
20
The rows in the first table have to be subtracted in l1,l2,c1,p1 and p2 with same epochnum and satnum according to this table and then inserted into a new table like this:
epochnum
st1
st2
satnum
dl1
dl2
dc1
dp1
dp2
1
1
20
1
20
30
40
50
60
1
1
20
2
65
65
75
75
null
...
The actual data has more than 400000 rows that has same epochnums and satnums like this. I have tried java programming in net-beans and used loops to simply get queries for each row and make the new table.
But I think maybe it is not efficient and unnecessarily takes long time due to the lots of queries that has to be done in java.
I wonder if there is a way that this can be done using just a few queries, or creating extra tables and .... I haven't come up with the best solution yet.
Are you looking for joins like this?
select t1.st, t1.epochnum, t1.satnum,
(t2.l1 - t1.l1),
(t2.l2 - t1.l2),
(t2.p1 - t1.p1),
(t2.p2 - t1.p2)
from t t1 join
t t2
on t1.epochnum = t2.epochnum and
t1.satnum = t2.satnum join
pairs p
on t1.st = p.st1 and t2.st = p.st2

keep all column after sum and groupby including empty values

I have the following dataframe:
source name cost other_c other_b
a a 7 dd 33
b a 6 gg 44
c c 3 ee 55
b a 2
d b 21 qw 21
e a 16 aq
c c 10 55
I am doing a sum of name and source with:
new_df = df.groupby(['source', 'name'], as_index=False)['cost'].sum()
but it is dropping the remaining 6 columns in my dataframe. Is there a way to keep the rest of the columns? I'm not looking to add new column, just carry over the columns from the original dataframe

Why main table and temporary table giving different results?

I have one temporary external table and have put the data from HDFS into this table. Now I am inserting the same data into my partition main external table. The data gets inserted successfully, but when I am querying the main table using columns I'm getting different values for the columns.
I have loaded the data using csv file into my temporary that contains four fields.
col1=id
col2=visitDate
col3=comment
col4=age
Below are the queries and their results:
Temporary table:
create external table IF NOT EXISTS dummy1(id string,visitDate string,comment string, age string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
;
MAIN Table:
create external table IF NOT EXISTS dummy1(id string,comment string)
PARTITIONED BY (visitDate string, age string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS ORC
;
Result:
Temporary table:
select *from incr_dummy1;
1 11 a 20
2 12 b 3
1 13 c 34
4 14 d 23
5 15 e 45
6 16 f 65
7 17 g 78
8 18 h 9
9 19 i 12
10 20 j 34
select visitDate,age from incr_dummy1;
11 20
12 3
13 34
14 23
15 45
16 65
17 78
18 9
19 12
20 34
Main Table:
select *from dummy1;
1 11 a 20
2 12 b 3
1 13 c 34
4 14 d 23
5 15 e 45
6 16 f 65
7 17 g 78
8 18 h 9
9 19 i 12
10 20 j 34
select visitDate,age from dummy1;
a 20
b 3
c 34
d 23
e 45
f 65
g 78
h 9
i 12
j 34
so in above main external table,the value of "comment" column is coming when I'm querying for "visitDate" column.
Please let me know what mistake I'm doing here?
As i can see column orders are not same in temporary and final tables.
While inserting data from Temporary table to final table check you are having the correct order of columns in select statement(partition cols needs to be at the end of select cols).
hive> insert into dummy1 partition(visitDate,age) select id,comment,visitDate,age from incr_dummy1;
Just in case if you are still having issues then its better to check
As you are having external partitioned table (when we drop the table data will not be dropped on HDFS), check the hdfs directory is there any extra files that are not been deleted.
Then drop the table, delete the hdfs directory and create the table then run your job again.
Update:
Option1:
is it possible to match columns order in temporary table with final table, if yes then change the order of columns.
Option2:
use subquery with quoted identifier to exclude the original columns and get only the alias columns into our final select query.
hive> set hive.support.quoted.identifiers=none;
hive> insert into dummy1 partition(visitDate,age)
select `(visitDate|age)?+.+` from --exlude visitDate,age columns.
(select *,visitDate vis_dat,age age_n from incr_dummy1)t;

Pandas pivot_table -- index values extended through rows

I'm trying to tidy some data, specifically by taking two columns "measure" and "value" and making more columns for each unique value of measure.
So far I have some python (3) code that reads in data and pivots it to the form that I want--roughly. This code looks like so:
import pandas as pd
#Load the data
df = pd.read_csv(r"C:\Users\User\Documents\example data.csv")
#Pivot the dataframe
df_pivot = df.pivot_table(index=['Geography Type', 'Geography Name', 'Week Ending',
'Item Name'], columns='Measure', values='Value')
print(df_pivot.head())
This outputs:
Measure X Y Z
Geography Type Geography Name Week Ending Item Name
Type 1 Total US 1/1/2018 Item A 57 51 16
Item B 95 37 17
1/8/2018 Item A 92 8 32
Item B 36 49 54
Type 2 Region 1 1/1/2018 Item A 78 46 88
This is almost perfect, but for my work I need to put this file in software and for the software to read the data correctly it needs values for each of the rows, so I need the columns values for each of those indexes to extend through the rows, like so:
Measure X Y Z
Geography Type Geography Name Week Ending Item Name
Type 1 Total US 1/1/2018 Item A 57 51 16
Type 1 Total US 1/1/2018 Item B 95 37 17
Type 1 Total US 1/8/2018 Item A 92 8 32
Type 1 Total US 1/8/2018 Item B 36 49 54
Type 2 Region 1 1/1/2018 Item A 78 46 88
and so on.

Repetation of column when using join between two table

As per using select query in postgres along 8 or 9 table using join found output as
1. A 2 34
2. A 2 56
3. B 3 34
4. B 3 56
whereas i required output in two form either
1. A 2 34
2. A 2 34
3. B 3 56
4. B 3 56
or
A 2 34
B 3 56
what can i do?
Using distinct?
select distinct * from table