Best way to partition by timestamp in a parquet dataframe

Best way to partition by timestamp in a parquet dataframe - apache-spark-sql

I have a dataframe containing minute level values that looks like below:
+---------------------+-------+
| Timestamp | Value |
+---------------------+-------+
| 2018-01-01 00:00:00 | 5 |
| 2018-01-01 00:01:00 | 7 |
| 2018-01-01 00:02:00 | 9 |
| 2018-01-01 00:03:00 | 0 |
| 2018-01-01 00:04:00 | 5 |
| 2018-01-01 00:05:00 | 8 |
| ... | ... |
| ... | ... |
| 2018-12-31 23:58:00 | 8 |
| 2018-12-31 23:59:00 | 7 |
+---------------------+-------+
I'd like to save it as a partitioned parquet file so that I can optimize file read.
Later I'd need to select the data for a given duration. For eg: 2018-01-05 00:00:00 to 2018-01-06 00:00:00. I was thinking I can partition this data on year,month,day,hour values as below:
df_final = df.withColumn("TimeStamp", to_timestamp(col('TimeStamp'), 'yyyy-MM-dd HH:mm:ss')) \
.withColumn("year", date_format(col("TimeStamp"), "yyyy")) \
.withColumn("month", date_format(col("TimeStamp"), "MM")) \
.withColumn("day", date_format(col("TimeStamp"), "dd")) \
.withColumn("hour", date_format(col("TimeStamp"), "HH"))
This creates a folder structure like this in the resultant parquet
└── YYYY
└── MM
└── DD
└── HH
But does this partitioning help in read optimization? Also I see that the resulting parquet file is 10x larger in size than the unpartitioned file, on disk.
What is the best way to partition this file so that I can fetch data for a given duration faster?

Generally when partitioning multilevel, check for number of records that goes into last level, if it is very less, it would be good to stop at previous partition level and use that in our requirements.
Sometimes splitting the files at multi level can be over optimization leading to lot of files and disk reads loosing the parquet columnar benefits.

Related

Pandas sum values between two dates in the most efficient way?

I have a dataset which shows production reported every week and another reporting the production every hours over some subproduction. I would now like to compare the sum of all this hourly subproduction with the value reported every week in the most efficient way. How could I achieve this? I would like to avoid a for loop at all cost as my dataset is really large.
So my datasest looks like this:
Weekly reported data:
Datetime_text | Total_Production_A
--------------------------|--------------------
2014-12-08 00:00:00.000 | 8277000
2014-12-15 00:00:00.000 | 8055000
2014-12-22 00:00:00.000 | 7774000
Hourly data:
Datetime_text | A_Prod_1 | A_Prod_2 | A_Prod_3 | ...... | A_Prod_N |
--------------------------|-----------|-----------|-----------|-----------|-----------|
2014-12-06 23:00:00.000 | 454 | 9 | 54 | 104 | 4 |
2014-12-07 00:00:00.000 | 0 | NaV | 0 | 23 | 3 |
2014-12-07 01:00:00.000 | 54 | 0 | 4 | NaV | 20 |
and so on. I would like to a new table where the differnce between the weekly reported data and hourly reported data is calculated for all dates of weekly reported data. So something like this:
Datetime_text | Diff_Production_A
--------------------------|------------------
2014-12-08 00:00:00.000 | 10
2014-12-15 00:00:00.000 | -100
2014-12-22 00:00:00.000 | 1350
where Diff_Production_A = Total_Production_A - sum(A_Prod_1,A_Prod_2,A_Prod_3,...,A_Prod_N;over all datetimes of a week) How can I best achieve this?
Any help is this regard would be greatly appriciated :D
Best
fidu13

Store datetime as pd.Timestamp, then you can do all kinds of manipulation on the dates.
For your problem, they is to group the hourly data by week (starting on Mondays), then merge it with the weekly data and calculate the differences:
weekly["Datetime"] = pd.to_datetime(weekly["Datetime_Text"])
hourly["Datetime"] = pd.to_datetime(hourly["Datetime_Text"])
hourly["HourlyTotal"] = hourly.loc[:, "A_Prod_1":"A_Prod_N"].sum(axis=1)
result = (
hourly.groupby(pd.Grouper(key="Datetime", freq="W-MON"))["HourlyTotal"]
.sum()
.to_frame()
.merge(
weekly[["Datetime", "Total_Production_A"]],
how="outer",
left_index=True,
right_on="Datetime",
)
.assign(Diff=lambda x: x["Total_Production_A"] - x["HourlyTotal"])
)

DBT Snapshots with not unique records in the source

I’m interested to know if someone here has ever come across a situation where the source is not always unique when dealing with snapshots in DBT.
I have a data lake where data arrives on an append only basis. Every time the source is updated, a new recorded is created on the respective table in the data lake.
By the time the DBT solution is ran, my source could have more than 1 row with the unique id as the data has changed more than once since the last run.
Ideally, I’d like to update the respective dbt_valid_to columns from the snapshot table with the earliest updated_at record from the source and subsequently add the new records to the snapshot table making the latest updated_at record the current one.
I know how to achieve this using window functions but not sure how to handle such situation with dbt.
I wonder if anybody has faced this same issue before.
Snapshot Table
| **id** | **some_attribute** | **valid_from** | **valid_to** |
| 123 | ABCD | 2021-01-01 00:00:00 | 2021-06-30 00:00:00 |
| 123 | ZABC | 2021-06-30 00:00:00 | null |
Source Table
|**id**|**some_attribute**| **updated_at** |
| 123 | ABCD | 2021-01-01 00:00:00 |-> already been loaded to snapshot
| 123 | ZABC | 2021-06-30 00:00:00 |-> already been loaded to snapshot
-------------------------------------------
| 123 | ZZAB | 2021-11-21 00:10:00 |
| 123 | FXAB | 2021-11-21 15:11:00 |
Snapshot Desired Result
| **id** | **some_attribute** | **valid_from** | **valid_to** |
| 123 | ABCD | 2021-01-01 00:00:00 | 2021-06-30 00:00:00 |
| 123 | ZABC | 2021-06-30 00:00:00 | 2021-11-21 00:10:00 |
| 123 | ZZAB | 2021-11-21 00:10:00 | 2021-11-21 15:11:00 |
| 123 | FXAB | 2021-11-21 15:11:00 | null |

Standard snapshots operate under the assumption that the source table we are snapshotting are being changed without storing history. This is opposed to the behaviour we have here (basically the source table we are snapshotting is nothing more than an append only log of events) - which means that we may get away with simply using a boring old incremental model to achieve the same SCD2 outcome that snapshots give us.
I have some sample code here where I did just that that may be of some help https://gist.github.com/jeremyyeo/3a23f3fbcb72f10a17fc4d31b8a47854

I agree it would be very convenient if dbt snapshots had a strategy that could involve deduplication, but it isn’t supported today.
The easiest work around would be a stage view downstream of the source that has the window function you describe. Then you snapshot that view.
However, I do see potential for a new snapshot strategy that handles append only sources. Perhaps you’d like to peruse the dbt Snapshot docs and strategies source code on existing strategies to see if you’d like to make a new one!

Merging some columns from two postgres tables into a new table based on row value

Hello PostgresSQL experts (and maybe this is also a task for Perl's DBI since I also happen to be working with it, but...) I might also have some terminologies misused here so bear with me.
I have a set of 32 tables, every one exactly as the other. The first column of every table always contains a date, while the second column contains values (integers) that can change once every 24 hours, some samples get back-dated. In many cases, the tables may not contain data for a particular date, ever. So here's an example of two such tables:
date_list | sum date_list | sum
---------------------- --------------------------
2020-03-12 | 4 2020-03-09 | 1
2020-03-14 | 5 2020-03-11 | 3
| 2020-03-12 | 5
| 2020-03-13 | 9
| 2020-03-14 | 12
The idea is to merge the separate tables into one, sort of like a grid, but with the samples placed in the correct row in its own column and ensuring that the date column (always the first column) is not missing any dates, looking like this:
date_list | sum1 | sum2 | sum3 .... | sum32
---------------------------------------------------------
2020-03-08 | | |
2020-03-09 | | 1 |
2020-03-10 | | | 5
2020-03-11 | | 3 | 25
2020-03-12 | 4 | 5 | 35
2020-03-13 | | 9 | 37
2020-03-14 | 5 | 12 | 40
And so on, so 33 columns by 2020-01-01 to date.
Now, I have tried doing a FULL OUTER JOIN and it succeeds. It's the subsequent attempts that get me trouble, creating a long, cascading table with the values in the wrong place or accidentally clobbering data. So I know this works if I use a table of one column with a date sequence and joining the first data table, just as a test of my theory using baby steps:
SELECT date_table.date_list, sums_1.sum FROM date_table FULL OUTER JOIN sums_1 ON date_table.date_list = sums_1.date_list
2020-03-07 | 1
2020-03-08 |
2020-03-09 |
2020-03-10 | 2
2020-03-11 |
2020-03-12 | 4
Encouraged, I thought I'd get a little more ambitious with my testing, but that places some rows out of sequence to the bottom of the table and I'm not sure that I'm losing data or not, this time trying USING as an alternative:
SELECT * FROM sums_1 FULL OUTER JOIN sums_2 USING (date_list);
Result:
fecha_sintomas | sum | sum
----------------+-------+-------
2020-03-09 | | 1
2020-03-11 | | 3
2020-03-12 | 4 | 5
2020-03-13 | | 9
2020-03-14 | 5 | 12
2020-03-15 | 6 | 15
2020-03-16 | 8 | 20
: : :
2020-10-29 | 10053 | 22403
2020-10-30 | 10066 | 22407
2020-10-31 | 10074 | 22416
2020-11-01 | 10076 | 22432
2020-11-02 | 10077 | 22434
2020-03-07 | 1 |
2020-03-10 | 2 |
(240 rows)
I think I'm getting close. In any case, where do I get to what I want, which is my grid of data described above? Maybe this is an iterative process that could benefit from using DBI?
Thanks,

You can full join like so:
select date_list, s1.sum as sum1, s2.sum as sum2, s3.sum as sum3
from sums_1 s1
full join sums_2 s2 using (date_list)
full join sums_3 s3 using (date_list)
order by date_list;
The using syntax makes unqualified column date_list unambiguous in the select and order by clauses. Then, we need to enumerate the sum columns, provided aliases for each of them.

Parsing a transposed CSV with datetime values on multiple rows in pandas

I am attempting to import a csv file which includes a number of time-series.
The challenges I am facing are:
a) the csv file is transposed so dates cannot be parsed from columns. Transposing the file using a read_csv().T command would generally work, but it is not appropriate given the datetime information.
b) since the datetime index is on a header row, repeated data points are added a numeral (i.e. Jan becomes Jan, Jan.1, Jan.2 etc.), so stripping datetime values becomes difficult.
c) the first column headers (which do not include datetime information) are placed on the last row of datetime data (third row), which further complicates parsing headers.
Is there an easy way to go from the csv to a 'standard' dataframe structure, with a datetime index parsed from the csv and values in columns?
An example of the csv data structure is here provided:
empty | empty | Jan | Jan | Jan | ... | Dec |
empty | empty | 1 | 1 | 1 | ... | 31 |
head1 | head2 | 00:00 | 01:00 | 02:00 | ... | 23:00 |
---
value1 | value2 | 0.35 | 0.38 | 0.44 | ... | 0.20 |
...

Try:
# read csv with no header
df = pd.read_csv('untitled.txt', header=None).T
# create an index by joining all the columns
df['idx'] = [ ' '.join((a,b,c)) for a,b,c in
zip(df[0].fillna(''),
df[1].fillna(''),
df[2].fillna('')) ]
# drop the unnecessary columns
df.drop([0,1,2], axis=1, inplace=True)
# output
df.set_index('idx').T.reset_index(drop=True)
Output:
+----+-----------+-----------+---------------+---------------+---------------+----------------+
| | head1 | head2 | Jan 1 00:00 | Jan 1 01:00 | Jan 1 02:00 | Dec 31 23:00 |
|----+-----------+-----------+---------------+---------------+---------------+----------------|
| 0 | value1 | value2 | 0.35 | 0.38 | 0.44 | 0.2 |
+----+-----------+-----------+---------------+---------------+---------------+----------------+
As above, the columns are still text (str type). You need to convert them back into timestamps if need be.

Pandas fill missing values of a column based on the datetime values of another column

Python newbie here, this is my first question.
I tried to find a solution on similar SO questions, like this one, this one, and also this one, but I think my problem is different.
Here's my situation: I have a quite large dataset with two columns: Date (datetime object), and session_id (integer). The timestamps refer to the moment where a certain action occurred during an online session.
My problem is that I have all the dates, but I am missing some of the corresponding session_id values. What I would like to do is to fill these missing values using the date column:
If the action occurred between the first and last date of a certain session, I would like to fill the missing value with the id of that session.
I would mark as '0' the session where the action occurred outside the range of any session -
and mark it as '-99' if it is not possible to associate the event to a single session, because it occurred during the time range of different session.
To give an example of my problem, let's consider the toy dataset below, where I have just three sessions: a, b, c. Session a and b registered three events, session c two. Moreover, I have three missing id values.
| DATE |sess_id|
----------------------------------
0 | 2018-01-01 00:19:01 | a |
1 | 2018-01-01 00:19:05 | b |
2 | 2018-01-01 00:21:07 | a |
3 | 2018-01-01 00:22:07 | b |
4 | 2018-01-01 00:25:09 | c |
5 | 2018-01-01 00:25:11 | Nan |
6 | 2018-01-01 00:27:28 | c |
7 | 2018-01-01 00:29:29 | a |
8 | 2018-01-01 00:30:35 | Nan |
9 | 2018-01-01 00:31:16 | b |
10 | 2018-01-01 00:35:22 | Nan |
...
[Image_Timeline example][1]
This is what I would like to obtain:
| DATE |sess_id|
----------------------------------
0 | 2018-01-01 00:19:01 | a |
1 | 2018-01-01 00:19:05 | b |
2 | 2018-01-01 00:21:07 | a |
3 | 2018-01-01 00:22:07 | b |
4 | 2018-01-01 00:25:09 | c |
5 | 2018-01-01 00:25:11 | -99 |
6 | 2018-01-01 00:27:28 | c |
7 | 2018-01-01 00:29:29 | a |
8 | 2018-01-01 00:30:35 | b |
9 | 2018-01-01 00:31:16 | b |
10 | 2018-01-01 00:35:22 | 0 |
...
In this way I will be able to recover at least some of the events without session code.
I think that maybe the first thing to do is to compute two new columns showing the first and last time value for each session, something like that:
foo['last'] = foo.groupby('sess_id')['DATE'].transform(max)
foo['firs'] = foo.groupby('SESSIONCODE')['DATE'].transform(min)
And then use first-last time value to check whether each event whose session id is unknown falls withing that range.

Your intuition seems fine by me, but you can't apply it this way since your dataframe foo doens't have the same size as your groupby dataframe. What you could do is map the values like this:
foo['last'] = foo.sess_id.map(foo.groupby('sess_id').DATE.max())
foo['first'] = foo.sess_id.map(foo.groupby('sess_id').DATE.min())
But I don't think it's necessary, you can just use the groupby dataframe as such.
A way to solve your problem could be to look for the missing values in sess_id column, and apply a custom function to the corresponding dates:
def my_custom_function(time):
current_sessions = my_agg.loc[(my_agg['min']<time) & (my_agg['max']>time)]
count = len(current_sessions)
if count == 0:
return 0
if count > 1:
return -99
return current_sessions.index[0]
my_agg = foo.groupby('sess_id').DATE.agg([min,max])
foo.loc[foo.sess_id.isnull(),'sess_id'] = foo.loc[foo.sess_id.isnull(),'DATE'].apply(my_custom_function)
Output:
DATE sess_id
0 2018-01-01 00:19:01 a
1 2018-01-01 00:19:05 b
2 2018-01-01 00:21:07 a
3 2018-01-01 00:22:07 b
4 2018-01-01 00:25:09 c
5 2018-01-01 00:25:11 -99
6 2018-01-01 00:27:28 c
7 2018-01-01 00:29:29 a
8 2018-01-01 00:30:35 b
9 2018-01-01 00:31:16 b
10 2018-01-01 00:35:22 0
I think it performs what you are looking for, though the output you posted in your question seems to contain typos.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Best way to partition by timestamp in a parquet dataframe - apache-spark-sql

Related

Pandas sum values between two dates in the most efficient way?

DBT Snapshots with not unique records in the source

Merging some columns from two postgres tables into a new table based on row value

Parsing a transposed CSV with datetime values on multiple rows in pandas

Pandas fill missing values of a column based on the datetime values of another column

Categories

Resources