Druid generate missing records - sql

I have a data table in druid and which has missing rows and I want to fill them by generating the missing timestamps and adding the precedent row value.
This is the table in druid :
| __time | distance |
| 2022-05-05T08:41:00.000Z | 1337 |
| 2022-05-05T08:42:00.000Z | 1350 |
| 2022-05-05T08:44:00.000Z | 1360 |
| 2022-05-05T08:47:00.000Z | 1377 |
| 2022-05-05T08:48:00.000Z | 1400 |
And i want to add the missing minutes either by forcing it in the side of druid storage or by query it directly in druid without passing by other module.
The final result that I want will be look like this:
| __time | distance |
| 2022-05-05T08:41:00.000Z | 1337 |
| 2022-05-05T08:42:00.000Z | 1350 |
| 2022-05-05T08:43:00.000Z | 1350 |
| 2022-05-05T08:44:00.000Z | 1360 |
| 2022-05-05T08:45:00.000Z | 1360 |
| 2022-05-05T08:46:00.000Z | 1360 |
| 2022-05-05T08:47:00.000Z | 1377 |
| 2022-05-05T08:48:00.000Z | 1400 |
And thank you in advance !

A Driud time series query will produce a densely populated timeline at a given time granularity like the one you want for every minute. But its current functionality either skips empty time buckets or assigns them a value of zero.
Doing other gap filling functions like LVCF (last value carried forward) that you describe seems like a great enhancement. You can join the Apache Druid community and create an issue that describes this request. That's a great way to start a conversation about requirements and how it might be achieved.
And/Or you could also add the functionality and submit a PR. We're always looking for more members in the Apache Druid community.


DBT Snapshots with not unique records in the source

I’m interested to know if someone here has ever come across a situation where the source is not always unique when dealing with snapshots in DBT.
I have a data lake where data arrives on an append only basis. Every time the source is updated, a new recorded is created on the respective table in the data lake.
By the time the DBT solution is ran, my source could have more than 1 row with the unique id as the data has changed more than once since the last run.
Ideally, I’d like to update the respective dbt_valid_to columns from the snapshot table with the earliest updated_at record from the source and subsequently add the new records to the snapshot table making the latest updated_at record the current one.
I know how to achieve this using window functions but not sure how to handle such situation with dbt.
I wonder if anybody has faced this same issue before.
Snapshot Table
| **id** | **some_attribute** | **valid_from** | **valid_to** |
| 123 | ABCD | 2021-01-01 00:00:00 | 2021-06-30 00:00:00 |
| 123 | ZABC | 2021-06-30 00:00:00 | null |
Source Table
|**id**|**some_attribute**| **updated_at** |
| 123 | ABCD | 2021-01-01 00:00:00 |-> already been loaded to snapshot
| 123 | ZABC | 2021-06-30 00:00:00 |-> already been loaded to snapshot
| 123 | ZZAB | 2021-11-21 00:10:00 |
| 123 | FXAB | 2021-11-21 15:11:00 |
Snapshot Desired Result
| **id** | **some_attribute** | **valid_from** | **valid_to** |
| 123 | ABCD | 2021-01-01 00:00:00 | 2021-06-30 00:00:00 |
| 123 | ZABC | 2021-06-30 00:00:00 | 2021-11-21 00:10:00 |
| 123 | ZZAB | 2021-11-21 00:10:00 | 2021-11-21 15:11:00 |
| 123 | FXAB | 2021-11-21 15:11:00 | null |
Standard snapshots operate under the assumption that the source table we are snapshotting are being changed without storing history. This is opposed to the behaviour we have here (basically the source table we are snapshotting is nothing more than an append only log of events) - which means that we may get away with simply using a boring old incremental model to achieve the same SCD2 outcome that snapshots give us.
I have some sample code here where I did just that that may be of some help https://gist.github.com/jeremyyeo/3a23f3fbcb72f10a17fc4d31b8a47854
I agree it would be very convenient if dbt snapshots had a strategy that could involve deduplication, but it isn’t supported today.
The easiest work around would be a stage view downstream of the source that has the window function you describe. Then you snapshot that view.
However, I do see potential for a new snapshot strategy that handles append only sources. Perhaps you’d like to peruse the dbt Snapshot docs and strategies source code on existing strategies to see if you’d like to make a new one!

How to use groupby and nth_value at the same time in pyspark?

So, I have a dataset with some repeated data, which I need to remove. For some reason, the data I need is always in the middle:
--> df_apps
2021-01-10 | FACEBOOK | 1000 | 5000
2021-01-10 | FACEBOOK | 20000 | 900000
2021-02-10 | FACEBOOK | 9000 | 72000
2021-01-11 | FACEBOOK | 4000 | 2000
2021-01-11 | FACEBOOK | 40000 | 85000
2021-02-11 | FACEBOOK | 1000 | 2000
In pandas, it'd be as simple as df_apps_grouped = df_apps.groupby('DATE').nth_value(1) and I'd get the result bellow:
--> df_apps_grouped
2021-01-10 | FACEBOOK | 20000 | 900000
2021-01-11 | FACEBOOK | 40000 | 85000
But for one specific project, I must use pyspark and I can't get this result on it.
Could you please help me with this?
You'll want to do:
from pyspark.sql import Window, functions as F
w = Window.partitionBy('date').orderBy('date')
df = df.withColumn('row_n', F.row_number().over(w)).filter('row_n ==1')
Because of its distributed nature the rows are in random order and row 1 might be different the second time you query it. This is why you need an order by, this will make sure you get the same result every time
What you are looking for is row_number applied over the a window partitioned by DATE and ordered by DATE, however due to the distributed nature of spark, we can't guarantee that during ordering
2021-01-10 | FACEBOOK | 1000 | 5000
will always come before
2021-01-10 | FACEBOOK | 20000 | 900000
I would suggest, including a line number if you are reading from a file, and ordering based on the file number. Refer here for achieving this in Spark.

How to import Excel table with double headers into oracle database

I have this excel table I am trying to transfer over to an oracle database. The thing is that the table has headers that overlap and I'm not sure if there is a way to import this nicely into an Oracle Database.
| | 2018-01-01| 2018-01-02|
|Item +-----+-----+-----+-----+
| | RMB | USD | RMB | USD |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
The top headers are just the dates for the month and then their respective data for that date. Is there a way to nicely transfer this to an oracle table?
EDIT: Date field is an actual date such as 02/19/2018.
If you pre-create a table (as I do), then you can start loading from the 3rd line (i.e. skip the first two), putting every Excel column into the appropriate Oracle table column.
Alternatively (& obviously), rename column headers so that file wouldn't have two header levels).

Let pandas use 0-based row number as index when reading Excel files

I am trying to use pandas to process a series of XLS files. The code I am currently using looks like:
with pandas.ExcelFile(data_file) as xls:
data_frame = pandas.read_excel(xls, header=[0, 1], skiprows=2, index_col=None)
And the format of the XLS file looks like
| Unit: 1000000 USD |
| | | | | Balance |
+ ID + Branch + Customer ID + Customer Name +--------------------------+
| | | | | Daily | Monthly | Yearly |
| 111111 | Branch1 | 1 | Company A | 10 | 5 | 2 |
| 222222 | Branch2 | 2 | Company B | 20 | 25 | 20 |
| 111111 | Branch1 | 3 | Company C | 30 | 35 | 40 |
Even I explicitly gave index_col=None, pandas still take ID column as the index. I am wondering the right way of making row numbers to be the index.
pandas currently doesn't support parsing a MultiIndex columns without also parsing a row index. Related issue here - it probably could be supported, but this gets tricky to define in a non-ambiguous way.
It's a hack, but the easiest way to work around this right now is to add a blank column on the left side of data, then read it in like this.
pd.read_excel('file.xlsx', header=[0,1], skiprows=2).reset_index(drop=True)
If you can't / don't want to modify the files, a couple options are:
If the data has a known / common header, use pd.read_excel(..., skiprows=4, header=None) and assign the columns yourself, suggested by #ayhan.
If you need to parse the header, use pd.read_excel(..., skiprows=2, header=0), then munge the second level of labels into a MultiIndex. This will probably mess up dtypes, so you may also need to do some typecasting (pd.to_numeric) as well.

Query to compare values across different tables?

I have a pair of models in my Rails app that I'm having trouble bridging.
These are the tables I'm working with:
| id | fips | name |
| 1 | 06 | California |
| 2 | 36 | New York |
| 3 | 48 | Texas |
| 4 | 12 | Florida |
| 5 | 17 | Illinois |
| … | … | … |
| id | place |
| 1 | Fl |
| 2 | Calif. |
| 3 | Texas |
| … | … |
Not all places are represented in the states model, but I'm trying to perform a query where I can compare a place's place value against all state names, find the closest match, and return the corresponding fips.
So if my input is Calif., I want my output to be 06
I'm still very new to writing SQL queries, so if there's a way to do this using Ruby within my Rails (4.1.5) app, that would be ideal.
My other plan of attack was to add a fips column to the "places" table, and write something that would run the above comparison and then populate fips so my app doesn't have to run this query every the page loads. But I'm very much a beginner, so that sounds... ambitious.
This is not an easy query in SQL. Your best bet is one of the fuzzing string matching routines, which are documented here.
For instance, soundex() or levenshtein() may be sufficient for what you want. Here is an example:
select distinct on (p.place) p.place, s.name, s.fips, levenshtein(p.place, s.name) as dist
from places p cross join
states s
order by p.place, dist asc;