Spark/Hive - Group data into a "pivot-table" format - hive

I have a very annoying set of files structured like so:
userId string,
eventType string,
source string,
errorCode string,
startDate timestamp,
endDate timestamp
Each file may contain an arbitrary number of records per eventId, with varying eventTypes and sources, and different code and start/end date for each.
Is there a way in Hive or Spark to group all of these together on userId, sort of like a key-value, where the value is the list of all fields associated with the userId? Specifically, I'd like it to be keyed by eventType and source. Basically I want to trade table length for width, sort of like a pivot table. My goal for this is to eventually be stored as Apache Parquet or Avro file format for more speedy analysis in the future.
Here's an example:
Source data:
userId, eventType, source, errorCode, startDate, endDate
552113, 'ACK', 'PROVIDER', 0, '2017-09-01 12:01:45.432', '2017-09-01 12:01:45.452'
284723, 'ACK', 'PROVIDER', 0, '2017-09-01 12:01:45.675', '2017-09-01 12:01:45.775'
552113, 'TRADE', 'MERCH', 0, '2017-09-01 12:01:47.221', '2017-09-01 12:01:46.229'
552113, 'CHARGE', 'MERCH', 0, '2017-09-01 12:01:48.123', '2017-09-01 12:01:48.976'
284723, 'REFUND', 'MERCH', 1, '2017-09-01 12:01:48.275', '2017-09-01 12:01:48.947'
552113, 'CLOSE', 'PROVIDER', 0, '2017-09-01 12:01:49.908', '2017-09-01 12:01:50.623'
284723, 'CLOSE', 'PROVIDER', 0, '2017-09-01 12:01:50.112', '2017-09-01 12:01:50.777'
Goal:
userId, eventTypeAckProvider, sourceAckProvider, errorCodeAckProvider, startDateAckProvider, endDateAckProvider, eventTypeTradeMerch, sourceTradeMerch, errorCodeTradeMerch, startDateTradeMerch, endDateTradeMerch, eventTypeChargeMerch, sourceChargeMerch, errorCodeChargeMerch, startDateChargeMerch, endDateChargeMerch, eventTypeCloseProvider, sourceCloseProvider, errorCodeCloseProvider, startDateCloseProvider, endDateCloseProvider, eventTypeRefundMerch, sourceRefundMerch, errorCodeRefundMerch, startDateRefundMerch, endDateRefundMerch
552113, 'ACK', 'PROVIDER', 0, '2017-09-01 12:01:45.432', '2017-09-01 12:01:45.452', 'TRADE', 'MERCH', 0, '2017-09-01 12:01:47.221', '2017-09-01 12:01:46.229', 'CHARGE', 'MERCH', 0, '2017-09-01 12:01:48.123', '2017-09-01 12:01:48.976', 'CLOSE', 'PROVIDER', 0, '2017-09-01 12:01:49.908', '2017-09-01 12:01:50.623', NULL, NULL, NULL, NULL, NULL
284723, 'ACK', 'PROVIDER', 0, '2017-09-01 12:01:45.675', '2017-09-01 12:01:45.775', NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, 'CLOSE', 'PROVIDER', 0, '2017-09-01 12:01:50.112', '2017-09-01 12:01:50.777', 'REFUND', 'MERCH', 1, '2017-09-01 12:01:48.275', '2017-09-01 12:01:48.947'
The field names or order doesn't matter, as long as I can distinguish them.
I've tried two methods already to get this to work:
Manually select each combination from the table and join to a master dataset. This works just fine, and parallelizes well, but doesn't allow for an arbitrary number of values for the key fields, and requires the schema to be predefined.
Use Spark to create a dictionary of key:value records where each value is a dictionary. Basically loop through the dataset, add a new key to the dictionary if it doesn't exist, and for that entry, add a new field to the value-dictionary if it doesn't exist. This works beautifully, but is extremely slow and doesn't parallelize well, if it does at all. Also I am not sure whether that would be a Avro/Parquet compatible format.
Are there any alternatives to those two methods? Or even a better structure than what my goal is?

Would you like to have something like this?
from pyspark.sql.functions import struct, col, create_map, collect_list
df = sc.parallelize([
['552113', 'ACK', 'PROVIDER', 0, '2017-09-01 12:01:45.432', '2017-09-01 12:01:45.452'],
['284723', 'ACK', 'PROVIDER', 0, '2017-09-01 12:01:45.675', '2017-09-01 12:01:45.775'],
['552113', 'TRADE', 'MERCH', 0, '2017-09-01 12:01:47.221', '2017-09-01 12:01:46.229'],
['552113', 'CHARGE', 'MERCH', 0, '2017-09-01 12:01:48.123', '2017-09-01 12:01:48.976'],
['284723', 'REFUND', 'MERCH', 1, '2017-09-01 12:01:48.275', '2017-09-01 12:01:48.947'],
['552113', 'CLOSE', 'PROVIDER', 0, '2017-09-01 12:01:49.908', '2017-09-01 12:01:50.623'],
['284723', 'CLOSE', 'PROVIDER', 0, '2017-09-01 12:01:50.112', '2017-09-01 12:01:50.777']
]).toDF(('userId', 'eventType', 'source', 'errorCode', 'startDate', 'endDate'))
df.show()
new_df = df.withColumn("eventType_source", struct([col('eventType'), col('source')])).\
withColumn("errorCode_startEndDate", struct([col('errorCode'), col('startDate'), col('endDate')]))
new_df = new_df.groupBy('userId').agg(collect_list(create_map(col('eventType_source'), col('errorCode_startEndDate'))).alias('event_detail'))
new_df.show()

Can you try this and give your comments,
>>> from pyspark.sql import SparkSession
>>> from pyspark.sql import functions as F
>>> from pyspark.sql.types import *
>>> spark = SparkSession.builder.getOrCreate()
>>> l=[(552113, 'ACK', 'PROVIDER', 0, '2017-09-01 12:01:45.432', '2017-09-01 12:01:45.452'),(284723, 'ACK', 'PROVIDER', 0, '2017-09-01 12:01:45.675', '2017-09-01 12:01:45.775'),(552113, 'TRADE', 'MERCH', 0, '2017-09-01 12:01:47.221', '2017-09-01 12:01:46.229'),(552113, 'CHARGE', 'MERCH', 0, '2017-09-01 12:01:48.123', '2017-09-01 12:01:48.976'),(284723, 'REFUND', 'MERCH', 1, '2017-09-01 12:01:48.275', '2017-09-01 12:01:48.947'),(552113, 'CLOSE', 'PROVIDER', 0, '2017-09-01 12:01:49.908', '2017-09-01 12:01:50.623'),(284723, 'CLOSE', 'PROVIDER', 0, '2017-09-01 12:01:50.112', '2017-09-01 12:01:50.777')]
>>> df = spark.createDataFrame(l,['userId', 'eventType', 'source', 'errorCode', 'startDate','endDate'])
>>> df.show(10,False)
+------+---------+--------+---------+-----------------------+-----------------------+
|userId|eventType|source |errorCode|startDate |endDate |
+------+---------+--------+---------+-----------------------+-----------------------+
|552113|ACK |PROVIDER|0 |2017-09-01 12:01:45.432|2017-09-01 12:01:45.452|
|284723|ACK |PROVIDER|0 |2017-09-01 12:01:45.675|2017-09-01 12:01:45.775|
|552113|TRADE |MERCH |0 |2017-09-01 12:01:47.221|2017-09-01 12:01:46.229|
|552113|CHARGE |MERCH |0 |2017-09-01 12:01:48.123|2017-09-01 12:01:48.976|
|284723|REFUND |MERCH |1 |2017-09-01 12:01:48.275|2017-09-01 12:01:48.947|
|552113|CLOSE |PROVIDER|0 |2017-09-01 12:01:49.908|2017-09-01 12:01:50.623|
|284723|CLOSE |PROVIDER|0 |2017-09-01 12:01:50.112|2017-09-01 12:01:50.777|
+------+---------+--------+---------+-----------------------+-----------------------+
>>> myudf = F.udf(lambda *cols : cols,ArrayType(StringType())) #composition to create rowwise list
>>> df1 = df.select('userId',myudf('eventType', 'source', 'errorCode','startDate', 'endDate').alias('val_list'))
>>> df2 = df1.groupby('userId').agg(F.collect_list('val_list')) # grouped on userId
>>> eventtypes = ['ACK','TRADE','CHARGE','CLOSE','REFUND'] # eventtypes and the order required in output
>>> def f(Vals):
aggVals = [typ for x in eventtypes for typ in Vals if typ[0] == x] # to order the grouped data based on eventtypes above
if len(aggVals) == 5:
return aggVals
else:
missngval = [(idx,val) for idx,val in enumerate(eventtypes)if val not in zip(*aggVals)[0]] # get missing eventtypes with their index to create null
for idx,val in missngval:
aggVals.insert(idx,[None]*5)
return aggVals
>>> myudf2 = F.udf(f,ArrayType(ArrayType(StringType())))
>>> df3 = df2.select('userId',myudf2('agg_list').alias('values'))
>>> df4 = df3.select(['userId']+[df3['values'][i][x] for i in range(5) for x in range(5)]) # to select from Array[Array]
>>> oldnames = df4.columns
>>> destnames = ['userId', 'eventTypeAckProvider', 'sourceAckProvider', 'errorCodeAckProvider', 'startDateAckProvider', 'endDateAckProvider', 'eventTypeTradeMerch', 'sourceTradeMerch', 'errorCodeTradeMerch', 'startDateTradeMerch', 'endDateTradeMerch', 'eventTypeChargeMerch', 'sourceChargeMerch', 'errorCodeChargeMerch', 'startDateChargeMerch', 'endDateChargeMerch', 'eventTypeCloseProvider', 'sourceCloseProvider', 'errorCodeCloseProvider', 'startDateCloseProvider', 'endDateCloseProvider', 'eventTypeRefundMerch', 'sourceRefundMerch', 'errorCodeRefundMerch', 'startDateRefundMerch', 'endDateRefundMerch']
>>> finalDF = reduce(lambda d,idx : d.withColumnRenamed(oldnames[idx],destnames[idx]),range(len(oldnames)),df4) # Renaming the columns
>>> finalDF.show()
+------+--------------------+-----------------+--------------------+-----------------------+-----------------------+-------------------+----------------+-------------------+-----------------------+-----------------------+--------------------+-----------------+--------------------+-----------------------+-----------------------+----------------------+-------------------+----------------------+-----------------------+-----------------------+--------------------+-----------------+--------------------+-----------------------+-----------------------+
|userId|eventTypeAckProvider|sourceAckProvider|errorCodeAckProvider|startDateAckProvider |endDateAckProvider |eventTypeTradeMerch|sourceTradeMerch|errorCodeTradeMerch|startDateTradeMerch |endDateTradeMerch |eventTypeChargeMerch|sourceChargeMerch|errorCodeChargeMerch|startDateChargeMerch |endDateChargeMerch |eventTypeCloseProvider|sourceCloseProvider|errorCodeCloseProvider|startDateCloseProvider |endDateCloseProvider |eventTypeRefundMerch|sourceRefundMerch|errorCodeRefundMerch|startDateRefundMerch |endDateRefundMerch |
+------+--------------------+-----------------+--------------------+-----------------------+-----------------------+-------------------+----------------+-------------------+-----------------------+-----------------------+--------------------+-----------------+--------------------+-----------------------+-----------------------+----------------------+-------------------+----------------------+-----------------------+-----------------------+--------------------+-----------------+--------------------+-----------------------+-----------------------+
|284723|ACK |PROVIDER |0 |2017-09-01 12:01:45.675|2017-09-01 12:01:45.775|null |null |null |null |null |null |null |null |null |null |CLOSE |PROVIDER |0 |2017-09-01 12:01:50.112|2017-09-01 12:01:50.777|REFUND |MERCH |1 |2017-09-01 12:01:48.275|2017-09-01 12:01:48.947|
|552113|ACK |PROVIDER |0 |2017-09-01 12:01:45.432|2017-09-01 12:01:45.452|TRADE |MERCH |0 |2017-09-01 12:01:47.221|2017-09-01 12:01:46.229|CHARGE |MERCH |0 |2017-09-01 12:01:48.123|2017-09-01 12:01:48.976|CLOSE |PROVIDER |0 |2017-09-01 12:01:49.908|2017-09-01 12:01:50.623|null |null |null |null |null |
+------+--------------------+-----------------+--------------------+-----------------------+-----------------------+-------------------+----------------+-------------------+-----------------------+-----------------------+--------------------+-----------------+--------------------+-----------------------+-----------------------+----------------------+-------------------+----------------------+-----------------------+-----------------------+--------------------+-----------------+--------------------+-----------------------+-----------------------+

Related

Shift rows dynamically based on column value

Below is my input dataframe:
+---+----------+--------+
|ID |date |shift_by|
+---+----------+--------+
|1 |2021-01-01|2 |
|1 |2021-02-05|2 |
|1 |2021-03-27|2 |
|2 |2022-02-28|1 |
|2 |2022-04-30|1 |
+---+----------+--------+
I need to groupBy "ID" and shift based on the "shift_by" column. In the end, the result should look like below:
+---+----------+----------+
|ID |date1 |date2 |
+---+----------+----------+
|1 |2021-01-01|2021-03-27|
|2 |2022-02-28|2022-04-30|
+---+----------+----------+
I have implemented the logic using UDF, but it makes my code slow. I would like to understand if this logic can be implemented without using UDF.
Below is a sample dataframe:
from datetime import datetime
from pyspark.sql.types import *
data2 = [(1, datetime.date(2021, 1, 1), datetime.date(2021, 3, 27)),
(2, datetime.date(2022, 2, 28), datetime.date(2022, 4, 30))
]
schema = StructType([
StructField("ID", IntegerType(), True),
StructField("date1", DateType(), True),
StructField("date2", DateType(), True),
])
df = spark.createDataFrame(data=data2, schema=schema)
based on the comments and chats, you can try to calculate first and last values of the lat/lon fields of concern.
import pyspark.sql.functions as func
from pyspark.sql.window import Window as wd
import sys
data_sdf. \
withColumn('foo_first', func.first('foo').over(wd.partitionBy('id').orderBy('date').rowsBetween(-sys.maxsize, sys.maxsize))). \
withColumn('foo_last', func.last('foo').over(wd.partitionBy('id').orderBy('date').rowsBetween(-sys.maxsize, sys.maxsize))). \
select('id', 'foo_first', 'foo_last'). \
dropDuplicates()
OR, you can create structs and take min/max
data_sdf = spark.createDataFrame(
[(1, '2021-01-01', 2, 2),
(1, '2021-02-05', 3, 2),
(1, '2021-03-27', 4, 2),
(2, '2022-02-28', 1, 5),
(2, '2022-04-30', 5, 1)],
['ID', 'date', 'lat', 'lon'])
data_sdf. \
withColumn('dt_lat_lon_struct', func.struct('date', 'lat', 'lon')). \
groupBy('id'). \
agg(func.min('dt_lat_lon_struct').alias('min_dt_lat_lon_struct'),
func.max('dt_lat_lon_struct').alias('max_dt_lat_lon_struct')
). \
selectExpr('id',
'min_dt_lat_lon_struct.lat as lat_first', 'min_dt_lat_lon_struct.lon as lon_first',
'max_dt_lat_lon_struct.lat as lat_last', 'max_dt_lat_lon_struct.lon as lon_last'
)
# +---+---------+---------+--------+--------+
# | id|lat_first|lon_first|lat_last|lon_last|
# +---+---------+---------+--------+--------+
# | 1| 2| 2| 4| 2|
# | 2| 1| 5| 5| 1|
# +---+---------+---------+--------+--------+
Aggregation using min and max seems could work in your case.
from pyspark.sql import functions as F
df = spark.createDataFrame(
[(1, '2021-01-01', 2),
(1, '2021-02-05', 2),
(1, '2021-03-27', 2),
(2, '2022-02-28', 1),
(2, '2022-04-30', 1)],
['ID', 'date', 'shift_by'])
df = df.groupBy('ID').agg(
F.min('date').alias('date1'),
F.max('date').alias('date2'),
)
df.show()
# +---+----------+----------+
# | ID| date1| date2|
# +---+----------+----------+
# | 1|2021-01-01|2021-03-27|
# | 2|2022-02-28|2022-04-30|
# +---+----------+----------+

How to convert Pyspark Row datetime.datetime dataframe to column name DateType with timestamp in DD-MM-YYYY

I m working on NYK Green Taxi Data Jan 2017,I converted string column to Datetype now I can see only in row as alias , How to convert it into dataframe to Show() as table in Pickupdate column and DD-MM-YYYYFormat
-- lpep_pickup_datetime: string (nullable = true)
df2 = df.select(to_timestamp(df.lpep_pickup_datetime, 'yyyy-MM-dd HH:mm:ss').alias('pickup_datetime')).collect()
df2
[Row(pickup_datetime=datetime.datetime(2017, 1, 1, 0, 1, 15)), Row(pickup_datetime=datetime.datetime(2017, 1, 1, 0, 3, 34)), Row(pickup_datetime=datetime.datetime(2017, 1, 1, 0, 4, 2)), Row(pickup_datetime=datetime.datetime(2017, 1, 1, 0, 1, 40)), Row(pickup_datetime=datetime.datetime(2017, 1, 1, 0, 0, 51)), Row(pickup_datetime=datetime.datetime(2017, 1, 1, 0, 0, 28))
Output should be DD-MM-YYYY format Pickup_datetime Column like 1-1-2017 00:01:15
Don't do collect which will convert your dataframe into a list of Rows. Just do the select, which will return a new dataframe df2, and call show() on that new dataframe.
df2 = df.select(to_timestamp(df.lpep_pickup_datetime, 'yyyy-MM-dd HH:mm:ss').alias('pickup_datetime'))
df2.show()

Using pandas query method to search for a datetime object

I'm trying to match a datetime object in a Pandas DataFrame with the query method. Given this code
import datetime
import pandas as pd
search_time = datetime.datetime(2019, 10, 27, 0, 0, 6)
df = pd.DataFrame([[0, 0, datetime.datetime(2019, 10, 27, 0, 0, 0)],
[1, 0, search_time]],
columns=(['0', '1', 'datetime']))
df1 = df[df.datetime == search_time]
print(df1)
df2 = df.query('datetime == #search_time')
I want df1 and df2 to equal. While df1 returns what I expect,
0 1 datetime
1 1 0 2019-10-27 00:00:06
df2 raises KeyError: False. How can I correct the query syntax?
Problem is column name datetime collide with datetime object, solution is rename it, e.g. datetime1:
import datetime
import pandas as pd
search_time = datetime.datetime(2019, 10, 27, 0, 0, 6)
df = pd.DataFrame([[0, 0, datetime.datetime(2019, 10, 27, 0, 0, 0)],
[1, 0, search_time]],
columns=(['0', '1', 'datetime1']))
df1 = df[df.datetime1 == search_time]
print(df1)
0 1 datetime1
1 1 0 2019-10-27 00:00:06
df2 = df.query('datetime1 == #search_time')
print (df2)
0 1 datetime1
1 1 0 2019-10-27 00:00:06
Also is possible rename it by pandas:
df = pd.DataFrame([[0, 0, datetime.datetime(2019, 10, 27, 0, 0, 0)],
[1, 0, search_time]],
columns=(['0', '1', 'datetime']))
df2 = df.rename(columns={'datetime':'datetime1'}).query('datetime1 == #search_time')
print (df2)
0 1 datetime1
1 1 0 2019-10-27 00:00:06

Most efficient way to find overlap between two tables

Given two tables with the column "title" that is not sorted or unique:
Book
|id|title |
|1 |book_1|
|2 |book_2|
|3 |book_3|
|4 |book_4|
|5 |book_5|
|6 |book_5|
|7 |book_5|
|8 |book_6|
|9 |book_7|
UserBook
|user_id|book_id|state |title |
|1 |2 |"in progress"|book_2 |
|1 |4 |"completed" |book_4 |
|1 |6 |"completed" |book_5 |
|2 |3 |"completed" |book_3 |
|2 |6 |"completed" |book_5 |
|3 |1 |"completed" |book_1 |
|3 |2 |"completed" |book_2 |
|3 |4 |"completed" |book_4 |
|3 |7 |"in progress"|book_5 |
|3 |8 |"completed" |book_6 |
|3 |9 |"completed" |book_7 |
I'd like to create a binary matrix of users and book titles with state "completed".
[0, 0, 0, 1, 1, 0, 0]
[0, 0, 1, 0, 1, 0, 0]
[1, 1, 0, 1, 0, 1, 1]
This gets the results I'd like, but has very high algorithmic complexity. I am hoping to get the results with SQL.
How much more simple could it be if state was boolean and titles were unique?
matrix = []
User.all.each do |user|
books = Book.distinct.sort(title: :asc).pluck(:title).uniq
user_books = UserBook.where(user: user, state: "completed").order(title: :asc).pluck(:title)
matrix << books.map{|v| user_books.include?(v) ? 1 : 0}
end
SQL is not very good at matrices. But you can store the values as (x,y) pairs. You want to include 0 values as well as 1, so the idea is to generate the rows using a cross join and then bring in the existing data:
select b.book_id, u.user_id,
(case when ub.id is not null then 1 else 0 end) as is_completed
from books b cross join
users u left join
user_books ub
on ub.user_id = u.id and
ub.book_id = b.id and
ub.state = 'completed';
You could group UserBook by user_id and use aggregate functions to select the list of books on each group. The entire code snippets is as follows:
books = Book.order(title: :asc).pluck(:title).uniq
matrix = []
UserBook.where(state: "completed")
.select("string_agg(title, ',') as grouped_name")
.group(:user_id)
.each do |group|
user_books = group.grouped_name.split(',')
matrix << books.map { |title| user_books.include?(title) ? 1 : 0 }
end
In MySQL you need to replace string_agg(title, ',') with GROUP_CONCAT(title)
Should you consider producing the desired array using Ruby, rather than SQL, first read data from the table Book into an array book:
book = [
[1, "book_1"], [2, "book_2"], [3, "book_3"], [4, "book_4"],
[5, "book_5"], [6, "book_5"], [7, "book_5"], [8, "book_6"],
[9, "book_7"]
]
and data from the table UserBook into an array user_book:
user_book = [
[1, 2, :in_progress], [1, 4, :completed], [1, 6, :completed],
[2, 3, :completed], [2, 6, :completed],
[3, 1, :completed], [3, 2, :completed], [3, 4, :completed], [3, 7, :in_progress],
[3, 8, :completed], [3, 9, :completed]
]
Note the first element of each element of book, an integer, is the book_id, and the first two elements of each element of user_book, integers, are respectively the user_id and book_id.
You could then construct the desired array as follows:
h = book.map { |book_id,title| [book_id, title[/\d+\z/].to_i-1] }.to_h
#=> {1=>0, 2=>1, 3=>2, 4=>3, 5=>4, 6=>4, 7=>4, 8=>5, 9=>6}
cols = h.values.max + 1
#=> 6
arr = Array.new(3) { Array.new(cols, 0) }
#=> [[0, 0, 0, 0, 0, 0],
# [0, 0, 0, 0, 0, 0],
# [0, 0, 0, 0, 0, 0]]
user_book.each do |user_id, book_id, status|
arr[user_id-1][h[book_id]] = 1 if status == :completed
end
arr
#=> [[0, 0, 0, 1, 1, 0, 0],
# [0, 0, 1, 0, 1, 0, 0],
# [1, 1, 0, 1, 0, 1, 1]]
in straight SQL
select * from books join user_books on (books.id = user_books.id)
where user_books.state = 'completed';
In Ruby ActiveRecord
Book.joins(:user_books).where(:state => 'completed')

Pandas timestamp and python datetime interpret timezone differently

I don't understand why a isn't the same as b:
import pandas as pd
from datetime import datetime
import pytz
here = pytz.timezone('Europe/Amsterdam')
a = pd.Timestamp('2018-4-9', tz=here).to_pydatetime()
# datetime.datetime(2018, 4, 9, 0, 0, tzinfo=<DstTzInfo'Europe/Amsterdam' CEST+2:00:00 DST>)
b = datetime(2018, 4, 9, 0, tzinfo=here)
# datetime.datetime(2018, 4, 9, 0, 0, tzinfo=<DstTzInfo 'Europe/Amsterdam' LMT+0:20:00 STD>)
print(b-a)
# returns 01:40:00
From this stackoverflow post I learned that tzinfo doesn't work well for some timezones and that could be the reason for the wrong result.
pytz doc:
Unfortunately using the tzinfo argument of the standard datetime
constructors ‘’does not work’’ with pytz for many timezones.
The solution is to use localize or astimezone:
import pandas as pd
from datetime import datetime
import pytz
here = pytz.timezone('Europe/Amsterdam')
a = pd.Timestamp('2018-4-9', tz=here).to_pydatetime()
# datetime.datetime(2018, 4, 9, 0, 0, tzinfo=<DstTzInfo'Europe/Amsterdam' CEST+2:00:00 DST>)
b = here.localize(datetime(2018, 4, 9))
# datetime.datetime(2018, 4, 9, 0, 0, tzinfo=<DstTzInfo 'Europe/Amsterdam' CEST+2:00:00 DST>)
print(b-a)
# returns 00:00:00
If you look at a and b,
a
datetime.datetime(2018, 4, 9, 0, 0, tzinfo=<DstTzInfo 'Europe/Amsterdam' CEST+2:00:00 DST>)
verus
b
datetime.datetime(2018, 4, 9, 0, 0, tzinfo=<DstTzInfo 'Europe/Amsterdam' LMT+0:20:00 STD>)
CEST European Central Summer Time
vs
LMT Local Mean Time