Merging dataframe based on different conditions - dataframe

I'm doing the transition from SAS to python.
In SAS
proc SQL;
create table ABC as
select a.*, b.*
from table_1 as a inner join
table_2 as b
on a.ID = on b.ID and
a. week date \>= b. start date and a. week date \<= b. end date;
quit;
When I tried the above code in SAS, the observations between table a and table ABC are matched. But when I tried in python, I'm getting less number of observations compared to SAS. The week date, start date, end date are date variables in the format '2019-05-21'. whenever I used >=/<= in the date variables it shows an error like this.
TypeError: '>=' not supported between instances of 'Timestamp' and 'str'.
ABC =a. merge (b, left_on='ID', right_on='ID', how='left')
ABC [(ABC ['week date']>= (ABC ['start date '])) & (ABC ['week date'] \<= (ABC ['end date']))]

In python, if a & b are DataFrame compare type of variables (a.dtypes and b.dtypes)

Related

SQLite - Output count of all records per day including days with 0 records

I have a sqlite3 database maintained on an AWS exchange that is regularly updated by a Python script. One of the things it tracks is when any team generates a new post for a given topic. The entries look something like this:
id
client
team
date
industry
city
895
acme industries
blueteam
2022-06-30
construction
springfield
I'm trying to create a table that shows me how many entries for construction occur each day. Right now, the entries with data populate, but they exclude dates with no entries. For example, if I search for just
SELECT date, count(id) as num_records
from mytable
WHERE industry = "construction"
group by date
order by date asc
I'll get results that looks like this:
date
num_records
2022-04-01
3
2022-04-04
1
How can I make sqlite output like this:
date
num_records
2022-04-02
3
2022-04-02
0
2022-04-03
0
2022-04-04
1
I'm trying to generate some graphs from this data and need to be able to include all dates for the target timeframe.
EDIT/UPDATE:
The table does not already include every date; it only includes dates relevant to an entry. If no team posts work on a day, the date column will jump from day 1 (e.g. 2022-04-01) to day 3 (2022-04-03).
Given that your "mytable" table contains all dates you need as an assumption, you can first select all of your dates, then apply a LEFT JOIN to your own query, and map all resulting NULL values for the "num_records" field to "0" using the COALESCE function.
WITH cte AS (
SELECT date,
COUNT(id) AS num_records
FROM mytable
WHERE industry = "construction"
GROUP BY date
ORDER BY date
)
SELECT dates.date,
COALESCE(cte.num_records, 0) AS num_records
FROM (SELECT date FROM mytable) dates
LEFT JOIN cte
ON dates.date = cte.date

Compare 'YYYYMM' to two columns with differents data type Teradata

I'm using Teradata SQL Assistant, and I have an SQL issue similar to this one below.
Table_a (contains more than 500 million rows):
id| b | c |d
1|2021-05-03|2 020| 6
2|2019-04-22|2 019| 4
3|2018-11-15|2 018| 10
In my query I want to select only the rows where year/month of b is equal or greater than c/d, so in this example i will get only the row id = 2.
The problem here is that the column c is a smallint, d is byteint, and b is a date. I did not find a way to do to_char(b,'yyyymm') with c||d because c and d contains spaces and c column may have only one number (for example 6 and not 06 if the month is between January and September), so I did this:
select * from table_a
where cast(extract(year from b) as smallint) > c
or ( cast(extract(year from b) as smallint) = c and cast(extract(month from b) as byteint) >= d)
My query takes so much time and sometimes it fails with no more spool space (even if I put sample 10 for example). I guess the problem is in the data type and in the converting. I already tested it without converting, but I got the same performance issue.

calculate days difference between two days in hive

I am trying to display the records where difference in days between "empdate" column and current date is lesser than equal to 365.
The column empdate is of varchar datatype.I have written the below query but not able to achieve the result. Where i am getting all the records which are greater than 365 between the current date and empdate.
Can anybody please help me on this.
select * from table where
cast(datediff(from_unixtime(unix_timestamp(current_date 'yyyy-MM-dd'),'yy-MM-dd'),
from_unixtime(unix_timestamp(cast(empdate as string)'yyMMdd'),'yy-MM-dd') as int)<=365;
the below query might be helpful to you,
select datediff(to_date(from_unixtime(unix_timestamp(current_date),'yyyy-MM-dd')),
to_date(from_unixtime(unix_timestamp(empdate, 'yyMMdd')))) <=365 as daydiff
from time_table;
Test for the query from hive:
create table if not exists time_table(empdate string, empvalue string) row format
delimited fields terminated by ',' stored as textfile;
insert into table time_table values ('101001','A'),('200101','B'),
('100619','C'),('110707','D');
hive> select *, datediff(to_date(from_unixtime(unix_timestamp(current_date),'yyyy-MM-dd')),
to_date(from_unixtime(unix_timestamp(empdate, 'yyMMdd')))) <=365 as daydiff
from time_table;
OK
101001 A false
200101 B true
100619 C false
110707 D false

Visual Foxpro - Specific date subtracting two columns of numeric date and compare result

I am new to visual foxpro. I am trying to write the sql statements.
There are two columns of dates, data type is in numeric.
Column A date is in the YYYYMMDD format.
Column B date is in the YYYYMM format. DD is not available, thus I am only comparing the YYYYMM.
I need to subtract or find the difference between a specific date e.g. 31 August 2015 and the dates in column A and B. Once I have the difference, I need to compare and see if the difference in Column B is greater than Column A.
What I have thought is using substr and split the dates to YYYY and MM. Then I subtract it from the specific date, and then compare the YYYY portion to see if it column B is greater than column A.
Your description sounds as if columnA / 100 would give a comparable format.
So if you've got test data like these
CREATE CURSOR test (columnA Num(8), columnB Num(6))
INSERT INTO test VALUES (20150802, 201508)
INSERT INTO test VALUES (20150712, 201506)
... you can get all rows where colmumnB equals converted(columnA):
SELECT * FROM test WHERE INT(columnA / 100) = columnB
... or get the difference between A and B for all rows:
SELECT INT(columnA/100) - columnB FROM test
Or if you've got a date-type parameter, you can for example get all rows where columnB would match the parameter:
d = DATE(2015,8,31)
SELECT * FROM test WHERE columnB = YEAR(d) * 100 + MONTH(d)
If you want to do something different, I'd suggest to edit the question and add more details

Can SQL view have infinite number of rows? (Repeating schedule, each row a day?)

Can I have a view with an infinite number of rows? I don't want to
select all the rows at once, but is it possible to have a view that
represents a repeating weekly schedule, with rows for any date?
I have a database with information about businesses, their hours on
different days of the week. Their names:
# SELECT company_name FROM company;
company_name
--------------------
Acme, Inc.
Amalgamated
...
(47 rows)
Their weekly schedules:
# SELECT days, open_time, close_time
FROM hours JOIN company USING(company_id)
WHERE company_name='Acme, Inc.';
days | open_time | close_time
---------+-----------+-----------
1111100 | 08:30:00 | 17:00:00
0000010 | 09:00:00 | 12:30:00
Another table, not shown, has holidays they're closed.
So I can trivially create a user-defined function in the form of a
stored procedure that takes a particular date as an argument and
returns the business hours of each company:
SELECT company_name,open_time,close_time FROM schedule_for(current_date);
But I want to do it as a table query, in order that any
SQL-compatible host-language library will have no problem
interfacing with it, like this:
SELECT company_name, open_time, close_time
FROM schedule_view
WHERE business_date=current_date;
Relational database theory tells me that tables (relations) are
functions in the sense of being a unique mapping from each
primary key to a row (tuple). Obviously if the WHERE clause on
the above query were omitted it would result in a table (view)
having an infinite number of rows, which would be a practical issue. But
I'm willing to agree never to query such a view without a WHERE
clause that restricts the number of rows.
How can such a view be created (in PostgreSQL)? Or is a view even the way to do what I want?
Update
Here are some more details about my tables. The days of the week are saved as bits, and I select the appropriate row using a bit mask that has a single bit shifted once for each day of the requested week. To wit:
The company table:
# \d company
Table "company"
Column | Type | Modifiers
----------------+------------------------+-----------
company_id | smallint | not null
company_name | character varying(128) | not null
timezone | timezone | not null
The hours table:
# \d hours
Table "hours"
Column | Type | Modifiers
------------+------------------------+-----------
company_id | smallint | not null
days | bit(7) | not null
open_time | time without time zone | not null
close_time | time without time zone | not null
The holiday table:
# \d holiday
Table "holiday"
Column | Type | Modifiers
---------------+----------+-----------
company_id | smallint | not null
month_of_year | smallint | not null
day_of_month | smallint | not null
The function I currently have that does what I want (besides invocation) is defined as:
CREATE FUNCTION schedule_for(requested_date date)
RETURNS table(company_name text, open_time timestamptz, close_time timestamptz)
AS $$
WITH field AS (
/* shift the mask as many bits as the requested day of the week */
SELECT B'1000000' >> (to_char(requested_date,'ID')::int -1) AS day_of_week,
to_char(requested_date, 'MM')::int AS month_of_year,
to_char(requested_date, 'DD')::int AS day_of_month
)
SELECT company_name,
(requested_date+open_time) AT TIME ZONE timezone AS open_time,
(requested_date+close_time) AT TIME ZONE timezone AS close_time
FROM hours INNER JOIN company USING (company_id)
CROSS JOIN field
CROSS JOIN holiday
/* if the bit-mask anded with the DOW is the DOW */
WHERE (hours.days & field.day_of_week) = field.day_of_week
AND NOT EXISTS (SELECT 1
FROM holiday h
WHERE h.company_id = hours.company_id
AND field.month_of_year = h.month_of_year
AND field.day_of_month = h.day_of_month);
$$
LANGUAGE SQL;
So again, my goal is to be able to get today's schedule by doing this:
SELECT open_time, close_time FROM schedule_view
wHERE company='Acme,Inc.' AND requested_date=CURRENT_DATE;
and also be able to get the schedule for any arbitrary date by doing this:
SELECT open_time, close_time FROM schedule_view
WHERE company='Acme, Inc.' AND requested_date=CAST ('2013-11-01' AS date);
I'm assuming this would require creating the view here referred to as schedule_view but maybe I'm mistaken about that. In any event I want to keep any messy SQL code hidden from usage at the command-line-interface and client-language database libraries, as it currently is in the user-defined function I have.
In other words, I just want to invoke the function I already have by passing the argument in a WHERE clause instead of inside parentheses.
You could create a view with infinite rows by using a recursive CTE. But even that needs a starting point and a terminating condition or it will error out.
A more practical approach with set returning functions (SRF):
WITH x AS (SELECT '2013-10-09'::date AS day) -- supply your date
SELECT company_id, x.day + open_time AS open_ts
, x.day + close_time AS close_ts
FROM (
SELECT *, unnest(arr)::bool AS open, generate_subscripts(arr, 1) AS dow
FROM (SELECT *, string_to_array(days::text, NULL) AS arr FROM hours) sub
) sub2
CROSS JOIN x
WHERE open
AND dow = EXTRACT(ISODOW FROM x.day);
-- AND NOT EXISTS (SELECT 1 FROM holiday WHERE holiday = x.day)
-> SQLfiddle demo. (with constant day)
Expanding SRFs side-by-side is generally frowned upon (and for good reason, it's not in the SQL standard and show surprising behavior if the number of elements is not the same). The new feature WITH ORDINALITY in the upcoming Postgres 9.4 will allow cleaner syntax. Consider this related answer on dba.SE or similarly:
PostgreSQL unnest() with element number
I am assuming bit(7) as most effective data type for days. To work with it, I am converting it to an array in the first subquery sub.
Note the difference between ISODOW and DOW as field pattern for EXTRACT().
Updated question
Your function looks good, except for this line:
CROSS JOIN holiday
Otherwise, if I take the bit-shifting route, I end up with a similar query:
WITH x AS (SELECT '2013-10-09'::date AS day) -- supply your date
,y AS (SELECT day, B'1000000' >> (EXTRACT(ISODOW FROM day)::int - 1) AS dow
FROM x)
SELECT c.company_name, y.day + open_time AT TIME ZONE c.timezone AS open_ts
, y.day + close_time AT TIME ZONE c.timezone AS close_ts
FROM hours h
JOIN company c USING (company_id)
CROSS JOIN y
WHERE h.days & y.dow = y.dow;
AND NOT EXISTS ...
EXTRACT(ISODOW FROM requested_date)::int is just a faster equivalent of to_char(requested_date,'ID')::int
"Pass" day in WHERE clause?
To make that work you would have to generate a huge temporary table covering all possible days before selecting rows for the day in the WHERE clause. Possible (I would employ generate_series()), but very expensive.
My answer to your first draft is a smaller version of this: I expand all rows only for a pattern week before selecting the day matching the date in the WHERE clause. The tricky part is to display timestamps built from the input in the WHERE clause. Not possible. You are back to the huge table covering all days. Unless you have only few companies and a decently small date range, I would not go there.
This is built off the previous answers.
The sample data:
CREATE temp TABLE company (company_id int, company text);
INSERT INTO company VALUES
(1, 'Acme, Inc.')
,(2, 'Amalgamated');
CREATE temp TABLE hours(company_id int, days bit(7), open_time time, close_time time);
INSERT INTO hours VALUES
(1, '1111100', '08:30:00', '17:00:00')
,(2, '0000010', '09:00:00', '12:30:00');
create temp table holidays(company_id int, month_of_year int, day_of_month int);
insert into holidays values
(1, 1, 1),
(2, 1, 1),
(2, 1, 12) -- this was a saturday in 2013
;
First, just matching a date's day of week against the hours table's day of week, using the logic you provided:
select *
from company a
left join hours b
on a.company_id = b.company_id
left join holidays c
on b.company_id = c.company_id
where (b.days & (B'1000000' >> (to_char(current_date,'ID')::int -1)))
= (B'1000000' >> (to_char(current_date,'ID')::int -1))
;
Postgres lets you create custom operators to simplify expressions like in that where clause, so you might want an operator that matches the day of week between a bit string and a date. First the function that performs the test:
CREATE FUNCTION match_day_of_week(bit, date)
RETURNS boolean
AS $$
select ($1 & (B'1000000' >> (to_char($2,'ID')::int -1))) = (B'1000000' >> (to_char($2,'ID')::int -1))
$$
LANGUAGE sql IMMUTABLE STRICT;
You could stop there make in your where clause look something like "where match_day_of_week(days, some-date)". The custom operator makes this look a little prettier:
CREATE OPERATOR == (
leftarg = bit,
rightarg = date,
procedure = match_day_of_week
);
Now you've got syntax sugar to simplify that predicate. Here I've also added in the next test (that the month_of_year and day_of_month of a holiday don't correspond with the supplied date):
select *
from company a
left join hours b
on a.company_id = b.company_id
left join holidays c
on b.company_id = c.company_id
where b.days == current_date
and extract(month from current_date) != month_of_year
and extract(day from current_date) != day_of_month
;
For simplicity I start by adding an extra type (another awesome postgres feature) to encapsulate the month and day of the holiday.
create type month_day as (month_of_year int, day_of_month int);
Now repeat the process above to make another custom operator.
CREATE FUNCTION match_day_of_month(month_day, date)
RETURNS boolean
AS $$
select extract(month from $2) = $1.month_of_year
and extract(day from $2) = $1.day_of_month
$$
LANGUAGE sql IMMUTABLE STRICT;
CREATE OPERATOR == (
leftarg = month_day,
rightarg = date,
procedure = match_day_of_month
);
Finally, the original query is reduced to this:
select *
from company a
left join hours b
on a.company_id = b.company_id
left join holidays c
on b.company_id = c.company_id
where b.days == current_date
and not ((c.month_of_year, c.day_of_month)::month_day == current_date)
;
Reducing that down to a view looks like this:
create view x
as
select b.days,
(c.month_of_year, c.day_of_month)::month_day as holiday,
a.company_id,
b.open_time,
b.close_time
from company a
left join hours b
on a.company_id = b.company_id
left join holidays c
on b.company_id = c.company_id
;
And you could use that like this:
select company_id, open_time, close_time
from x
where days == current_date
and not (holiday == current_date)
;
Edit: You'll need to work on this logic a bit, by the way - this was more about showing the idea of how to do it with custom operators. For starters, if a company has multiple holidays defined you'll likely get multiple results back for that company.
I posted a similar response on PostgreSQL mailing list. Basically, avoiding the use of a function-invocation API in this situation is likely a foolish decision. The function call is the best API for this use-case. If you have a concrete scenario that you need to support where a function will not work then please provide that and maybe that scenario can be solved without having to compromise the PostgreSQL API. All your comments so far are about planning for an unknown future that very well may never come to be.