I have time periods spent in different units per user in a table. The time periods overlap and I would like to fix that. I have:
user|unit|start_time|end_time
1| 1|2015-01-01|2015-01-31
1| 2|2015-01-07|2015-01-14
2| 1|2015-01-09|2015-01-13
2| 2|2015-01-10|2015-01-15
ie. user 1 started at unit 1 on 2015-01-01, transfered to unit 2 on 2015-01-07, returned to unit 1 on 2015-01-14 and left unit 1 on the 2015-01-31. The user can't be in two places at once so the table should look more like this:
user|unit|start_time|end_time
1| 1|2015-01-01|2015-01-07 --fixed end_time
1| 2|2015-01-07|2015-01-14
1| 1|2015-01-14|2015-01-31 --newly created line
2| 1|2015-01-09|2015-01-10 --fixed end_time
2| 2|2015-01-10|2015-01-15
Here is some SQL to create the test table with some entries.
CREATE TABLE users_n_units
(
users character varying (100),
units character varying (100),
start_time date,
end_time date
);
INSERT INTO users_n_units (users,units,start_time,end_time)
VALUES ('1','1','2015-01-01','2015-01-31'),
('1','2','2015-01-07','2015-01-14'),
('2','1','2015-01-09','2015-01-13'),
('2','2','2015-01-10','2015-01-15');
You don’t really give enough information to fully answer this, and as others have pointed out you may end up with special cases so you should analyze what your data looks like carefully before running updates.
But in your test environment you can try something like this. The trick is to join your table to itself with clauses that restricts you to the data that matches your business logic properly, and then update it.
This statement works on your tiny sample set and just runs through and mechanically sets end times to the following time period’s start times. I have used something very similar to this on similar problems before so I know the mechanism should work for you.
CAUTION: not tested on anything other than this small set. Don’t run on production data!
UPDATE a SET a.end_time = b.start_time
FROM users_n_units a
INNER JOIN users_n_units b ON a.users = b.users AND a.units < b.units
Related
I am working on a big (not real) task to manage the expenses of several countries. I have already calculated the capacities of every town in investments, now I need to calculate the budget to built these spaceships. The task is as follows:
We have the tables below (there are tables Town and Spaceship, but the task is clear without them here). We need to calculate how much money is needed to complete each type of ship available for production. So, we have different types of spaceships and each type needs different types of parts (see table Spaceship_required_part). In every town there are produced several types of parts (see table Spaceship_part_in_town). We need to calculate, what is the cost (see cost in Spaceship_part, stage in Spaceship_part_in_town, and amount in Spaceship_required_part) to build a unit of every available type of spaceship. By available we mean that the parts needed can be found in the given city. We calculate the budget for a given city (I can do it for the rest of them by myself).
create table Spaceship_part(
id int PRIMARY KEY,
name text,
cost int
);
create table Spaceship_part_in_town(
id int PRIMARY KEY,
spaceship_part_id int references Spaceship_part,
city_id int references Town,
stage float -- the percentage of completion of the part
);
create table Spaceship_required_part(
id int PRIMARY KEY,
spaceship_part int references Spaceship_part,
spaceship int references Spaceship,
amount int -- amount of a particular part needed for the given spaceship
);
I understand how would I solve this task using a programming language, but my SQL skills are not that good. I understand that first I need to check what spaceships can we build using the available parts in the town. This can be done using a counter of the needed parts (amount) and available parts in town (count(spaceship_part_id)). Then I need to calculate the sum needed to build every spaceship using the formula (100-stage)*cost/100.
However, I have no idea how to compose this in SQL code. I am writing in PostgreSQL.
The data model is like:
To build a spaceship with least build cost, we can:
Step 1. Calculate a part's build_cost = (100 - stage) * cost / 100; for each part, rank the build cost based on stage so we minimize total cost for a spaceship.
Step 2. Based on build_cost, we calcualte the total_cost of a parts by required quantities (in order to compare with spaceship_required_part.amount) and take notes from where the parts are coming from in part_sources, which is in CSV format (city_id, stage, build_cost),...
Step 3. Once we have available parts and total qty & cost calculate, we join it with spaceship_required_part to get result like this:
spaceship_id|spaceship_part_id|amount|total_cost|part_sources |
------------+-----------------+------+----------+---------------------+
1| 1| 2| 50.0|(4,80,20),(3,70,30) |
1| 2| 1| 120.0|(1,40,120) |
2| 2| 2| 260.0|(1,40,120),(2,30,140)|
2| 3| 1| 180.0|(2,40,180) |
3| 3| 2| 360.0|(2,40,180),(4,40,180)|
The above tells us that to build:
spaceship#1, we need part#1 x 2 sourced from city#4 and city#3; part#2 x 1 from city 1; total cost = 50 + 120 = 170, or
spceeship#2, we need part#2 x 2 sourced from city#1 and city#2; part#3 x 1 from city#2; total cost = 160 + 180 = 340, or
spaceship#3, we need part#3 x 2 from city#2 and city#4; total cost = 360.
After 1st iteration, we can update spaceship_part_in_town and remove the 1st spaceship from spaceship_required_part, then run the query again to get the 2nd spaceship to build and its part sources.
with cte_part_sources as (
select spt.spaceship_part_id,
spt.city_id,
sp.cost,
spt.stage,
(100.0-spt.stage)*sp.cost/100.0 as build_cost,
row_number() over (partition by spt.spaceship_part_id order by spt.stage desc) as cost_rank
from spaceship_part_in_town spt
join spaceship_part sp
on spt.spaceship_part_id = sp.id),
cte_parts as (
select spaceship_part_id,
city_id,
cost_rank,
cost,
stage,
build_cost,
cost_rank as total_qty,
sum(build_cost) over (partition by spaceship_part_id order by cost_rank) as total_cost,
string_agg('(' || city_id || ',' || stage || ',' || build_cost || ')',',') over (partition by spaceship_part_id order by cost_rank) as part_sources
from cte_part_sources)
select srp.spaceship_id,
srp.spaceship_part_id,
srp.amount,
p.total_cost,
p.part_sources
from spaceship_required_part srp
left
join cte_parts p
on srp.spaceship_part_id = p.spaceship_part_id
and srp.amount = p.total_qty;
EDIT:
added db fiddle
I want to join or update the following two tables and also add up df for existing words. So if the word endeavor does not exist in the first table, it should be added with its df value or if the word hello exists in both tables df should be summed up.
FYI I'm using MariaDB and PySpark to do word counts on documents and calculate tf, df, and tfidf values.
Table name: df
+--------+----+
| word| df|
+--------+----+
|vicinity| 5|
| hallo| 2|
| admire| 3|
| settled| 1|
+--------+----+
Table name: word_list
| word| df|
+----------+---+
| hallo| 1|
| settled| 1|
| endeavor| 1|
+----------+---+
So in the end the updated/combined table should look like this:
| word| df|
+----------+---+
| vicinity| 5|
| hallo| 3|
| admire| 3|
| settled| 2|
| endeavor| 1|
+----------+---+
What I've tried to do so far is the following:
SELECT df.word, df.df + word_list.df FROM df FULL OUTER JOIN word_list ON df.word=word_list.word
SELECT df.word FROM df JOIN word_list ON df.word=word_list.word
SELECT df.word FROM df FULL OUTER JOIN word_list ON df.word=word_list.word
None of them worked, I either get a table with just null values, some null values, or some exception. I'm sure there must be an easy SQL statement to achieve this but I've been stuck with this for hours and also haven't found anything relatable on stack overflow.
You just need to UNION the two tables first, then aggregate on the word. Since the tables are identically structured it's very easy. Look at this fiddle. I have used maria 10.3 since you didn't specify, but these queries should be completely compliant with (just about) any DBMS.
https://dbfiddle.uk/?rdbms=mariadb_10.3&fiddle=c6d86af77f19fc1f337ad1140ef07cd2
select word, sum(df) as df
from (
select * from df
UNION ALL
select * from word_list
) z
group by word
order by sum(df) desc;
UNION is the vertical cousin of JOIN, that is, UNION joins to datasets vertically or row-wise, and JOIN adds them horizontally, that is by adding columns to the output. Both datasets need to have the same number of columns for the UNION to work, and you need to use UNION ALL here so that the union returns all rows, because the default behavior is to return unique rows. In this dataset, since settled has a value of 1 in both tables, it would only have one entry in the UNION if you don't use the ALL keyword, and so when you do the sum the value of df would be 1 instead of 2, as you are expecting.
The ORDER BY isn't necessary if you are just transferring to a new table. I just added it to get my results in the same order as your sample output.
Let me know if this worked for you.
I have a problem that I know how to solve (more or less) using a regular programming language in a non-optimal but good enough way.
I want to get a list of groups of points that are within certain range inside each group, but that each group does not overlap with the rest.
For example, group of points A are at a distance of 1 or less, same for points in group B but all points in A are at least at 1.1 distance of all points in group B.
The way I would do this on a programming language (in a non optimal way as I said) will be to pick any point, find all points that are in a range of 1 or less (call it group A), then pick one point that is not in that group and find all points that are not on the group A and that are at a distance of 1 or less call it group B. Loop again but now taking into account groups A and B.
Also it's worth mention that some points will have a flag to mark them as processed (previously grouped and that group saved), they should be ignored. Hopefully this that may speed up the query when several groups already exist.
I'm not sure if this is a task that can be accomplished within SQL in a single query or if I would be better to extract the data from the database and make another query with the new parameters.
My points are multi-dimensional (vectors of 128). But for simplicity this are some example ones:
id |x |y |z |
---|---------------------|----------------------|----------------------|
1| -0.03909766674041748| 0.03122374415397644| 0.02698654681444168|
2| -0.09763473272323608| 0.04069424420595169| 0.11512677371501923|
3|-0.040237002074718475| 0.0678766518831253| 0.03919816389679909|
4| -0.10432711988687515| 0.07187126576900482| 0.10971983522176743|
5| -0.1513511687517166| 0.07631294429302216| 0.05949840694665909|
6| -0.1276567280292511| 0.11543292552232742| 0.06757785379886627|
A query that I often use to find closest (very simplified points is this:
SELECT id, sqrt(
power(X - -0.10434776544570923, 2) +
power(Y - 0.08688679337501526, 2)
) AS distance
FROM points
HAVING distance < 0.5
ORDER BY distance ASC
I'm not sure how the output data would look on a tabular form, but Ideally I want something like this:
ref_point | point_ids
----------|-------------
2 | 12, 15, 16, 255, 85
6 | 8, 12, 55, 44
Where ref_point is a point id that is at <1 of distance of all the points on it's group
I'm just learning how to manipulate strings within SQL tables and am now trying to combine string manipulation with column value calculations. My problem states that I limit a serial number, denoted by "xx-yyyyyyy", to its first two values (without the hyphen) and then add cost values together (that relate to these serial values) after creation of these new serial numbers. However, when I add the cost values together, I am getting an incorrect result due to serial values not adding together (duplicate serial values within my output table). My question is, how do I go about entering my code so that I have no duplicate serial values in my output and all values (excluding NULLs) are added together?
Example table that I am working with is like so:
____Serial____|____Cost____
1| xx-yyyyyy | $aaa.bb
2| xx-yyyyyy | $aaa.bb
3| ... | ...
Here is my code that I have currently tried:
SELECT left(Serial, CHARINDEX('-', Serial)-1) AS NewSerial, sum(cost) AS TotalCost
FROM table
WHERE CHARINDEX('-', serial) > 0
GROUP BY Serial
ORDER BY TotalCost DESC
The results did add together cost values, but it did leave duplicate NewSerial values (which I assume is due to the GROUP BY clause).
Output (From my code):
_|___NewSerial____|____TotalCost____
1| ab | $abc.de
2| cd | $abc.de
3| ab | $abc.de
4| ef | $abc.de
5| cd | $abc.de
How can I go about fixing/solving this issue within this area so that the NewSerial values all add together rather than stay separate like in my output?
You need to repeat the expression in the GROUP BY:
SELECT left(Serial, CHARINDEX('-', Serial)-1) AS NewSerial, sum(cost) AS TotalCost
FROM table
WHERE CHARINDEX('-', serial) > 0
GROUP BY left(Serial, CHARINDEX('-', Serial)-1)
ORDER BY TotalCost DESC
I am working through a group by problem and could use some direction at this point. I want to summarize a number of variables by a grouping level which is different (but the same domain of values) for each of the variables to be summed. In pseudo-pseudo code, this is my issue: For each empYEAR variable (there are 20 or so employment-by-year variables in wide format), I want to sum it by the county in which the business was located in that particular year.
The data is a bunch of tables representing business establishments over a 20-year period from Dun & Bradstreet/NETS.
More details on the database, which is a number of flat files, all with the same primary key.
The primary key is DUNSNUMBER, which is present in several tables. There are tables detailing, for each year:
employment
county
sales
credit rating (and others)
all organized as follows (this table shows employment, but the other variables are similarly structured, with a year postfix).
dunsnumber|emp1990 |emp1991|emp1992|... |emp2011|
a | 12 |32 |31 |... | 35 |
b | |2 |3 |... | 5 |
c | 1 |1 | |... | |
d | 40 |86 |104 |... | 350 |
...
I would ultimately like to have a table that is structured like this:
county |emp1990|emp1991|emp1992|...|emp2011|sales1990|sales1991|sales1992|sales2011|...
A
B
C
...
My main challenge right now is this: How can I sum employment (or sales) by county by year as in the example table above, given that county as a grouping variable changes sometimes by the year and specified in another table?
It seems like something that would be fairly straightforward to do in, say, R with a long data format, but there are millions of records, so I prefer to keep the initial processing in postgres.
As I understand your question this sounds relatively straight forward. While I normally prefer normalized data to work with, I don't see that normalizing things beforehand will buy you anything specific here.
It seems to me you want something relatively simple like:
SELECT sum(emp1990), sum(emp1991), ....
FROM county c
JOIN emp e ON c.dunsnumber = e.dunsnumber
JOIN sales s ON c.dunsnumber = s.dunsnumber
JOIN ....
GROUP BY c.name, c.state;
I don't see a simpler way of doing this. Very likely you could query the system catalogs or information schema to generate a list of columns to sum up. the rest is a straight group by and join process as far as I can tell.
if the variable changes by name, the best thing to do in my experience is to put together a location view based on that union and join against it. This lets you hide the complexity from your main queries and as long as you don't also join the underlying tables should perform quite well.