I am trying to get a count of the number of transactions that occur in the 7 days following each transaction in a dataset (including the transaction itself). For example, in the following set of transactions I want the count to be as follows.
┌─────────────┬──────────────┐
│ Txn_Date │ Count │
├─────────────┼──────────────┤
│ 2020-01-01 │ 3 │
│ 2020-01-04 │ 3 │
│ 2020-01-05 │ 2 │
│ 2020-01-10 │ 1 │
│ 2020-01-18 │ 3 │
│ 2020-01-20 │ 2 │
│ 2020-01-24 │ 2 │
│ 2020-01-28 │ 1 │
└─────────────┴──────────────┘
For each one it needs to be a rolling week, so row 1 counts all the transactions between 2020-01-01 and 2020-01-08, the second all transactions between 2020-01-04 and 2020-01-11 etc.
The code I have is:
select Txn_Date,
count(Txn_Date) over (partition by Txn_Date(column) where Txn_Date(rows) between Txn_Date and date_add('day', 14, Txn_Date) as Count
This code will not work in it's current state, but hopefully gives an idea of what I am trying to achieve. The database I am working in is Hive.
A good way to provide demo data is to put it into a table variable.
DECLARE #table TABLE (Txn_Date DATE, Count INT)
INSERT INTO #table (Txn_Date, Count) VALUES
('2020-01-01', 3),
('2020-01-04', 3),
('2020-01-05', 2),
('2020-01-10', 1),
('2020-01-18', 3),
('2020-01-20', 2),
('2020-01-24', 2),
('2020-01-28', 1)
If you're using TSQL you can do this using the windowed function LAG after grouping the data by week.
SELECT DATEPART(WEEK,Txn_Date) AS Week, SUM(COUNT) AS Count, LAG(SUM(COUNT),1) OVER (ORDER BY DATEPART(WEEK,Txn_Date)) AS LastWeekCount
FROM #table
GROUP BY DATEPART(WEEK,Txn_Date)
Week Count LastWeekCount
-----------------------------
1 6 NULL
2 3 6
3 3 3
4 4 3
5 1 4
Lag literally lets you go back n rows for a column in a specific order. For this we wanted to go back 1 row in week number order. To move in the opposite direction you can use LEAD the same way.
We're also using the TSQL function DATEPART to get the ISO week number for the date, and grouping by that.
Related
I'm looking for a query that will allow me to
Imagine this is the table, "News_Articles", with two columns: "ID" & "Headline". How can I make a result from this showing
NEWS_ARTICLES
ID
Headline
0001
Today's News: Local Election Today!
0002
COVID-19 Rates Drop Today
0003
Today's the day to shop local
One word per row (from the headline column)
A count of how many unique IDs it appears in
A count of how many total times the word appears in the whole dataset
DESIRED RESULT
Word
Unique_Count
Total_Count
Today
3
4
Local
2
2
Election
1
1
Ideally, we'd like to remove any conjunctions from the words as well (see "Today's" above is counted as "Today").
I'd also like to be able to remove filler words such as "the" or "a". Ideally this would be through some existing library but if not, I can always manually remove the ones I see with a where clause.
I would also change all characters to lowercase if needed.
Thank you!
You can use full text search and unnest to extract the lexemes, then aggregate:
SELECT parts.lexeme AS word,
count(*) AS unique_count,
sum(cardinality(parts.positions)) AS total_count
FROM news_articles
CROSS JOIN LATERAL unnest(to_tsvector('english', news_articles.headline)) AS parts
GROUP BY parts.lexeme;
word │ unique_count │ total_count
═══════╪══════════════╪═════════════
-19 │ 1 │ 1
covid │ 1 │ 1
day │ 1 │ 1
drop │ 1 │ 1
elect │ 1 │ 1
local │ 2 │ 2
news │ 1 │ 1
rate │ 1 │ 1
shop │ 1 │ 1
today │ 3 │ 4
(10 rows)
I have a DataFrame that's 659 x 2 in its size, and is sorted according to its Low column. Its first 20 rows can be seen below:
julia> size(dfl)
(659, 2)
julia> first(dfl, 20)
20×2 DataFrame
Row │ Date Low
│ Date… Float64
─────┼──────────────────────
1 │ 2010-05-06 0.708333
2 │ 2010-07-01 0.717292
3 │ 2010-08-27 0.764583
4 │ 2010-08-31 0.776146
5 │ 2010-08-25 0.783125
6 │ 2010-05-25 0.808333
7 │ 2010-06-08 0.820938
8 │ 2010-07-20 0.82375
9 │ 2010-05-21 0.824792
10 │ 2010-08-16 0.842188
11 │ 2010-08-12 0.849688
12 │ 2010-02-25 0.871979
13 │ 2010-02-23 0.879896
14 │ 2010-07-30 0.890729
15 │ 2010-06-01 0.916667
16 │ 2010-08-06 0.949271
17 │ 2010-09-10 0.949792
18 │ 2010-03-04 0.969375
19 │ 2010-05-17 0.9875
20 │ 2010-03-09 1.0349
What I'd like to do is to filter out all rows in this dataframe such that only rows with monotonically increasing dates remain. So if applied to the first 20 rows above, I'd like the output to be the following:
julia> my_filter_or_subset(f, first(dfl, 20))
5×2 DataFrame
Row │ Date Low
│ Date… Float64
─────┼──────────────────────
1 │ 2010-05-06 0.708333
2 │ 2010-07-01 0.717292
3 │ 2010-08-27 0.764583
4 │ 2010-08-31 0.776146
5 │ 2010-09-10 0.949792
Is there some high-level way to achieve this using Julia and DataFrames.jl?
I should also note that, I originally prototyped the solution in Python using Pandas, and b/c it was just a PoC I didn't bother try to figure out how to achieve this using Pandas either (assuming it's even possible). And instead, I just used a Python for loop to iterate over each row of the dataframe, then only appended the rows whose dates are greater than the last date of the growing list.
I'm now trying to write this better in Julia, and looked into filter and subset methods in DataFrames.jl. Intuitively filter doesn't seem like it'd work, since the user supplied filter function can only access contents from each passed row; subset might be feasible since it has access to the entire column of data. But it's not obvious to me how to do this cleanly and efficiently, assuming it's even possible. If not, then guess I'll just have to stick with using a for loop here too.
You need to use for loop for this task in the end (you have to loop all values)
In Julia loops are fast so using your own for loop does not hinder performance.
If you are looking for something that is relatively short to type (but it will be slower than a custom for loop as it will perform the operation in several passes) you can use e.g.:
dfl[pushfirst!(diff(accumulate(max, dfl.Date)) .> 0, true), :]
Using PostgreSQL, I would like to track the presence of each distinct id's from one day to the next one in the following table. How many were added / removed since the previous date ? How many were there at both dates ?
Date
Id
2021-06-28
1
2021-06-28
2
2021-06-28
3
2021-06-29
3
2021-06-29
4
2021-06-29
5
2021-06-30
4
2021-06-30
5
2021-06-30
6
I am thus looking for a SQL query that returns this kind of results:
Date
Added
Idle
Removed
2021-06-28
3
0
0
2021-06-29
2
1
2
2021-06-30
1
2
1
Do you have any idea ?
First, for each day and id figure out the date when this id last occured.
The number of retained ids is the count of rows per day where the id occurred the previous day.
New are all values that were not retained, and removed were the rows from the previous day that were not retained.
In SQL:
SELECT d,
total - retained AS added,
retained AS idle,
lag(total, 1, 0::bigint) OVER (ORDER BY d) - retained AS removed
FROM (SELECT d,
count(prev_d) FILTER (WHERE d - 1 = prev_d) AS retained,
count(*) AS total
FROM (SELECT d,
lag(d, 1, '-infinity') OVER (PARTITION BY id ORDER BY d) AS prev_d
FROM mytable) AS q
GROUP BY d) AS p;
d │ added │ idle │ removed
════════════╪═══════╪══════╪═════════
2021-06-28 │ 3 │ 0 │ 0
2021-06-29 │ 2 │ 1 │ 2
2021-06-30 │ 1 │ 2 │ 1
(3 rows)
I am looking for an elegant solution to plot lines via Plots.jl, sorted by time stamp
I want each line to represent closeAsk (Float64) data field by different year, judging by time stamp year i.e., 2002, 2003, etc.
So, if we have data stamp from 2002 through 2019 as the below example, we should have 18 lines on a graph.
julia> df2 = df[[:closeAsk, :time]]
5000×2 DataFrame
│ Row │ closeAsk │ time │
│ │ Float64 │ String │
├──────┼──────────┼─────────────────────────────┤
│ 1 │ 0.9949 │ 2002-11-28T22:00:00.000000Z │
│ 2 │ 0.995 │ 2002-11-30T22:00:00.000000Z │
⋮
│ 4998 │ 1.13414 │ 2019-02-06T22:00:00.000000Z │
│ 4999 │ 1.13244 │ 2019-02-07T22:00:00.000000Z │
│ 5000 │ 1.13251 │ 2019-02-10T22:00:00.000000Z │
The way I am thinking of is to use comprehension to create a set of DataFrames representing the closeAsk field for each year, which we feed to plot(x, y), where y is an array of those butchered DataFrames.
Thanks in advance.
It should be simplest to use StatsPlots.jl like this:
using StatsPlots, Dates
d = Date.(first.(df2.time, 10))
df2.year = year.(d)
df2.day = #. dayofyear(d) + ((!isleapyear(d)) & (month(d) > 2))
#df df2 plot(:day, :closeAsk, group=:year)
Note that I create :day in a way to align month-day combinations on x-axis correctly (controlling for the fact that in different years you might have a different collection of trading days and making a leap year correction where needed).
EDIT
Explanation of #. dayofyear(d) + ((!isleapyear(d)) & (month(d) > 2)):
#. broadcast all functions that follow this sign
dayofyear(d): returns the number of day in a given year; note that leap years have 366 days and other years have 365 days
((!isleapyear(d)) & (month(d) > 2)): add 1 to a day number if the year is not a leap year and the date is past February, in this way we normalize all years to have 366 days (so that the same days in Month-Day format have the same day number - note that we have to shift by +1 dates starting from March, 1 till December, 31)
Here is a short example (note that 2020 is leap year and 2021 is not):
julia> d = Date.(["2020-02-28", "2020-02-29", "2020-03-01", "2021-02-28", "2021-03-01"])
5-element Array{Date,1}:
2020-02-28
2020-02-29
2020-03-01
2021-02-28
2021-03-01
julia> #. dayofyear(d) + ((!isleapyear(d)) & (month(d) > 2))
5-element Array{Int64,1}:
59
60
61
59
61
In this way for the same value on x-axis on your plot you always have the same day of year for all years.
I have the following table One:
id │ value
────┼───────
1 │ a
2 │ b
And Two:
id │ value
─────┼───────
10 │ a
20 │ a
30 │ b
40 │ a
50 │ b
One.value has a unique constraint but not Two.value (one-to-many relationship).
Which SQL (Postgres) query will retrieve as array the ids of Two whose value match One.value? The result I am looking for is:
id │ value
─────────────┼───────
{10,20,40} │ a
{30,50} │ b
Check on SQL Fiddle
SELECT array_agg(id) AS id, "value"
FROM Two
GROUP BY "value";
Using value as identifier (column name here) is a bad practice, as it is a reserved keyword.