Frequency of Words in Column of Strings - sql

I'm looking for a query that will allow me to
Imagine this is the table, "News_Articles", with two columns: "ID" & "Headline". How can I make a result from this showing
NEWS_ARTICLES
ID
Headline
0001
Today's News: Local Election Today!
0002
COVID-19 Rates Drop Today
0003
Today's the day to shop local
One word per row (from the headline column)
A count of how many unique IDs it appears in
A count of how many total times the word appears in the whole dataset
DESIRED RESULT
Word
Unique_Count
Total_Count
Today
3
4
Local
2
2
Election
1
1
Ideally, we'd like to remove any conjunctions from the words as well (see "Today's" above is counted as "Today").
I'd also like to be able to remove filler words such as "the" or "a". Ideally this would be through some existing library but if not, I can always manually remove the ones I see with a where clause.
I would also change all characters to lowercase if needed.
Thank you!

You can use full text search and unnest to extract the lexemes, then aggregate:
SELECT parts.lexeme AS word,
count(*) AS unique_count,
sum(cardinality(parts.positions)) AS total_count
FROM news_articles
CROSS JOIN LATERAL unnest(to_tsvector('english', news_articles.headline)) AS parts
GROUP BY parts.lexeme;
word │ unique_count │ total_count
═══════╪══════════════╪═════════════
-19 │ 1 │ 1
covid │ 1 │ 1
day │ 1 │ 1
drop │ 1 │ 1
elect │ 1 │ 1
local │ 2 │ 2
news │ 1 │ 1
rate │ 1 │ 1
shop │ 1 │ 1
today │ 3 │ 4
(10 rows)

Related

Counting number of transactions in the following week SQL

I am trying to get a count of the number of transactions that occur in the 7 days following each transaction in a dataset (including the transaction itself). For example, in the following set of transactions I want the count to be as follows.
┌─────────────┬──────────────┐
│ Txn_Date │ Count │
├─────────────┼──────────────┤
│ 2020-01-01 │ 3 │
│ 2020-01-04 │ 3 │
│ 2020-01-05 │ 2 │
│ 2020-01-10 │ 1 │
│ 2020-01-18 │ 3 │
│ 2020-01-20 │ 2 │
│ 2020-01-24 │ 2 │
│ 2020-01-28 │ 1 │
└─────────────┴──────────────┘
For each one it needs to be a rolling week, so row 1 counts all the transactions between 2020-01-01 and 2020-01-08, the second all transactions between 2020-01-04 and 2020-01-11 etc.
The code I have is:
select Txn_Date,
count(Txn_Date) over (partition by Txn_Date(column) where Txn_Date(rows) between Txn_Date and date_add('day', 14, Txn_Date) as Count
This code will not work in it's current state, but hopefully gives an idea of what I am trying to achieve. The database I am working in is Hive.
A good way to provide demo data is to put it into a table variable.
DECLARE #table TABLE (Txn_Date DATE, Count INT)
INSERT INTO #table (Txn_Date, Count) VALUES
('2020-01-01', 3),
('2020-01-04', 3),
('2020-01-05', 2),
('2020-01-10', 1),
('2020-01-18', 3),
('2020-01-20', 2),
('2020-01-24', 2),
('2020-01-28', 1)
If you're using TSQL you can do this using the windowed function LAG after grouping the data by week.
SELECT DATEPART(WEEK,Txn_Date) AS Week, SUM(COUNT) AS Count, LAG(SUM(COUNT),1) OVER (ORDER BY DATEPART(WEEK,Txn_Date)) AS LastWeekCount
FROM #table
GROUP BY DATEPART(WEEK,Txn_Date)
Week Count LastWeekCount
-----------------------------
1 6 NULL
2 3 6
3 3 3
4 4 3
5 1 4
Lag literally lets you go back n rows for a column in a specific order. For this we wanted to go back 1 row in week number order. To move in the opposite direction you can use LEAD the same way.
We're also using the TSQL function DATEPART to get the ISO week number for the date, and grouping by that.

Extracting Data from .csv File in Julia

I'm quite new to Julia and i have a .csv File, which is stored inside a gzip, where i want to extract some informations from for educational purposes and to get to know the language better.
In Python there are many helpful functions from Panda to help with that, but i can't seem to get the Problem straight...
This is my Code (I KNOW, VERY WEAK!!!) :
{
import Pkg
#Pkg.add("CSV")
#Pkg.add("DataFrames")
#Pkg.add("CSVFiles")
#Pkg.add("CodecZlib")
#Pkg.add("GZip")
using CSVFiles
using Pkg
using CSV
using DataFrames
using CodecZlib
using GZip
df = CSV.read("Path//to//file//file.csv.gzip", DataFrame)
print(df)
}
I added a Screen to show how the Columns inside the .csv File are looking like.
I would like to extract the Dates and make some sort of a Top 10 most commented users, Top 10 days with the most threads etc.
I would like to point out that this is not an Exercise given to me, but a training i would like to do 4 myself.
I know the Panda Version to this is looking like this:
df['threadcreateddate'] = pd.to_datetine(df['thread_created_utc']).dt.date
or
df['commentcreateddate'] = pd.to_datetime(df['comment_created_utc']).dt.date
And to sort it:
pf_number_of_threads = df.groupby('threadcreateddate')["thread_id'].nunique()
If i were to plot it:
df_number_of_threads.plot(kind='line')
plt.show()
To print:
head = df.head()
print(df_number_of_threads.sort_values(ascending=False).head(10))
Can someone help? The df.select() function didn't work for me.
1. Packages
We obviously need DataFrames.jl. And since we're dealing with dates in the data, and doing a plot later, we'll include Dates and Plots as well.
As this example in CSV.jl's documentation shows, no additional packages are needed for gzipped data. CSV.jl can decompress automatically. So, you can remove the other using statements from your list.
julia> using CSV, DataFrames, Dates, Plots
2. Preparing the Data Frame
You can use CSV.read to load the data into the Data Frame, as in the question. Here, I'll use some sample (simplified) data for illustration, with just 4 columns:
julia> df
6×4 DataFrame
Row │ thread_id thread_created_utc comment_id comment_created_utc
│ Int64 String Int64 String
─────┼─────────────────────────────────────────────────────────────────
1 │ 1 2022-08-13T12:00:00 1 2022-08-13T12:00:00
2 │ 1 2022-08-13T12:00:00 2 2022-08-14T12:00:00
3 │ 1 2022-08-13T12:00:00 3 2022-08-15T12:00:00
4 │ 2 2022-08-16T12:00:00 4 2022-08-16T12:00:00
5 │ 2 2022-08-16T12:00:00 5 2022-08-17T12:00:00
6 │ 2 2022-08-16T12:00:00 6 2022-08-18T12:00:00
3. Converting from String to DateTime
To extract the thread dates from the string columns we have, we'll use the Dates standard libary.
Depending on the exact format your dates are in, you might have to add a datefmt argument for conversion to Dates data types (see the Constructors section of Dates in the Julia manual). Here in the sample data, the dates are in ISO standard format, so we don't need to specify the date format explicitly.
In Julia, we can get the date directly without intermediate conversion to a date-time type, but since it's a good idea to have the columns be in the proper type anyway, we'll first convert the existing columns from strings to DateTime:
julia> transform!(df, [:thread_created_utc, :comment_created_utc] .=> ByRow(DateTime), renamecols = false)
6×4 DataFrame
Row │ thread_id thread_created_utc comment_id comment_created_utc
│ Int64 DateTime Int64 DateTime
─────┼─────────────────────────────────────────────────────────────────
1 │ 1 2022-08-13T12:00:00 1 2022-08-13T12:00:00
2 │ 1 2022-08-13T12:00:00 2 2022-08-14T12:00:00
3 │ 1 2022-08-13T12:00:00 3 2022-08-15T12:00:00
4 │ 2 2022-08-16T12:00:00 4 2022-08-16T12:00:00
5 │ 2 2022-08-16T12:00:00 5 2022-08-17T12:00:00
6 │ 2 2022-08-16T12:00:00 6 2022-08-18T12:00:00
Though it looks similar, this data frame doesn't use Strings for the date-time columns, instead has proper DateTime type values.
(For an explanation of how this transform! works, see the DataFrames manual: Selecting and transforming columns.)
Edit: Based on the screenshot added to the question now, in your case you'd use transform!(df, [:thread_created_utc, :comment_created_utc] .=> ByRow(s -> DateTime(s, dateformat"yyyy-mm-dd HH:MM:SS.s")), renamecols = false).
4. Creating Date columns
Now, creating the date columns is as easy as:
julia> df.threadcreateddate = Date.(df.thread_created_utc);
julia> df.commentcreateddate = Date.(df.comment_created_utc);
julia> df
6×6 DataFrame
Row │ thread_id thread_created_utc comment_id comment_created_utc commentcreateddate threadcreatedate
│ Int64 DateTime Int64 DateTime Date Date
─────┼───────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ 1 2022-08-13T12:00:00 1 2022-08-13T12:00:00 2022-08-13 2022-08-13
2 │ 1 2022-08-13T12:00:00 2 2022-08-14T12:00:00 2022-08-14 2022-08-13
3 │ 1 2022-08-13T12:00:00 3 2022-08-15T12:00:00 2022-08-15 2022-08-13
4 │ 2 2022-08-16T12:00:00 4 2022-08-16T12:00:00 2022-08-16 2022-08-16
5 │ 2 2022-08-16T12:00:00 5 2022-08-17T12:00:00 2022-08-17 2022-08-16
6 │ 2 2022-08-16T12:00:00 6 2022-08-18T12:00:00 2022-08-18 2022-08-16
These could also be written as a transform! call, and in fact the transform! call in the previous code segment could have instead been replaced with df.thread_created_utc = DateTime.(df.thread_created_utc) and df.comment_created_utc = DateTime.(df.comment_created_utc). However, transform offers a very powerful and flexible syntax that can do a lot more, so it's useful to familiarize yourself with it if you're going to work on DataFrames.
5. Getting the number of threads per day
julia> gdf = combine(groupby(df, :threadcreateddate), :thread_id => length ∘ unique => :number_of_threads)
2×2 DataFrame
Row │ threadcreateddate number_of_threads
│ Date Int64
─────┼──────────────────────────────────────
1 │ 2022-08-13 1
2 │ 2022-08-16 1
Note that df.groupby('threadcreateddate') becomes groupby(df, :threadcreateddate), which is a common pattern in Python-to-Julia conversions. Julia doesn't use the . based object-oriented syntax, and instead the data frame is one of the arguments to the function.
length ∘ unique uses the function composition operator ∘, and the result is a function that applies unique and then length. Here we take the unique values of thread_id column in each group, apply length to them (so, the equivalent of nunique), and store the result in number_of_threads column in a new GroupedDataFrame called gdf.
6. Plotting
julia> plot(gdf.threadcreateddate, gdf.number_of_threads)
Since our grouped data frame conveniently contains both the date and the number of threads, we can plot the number_of_threads against the dates, making for a nice and informative visualization.
As Sundar R commented it is hard to give you a precise answer for your data as there might be some relevant details. But here is a general pattern you can follow:
julia> using DataFrames
julia> df = DataFrame(id = [1, 1, 2, 2, 2, 3])
6×1 DataFrame
Row │ id
│ Int64
─────┼───────
1 │ 1
2 │ 1
3 │ 2
4 │ 2
5 │ 2
6 │ 3
julia> first(sort(combine(groupby(df, :id), nrow), :nrow, rev=true), 10)
3×2 DataFrame
Row │ id nrow
│ Int64 Int64
─────┼──────────────
1 │ 2 3
2 │ 1 2
3 │ 3 1
What this code does:
groupby groups data by the column you want to aggregate
combine with nrow argument counts the number of rows in each group and stores it in :nrow column (this is the default, you could choose other column name)
sort sorts data frame by :nrow and rev=true makes the order descending
first picks 10 first rows from this data frame
If you want something more similar to dplyr in R with piping you can use #chain that is exported by DataFramesMeta.jl:
julia> using DataFramesMeta
julia> #chain df begin
groupby(:id)
combine(nrow)
sort(:nrow, rev=true)
first(10)
end
3×2 DataFrame
Row │ id nrow
│ Int64 Int64
─────┼──────────────
1 │ 2 3
2 │ 1 2
3 │ 3 1

Filter/select rows by comparing to previous rows when using DataFrames.jl?

I have a DataFrame that's 659 x 2 in its size, and is sorted according to its Low column. Its first 20 rows can be seen below:
julia> size(dfl)
(659, 2)
julia> first(dfl, 20)
20×2 DataFrame
Row │ Date Low
│ Date… Float64
─────┼──────────────────────
1 │ 2010-05-06 0.708333
2 │ 2010-07-01 0.717292
3 │ 2010-08-27 0.764583
4 │ 2010-08-31 0.776146
5 │ 2010-08-25 0.783125
6 │ 2010-05-25 0.808333
7 │ 2010-06-08 0.820938
8 │ 2010-07-20 0.82375
9 │ 2010-05-21 0.824792
10 │ 2010-08-16 0.842188
11 │ 2010-08-12 0.849688
12 │ 2010-02-25 0.871979
13 │ 2010-02-23 0.879896
14 │ 2010-07-30 0.890729
15 │ 2010-06-01 0.916667
16 │ 2010-08-06 0.949271
17 │ 2010-09-10 0.949792
18 │ 2010-03-04 0.969375
19 │ 2010-05-17 0.9875
20 │ 2010-03-09 1.0349
What I'd like to do is to filter out all rows in this dataframe such that only rows with monotonically increasing dates remain. So if applied to the first 20 rows above, I'd like the output to be the following:
julia> my_filter_or_subset(f, first(dfl, 20))
5×2 DataFrame
Row │ Date Low
│ Date… Float64
─────┼──────────────────────
1 │ 2010-05-06 0.708333
2 │ 2010-07-01 0.717292
3 │ 2010-08-27 0.764583
4 │ 2010-08-31 0.776146
5 │ 2010-09-10 0.949792
Is there some high-level way to achieve this using Julia and DataFrames.jl?
I should also note that, I originally prototyped the solution in Python using Pandas, and b/c it was just a PoC I didn't bother try to figure out how to achieve this using Pandas either (assuming it's even possible). And instead, I just used a Python for loop to iterate over each row of the dataframe, then only appended the rows whose dates are greater than the last date of the growing list.
I'm now trying to write this better in Julia, and looked into filter and subset methods in DataFrames.jl. Intuitively filter doesn't seem like it'd work, since the user supplied filter function can only access contents from each passed row; subset might be feasible since it has access to the entire column of data. But it's not obvious to me how to do this cleanly and efficiently, assuming it's even possible. If not, then guess I'll just have to stick with using a for loop here too.
You need to use for loop for this task in the end (you have to loop all values)
In Julia loops are fast so using your own for loop does not hinder performance.
If you are looking for something that is relatively short to type (but it will be slower than a custom for loop as it will perform the operation in several passes) you can use e.g.:
dfl[pushfirst!(diff(accumulate(max, dfl.Date)) .> 0, true), :]

SQL Server - Join to void nulls

Am sorry I couldn't explain the issue clearly. The actual problem is I have transaction table that contains item transactions such as purchase and sales across various location. I need to find the unit purchase cost of all items across all branches. Now, in a given location, all items may not be purchased. At the same time all items would be purchased in the central warehouse. Means, some items are transferred from Warehouse to locations instead of purchase at the location. In such cases, the unit cost should be picked from the central warehouse purchase data.
Now, I can get the items and purchase cost across each location from the transaction table, given that the item is purchased at the location. My question was, how to fetch the central warehouse price for items that do not have a purchase history in the transaction table and list it along with all other location purchase cost. Why its difficult is, if there is no purchase hist it means i have no item number to search in the central warehouse.
Frankly I do not know how to do this through SQL query in a single go. Hence i did make a master view as first step - containing all branches and items. This is not ideal because the data is so huge as I have around 50 locations and 200K items resulting in 50 x 200k rows. However, it served the purpose of acting as a location-item master.
Second step, I made central warehouse master with item and purchase cost at the warehouse.
Thirdly, queried transaction table to fetch items that has no purchase at specific locations. These item id was linked to location-item master and used a case statement to get, if purchase cost is null then get the cost from warehouse.
Thank you for pointing out the mistakes and for introducing COALESCE.
Table (Tab1) is as below:
┌─────────┐
│ TabCol1 │
├─────────┤
│ 01 │
│ 02 │
│ 03 │
│ 04 │
│ 05 │
└─────────┘
I have a table (Tab2 ) with two columns:
┌──────┬──────┐
│ Col1 │ Col2 │
├──────┼──────┤
│ 1111 │ 01 │
│ 1111 │ 02 │
│ 1111 │ 03 │
└──────┴──────┘
If we join the above table we get:
┌─────────┬──────┬──────┐
│ TabCol1 │ Col1 │ Col2 │
├─────────┼──────┼──────┤
│ 01 │ 1111 │ 01 │
│ 02 │ 1111 │ 02 │
│ 03 │ 1111 │ 03 │
│ 04 │ NULL │ NULL │
│ 05 │ NULL │ NULL │
└─────────┴──────┴──────┘
What I need is, instead of NULL, I must get 1111:
┌─────────┬──────┬──────┐
│ TabCol1 │ Col1 │ Col2 │
├─────────┼──────┼──────┤
│ 01 │ 1111 │ 01 │
│ 02 │ 1111 │ 02 │
│ 03 │ 1111 │ 03 │
│ 04 │ 1111 │ 04 │
│ 05 │ 1111 │ 05 │
└─────────┴──────┴──────┘
In other words, I need to make a master table, with all COL1 filled to avoid NULL.
What you are trying to achieve makes completely no sense to me, but there's one way to get that result:
select T1.TabCol1,
coalesce(T2.Col1, '1111'),
coalesce(T2.Col2, T1.TabCol1)
from Tab1 T1 left join Tab2 T2 on T1.TabCol1 = T2.Col2
You can replace NULLs with whatever you want:
SELECT TabCol1, ISNULL(Col1, '1111') AS Col1, ISNULL(Col2, TabCol1) AS Col2
FROM Tab2
LEFT JOIN Tab1 ON Tab1.TabCol1 = Tab2.Col2
Note that ISNULL only works in SQL Server. Alternatively, you can use COALESCE which is supported by most databases:
SELECT TabCol1, COALESCE(Col1, '1111') AS Col1, COALESCE(Col2, TabCol1) AS Col2
FROM Tab2
LEFT JOIN Tab1 ON Tab1.TabCol1 = Tab2.Col2

How to convert several duplicate rows into an array in SQL (Postgres)?

I have the following table One:
id │ value
────┼───────
1 │ a
2 │ b
And Two:
id │ value
─────┼───────
10 │ a
20 │ a
30 │ b
40 │ a
50 │ b
One.value has a unique constraint but not Two.value (one-to-many relationship).
Which SQL (Postgres) query will retrieve as array the ids of Two whose value match One.value? The result I am looking for is:
id │ value
─────────────┼───────
{10,20,40} │ a
{30,50} │ b
Check on SQL Fiddle
SELECT array_agg(id) AS id, "value"
FROM Two
GROUP BY "value";
Using value as identifier (column name here) is a bad practice, as it is a reserved keyword.