How to convert several duplicate rows into an array in SQL (Postgres)? - sql

I have the following table One:
id │ value
────┼───────
1 │ a
2 │ b
And Two:
id │ value
─────┼───────
10 │ a
20 │ a
30 │ b
40 │ a
50 │ b
One.value has a unique constraint but not Two.value (one-to-many relationship).
Which SQL (Postgres) query will retrieve as array the ids of Two whose value match One.value? The result I am looking for is:
id │ value
─────────────┼───────
{10,20,40} │ a
{30,50} │ b

Check on SQL Fiddle
SELECT array_agg(id) AS id, "value"
FROM Two
GROUP BY "value";
Using value as identifier (column name here) is a bad practice, as it is a reserved keyword.

Related

Counting number of transactions in the following week SQL

I am trying to get a count of the number of transactions that occur in the 7 days following each transaction in a dataset (including the transaction itself). For example, in the following set of transactions I want the count to be as follows.
┌─────────────┬──────────────┐
│ Txn_Date │ Count │
├─────────────┼──────────────┤
│ 2020-01-01 │ 3 │
│ 2020-01-04 │ 3 │
│ 2020-01-05 │ 2 │
│ 2020-01-10 │ 1 │
│ 2020-01-18 │ 3 │
│ 2020-01-20 │ 2 │
│ 2020-01-24 │ 2 │
│ 2020-01-28 │ 1 │
└─────────────┴──────────────┘
For each one it needs to be a rolling week, so row 1 counts all the transactions between 2020-01-01 and 2020-01-08, the second all transactions between 2020-01-04 and 2020-01-11 etc.
The code I have is:
select Txn_Date,
count(Txn_Date) over (partition by Txn_Date(column) where Txn_Date(rows) between Txn_Date and date_add('day', 14, Txn_Date) as Count
This code will not work in it's current state, but hopefully gives an idea of what I am trying to achieve. The database I am working in is Hive.
A good way to provide demo data is to put it into a table variable.
DECLARE #table TABLE (Txn_Date DATE, Count INT)
INSERT INTO #table (Txn_Date, Count) VALUES
('2020-01-01', 3),
('2020-01-04', 3),
('2020-01-05', 2),
('2020-01-10', 1),
('2020-01-18', 3),
('2020-01-20', 2),
('2020-01-24', 2),
('2020-01-28', 1)
If you're using TSQL you can do this using the windowed function LAG after grouping the data by week.
SELECT DATEPART(WEEK,Txn_Date) AS Week, SUM(COUNT) AS Count, LAG(SUM(COUNT),1) OVER (ORDER BY DATEPART(WEEK,Txn_Date)) AS LastWeekCount
FROM #table
GROUP BY DATEPART(WEEK,Txn_Date)
Week Count LastWeekCount
-----------------------------
1 6 NULL
2 3 6
3 3 3
4 4 3
5 1 4
Lag literally lets you go back n rows for a column in a specific order. For this we wanted to go back 1 row in week number order. To move in the opposite direction you can use LEAD the same way.
We're also using the TSQL function DATEPART to get the ISO week number for the date, and grouping by that.

Frequency of Words in Column of Strings

I'm looking for a query that will allow me to
Imagine this is the table, "News_Articles", with two columns: "ID" & "Headline". How can I make a result from this showing
NEWS_ARTICLES
ID
Headline
0001
Today's News: Local Election Today!
0002
COVID-19 Rates Drop Today
0003
Today's the day to shop local
One word per row (from the headline column)
A count of how many unique IDs it appears in
A count of how many total times the word appears in the whole dataset
DESIRED RESULT
Word
Unique_Count
Total_Count
Today
3
4
Local
2
2
Election
1
1
Ideally, we'd like to remove any conjunctions from the words as well (see "Today's" above is counted as "Today").
I'd also like to be able to remove filler words such as "the" or "a". Ideally this would be through some existing library but if not, I can always manually remove the ones I see with a where clause.
I would also change all characters to lowercase if needed.
Thank you!
You can use full text search and unnest to extract the lexemes, then aggregate:
SELECT parts.lexeme AS word,
count(*) AS unique_count,
sum(cardinality(parts.positions)) AS total_count
FROM news_articles
CROSS JOIN LATERAL unnest(to_tsvector('english', news_articles.headline)) AS parts
GROUP BY parts.lexeme;
word │ unique_count │ total_count
═══════╪══════════════╪═════════════
-19 │ 1 │ 1
covid │ 1 │ 1
day │ 1 │ 1
drop │ 1 │ 1
elect │ 1 │ 1
local │ 2 │ 2
news │ 1 │ 1
rate │ 1 │ 1
shop │ 1 │ 1
today │ 3 │ 4
(10 rows)

Is it possible to set a chosen column as index in a julia dataframe?

dataframes in pandas are indexed in one or more numerical and/or string columns. Particularly, after a groupby operation, the output is a dataframe where the new index is given by the groups.
Similarly, julia dataframes always have a column named Row which I think is equivalent to the index in pandas. However, after groupby operations, julia dataframes don't use the groups as the new index. Here is a working example:
using RDatasets;
using DataFrames;
using StatsBase;
df = dataset("Ecdat","Cigarette");
gdf = groupby(df, "Year");
combine(gdf, "Income" => mean)
Output:
11×2 DataFrame
│ Row │ Year │ Income_mean │
│ │ Int32 │ Float64 │
├─────┼───────┼─────────────┤
│ 1 │ 1985 │ 7.20845e7 │
│ 2 │ 1986 │ 7.61923e7 │
│ 3 │ 1987 │ 8.13253e7 │
│ 4 │ 1988 │ 8.77016e7 │
│ 5 │ 1989 │ 9.44374e7 │
│ 6 │ 1990 │ 1.00666e8 │
│ 7 │ 1991 │ 1.04361e8 │
│ 8 │ 1992 │ 1.10775e8 │
│ 9 │ 1993 │ 1.1534e8 │
│ 10 │ 1994 │ 1.21145e8 │
│ 11 │ 1995 │ 1.27673e8 │
Even if the creation of the new index isn't done automatically, I wonder if there is a way to manually set a chosen column as index. I discover the method setindex! reading the documentation. However, I wasn't able to use this method. I tried:
#create new df
income = combine(gdf, "Income" => mean)
#set index
setindex!(income, "Year")
which gives the error:
ERROR: LoadError: MethodError: no method matching setindex!(::DataFrame, ::String)
I think that I have misused the command. What am I doing wrong here? Is it possible to manually set an index in a julia dataframe using one or more chosen columns?
DataFrames.jl does not currently allow specifying an index for a data frame. The Row column is just there for printing---it's not actually part of the data frame.
However, DataFrames.jl provides all the usual table operations, such as joins, transformations, filters, aggregations, and pivots. Support for these operations does not require having a table index. A table index is a structure used by databases (and by Pandas) to speed up certain table operations, at the cost of additional memory usage and the cost of creating the index.
The setindex! function you discovered is actually a method from Base Julia that is used to customize the indexing behavior for custom types. For example, x[1] = 42 is equivalent to setindex!(x, 42, 1). Overloading this method allows you to customize the indexing behavior for types that you create.
The docstrings for Base.setindex! can be found here and here.
If you really need a table with an index, you could try IndexedTables.jl.

SQL Server - Join to void nulls

Am sorry I couldn't explain the issue clearly. The actual problem is I have transaction table that contains item transactions such as purchase and sales across various location. I need to find the unit purchase cost of all items across all branches. Now, in a given location, all items may not be purchased. At the same time all items would be purchased in the central warehouse. Means, some items are transferred from Warehouse to locations instead of purchase at the location. In such cases, the unit cost should be picked from the central warehouse purchase data.
Now, I can get the items and purchase cost across each location from the transaction table, given that the item is purchased at the location. My question was, how to fetch the central warehouse price for items that do not have a purchase history in the transaction table and list it along with all other location purchase cost. Why its difficult is, if there is no purchase hist it means i have no item number to search in the central warehouse.
Frankly I do not know how to do this through SQL query in a single go. Hence i did make a master view as first step - containing all branches and items. This is not ideal because the data is so huge as I have around 50 locations and 200K items resulting in 50 x 200k rows. However, it served the purpose of acting as a location-item master.
Second step, I made central warehouse master with item and purchase cost at the warehouse.
Thirdly, queried transaction table to fetch items that has no purchase at specific locations. These item id was linked to location-item master and used a case statement to get, if purchase cost is null then get the cost from warehouse.
Thank you for pointing out the mistakes and for introducing COALESCE.
Table (Tab1) is as below:
┌─────────┐
│ TabCol1 │
├─────────┤
│ 01 │
│ 02 │
│ 03 │
│ 04 │
│ 05 │
└─────────┘
I have a table (Tab2 ) with two columns:
┌──────┬──────┐
│ Col1 │ Col2 │
├──────┼──────┤
│ 1111 │ 01 │
│ 1111 │ 02 │
│ 1111 │ 03 │
└──────┴──────┘
If we join the above table we get:
┌─────────┬──────┬──────┐
│ TabCol1 │ Col1 │ Col2 │
├─────────┼──────┼──────┤
│ 01 │ 1111 │ 01 │
│ 02 │ 1111 │ 02 │
│ 03 │ 1111 │ 03 │
│ 04 │ NULL │ NULL │
│ 05 │ NULL │ NULL │
└─────────┴──────┴──────┘
What I need is, instead of NULL, I must get 1111:
┌─────────┬──────┬──────┐
│ TabCol1 │ Col1 │ Col2 │
├─────────┼──────┼──────┤
│ 01 │ 1111 │ 01 │
│ 02 │ 1111 │ 02 │
│ 03 │ 1111 │ 03 │
│ 04 │ 1111 │ 04 │
│ 05 │ 1111 │ 05 │
└─────────┴──────┴──────┘
In other words, I need to make a master table, with all COL1 filled to avoid NULL.
What you are trying to achieve makes completely no sense to me, but there's one way to get that result:
select T1.TabCol1,
coalesce(T2.Col1, '1111'),
coalesce(T2.Col2, T1.TabCol1)
from Tab1 T1 left join Tab2 T2 on T1.TabCol1 = T2.Col2
You can replace NULLs with whatever you want:
SELECT TabCol1, ISNULL(Col1, '1111') AS Col1, ISNULL(Col2, TabCol1) AS Col2
FROM Tab2
LEFT JOIN Tab1 ON Tab1.TabCol1 = Tab2.Col2
Note that ISNULL only works in SQL Server. Alternatively, you can use COALESCE which is supported by most databases:
SELECT TabCol1, COALESCE(Col1, '1111') AS Col1, COALESCE(Col2, TabCol1) AS Col2
FROM Tab2
LEFT JOIN Tab1 ON Tab1.TabCol1 = Tab2.Col2

SQL query for converting column breaks in a single column

I have a database in postgres where one of the columns contains text data with multiple column breaks.
So, when I export the data into csv file, the columns are jumbled!
I need a query which will ignore the column breaks in a single column and give an output where the data in the column is available in the same column and does not extend to the next column.
This example table exhibits the problem you are talking about:
test=> SELECT * FROM breaks;
┌────┬───────────┐
│ id │ val │
├────┼───────────┤
│ 1 │ text with↵│
│ │ three ↵│
│ │ lines │
│ 2 │ text with↵│
│ │ two lines │
└────┴───────────┘
(2 rows)
Then you can use the replace function to replace the line breaks with spaces:
test=> SELECT id, replace(val, E'\n', ' ') FROM breaks;
┌────┬───────────────────────┐
│ id │ replace │
├────┼───────────────────────┤
│ 1 │ text with three lines │
│ 2 │ text with two lines │
└────┴───────────────────────┘
(2 rows)