Beginner trying to learn how Aggregation works - mongodb-query

Last thing in our SQL Beginners course was to tip our toes on few other DB:s and I chose MongoDB. The last and the "Hardest" thing I can do as bonus round is to turn this sqlite command to MongoDB collection line.
sqlite> SELECT ore, COUNT(*), MAX(price) FROM Database GROUP BY ore;
I created the DB with these values:
db.Database.insertMany( [
{ biome: "Desert", ore: "Silver", price: 8000 },
{ biome: "Forest", ore: "Gold", price: 5000 },
{ biome: "Meadow" , ore: "Silver", price: 7000 },
{ biome: "Swamp", ore: "Bronze", price: 6000 },
{ biome: "Mountains", ore: "Gold", price: 9000 },
{ biome: "Arctic" , ore: "Gold", price: 6500 }
] )
So yeah... I have been reading about pipelines and aggregation operations, but it is flying over my head xP. This is not vital for this course but I would love to learn how this goes. My school has very bad habit of teaching everything in our native language, even if no one in their right minds would ever use that terminology in real life. This makes it sometimes extra hard for me to learn these things while trying to study on my own. If anyone want to give any examples I would be grateful!
End result should look something like this:
sqlite> SELECT ore, COUNT(*), MAX(price) FROM Database GROUP BY ore;
ore COUNT(*) MAX(price)
---------- ---------- -----------
Silver 2 8000
Bronze 1 6000
Gold 3 9000

What you are looking for is the $group stage.
The $group stage is used to group multiple documents in a collection based on one or many keys. You can learn more about this pipeline stage here.
You will mention the keys you want to group by in the _id key of the Group stage.
all the rest of the keys are user-defined with can be accumulated with MongoDB's built-in operators.
In your case, you can make use of the $sum operator and pass in the value 1 to add one value to the user-defined count key for each document grouped.
And to find the max price, make use of the $max key and pass in $price (note $ prefix since you are self-referencing a key in the source document) to get the max value of a single group.
db.collection.aggregate([
{
"$group": {
"_id": "$ore",
"count": {
"$sum": 1
},
"max": {
"$max": "$price"
}
},
},
])
Mongo Playground Sample Execution

Related

Flattening a nested and repeated structure in BigQuery (standard SQL)

There are a lot of posts on unnesting repeated fields in BigQuery -- but, being new to this environment, I have tried almost every code variation I found to flatten a data file. But, I cannot seem to produce one without creating blanks in the id field. It seem like I need to unflatten a nested variable?
I'm using a COVID Dimensions data set that is part of the public collection. Here is some minimal code that produces my problem:
SELECT
id,
authors
FROM
`covid-19-dimensions-ai.data.publications`
CROSS JOIN
UNNEST(authors)
LIMIT 1000
And, here is the JSON structure after running this query. Everything is flattened with the structure I want, but I don't know how to fill in / avoid the blank id variables.
{
"id": "pub.1130234899",
"authors": {
"first_name": "Eric M",
"last_name": "Yoshida",
"initials": null,
"researcher_id": "ur.01071531321.03",
"grid_ids": [
"grid.17091.3e"
],
"corresponding": false,
"raw_affiliations": [
"Division of Gastroenterology, University of British Columbia, Vancouver, British Columbia, Canada"
],
"affiliations_address": [
{
"grid_id": "grid.17091.3e",
"city_id": "6173331",
"state_code": "CA-BC",
"country_code": "CA",
"raw_affiliation": "Division of Gastroenterology, University of British Columbia, Vancouver, British Columbia, Canada"
}
]
}
}
See small correction to your original query
SELECT
id,
author
FROM
`covid-19-dimensions-ai.data.publications`
CROSS JOIN
UNNEST(authors) author
LIMIT 1000

N1QL query count for each document of specific type

I am new to couchbase and to non-relational DB.
I have a bucket with players and teams(2 types of documents).
each player has type, playedFor(an array with all the teams he played) and a name for example:
{
"type":"player"
"name":"player1"
"playedFor": [
"England/Manchester/United"
"England/Manchester/City"
]
}
each team has type, name and category for example:
{
"type": "team"
"name": "England/Manchester/City"
"category": "FC"
}
I want to know how many players played for each team of category FC.
I made this query to calc for specific team:
SELECT COUNT(1) AS total
FROM bucket AS a
WHERE a.type='player'
AND (any r in a.playedFor satisfies r in ["England/Manchester/United"] end)
but how can i make this query for all teams?
The wrinkle in the way you've modeled this data is that player can play for 1 or more teams (hence the array).
One way to approach this is to use Couchbase's UNNEST clause to "flatten" these arrays (it's basically joining the document to each of the items in the array).
At that point, it becomes as easy as a standard GROUP BY. Here's an example:
SELECT team, count(1) AS totalPlayers
FROM `bucket` AS a
UNNEST a.playedFor team
WHERE a.type='player'
GROUP BY team
This query would generate output like:
[
{
"team": "Pittsburgh/Pirates",
"totalPlayers": 8
},
{
"team": "England/Manchester/United",
"totalPlayers": 10
},
{
"team": "England/Manchester/City",
"totalPlayers": 15
},
{
"team": "Cincinnati/Reds",
"totalPlayers": 21
}
]
(Sorry, I used MLB teams to augment your sample, since I don't know much about soccer teams).
Notice that the separate team documents don't figure into this query, but you could also JOIN to them if you need information from them for your quer(ies).

Distinct Pairing of users of different groups using Sql

50 Users with a record format
Id,
Name,
Group_Id
And Groups
1,
2,
3
Are to be inserted into a pairs table in the format
Id,
Pair_1,
Pair_2
Note
Users belongs to different groups.
Users from group 2 cannot pair with each other and users from group 3 can also not pair with each other, duplicates must also be avoided.
How do i go about this in Sql. Am a novice.
This is a sample data in Javascript
[
{
Id:1,
Name:"James",
Group_Id:3
},
{
Id:2,
Name:"Daniel",
Group_Id:3
},
{
Id:3,
Name:"Jonathan",
Group_Id:2
},
{
Id:4,
Name:"Esther",
Group_Id:1
},
{
Id:5,
Name:"Leo",
Group_Id:1
}
]
Pair_1 & Pair_2 are two paired users to be added to a pairs table based on the condition explained earlier.

complex couchbase query using metadata & group by

I am new to Couchbase and kind a stuck with the following problem.
This query works just fine in the Couchbase Query Editor:
SELECT
p.countryCode,
SUM(c.total) AS total
FROM bucket p
USE KEYS (
SELECT RAW "p::" || ca.token
FROM bucket ca USE INDEX (idx_cr)
WHERE ca._class = 'backend.db.p.ContactsDo'
AND ca.total IS NOT MISSING
AND ca.date IS NOT MISSING
AND ca.token IS NOT MISSING
AND ca.id = 288
ORDER BY ca.total DESC, ca.date ASC
LIMIT 20 OFFSET 0
)
LEFT OUTER JOIN bucket finished_contacts
ON KEYS ["finishedContacts::" || p.token]
GROUP BY p.countryCode ORDER BY total DESC
I get this:
[
{
"countryCode": "en",
"total": 145
},
{
"countryCode": "at",
"total": 133
},
{
"countryCode": "de",
"total": 53
},
{
"countryCode": "fr",
"total": 6
}
]
Now, using this query in a spring-boot application i end up with this error:
Unable to retrieve enough metadata for N1QL to entity mapping, have you selected _ID and _CAS?
adding metadata,
SELECT
meta(p).id AS _ID,
meta(p).cas AS _CAS,
p.countryCode,
SUM(c.total) AS total
FROM bucket p
trying to map it to the following object:
data class CountryIntermediateRankDo(
#Id
#Field
val id: String,
#Field
#NotNull
val countryCode: String,
#Field
#NotNull
val total: Long
)
results in:
Unable to execute query due to the following n1ql errors:
{“msg”:“Expression must be a group key or aggregate: (meta(p).id)“,”code”:4210}
Using Map as return value results in:
org.springframework.data.couchbase.core.CouchbaseQueryExecutionException: Query returning a primitive type are expected to return exactly 1 result, got 0
Clearly i missed something important here in terms of how to write proper Couchbase queries. I am stuck between needing metadata and getting this key/aggregate error that relates to the GROUP BY clause. I'd be very thankful for any help.
When you have a GROUP BY query, everything in the SELECT clause should be either a field used for grouping or a group aggregate. You need to add the new fields into the GROUP by statement, sort of like this:
SELECT
_ID,
_CAS,
p.countryCode,
SUM(p.c.total) AS total
FROM testBucket p
USE KEYS ["foo", "bar"]
LEFT OUTER JOIN testBucket finished_contacts
ON KEYS ["finishedContacts::" || p.token]
GROUP BY p.countryCode, meta(p).id AS _ID, meta(p).cas AS _CAS
ORDER BY total DESC
(I had to make some changes to your query to work with it effectively. You'll need to retrofit the advice to your specific case.)
If you need more detailed advice, let me suggest the N1QL forum https://forums.couchbase.com/c/n1ql . StackOverflow is great for one-and-done questions, but the forum is better for extended interactions.

How to query and iterate over array of structures in Athena (Presto)?

I have a S3 bucket with 500,000+ json records, eg.
{
"userId": "00000000001",
"profile": {
"created": 1539469486,
"userId": "00000000001",
"primaryApplicant": {
"totalSavings": 65000,
"incomes": [
{ "amount": 5000, "incomeType": "SALARY", "frequency": "FORTNIGHTLY" },
{ "amount": 2000, "incomeType": "OTHER", "frequency": "MONTHLY" }
]
}
}
}
I created a new table in Athena
CREATE EXTERNAL TABLE profiles (
userId string,
profile struct<
created:int,
userId:string,
primaryApplicant:struct<
totalSavings:int,
incomes:array<struct<amount:int,incomeType:string,frequency:string>>,
>
>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES ( 'ignore.malformed.json' = 'true')
LOCATION 's3://profile-data'
I am interested in the incomeTypes, eg. "SALARY", "PENSIONS", "OTHER", etc.. and ran this query changing jsonData.incometype each time:
SELECT jsonData
FROM "sampledb"."profiles"
CROSS JOIN UNNEST(sampledb.profiles.profile.primaryApplicant.incomes) AS la(jsonData)
WHERE jsonData.incometype='SALARY'
This worked fine with CROSS JOIN UNNEST which flattened the incomes array so that the data example above would span across 2 rows. The only idiosyncratic thing was that CROSS JOIN UNNEST made all the field names lowercase, eg. a row looked like this:
{amount=1520, incometype=SALARY, frequency=FORTNIGHTLY}
Now I have been asked how many users have two or more "SALARY" entries, eg.
"incomes": [
{ "amount": 3000, "incomeType": "SALARY", "frequency": "FORTNIGHTLY" },
{ "amount": 4000, "incomeType": "SALARY", "frequency": "MONTHLY" }
],
I'm not sure how to go about this.
How do I query the array of structures to look for duplicate incomeTypes of "SALARY"?
Do I have to iterate over the array?
What should the result look like?
UNNEST is a very powerful feature, and it's possible to solve this problem using it. However, I think using Presto's Lambda functions is more straight forward:
SELECT COUNT(*)
FROM sampledb.profiles
WHERE CARDINALITY(FILTER(profile.primaryApplicant.incomes, income -> income.incomeType = 'SALARY')) > 1
This solution uses FILTER on the profile.primaryApplicant.incomes array to get only those with an incomeType of SALARY, and then CARDINALITY to extract the length of that result.
Case sensitivity is never easy with SQL engines. In general I think you should not expect them to respect case, and many don't. Athena in particular explicitly converts column names to lower case.
You can combine filter with cardinality to filter array elements having incomeType = 'SALARY' more than once.
This can be further improve so that intermediate array is not materialized by using reduce (see examples in the docs; I'm not quoting them here, since they do not directly answer your question).