Improve mongoDB lookup performance in rails

Improve mongoDB lookup performance in rails - ruby-on-rails-3

I have a collection called survey_data which has following fields
topic_id
indicator_id
population_id
The survey_data has over 2.8M records. I want to fetch the populations from populations collection for a given set of indicator_id and topic_id. But the query below is taking 20 seconds even after adding a compound index for all the fields.
db.survey_data.find({"topic_id":60,"indicator_id":16)
How can i improve the performance? May be a single query using "mongoid" for rails3 would be preferred.
Explain
{
"cursor" : "BtreeCursor data_source_index",
"isMultiKey" : false,
"n" : 2261852,
"nscannedObjects" : 2261852,
"nscanned" : 2261852,
"nscannedObjectsAllPlans" : 2261852,
"nscannedAllPlans" : 2261852,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 21,
"nChunkSkips" : 0,
"millis" : 19952,
"indexBounds" : {
"data_source_id" : [
[
60,
60
]
]
}
SurveyData Index:
index({data_source_id: 1,
data_source_year_id: 1,
indicator_id: 1,
indicator_option_id: 1,
country_id: 1,
provinces_state_id: 1,
health_regions_id: 1,
health_regions_type_id: 1,
other_administrative_boundary_id: 1,
sub_population_options_id: 1,
reportability_id: 1},
{name:"survey_data_index",
background: true
})

Three things to look at:
topic_id doesn't appear to be in your index.
Try creating an index with just the fields that you are querying on, in the same order as your query.
Do you need to grab 100,000 records all at once? If you pull the first 100 records using limit does it speed things up?
There are some really great resources on query tuning. Here are a couple:
Optimize Query Performance with Indexes and Projections
Automated Slow Query Analysis: Dex the Index Robot

Related

How to group by the amount of values in an array in postgresql

I have a posts table with few columns including a liked_by column which's type is an int array.
As I can't post the table here I'll post a single post's JSON structure which comes as below
"post": {
"ID": 1,
"CreatedAt": "2022-08-15T11:06:44.386954+05:30",
"UpdatedAt": "2022-08-15T11:06:44.386954+05:30",
"DeletedAt": null,
"title": "Pofst1131",
"postText": "yyhfgwegfewgewwegwegwegweg",
"img": "fegjegwegwg.com",
"userName": "AthfanFasee",
"likedBy": [
3,
1,
4
],
"createdBy": 1,
}
I'm trying to send posts in the order they are liked (Most Liked Posts). Which should order the posts according to the number of values inside the liked_by array. How can I achieve this in Postgres?
For a side note, I'm using Go lang with GORM ORM but I'm using raw SQL builder instead of ORM tools. I'll be fine with solving this problem using go lang as well. The way I achieved this in MongoDB and NodeJS is to group by the size of liked by array and add a total like count field and sort using that field as below
if(sort === 'likesCount') {
data = Post.aggregate([
{
$addFields: {
totalLikesCount: { $size: "$likedBy" }
}
}
])
data = data.sort('-totalLikesCount');
} else {
data = data.sort('-createdAt') ;
}

Use a native query.
Provided that the table column that contains the sample data is called post, then
select <list of expressions> from the_table
order by json_array_length(post->'likedBy') desc;
Unrelated but why don't you try a normalized data design?
Edit
Now that I know your table structure here is the updated query. Use array_length.
select <list of expressions> from public.posts
order by array_length(liked_by, 1) desc nulls last;
You may also wish to add a where clause too.

Beginner trying to learn how Aggregation works

Last thing in our SQL Beginners course was to tip our toes on few other DB:s and I chose MongoDB. The last and the "Hardest" thing I can do as bonus round is to turn this sqlite command to MongoDB collection line.
sqlite> SELECT ore, COUNT(*), MAX(price) FROM Database GROUP BY ore;
I created the DB with these values:
db.Database.insertMany( [
{ biome: "Desert", ore: "Silver", price: 8000 },
{ biome: "Forest", ore: "Gold", price: 5000 },
{ biome: "Meadow" , ore: "Silver", price: 7000 },
{ biome: "Swamp", ore: "Bronze", price: 6000 },
{ biome: "Mountains", ore: "Gold", price: 9000 },
{ biome: "Arctic" , ore: "Gold", price: 6500 }
] )
So yeah... I have been reading about pipelines and aggregation operations, but it is flying over my head xP. This is not vital for this course but I would love to learn how this goes. My school has very bad habit of teaching everything in our native language, even if no one in their right minds would ever use that terminology in real life. This makes it sometimes extra hard for me to learn these things while trying to study on my own. If anyone want to give any examples I would be grateful!
End result should look something like this:
sqlite> SELECT ore, COUNT(*), MAX(price) FROM Database GROUP BY ore;
ore COUNT(*) MAX(price)
---------- ---------- -----------
Silver 2 8000
Bronze 1 6000
Gold 3 9000

What you are looking for is the $group stage.
The $group stage is used to group multiple documents in a collection based on one or many keys. You can learn more about this pipeline stage here.
You will mention the keys you want to group by in the _id key of the Group stage.
all the rest of the keys are user-defined with can be accumulated with MongoDB's built-in operators.
In your case, you can make use of the $sum operator and pass in the value 1 to add one value to the user-defined count key for each document grouped.
And to find the max price, make use of the $max key and pass in $price (note $ prefix since you are self-referencing a key in the source document) to get the max value of a single group.
db.collection.aggregate([
{
"$group": {
"_id": "$ore",
"count": {
"$sum": 1
},
"max": {
"$max": "$price"
}
},
},
])
Mongo Playground Sample Execution

How to query and iterate over array of structures in Athena (Presto)?

I have a S3 bucket with 500,000+ json records, eg.
{
"userId": "00000000001",
"profile": {
"created": 1539469486,
"userId": "00000000001",
"primaryApplicant": {
"totalSavings": 65000,
"incomes": [
{ "amount": 5000, "incomeType": "SALARY", "frequency": "FORTNIGHTLY" },
{ "amount": 2000, "incomeType": "OTHER", "frequency": "MONTHLY" }
]
}
}
}
I created a new table in Athena
CREATE EXTERNAL TABLE profiles (
userId string,
profile struct<
created:int,
userId:string,
primaryApplicant:struct<
totalSavings:int,
incomes:array<struct<amount:int,incomeType:string,frequency:string>>,
>
>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES ( 'ignore.malformed.json' = 'true')
LOCATION 's3://profile-data'
I am interested in the incomeTypes, eg. "SALARY", "PENSIONS", "OTHER", etc.. and ran this query changing jsonData.incometype each time:
SELECT jsonData
FROM "sampledb"."profiles"
CROSS JOIN UNNEST(sampledb.profiles.profile.primaryApplicant.incomes) AS la(jsonData)
WHERE jsonData.incometype='SALARY'
This worked fine with CROSS JOIN UNNEST which flattened the incomes array so that the data example above would span across 2 rows. The only idiosyncratic thing was that CROSS JOIN UNNEST made all the field names lowercase, eg. a row looked like this:
{amount=1520, incometype=SALARY, frequency=FORTNIGHTLY}
Now I have been asked how many users have two or more "SALARY" entries, eg.
"incomes": [
{ "amount": 3000, "incomeType": "SALARY", "frequency": "FORTNIGHTLY" },
{ "amount": 4000, "incomeType": "SALARY", "frequency": "MONTHLY" }
],
I'm not sure how to go about this.
How do I query the array of structures to look for duplicate incomeTypes of "SALARY"?
Do I have to iterate over the array?
What should the result look like?

UNNEST is a very powerful feature, and it's possible to solve this problem using it. However, I think using Presto's Lambda functions is more straight forward:
SELECT COUNT(*)
FROM sampledb.profiles
WHERE CARDINALITY(FILTER(profile.primaryApplicant.incomes, income -> income.incomeType = 'SALARY')) > 1
This solution uses FILTER on the profile.primaryApplicant.incomes array to get only those with an incomeType of SALARY, and then CARDINALITY to extract the length of that result.
Case sensitivity is never easy with SQL engines. In general I think you should not expect them to respect case, and many don't. Athena in particular explicitly converts column names to lower case.

You can combine filter with cardinality to filter array elements having incomeType = 'SALARY' more than once.
This can be further improve so that intermediate array is not materialized by using reduce (see examples in the docs; I'm not quoting them here, since they do not directly answer your question).

Map-Reduce to combine data (MongoDb)

I have two collections.
LogData
[{
"SId": 10,
"NoOfDaya" : 9,
"Status" : 4
}
{
"SId": 11,
"NoOfDaya" : 8,
"Status" : 2
}]
OptData
[ {
"SId": 10,
"CId": 12,
"CreatedDate": ISO(24-10-2014)
}
{
"SId": 10,
"CId": 13,
"CreatedDate": ISO(24-10-2014)
}]
Now using mongoDB I need to find the data in form
select a.SPID,a.CreatedDate,CID=(MAX(a.CID)) from OptData a
Join LogData c on a.SID=c.SID where Status>2
group by a.SPID,a.CreatedDate
LogData have 600 records whereas OPTData have 90 millions records in production. I need to update LogData frequently, that's why its in separate collection.
Please don't suggest to keep data in one collection.
This is same query, I asked with different approach Creating file in GridFs (MongoDb)
Please don't suggest Joins can't be applied in mongoDB.

Because MongoDB does not support JOINs, you will have to perform two separate queries and do the JOIN on the application layer. With just 600 documents the collection LogData is very small, so it should be no problem to completely load it into your applications memory and use it to enrich the results returned from OptData.
Another option would be to denormalize the data from LogData by mirroring the fields you need from LogData in the respective documents in OptData. So your OptData documents would look something like this:
{
"SId": 10,
"CId": 12,
"CreatedDate": ISO(24-10-2014),
"LogStatus": 2
}

What is the best way to use lookup tables in QlikView?

In QlikView, I have a table Data and one database table A. Table A should be used twice (A_Left, A_Right). (Table A can have thousands of entries.)
My load script is:
A_Left:
Load a_id_left,
a_name_left
inline [
a_id_left, a_name_left
1, 'nwsnd'
2, 'dcsdcws'
3, 'fsdf' ];
A_Rigtht:
Load a_id_right,
a_name_right
inline [
a_id_right, a_name_right
1, 'nwsnd'
2, 'dcsdcws'
3, 'fsdf' ];
Data:
Load id,
a_id_left,
a_name_left as 'Name_Left',
a_id_right,
a_name_right as 'Name_Right',
data
inline [
id, a_id_left, a_right_id, data
1, 1, 2, 37
1, 1, 3, 18
1, 2, 3, 62
];
So my question is: What is the best way to use lookup tables in QlikView?
(Should I use MAPPING and/or ApplyMap? Why? Is that faster?)
One other part of the question is: Would it help change the data structure from star to table?
(I know that would cost more memory.) And, by the way: How could I put all data in one table
so that I can store it completely in one QVD file?
Thanks for help an ideas.

For simple lookups where you wish to look up a single value from another value you can use a MAPPING load and then use the ApplyMap() function. For example, say I have the following table:
LOAD
*
INLINE [
UserID, System
1, Windows
2, Linux
3, Windows
];
I have another table that contains UserID and UserName as follows:
LOAD
*
INLINE [
UserID, UserName
1, Alice
2, Bob
3, Carol
];
I can then combine the above tables with ApplyMap as follows:
UserNameMap:
MAPPING LOAD
*
INLINE [
UserID, UserName
1, Alice
2, Bob
3, Carol
];
SystemData:
LOAD
UserID,
ApplyMap('UserNameMap', UserID, 'MISSING') as UserName,
System
INLINE [
UserID, System
1, Windows
2, Linux
3, Windows
];
ApplyMap is very fast and should not significantly slow down your load time (although it will not be as fast as a direct QVD load). However, as mentioned ApplyMap can only be used if you wish to map a single value into your table. For more fields, you will need to use a join (which is similar to a SQL JOIN) if you wish to combine your results into a single table.
If you do not wish to join them into a single table (but keep it as a "star" scheme), just make sure that the fields that you wish to link are named the same. For example:
A_Left:
Load a_id_left,
a_name_left as [Name_Left]
inline [
a_id_left, a_name_left
1, 'nwsnd'
2, 'dcsdcws'
3, 'fsdf' ];
A_Rigtht:
Load a_id_right,
a_name_right as [Name_Right]
inline [
a_id_right, a_name_right
1, 'nwsnd'
2, 'dcsdcws'
3, 'fsdf' ];
Data:
Load id,
a_id_left,
a_id_right,
data
inline [
id, a_id_left, a_right_id, data
1, 1, 2, 37
1, 1, 3, 18
1, 2, 3, 62
];
(I have removed your "name" fields from "Data" as it would fail to load).
This will then work in your QlikView document due to QlikView's automatic field associativity.
However, if you wish to have the data in a single table (e.g. for output to QVD) then in your case you will need to JOIN your two tables into Data. We can rearrange some of the tables to make our life a bit easier, if we put your Data table first, we can then join your other two tables on:
Data:
Load id,
a_id_left,
a_id_right,
data
inline [
id, a_id_left, a_id_right, data
1, 1, 2, 37
1, 1, 3, 18
1, 2, 3, 62
];
LEFT JOIN (Data)
Load a_id_left,
a_name_left as [Name_Left]
inline [
a_id_left, a_name_left
1, 'nwsnd'
2, 'dcsdcws'
3, 'fsdf' ];
LEFT JOIN (Data)
Load a_id_right,
a_name_right as [Name_Right]
inline [
a_id_right, a_name_right
1, 'nwsnd'
2, 'dcsdcws'
3, 'fsdf' ];
This will then resort in a single table named "Data" which you can then output to QVD etc.
You may wish to think about optimising your "Table A" extract since it is almost being loaded twice, this may take some time (e.g. from long distance server etc.) so it may be better to grab your data in one go and then slice it once it's in memory (much faster). A quick example could be like the below:
TableA:
LOAD
a_id_left,
a_id_right,
a_name_left,
a_name_right
FROM ...;
Data:
Load id,
a_id_left,
a_id_right,
data
inline [
id, a_id_left, a_id_right, data
1, 1, 2, 37
1, 1, 3, 18
1, 2, 3, 62
];
LEFT JOIN (Data)
LOAD DISTINCT
a_id_left,
a_name_left as [Name_Left]
RESIDENT TableA;
LEFT JOIN (Data)
LOAD DISTINCT
a_id_right,
a_name_right as [Name_Right]
RESIDENT TableA;
DROP TABLE TableA;

To answer the last part, you can use the CONCATENATE function to append the contents of one table to anoher, i.e.
Final:
Load *
from
Table1
CONCATENATE
Load *
from
Table2
will give you one table, called Final with the merged contents of Table1 and Table2.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Improve mongoDB lookup performance in rails - ruby-on-rails-3

Related

How to group by the amount of values in an array in postgresql

Beginner trying to learn how Aggregation works

How to query and iterate over array of structures in Athena (Presto)?

Map-Reduce to combine data (MongoDb)

What is the best way to use lookup tables in QlikView?

Categories

Resources