How to optimize the Azure Cosmos index for this query - indexing

I have a Cosmos DB with approx. 10GB of data that is used to store analytic data.
The model is as below:
{
"publisherID": "",
"managerID": "",
"mediaID": "",
"type": "",
"ip": "",
"userAgent": "",
"playerRes": "",
"title": "",
"playerName": "",
"videoTimeCode": 0,
"geo": {
"country": "",
"region": "",
"city": "",
"ll": []
},
"date": "",
"uuid": "",
"id": ""
}
I sometimes have very heavy query that are throttling because my RU limit is reached. Before considering increasing my RU limit, I want to make sure my queries are optimized.
All my queries follow the below pattern:
SELECT c.id,c.date,c.uuid,c.type FROM c WHERE c.mediaID = "{ID}" AND (c.type = "Load OR c.type = "Progress" OR c.type = "Play") AND (c.date BETWEEN "2021-06-30T22:00:00.000Z" AND "2021-07-31T21:59:59.999Z")
So after doing some research, I reached the conclusion that the best index I could have is:
{
"indexingMode": "consistent",
"automatic": true,
"includedPaths": [
{
"path": "/type/?"
},
{
"path": "/mediaID/?"
},
{
"path": "/date/?"
}
],
"excludedPaths": [
{
"path": "/*"
}
]
}
I got the below stats for this query:
Request Charge: 324.51000000000005 RUs
Showing Results: 1 - 100
Retrieved document count: 10024
Retrieved document size: 8597509 bytes
Output document count: 200
Output document size: 28324 bytes
Index hit document count: 199.24
Index lookup time: 2.41 ms
Document load time: 62.93 ms
Query engine execution time: 15.709800000000001 ms
System function execution time: 0 ms
User defined function execution time: 0 ms
Document write time: 0.47000000000000003 ms
Round Trips: 1
What is worried me is the difference between the Retrieved document count and the Output document count. I'm guessing this is why I need 324 RU just to get the first 100 results...
I'm not sure how to set the index to optimize the performance of the query (always the same pattern: WHERE mediaID = {ID} AND type = {TYPE} AND date Between 2 dates)
Any help would be welcome here.

Thanks for your feedback, that helped a lot!
I added composite index as below:
"compositeIndexes": [
[
{
"path": "/mediaID",
"order": "ascending"
},
{
"path": "/type",
"order": "ascending"
},
{
"path": "/date",
"order": "ascending"
}
]
]
The RU for the initial request in my first post is now 7 RUs (it was 320 before!). Thanks #404
If you don't mind I would like to push a bit further... :)
For another request (same structure), where I know there is a lot of data that need to be retrieved I need 42.02 RUs just to get the first 100 results. Does it make sense?
Request Charg 42.02 RUs
Showing Results 1 - 100
Retrieved document count 200
Retrieved document size 164847 bytes
Output document count 200
Output document size 28544 bytes
Index hit document count 200
Index lookup time 6.98 ms
Document load time 1.3399 ms
Query engine execution time 0.6601 ms
System function execution time 0 ms
User defined function execution time 0 ms
Document write time 0.47000000000000003 ms
Round Trips 1
Is there more optimization that can be done here, other than increasing the RU limit?

Related

Add computed field to Query in Grafana using JSON API als data source

What am I trying to achieve:
I would like to have a time series chart showing the total number of members in my club at any time. This member count should be calculated by using the field "Eintrittsdatum" (joining-date) and "Austrittsdatum" (leaving-date). I’m thinking of it as a running sum - every filled field with a joining-date means +1 on the member count, every leaving-date entry is a -1.
Data structure
I’m calling the API of webling.ch with a secret key. This is my data structure with sample data per member:
[
{
"type": "member",
"meta": {
"created": "2020-03-02 11:33:00",
"createuser": {
"label": "Joana Doe",
"type": "user"
},
"lastmodified": "2022-12-06 16:32:56",
"lastmodifieduser": {
"label": "Joana Doe",
"type": "user"
}
},
"readonly": true,
"properties": {
"Mitglieder ID": 99,
"Anrede": "Dear",
"Vorname": "Jon",
"Name": "Doe",
"Strasse": "Doeington Street",
"Adresszusatz": null,
"PLZ": "9999",
"Ort": "Doetown",
"E-Mail": "jon.doe#doenet.net",
"Telefon Privat": null,
"Telefon Geschäft": null,
"Mobile": "099 877 54 54",
"Geschlecht": "m",
"Geburtstag": "1966-03-10",
"Mitgliedschaftstyp": "Aktivmitgliedschaft",
"Eintrittsdatum": "2020-03-01",
"Austrittsdatum": null,
"Passfoto": null,
"Wordpress Benutzername": null,
"Wohnhaft im Glarnerland": false,
"Lat": "43.1563379",
"Long": "6.0474622"
},
"parents": [
240
],
"children": {
},
"links": {
"debitor": [
2124,
3056,
3897
],
"attendee": [
2576
]
},
"id": 1815
}
]
Grafana data source
I am using the “JSON API” by Marcus Olsson: GitHub - grafana/grafana-json-datasource: A data source plugin for loading JSON APIs into Grafana.
Grafana v9.3.1 (89b365f8b1) on Linux
My current approach
Queries:
Query C - uses a filter on the source-API to only show entries with "Eintrittsdatum" IS NOT EMPTY
Field 1 (alias "datum") has a JSONata-Query of:
properties.Eintrittsdatum
Field 2 (alias "names") should return the full name and has a query of:
$map($.properties, function($v) {(
($v.Vorname&" "&$v.Name);
)})
Field 3 (alias "value") should return "1" for every entry and has a query of:
$map($.properties, function($v) {(
(1);
)})
Query D - uses a filter on the source-API to only show entries with "Austrittsdatum" IS NOT EMPTY
Field 1 (alias "datum") has a JSONata-Query of:
properties.Austrittsdatum
Field 2 (alias "names") should return the full name and has a query of:
$map($.properties, function($v) {(
($v.Vorname&" "&$v.Name);
)})
Field 3 (alias "value") should return "1" for every entry and has a query of:
$map($.properties, function($v) {(
(1);
)})
Here's a screenshot to clarify things
(https://zigerschlitzmakers.ch/wp-content/uploads/2023/01/ScreenshotGrafana-1.png)
Transformations:
My applied transformations
(https://zigerschlitzmakers.ch/wp-content/uploads/2023/01/ScreenshotGrafana-2.png)
What's working
I can correctly gather the number of members added/subtracted per day.
What's not working
I can't get the graph to display the way i want: I'd like to have a running sum of these numbers instead of the following two graphs.
Time series graph with merged queries
(https://zigerschlitzmakers.ch/wp-content/uploads/2023/01/ScreenshotGrafana-3.png)
Time series graph with unmerged queries
(https://zigerschlitzmakers.ch/wp-content/uploads/2023/01/ScreenshotGrafana-4.png)
I can't get the names to display within the tooltip of the data points (really not THAT necessary).

How to load a jsonl file into BigQuery when the file has mix data fields as columns

During my work flow, after extracting the data from API, the JSON has the following structure:
[
{
"fields":
[
{
"meta": {
"app_type": "ios"
},
"name": "app_id",
"value": 100
},
{
"meta": {},
"name": "country",
"value": "AE"
},
{
"meta": {
"name": "Top"
},
"name": "position",
"value": 1
}
],
"metrics": {
"click": 1,
"price": 1,
"count": 1
}
}
]
Then it is store as .jsonl and put on GCS. However, when I load it onto BigQuery for further extraction, the automatic schema inference return the following error:
Error while reading data, error message: JSON parsing error in row starting at position 0: Could not convert value to string. Field: value; Value: 100
I want to convert it in to the following structure:
app_type
app_id
country
position
click
price
count
ios
100
AE
Top
1
1
1
Is there a way to define manual schema on BigQuery to achieve this result? Or do I have to preprocess the jsonl file before put it to BigQuery?
One of the limitations in loading JSON data from GCS to BigQuery is that it does not support maps or dictionaries in JSON.
A invalid example would be:
"metrics": {
"click": 1,
"price": 1,
"count": 1
}
Your jsonl file should be something like this:
{"app_type":"ios","app_id":"100","country":"AE","position":"Top","click":"1","price":"1","count":"1"}
I already tested it and it works fine.
So wherever you process the conversion of the json files to jsonl files and storage to GCS, you will have to do some preprocessing.
Probably you have to options:
precreate target table with an app_id field as an INTEGER
preprocess jsonfile and enclose 100 into quotes like "100"

SQL query as string in JSON file

It is possible to store SQL query in JSON array of objects?? Because when i have something like this:
[{
"id": "1",
"query": "SELECT ID FROM table"
},
{
"id": "2",
"query": "SELECT ID FROM table"
},
{
"id": "3",
"query": "SELECT USER FROM table"
}
]
JSON file in VSCode is ok no error it is getting nasty when i want to store complex queries with joins etc.
for example this query even if i format it correctly it will generate error in JSON file about formatting
(just example i not it is not valid)
SELECT user, id, , count(price) as numrev
FROM price
where id = 1 and user = 0
group by user, id, price
that it can't be stored in string
It is bit easy to do, but requires on extra step.
Simply convert/encode you raw SQL queries in base64 text.
Decode the text before you execute the queries in you code.
If the JSON file is created automatically by a program/code
All most all programming languages proved base64 encode / decode functions as part of the core if not download compatible package / library to achieve this automation
var queries = [{
"id": "1",
"query": "U0VMRUNUIElEIEZST00gdGFibGU="
},
{
"id": "2",
"query": "U0VMRUNUIElEIEZST00gdGFibGU="
},
{
"id": "3",
"query": "U0VMRUNUIFVTRVIgRlJPTSB0YWJsZQ=="
},
{
"id": "4",
"query": "U0VMRUNUIHVzZXIsIGlkLCAsIGNvdW50KHByaWNlKSBhcyBudW1yZXYKICBGUk9NIHByaWNlCiAgd2hlcmUgaWQgPSAxIGFuZCB1c2VyID0gMCAKICBncm91cCBieSB1c2VyLCBpZCwgcHJpY2U="
}
];
for (i = 0; i < queries.length; i++) {
console.log("id = " + queries[i].id + ", query = " + atob(queries[i].query));
}
when you parse you JSON array make sure to decode before you execute the SQL queries.
let me know this one helped you.. ☺☻☺
FYI , refer http://www.utilities-online.info/base64/
enter image description here

Filter parameters to POST verify and place order request for Performance storage

I am trying to do BPM and SoftLayer integration using Java REST client. On my initial analysis(as well as help form stack overflow),I found
Step 1) we to get getPriceItem list to have all IDs for next request.
https://username:api_key#api.softlayer.com/rest/v3/SoftLayer_Product_Package/2/getItemPrices?objectMask=mask[id,item[keyName,description],pricingLocationGroup[locations[id, name, longName]]]
and then do verify and place order POST call using respective APIs.
I am stucked on Step 1) as filtering here seems to be bit tricky. I am getting a json response of over 20000 lines.
I wanted to show similar data(just like SL Performance storage UI ) on my custom BPM UI . (One drop down to select type of storage, 2nd to show location, 3rd to show size and 4th would be IOPS) where user can select the items and place request.
Here I found, SL is something similar to this for populating the drop downs-
https://control.softlayer.com/sales/productpackage/getregions?_dc=1456386930027&categoryCode=performance_storage_iscsi&packageId=222&page=1&start=0&limit=25
Can't we have implementation where we can use control.softlayer.com just like SL instead of api.softlayer.com? In that case we can use similar logic to display data on UI.
Thanks
Anupam
Here, using the API, the steps for performance storage. For endure storage the steps are similar you just need to review the value for categoryCode and modify if it needed
you can get the locations using this method:
http://sldn.softlayer.com/reference/services/SoftLayer_Product_Package/getRegions
you just need to know the package of the storage e.g.
GET https://api.softlayer.com/rest/v3.1/SoftLayer_Product_Package/222/getRegions
then, you can get the storage size for that you can use the SoftLayer_Product_Package::getItems or SoftLayer_Product_Package::getItemPrices methods and a filter e.g.
GET https://api.softlayer.com/rest/v3.1/SoftLayer_Product_Package/222/getItemPrices?objectFilter={"itemPrices": {"categories": {"categoryCode": {"operation": "performance_storage_space"}},"locationGroupId": { "operation": "is null"}}}
Note: We are filtering the data to get the prices whose category code is "performance_storage_space" and we want the standard price locationGroupId = null
then, you can get the IOPS, you can use the same approach like above, but there is a dependency between the IOPS and storage space e.g.
GET https://api.softlayer.com/rest/v3.1/SoftLayer_Product_Package/222/getItemPrices?objectFilter={"itemPrices": { "attributes": { "value": { "operation": 20 } }, "categories": { "categoryCode": { "operation": "performance_storage_iops" } }, "locationGroupId": { "operation": "is null" } } }
Note: In the example we assume that selected storage space was "20", the prices for IOPS have an record called atributes, this record tell us the valid storage spaces of the IOPS, then we have other filters to get only the IOPS prices categoryCode = performance_storage_iops and we want only the standard prices locationGroupId=null
To selecting the storage type I do not think there is a method the only way I see is that you call the SoftLayer_Product_Package::getAllObjects method and filter the data to get the packages for endurance, performance and portable storage.
Just in case here an example using the Softlayer's Python client to order
"""
Order a block storage (performance ISCSI).
Important manual pages:
http://sldn.softlayer.com/reference/services/SoftLayer_Product_Order
http://sldn.softlayer.com/reference/services/SoftLayer_Product_Order/verifyOrder
http://sldn.softlayer.com/reference/services/SoftLayer_Product_Order/placeOrder
http://sldn.softlayer.com/reference/services/SoftLayer_Product_Package
http://sldn.softlayer.com/reference/services/SoftLayer_Product_Package/getItems
http://sldn.softlayer.com/reference/services/SoftLayer_Location
http://sldn.softlayer.com/reference/services/SoftLayer_Location/getDatacenters
http://sldn.softlayer.com/reference/services/SoftLayer_Network_Storage_Iscsi_OS_Type
http://sldn.softlayer.com/reference/services/SoftLayer_Network_Storage_Iscsi_OS_Type/getAllObjects
http://sldn.softlayer.com/reference/datatypes/SoftLayer_Location
http://sldn.softlayer.com/reference/datatypes/SoftLayer_Container_Product_Order_Network_Storage_Enterprise
http://sldn.softlayer.com/reference/datatypes/SoftLayer_Product_Item_Price
http://sldn.softlayer.com/blog/cmporter/Location-based-Pricing-and-You
http://sldn.softlayer.com/blog/bpotter/Going-Further-SoftLayer-API-Python-Client-Part-3
http://sldn.softlayer.com/article/Object-Filters
http://sldn.softlayer.com/article/Python
http://sldn.softlayer.com/article/Object-Masks
License: http://sldn.softlayer.com/article/License
Author: SoftLayer Technologies, Inc. <sldn#softlayer.com>
"""
import SoftLayer
import json
# Values "AMS01", "AMS03", "CHE01", "DAL05", "DAL06" "FRA02", "HKG02", "LON02", etc.
location = "AMS01"
# Values "20", "40", "80", "100", etc.
storageSize = "40"
# Values between "100" and "6000" by intervals of 100.
iops = "100"
# Values "Hyper-V", "Linux", "VMWare", "Windows 2008+", "Windows GPT", "Windows 2003", "Xen"
os = "Linux"
PACKAGE_ID = 222
client = SoftLayer.Client()
productOrderService = client['SoftLayer_Product_Order']
packageService = client['SoftLayer_Product_Package']
locationService = client['SoftLayer_Location']
osService = client['SoftLayer_Network_Storage_Iscsi_OS_Type']
objectFilterDatacenter = {"name": {"operation": location.lower()}}
objectFilterStorageNfs = {"items": {"categories": {"categoryCode": {"operation": "performance_storage_iscsi"}}}}
objectFilterOsType = {"name": {"operation": os}}
try:
# Getting the datacenter.
datacenter = locationService.getDatacenters(filter=objectFilterDatacenter)
# Getting the performance storage NFS prices.
itemsStorageNfs = packageService.getItems(id=PACKAGE_ID, filter=objectFilterStorageNfs)
# Getting the storage space prices
objectFilter = {
"itemPrices": {
"item": {
"capacity": {
"operation": storageSize
}
},
"categories": {
"categoryCode": {
"operation": "performance_storage_space"
}
},
"locationGroupId": {
"operation": "is null"
}
}
}
pricesStorageSpace = packageService.getItemPrices(id=PACKAGE_ID, filter=objectFilter)
# If the prices list is empty that means that the storage space value is invalid.
if len(pricesStorageSpace) == 0:
raise ValueError('The storage space value: ' + storageSize + ' GB, is not valid.')
# Getting the IOPS prices
objectFilter = {
"itemPrices": {
"item": {
"capacity": {
"operation": iops
}
},
"attributes": {
"value": {
"operation": storageSize
}
},
"categories": {
"categoryCode": {
"operation": "performance_storage_iops"
}
},
"locationGroupId": {
"operation": "is null"
}
}
}
pricesIops = packageService.getItemPrices(id=PACKAGE_ID, filter=objectFilter)
# If the prices list is empty that means that the IOPS value is invalid for the configured storage space.
if len(pricesIops) == 0:
raise ValueError('The IOPS value: ' + iops + ', is not valid for the storage space: ' + storageSize + ' GB.')
# Getting the OS.
os = osService.getAllObjects(filter=objectFilterOsType)
# Building the order template.
orderData = {
"complexType": "SoftLayer_Container_Product_Order_Network_PerformanceStorage_Iscsi",
"packageId": PACKAGE_ID,
"location": datacenter[0]['id'],
"quantity": 1,
"prices": [
{
"id": itemsStorageNfs[0]['prices'][0]['id']
},
{
"id": pricesStorageSpace[0]['id']
},
{
"id": pricesIops[0]['id']
}
],
"osFormatType": os[0]
}
# verifyOrder() will check your order for errors. Replace this with a call to
# placeOrder() when you're ready to order. Both calls return a receipt object
# that you can use for your records.
response = productOrderService.verifyOrder(orderData)
print(json.dumps(response, sort_keys=True, indent=2, separators=(',', ': ')))
except SoftLayer.SoftLayerAPIError as e:
print("Unable to place the order. faultCode=%s, faultString=%s" % (e.faultCode, e.faultString))

RavenDB facet takes to long query time

I am new to ravendb and trying it out to see if it can do the job for the company i work for .
i updated a data of 10K records to the server .
each data looks like this :
{
"ModelID": 371300,
"Name": "6310I",
"Image": "0/7/4/9/28599470c",
"MinPrice": 200.0,
"MaxPrice": 400.0,
"StoreAmounts": 4,
"AuctionAmounts": 0,
"Popolarity": 16,
"ViewScore": 0.0,
"ReviewAmount": 4,
"ReviewScore": 40,
"Cat": "E-Cellphone",
"CatID": 1100,
"IsModel": true,
"ParamsList": [
{
"ModelID": 371300,
"Pid": 188396,
"IntVal": 188402,
"Param": "Nokia",
"Name": "Manufacturer",
"Unit": "",
"UnitDir": "left",
"PrOrder": 0,
"IsModelPage": true
},
{
"ModelID": 371305,
"Pid": 398331,
"IntVal": 1559552,
"Param": "1.6",
"Name": "Screen size",
"Unit": "inch",
"UnitDir": "left",
"PrOrder": 1,
"IsModelPage": false
},.....
where ParamsList is an array of all the attributes of a single product.
after building an index of :
from doc in docs.FastModels
from docParamsListItem in ((IEnumerable<dynamic>)doc.ParamsList).DefaultIfEmpty()
select new { Param = docParamsListItem.IntVal, Cat = doc.Cat }
and a facet of
var facetSetupDoc = new FacetSetup
{
Id = "facets/Params2Facets",
Facets = new List<Facet> { new Facet { Name = "Param" } }
};
and search like this
var facetResults = session.Query<FastModel>("ParamNewIndex")
.Where(x => x.Cat == "e-cellphone")
.ToFacets("facets/Params2Facets");
it takes more than a second to query and that is on only 10K of data . where our company has more than 1M products in DB.
am i doing something wrong ?
In order to generate facets, you have to check for each & every individual value of docParamsListItem.IntVal. If you have a lot of them, that can take some time.
In general, you shouldn't have a lot of facets, since that make no sense, it doesn't help the user.
For integers, you usually use ranges, instead of the actual values.
For example, price within a certain range.
You use just the field for things like manufacturer, or the MegaPixels count, where you have lot number or items (about a dozen or two)
You didn't mention which build you are using, but we made some major improvements there recently.