Disabling built-in indexes in Google Cloud Datastore - indexing

I'm currently doing a benchmark to see if Google Cloud Datastore could suit our needs but I've got a problem with how indexes are handled.
I know that I will never have to filter on anything except the key field, and thus I would like to be able to disable the built-in indexing of all the other fields. I just want to use it as a key/value store.
I'm currently looking at potentially multiple TB of indexes if I cannot disable them (~50 fields, billions of rows) and that would kill our budget.
Is there any way to remove these indexes ? It seems the index.yaml file this link talks about is only about composite indexes.
Thanks for your help !

Found it ! You can explicitly tell Datastore not to index your field by doing it like this (excluded properties)

I have researched in Datastore github issues about this same question, about (2015), the last inquiry was on 2019. But there is no response. You can ask there if it has been any
I have also researched in the Public Issue Tracker PIT of Google Cloud Platform for an existing Feature Request (FR) or Issue related with this, but not found any.
I think the best way to proceed is to file a FR with the proper components. In this way the Engineering team will have visibility about this. The PIT uses the number of "stars" (people who have indicated interest in an issue) to prioritize work on the platform. Given that there is no FR opened, you should open a new one.

Related

Cosmos on Wirecloud

Taking as a reference public documentation (https://wirecloud.conwet.etsiinf.upm.es/slides/1.2_Integration%20with%20other%20GEs.html#slide16) I wonder if at this point there is any progress on connecting Wirecloud & Cosmos in order to retrieve historical data and visualised it over mashups setups.
If not, could you give any direction so I can give a try implementing something around this?
Note: I have already checked some of the available documentation, and it looks to me that my desired feature could be tackled by a simple python implementation to retrieve HDFS files to the appropriated NGSI format, Is it right?
Nevertheless, I believe it will be a dirty mechanism. What should be the recommended way?
I honestly hope not to be cheating by answering my own questions and marking them as correct, but I would like to leave a record of a solution for those folks that might be experiencing same troubles as me.
I have developed a quick and dirty mechanism to retrieve HDFS files into NGSI formats so we can retrieve historical data like we do with Orion widgets.
https://github.com/netzahdzc/cloudCos
Please note, that this is a quite working progress, so there are some hardcode that I hope eventually fix.
Official Cosmos-WireCloud integration is currently not available, although there are third-party widgets using cosmos out there.
In my opinion, the best option for accessing the HDFS filesystem, is using WebHDFS (you will need adding a FIWARE token into the request for authentication).
It should also be possible to connect to Hive (see this ticket for more info)

Crm2013/15 Online and queries on huge data volumes

I'm working on a couple of million records, as soon as I try to run an advanced find, and put as a criteria a linked entity, the advanced find goes in timeout.
Create custom views on this allows me to filter properly? Anyone knows the proper way of using the advanced find this way? Are there limitations on the out of the box CRM that i should be aware of?
In CRM 2013 - it is possible to add indexes for specific fields by adding the columns to the quick find view for the entity.
You will need to wait for the Indexing Management Job to run (which is run every 24 hours by default) - see http://blogs.msdn.com/b/darrenliu/archive/2014/04/02/crm-2013-maintenance-jobs.aspx.
In previous version of CRM, it was necessary to add the indexes directly to the database - this may be necessary for more complex queries.
was too early to post an answer. The problem that I encountered was related to the OOB advanced find. Looking for example for an account with some related contacts (a really plain search with a linked entity) i had a SQL timeout. Everything was OOB so I was a little bit clueless and I opened a case to Microsoft. They found a bug, if i was changing the sorting the advanced find started to work again. They are still investigating. So wasn't a setting problem but a crm bug.

Deleting rows in datastore by time range

I have a CKAN datastore with a column named "recvTime" of type timestamp (i.e. using "timestamp" as type at datastore_create time, as shown in this link). Example value for this column is "2014-06-12T16:08:39.542000".
I have a large numbers of records in the datastore (thousands) and I would like to delete the rows before a given date in "recvTime". My first thought was doing it using the REST API with the datastore_delete operation using a range filter, but it is not possible as described in the following Q&A.
Is there any other way of solving the issue, please?
Given that I have access to the host where CKAN server is running, I wonder if this could be achieved executing a regular SQL sentence on the Postgresql engine where the datastore is persisted. However, I haven't found information about manipulating the CKAN underlying datamodel in the CKAN documentation, so don't know if this a good idea or if it is risky...
Any workaround or information pointer is highly welcome. Thanks!
You could definitely do this directly on the underlying database if you were willing to dig in there (the structure is pretty simple with tables named after the corresponding resource id). You could even turn this into an API of your own using an extension (though you'd want to be careful about permissions).
You might also be interested in the new support (master only atm) for extending the DataStore API via a plugin in an extension - see https://github.com/ckan/ckan/pull/1725

Automating WebTrends analysis

Every week I access server logs processed by WebTrends (for about 7 profiles) and copy ad clickthrough and visitor information into Excel spreadsheets. A lot of it is just accessing certain sections and finding the right title and then copying the unique visitor information.
I tried using WebTrends' built-in query tool but that is really poorly done (only uses a drag-and-drop system instead of text-based) and it has a maximum number of parameters and maximum length of queries to query with. As far as I know, the tools in WebTrends are not suitable to my purpose of automating the entire web metrics gathering process.
I've gotten access to the raw server logs, but it seems redundant to parse that given that they are already being processed by WebTrends.
To me it seems very scriptable, but how would I go about doing that? Is screen-scraping an option?
I use ODBC for querying metrics and numbers out of webtrends. We even fill a scorecard with all key performance metrics..
Its in German, but maybe the idea helps you: http://www.web-scorecard.net/
Michael
Which version of WebTrends are you using? Unless this is a very old install, there should be options to schedule these reports to be emailed to you, and also to bookmark queries. Let me know which version it is and I can make some recommendations.

Using SQL for cleaning up JIRA database

Has anyone had luck with removing large amount of issues from a jira database instead of using the frontend? Deleting 60000 issues with the bulktools is not really feasible.
Last time I tried it, the jira went nuts because of its own way of doing indexes.
How about doing a backup to xml, editing the xml, and reimporting?
We got gutsy and did a truncate on the jiraissues table and then use the rebuild index feature on the frontend. It looks like it's working!
This is old, but I see that this question was just edited recently, so to chime in:
Writing directly to the JIRA database is problematic. The reindex feature suggested in the Oct 14 08 answer just rebuilds the Lucene index, so it is unlikely to clean up everything that needs to be cleaned up from the database on a modern JIRA instance. Off the top of my head, this will probably leave data lying around in the following tables, among others:
custom field data (customfieldvalue table)
issue links (issuelink table)
versions and components (nodeassociation table, which contains other stuff too, so be careful!)
remote issue links or wiki mentions (remotelink table)
If one has already done such a manual delete on production, it's always a good idea to run the database integrity checker (YOURJIRAURL/secure/admin/IntegrityChecker!default.jspa) to make sure that nothing got seriously broken.
Fast forwarding to 2014, the best solution is to write a quick shell script that uses the REST API to delete all of the required issues. (The JIRA CLI plugin is usually a good option for automating certain types of tasks too, but as far as I can tell, it does not currently support the deletion of issues, so the REST API is your best bet.)