Hi I'm not sure if this is the best way to do this. I am familiar with Views, but I'm interested in outputting to a specific variable, not a table, list, etc.
Here is what I'm trying to do:
for each taxonomy term, echo the term, count how many users have a node with that term and echo that count.
This is not possible with the default Views module as it doesn't offer this kind of aggregation function. However, with the Views Aggregator Plus module you can add the desired behaviour. When you would get a (grouped) table in standard Views, this module is able to count rows for each group (and do much more like sum and average) so you would get a single row per taxonomy term and the number of nodes/users associated to that term.
Related
I have a Postgress question/challange
There is a table 'products' with a lot of rows (millions).
Each product is of a certain Class.
Each product has a number of features of different type:
A Feature could be 'color' and the value is a picklist of all colors.
A Feature could be Voltages with a numerical value of (low) 220 to (high) 240.
There can be up to 100 features for each product.
What is done is to put all features of a product in a Many-table (with the Product table as the One).
So, this table is even bigger (much bigger).
Standard query (no Feature-filters)
A query comes along for all products of that Class. This can result is a lot of products, so Pagination is implemented on the SQL Query.
I solved this by query the products table first, then a separate query on the feature-table , gather all features for the products in the first batch and add them to the result (in the NodeJS Api application)
Problem with using a Feature-filter
But now a new request comes along to request for product of a certain Class, and matching the value for a certain feature.
It is not possible to use the same method as before and just filter out all products not matching the value for the specific feature mentioned in the request.
Because post-processing the database result and taking out products (not matching the Feature-value) will mess up the pagination (which comes from the database).
Possible Solutions
The following solutions I have already thought of:
Go the MongoDB way
Just put everything of a product in one record, and use Array's in Postgres for the features.
Downside is that array's can become quite large and I don't know how Postgres performance will be on very large records.
(Maybe I should go with MongoDB, which is filled by Postgres, just to handle requests)
Any tips here?
Forget pagination from the database
Just do not do the pagination in the database abnd handle it in NodeJS. Then I can do the postprocessing in javascript.
But I need to use WHERE clause for filtering (not LIMIT/OFFSET) which makes it quite complex and costs a lot of memory on the NodeJS Application.
This is not the best solution.
Use another technique?
I'm not familiar with Data Warehousing techniques, but is there a solution lurking in that area?
Current stack is Python, Postgres, NodeJS for the API. Any other tools which can help me?
I expect this is a common enough use-case, but I'm unsure the best way to leverage database features to do it. Hopefully the community can help.
Given a business domain where there are a number of attributes to make up a record. We can just call these a,b,c
Each of these belong to a parent record, of which there can be many,
Given an external datasource that will post updates to those attributes, at arbitrary times, and typically only a subset, so you get instructions like
z:{a:3}
or
y:{b:2,c:100}
What are good ways to be able to query postgres for the 'current state', ie. wanting a single row result that represents the most recent value for all of a,b,c, for each of the parent records.
current state looks overall like
x:{a:0, b:0, c:1}
y:{a:1, b:2, c:3}
z:{a:2, b:65, c:6}
If it matters, The difference in time between updates on a single value could be arbitrarily long
I am deliberately avoiding having a table that keeps updating and writing an individual row for the state because the write-contention could be a problem, and I think there must be a better overall pattern.
Your question is a bit theorical - but in essence you are describing a top-1-per-group problem. In Postgres, you can use distinct on for this.
Assuming that your table is called mytable, where attributes are stored in column attribute, and that column ordering_id defines the ordering of the rows (that could be a timestamp or an serial for example), you would phrase the query as:
select distinct on (attribute) t.*
from mytable t
order by attribute, ordering_id desc
I'm currently trying to determine how to build out a keyword dimension table. We're tracking website visits to our website, and would like to be able to find the most used keywords used to search via a search engine for the site as well as any search terms used during the visit on the site (price > $100, review > 4 stars, etc). Since the keywords are completely dynamic and can be used in an infinite number of combinations, I'm having a hard time trying to determine how to store these keywords. I have a pageview fact table that includes a record every time a page is viewed. The source I'm pulling from includes all the search terms in a delimited list I am able to parse with a regular expression, I just don't know how to store it in the database since the number of keywords can vary so widely from pageview to pageview. I'm thinking this may be more suited for a NOSQL solution that trying to cram it into a MSSQL table, but I don't know. Any help is greatly appreciated!
Depending on how you want to analyze the data, there's a few solutions.
But for the amount of data that you are probably analyzing, I'd just create a table that uses the PK of the fact to store each keyword.
FACT_PAGEVIEW_ID bigint -- Surrogate key of fact table. Or natural key if you don't have a surrogate.
KEYWORD varchar(255) -- or whatever max len the keywords are
VALUE varchar(255)
The granularity of this table is 1 row per ID/Keyword combination. You may have to add value as well if you allow the same keyword multiple times in a querystring.
This allows you to group the keywords by pageview, or start with the pageview fact, filter it, then join to this to identify keywords.
The other option would be a keyword dimension and a many-many bridge table with a "keyword group", but since any number of combinations can be used, this is probably the quicker way and will likely get you 90% of the way there. Most questions, like "what combination of keywords are used most frequently", and "what keywords are most used by the top 10% of the user base" can be answered with this structure.
Surfing the net I ran into Aquabrowser (no need to click, I'll post a pic of the relevant part).
It has a nice way of presenting search results and discovering semantically linked entities.
Here is a screenshot taken from one of the demos.
On the left side you have they word you typed and related words.
Clicking them refines your results.
Now as an example project I have a data set of film entities and subjects (like wolrd-war-2 or prison-escape) and their relations.
Now I imagine several use cases, first where a user starts with a keyword.
For example "world war 2".
Then i would somehow like to calculate related keywords and rank them.
I think about some sql query like this:
Lets assume "world war 2" has id 3.
select keywordId, count(keywordId) as total from keywordRelations
WHERE movieId IN (select movieId from keywordRelations
join movies using (movieId)
where keywordId=3)
group by keywordId order by total desc
which basically should select all movies which also have the keyword world-war-2 and then looks up the keywords which theese films have as well and selects those which occour the most.
I think with theese keywords I can select movies which match best and have a nice tag cloud containing similar movies and related keywords.
I think this should work but its very, very, very inefficient.
And its also only one level or relation.
There must be a better way to do this, but how??
I basically have an collection of entities. They could be different entities (movies, actors, subjects, plot-keywords) etc.
I also have relations between them.
It must somehow be possible to efficiently calculate "semantic distance" for entities.
I also would like to implement more levels of relation.
But I am totally stuck. Well I have tried different approaches but everything ends up in some algorithms that take ages to calculate and the runtime grows exponentially.
Are there any database systems available optimized for that?
Can someone point me in the right direction?
You probably want an RDF triplestore. Redland is a pretty commonly used one, but it really depends on your needs. Queries are done in SPARQL, not SQL. Also... you have to drink the semantic web koolaid.
From your tags I see you're more familiar with sql, and I think it's still possible to use it effectively for your task.
I have an application where a custom-made full-text search implemented using sqlite as a database. In the search field I can enter terms and popup list will show suggestions about the word and for any next word only those are shown that appears in the articles where previously entered words appeared. So it's similar to the task you described
To make things more simple let's assume we have only three tables. I suppose you have a different schema and even details can be different but my explanation is just to give an idea.
Words
[Id, Word] The table contains words (keywords)
Index
[Id, WordId, ArticleId]
This table (indexed also by WordId) lists articles where this term appeared
ArticleRanges
[ArticleId, IndexIdFrom, IndexIdTo]
This table lists ranges of Index.Id for any given Article (obviously also indexed by ArticleId) . This table requires that for any new or updated article Index table should contain entries having known from-to range. I suppose it can be achieved with any RDBMS with a little help of autoincrement feature
So for any given string of words you
Intersect all articles where all previous words appeared. This will narrow the search. SELECT ArticleId FROM Index Where WordId=... INTERSECT ...
For the list of articles you can get ranges of records from ArticleRanges table
For this range you can effectively query WordId lists from Index grouping the results to get Count and finally sort by it.
Although I listed them as separate actions, the final query can be just big sql based on the parsed query string.
Having dutifully normalised all my data, I'm having a problem combining 3NF rows into a single row for output.
Up until now I've been doing this with server-side coding, but for various reasons I now need to select all rows related to another row, and combine them in a single row, all in MySQL...
So to try and explain:
I have three tables.
Categories
Articles
CategoryArticles_3NF
A category contains CategoryID + titles, descriptions etc. It can contain any number of articles in the Articles table, consisting of ArticleID + a text field to house the content.
The CategoryArticles table is used to link the two, so contains both the CategoryID and the ArticleID.
Now, if I select a Category record, and I JOIN the Articles table via the linking CategoryArticles_3NF table, the result is a separate row for each article contained within that category.
The issue is that I want to output one single row for each category, containing content from all articles within.
If that sounds like a ridiculous request, it's because it is. I'm just using articles as a good way to describe the problem. My data is actually somewhat different.
Anyway - the only way I can see to achieve this is to use a 'GROUP_CONCAT' statement to group the content fields together - the problem with this is that there is a limit to how much data this can return, and I need it to be able to handle significantly more.
Can anyone tell me how to do this?
Thanks.
This sounds like something that should be done in the front end without more information.
If you need to, you can increase the size limit of GROUP_CONCAT by setting the system variable group_concat_max_len. It has a limit based on max_allowed_packet, which you can also increase. I think that the max size for a packet is 1GB. If you need to go higher than that then there are some serious flaws in your design.
EDIT: So that this is in the answer and not just buried in the comments...
If you don't want to change the group_concat_max_len globally then you can change it for just your session with:
SET SESSION group_concat_max_len = <your value here>