Data Modelisation : Bank Churn dataset - dataframe

To integrate the data on Talend I must first model the data warehouse including the dimension and the fact tables, it is something that I cannot do for the dataset attached .There is also the Business Requirement Document for this dataset.
https://drive.google.com/drive/folders/1e94lj4c3N6cTmyYaPkHj-6ogpX7WZ-L4
For my dimensions tables should I have
oActiveCustomer
oBank_Churn
oCreditCard
oCustomerInfo
oExitCustomer
oGender
oGeography
The fact table will be the custumer churn? How can this table be stuctured ?
My dataset contains .xslx files and if I want to load the
metadata on talend I cannot manage to do so because they are not .csv files.

Based on what you provided, I can say that you need a transaction fact table : you want to capture the occurrence of that thing and so you record a transaction in your data warehouse
Assuming that the customer will be accumulating points to get the credit score.
In a typical Kimball-style star schema, the fact table that is at the centre of the schema would consist of order transaction data where you can have numeric measures.
<div class="mxgraph" style="max-width:100%;border:1px solid transparent;" data-mxgraph="{"highlight":"#0000ff","nav":true,"resize":true,"toolbar":"zoom layers tags lightbox","edit":"_blank","xml":"<mxfile host=\"app.diagrams.net\" modified=\"2023-01-05T22:05:04.539Z\" agent=\"5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36 Edg/108.0.1462.54\" etag=\"ziXNxJ3eew2bmR6wmLdf\" version=\"20.7.4\"><diagram id=\"9iJC2scZwhvFUXkt-TRG\" name=\"Page-1\">7ZxRc9o4EMc/DY/XwTYG8pgY0vaaXDJHOr0+ZRRb2LrIEiPLAfrpb2VkDFFoyFVUeVCGmaC1ZFn6ac1/rYVelJSrjwItimueYdoL+9mqF016YRj0+wP4pyzrjSUeRRtDLkimK3WGGfmB25baWpMMV3sVJedUksW+MeWM4VTu2ZAQfLlfbc7pfq8LlGPDMEsRNa3fSCaLjXUcjjr7J0zyou05GJ5tjpSoraxHUhUo48sdUzTtRYngXG7elasEUzV57bxs2l0eOLq9MIGZPKbB012J7+nia4zG4e2cTbP69usf+ixPiNZ6wBNS3id1JXmJhb5wuW5no1qSkiIGpYs5Z3Kmj/ShnBaEZldozWt1NZVE6WNbuii4ID+gPqJwKAADHBZSww6H6myE0oRTLsDAeNNB12imTqa7EbiCZrftqINnpmu02qt4hSrZXiCnFC0q8tBcsmpYIpETdsElDFZX0tOBhcSrg/McbOnBsscwUVKsoYpuMNa82wU/1OVlt3yCgbYVO0snisZ62eolm29P3VGFNxrsGyCHBuQW8OyLgRgGLhtCgj/iZ0heoIQoyRkUKZ6rZmrmCHjPuTZLvlAnW6CUsPyqqTMZdJa/9fCViUPbOW08pCBZhpmCyCWS6GG7yBacMNlMT3wBL5jFpP8h7sVw4QmUg64ML1VdyIQzGAsiDTwMy2GJ1ZI4jvRhnzHxa96wnI/C3dazTjs6SPvzxNO2TDsOHdMeGLRntWCoxB61ZdSjsWPUgenZHzHLsEiQxDnXPXjk9pAHrXp1x9z073MQrKq/KOgEi2duj3l0pF47HfPYYP4JVQkSmYdtG3bsWq0FQwP25+o8leQJX+Py4YUYzDP/ReZj15otGBnM/6rLm/mt4FmdwvR55JZjsr5z7TY2kE8rSUpQbtkMUeTFm33okXPxdmZAv8OsFj42s//QxbVoC00HTwTOiFS67W698MytMx+71m5tpLDD/AKxx8nNnz5AOx33KHCt38z4TG2awJXmag/M/CT3uyZb08Fdk2j4bNskOPKOvo3XrWM2I7MtYr9v8j/cOn7lI9yxV5tBWcJrJuDUft/EOu2j901OtSdqyjW4X0rsUVtHPXYuzU2Z9gH+PGjLoIP+sbfwwYlIm/F2k82CKGYZMp+kel32ui4bDvZ12TAOjtRl8ckeq5j+PPtyP4G7t3fpN7v02fsWZcELCWoetH3QzvNY2qB+B/Q13C4LT9oyaedpLKHp0pdEVPJm7oGfArj7JJbQ9G0vwE9B2nnqSmimqF2iVN4nRS2Y198WnouOjpVk2zQH+5DNnLQ2w/jSPxd9u1NvnOb9avDQ3O/YPgj3wO0Dd67FQ3PnQwVdnrV91u7V+AvbHk3ewizlPlHFPnD3ajx64SsjSkBtnqv4vIXToXcuzyPzk3zKMg/+1OCdZ5lHgUEVZzluIy6YlILnnCE67aww1TXLcKYnuqtzxRXCJub5F0u51pEXqiVXgZYs27gMJk2s/9Htm8J3VQAcujhZ7R6crLel7Fx9dbtbU2C5JGrIO2GVGsBBTO14eS1S/JOJab9rATfAHP8U8YG9L4EpUsn6e93a5xe65dch+75HzDm/dm/3dX4HXPQ38Yuc8gver/+Fjv0Pit1PMDTHdn7IIpr+Bw==</diagram></mxfile>"}"></div>
<script type="text/javascript" src="https://viewer.diagrams.net/js/viewer-static.min.js"></script>

Related

Laravel specification search

Im building ecommerce website and i want to ask what is the best way to search specifications of product.I have a field in database with specification that can be stored as json or as php array.Is there any tutorial that i can use to output unique specification for example category CPU, and to put different CPU brands under that category so user can decide which specification to add to search filter.To give you a perspective i want to make something like this
Amazon.com search field image
You could search though whereJsonContains() method.
for example;
let assume there are some filter in your form as below
brands
- intel
- amd
Number of core
- 2
- 4
- 8
Now let build a query to get filtered results
Cpu::whereJsonContains("preferences->brand",$request->brand)
->whereJsonContains("preferences->noc", $request->noc)
->get();
But I'm not sure if storing such a json data in your database is a good practice. Because it's not store and display case, there will generic results depends those specifications.

Google-BigQuery and StackDriver Data

I am currently sending StackDriver log files for my app to a BigQuery table. I would like to strip down the dataset and place it into a new BigQuery table to be queried later and render those results in a view on my app. I will be using python as my main language as I do not know Java, and creating a CRON job to run this script every 15 minutes to populate the new log dataset from StackDriver.
Striping down the dataset takes on two processes:
1.) Only write some of the columns from the original BigQuery table to the new one
2.) Create a subset of the data in certains columns to be written into new columns in the new BigQuery table. For example:
A row in the original BigQuery table will contain the string
Mozilla/5.0 (iPad; CPU OS 5_1_1 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko) Version/5.1 Mobile/9B206 Safari/7534.48.3
I would like to strip out iPad and place this into a devices column, AppleWebKit and place this into a browsers column, etc, in the new BigQuery table.
I know I can load the bigquery libraries into python to query the orignal BigQuery table but how do I strip out what I want and write that to a new table? Would this be a good use case for pandas? Is there an easier way to accomplish this task then my current idea?

Biqquery - A Metric Shared in All Table Schemas is Not Being Recognized when using BQ Wild Card Function to loop through those Tables

I am trying to perform a BQ Wild Card Function on a BQ table that is named a certain way. Let's call it ProjectId.DatasetId.p_Table_##########. The remaining suffix of .p_Table_ represents UNIQUETABLEID. I'm using the wild card function to pull the same data from each individual table with the .p_Table_.
#standardSQL
SELECT
_TABLE_SUFFIX as UNIQUETABLEID,
...
total.A,
total.B
FROM `ProjectId.DatasetId.p_Table_*` as total
NOTE that Variables A & B are both STRINGS
All the individual .p_Table_ tables have the same schema and the wild card function was working just fine last week. However, for some reason, the query is not recognizing two variables A and B in those tables even those they are in the .p_Table_ schemas still. It keeps popping up with this error:
Error: A not found inside total at [5:20]
Same Type of error comes up for B
I tested pulling the same metrics for individual tables of ProjectId.DatasetId.p_Table_########## and it recognized all the variables just fine.
QUESTIONS:
Could someone please explain why the BQ Wild Card Function call will not recognize A and B when looking through all suffixes of table .p_Table_ even though those metrics appear in the schema still?
OR, does someone have a better solution to loop through all these BQ tables to gather those metrics WITHOUT the Brute Force approach of pulling from ALL the individual tables and using UNION ALL? (There are 35 tables associated with this .p_Table_ and it'll be expected to grow, so we want this to be automated)

CouchDb 2.0: update_seq is not a number

According to official doc of couchDb 2.0
http://docs.couchdb.org/en/2.0.0/api/database/common.html
GET /{db}
Gets information about the specified database.
Parameters:
db – Database name
Request Headers:
Accept –
application/json
text/plain
Response Headers:
Content-Type –
application/json
text/plain; charset=utf-8
Response JSON Object:
committed_update_seq (number) – The number of committed update.
compact_running (boolean) – Set to true if the database compaction routine is operating on this database.
db_name (string) – The name of the database.
disk_format_version (number) – The version of the physical format used for the data when it is stored on disk.
data_size (number) – The number of bytes of live data inside the database file.
disk_size (number) – The length of the database file on disk. Views indexes are not included in the calculation.
doc_count (number) – A count of the documents in the specified database.
doc_del_count (number) – Number of deleted documents
instance_start_time (string) – Timestamp of when the database was opened, expressed in microseconds since the epoch.
purge_seq (number) – The number of purge operations on the database.
**update_seq (number) – The current number of updates to the database.**
Status Codes:
200 OK – Request completed successfully
404 Not Found – Requested database not found
The update_seq must be return as number but when we run the request
http://192.168.1.48:5984/testing **(CouchDb 2.0)**the response is
{"db_name":"testing","update_seq":"0-g1AAAAFTeJzLYWBg4MhgTmEQTM4vTc5ISXLIyU9OzMnILy7JAUoxJTIkyf___z8rkQGPoiQFIJlkT1idA0hdPGF1CSB19QTV5bEASYYGIAVUOp8YtQsgavcTo_YARO19YtQ-gKgFuTcLANRjby4","sizes":{"file":33952,"external":0,"active":0},"purge_seq":0,"other":{"data_size":0},"doc_del_count":0,"doc_count":0,"disk_size":33952,"disk_format_version":6,"data_size":0,"compact_running":false,"instance_start_time":"0"}
previously in couchdb 1.6.1 when we run the request
http://192.168.1.80:5984/learners (CouchDb 2.0) the response is
{"db_name":"learners","doc_count":0,"doc_del_count":3,**"update_seq":6**,"purge_seq":0,"compact_running":false,"disk_size":12386,"data_size":657,"instance_start_time":"1487830025605920","disk_format_version":6,"committed_update_seq":6}
So plz explain is this a exception in couchdb 2.0 or something else.
The CouchDB docs are not up to date on this one. CouchDB 2.0 introduced clustering, and with the clustering the update_seq had to be change to a unique string.
You should treat the update_seq as an opaque identifier, not as something with an inherent meaning. If the update_seq has changed, the database itself has changed.
That said, the first part of the update_seq is a number, so if you really need the numeric sequence, you can parse it. But I would strongly advise to not rely on it, because the update_seq format might change in a future version of CouchDB.

Splunk returns some queries much faster than others

I am working with access_combined_wcookie data (essentially Nginx log files) in Splunk. An example of a record is below:
5/25/14 2:44:08.000 AM xxx.xxx.xxx.xxx - - [25/May/2014:02:44:08 -0500] "GET /somepath/ HTTP/1.1" 200 9696 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" "-"
I'm specifically interested in being able to run queries on the GET uri. So, that I can issue the following for instance:
/somepath/ | chart count by date_mday
I have not loaded over 60M records into Splunk and keep adding more each day.
I see that sometimes Splunk automatically indexes the "/somepath/" value and when I enter the query I get an immediate answer. But when enter the following query:
/some-path/ | chart count by date_mday
(essentially the same as above, but with a hyphen or dash), I have to wait a while for Splunk to generate the results. It seems that the outcome is volume-related: on smaller sets I get the result immediately, almost as if it were cached, where as the larger result sets take a long time (disproportionally so).
Is there any way for me to control that behavior? Is the smaller volume results get cached in memory and thus adding more RAM to the machine (VM in this case) would help get faster results?
I have also posted this question to Splunk answers (an SO clone site for Splunk) and received a great reply which I'm linking to here
The gist is that a search term some-path is interpreted by Splunk as some AND path, so that's why the query takes so much longer. Instead, using the special search command TERM() works a lot better.
TERM(somepath) | chart count by date_mday
Search can be refined further of course, as well. As a side note, a member pointed me to a field in this type of file called uri_stem, which actually speeds up the search even more.