Using SQL & REGEX to Clean String and Cast to Date in Chartio

Using SQL & REGEX to Clean String and Cast to Date in Chartio - sql

I don't know too much about SQL and REGEX, especially how they work together. But I've become responsible for using Chartio to visualise data at work and need some help.
In Google Analytics, under Search Terms we capture a date range. When I pull that into Chartio it's a string and unclean, which is almost unusable.
A few examples of how it appears in Google Analytics.
2018-01-08T12:00:00.000Z
2018-01-28T00:00:00.000Z
12-31-2018
Auckland
Christchurch
In Chartio I can create a Data Store where I take the data from Google Analytics and can manipulate it.
I can create a custom column in the schema to convert the string into a Date using this command, as suggested by Chartio here
CAST("Dates"."ga:searchKeyword" as date)
But I need to clean the data first so that I only valid dates. My poor attempt at creating a command looks like this
CASE WHEN REGEXP_SUBSTR("(19|20)\d\d[-/.](0[1-9]|1[012])[-/.](0[1-9]|[12][0-9]|3[01])") THEN CAST("Dates"."ga:searchKeyword") AS DATE
I know my attempt is wrong, because it doesn't work and also I don't know what I am doing.
Please help!

Related

Amazon Marketing Cloud - CAST Function and SQL Documentation

I am using Amazon Marketing Cloud (AMC) for work and I am having trouble applying a WHERE [column] <> ''.
The CSV file that is output is large and it contains many records with nothing in the main ID column. I am able to filter out the nulls, but not the ''.
This is the error message I get when I CAST: "No match found for function signature type(<RecordType(BIGINT order, VARCHAR campaign)>)"
The field is compiled in a CTE using NAMED_ROW('order', ROW_NUMBER() OVER(PARTITION BY imp_user_id ORDER BY impression_timestamp),'campaign', campaign) AS campaign_order.
Then, the next CTE turns it into an array using ARRAY_SORT(COLLECT(distinct a.campaign_order)) AS path.
An example of the output is [[1, <Name of Campaign 1>],[2, <Name of Campaign 2>],...[N, <Name of Campaign N>]]
I know that AMC is based on Presto Database Engine, but when looking for documentation I am not sure whether to look at Presto, Hive, or Apache. Whenever I search for something one of those 3 sources usually comes up. I have luck sometimes and other times I do not. It would help if I knew exactly what form of SQL AMC was using so I can narrow down the documentation, syntax, etc.
This platform is still in beta I believe and is relatively exclusive in terms of access. So, I am not sure if many people will be able to help.
In short, I want to filter out records = '', but due to the data type of the field it won't let me.
How do I cast "type(<RecordType(BIGINT order, VARCHAR campaign)" so that I can filter out ''? Also, what documentation should I be using for AMC?
I am considering using LEN() so that I can filter out any records with length = 0.
Anyways, any and all help is appreciated!
If you need more information, then please let me know.
Thank you!

AMC documentation is available at https://advertising.amazon.com/marketing-cloud/documentation (Amazon Advertising account with AMC access is required).
In general AMC SQL is closer to PostgreSQL rather than Presto syntax.
It's a little hard to see why you are trying to cast a record to a string. I think it might be easier to filter the records before the window function is applied. I might be able to help more if you share the relevant part of your query. Alternatively, feel free to contact AMC support by email or via your sales rep.

how to view stats on snowflake?

I am looking for a way to visualize the stats of a table in Snowflake.
The long step is to pull a meaningful sample of the data with python and apply Pandas, but it is somewhat inefficient and unsafe to pull the data out of snowflake.
Snowflake's new interface shows these stats graphically and I would like to know if there is a way to obtain this data with query or by consulting metadata.
I need something like Pandas-profiling but without a external server. maybe snowflake store metadata/statistic about its colums. numeric, categoric
https://github.com/pandas-profiling/pandas-profiling
thank you for your advices.

You can find a lot meta information in the INFORMATION_SCHEMA.
All the views and table functions in the Snowflake INFORMATION_SCHEMA can be found here: https://docs.snowflake.com/en/sql-reference/info-schema.html

not sure if you're talking about viewing the information schema as mentioned, but if you need documentation on this whole new interface, it's called SnowSight
you can learn more there:
https://docs.snowflake.com/en/user-guide/ui-snowsight.html
cheers!

The highlight in your screenshot isn't statistics about the data in the table, but merely about the query result (which looks like a DESCRIBE TABLE query). For example, if you look at type, it simply tells you that this table has 6 VARCHAR columns, 2 timestamps, and 1 number.
What you're looking for is something that is provided by most BI tools or data catalogs. I suggest you take a look at those instead.
You could also use an independent tool, like Soda, which is open source.

Lucene data range search

I'm using Umbraco v7.2 for a site, and have run into a highly entertaining issue trying to search for things using the External Searcher by a date of ranges.
If I perform a Lucene search using the examine management search tools in the backoffice, I get results using this query:
{(+__NodeTypeAlias:bookingperiod)} AND startDate:2016-03-01T00\:00\:00
Subsequently, I KNOW that I can get results that include this date in a range. However, what's highly entertaining, quite puzzling and really rather frustrating, is that if I use a range query, I get no results. Here's the syntax:
{(+__NodeTypeAlias:bookingperiod)} AND +(startDate:[2016-02-28T00:00:00 TO 2016-03-20T00:00:00])
Now, in the interests of clarity, I've tried escaping the colon characters in the dates, the dashes in the dates and both, but it makes no difference at all. Can anyone explain to me where I'm going wrong?
Thanks!

I ran into this issue a while back, not sure why though, but changing to the format :"yyyyMMddHHmmss" helped, might be something with the parser.
So the query becomes:
+__NodeTypeAlias:bookingperiod AND +startDate:[20160228000000 TO 20160320000000]

YEARFRAC function on SQL Server not working

I was happy to find you can use an Excel-like YEARFRAC function in MS SQL server (http://technet.microsoft.com/en-us/library/ee634405.aspx), but for some reason I get an error that states:
'yearfrac' is not a recognized built-in function name
when I try to run my query. Here is my code:
SELECT CUSTOMER_ID, PRODUCT_SKU, SUB_START_DATE, SUB_EXP_DATE,
YEARFRAC(sub_start_date, sub_exp_date) AS START_TO_END FROM ...
For the record, I have double-checked the dates are in proper datetime format, and I tried both using no basis (as shown above), and using the available 1-4 bases. I also tried removing the column alias (START_TO_END). None of these worked. Any ideas?

No, that is in Analysis Services (DAX specifically), not in T-SQL. The header on the page does not make it clear which section of the documentation you're in, but look at the table of contents on the left...
Sorry the screen shot is double size, it's because of my Retina screen giving 144dpi instead of 72dpi.
Anyway, you should be able to replicate this functionality with your own UDF or, if you always calculate a specific way, it might even be simple enough to do inline. A calendar table may help.

How to retrieve results by date range and sort using SOLR with ColdFusion 9.0.1?

I'm using ColdFusion 9.0.1 and the integrated SOLR full text search engine.
I have dates stored in my SQL Server database as datetime fields for upcoming events. I took these records and inserted them into a SOLR collection with the custom3 and custom4 fields being the dateStart and dateEnd dates respectively. Users want to query the collection against a date range and sort by closest date to now.
First question: How do we set the datatype for the custom1-4 fields? Or, can we? Based on this post, Optimizing Solr for Sorting, the field should be set to either tdate or date rather than string for best performance. Or does SOLR automatically make the field have the correct datatype based on this post, Sort by date in Solr/Lucene performance problems?
Second question: How would the search criteria be structured to pull records? How about between May 1, 2011 and July 31, 2011, for example?

I don't tell too many people this, but for you, I believe it's time to ditch CFINDEX/CFSEARCH, and start using Solr directly.
CF's implementation is built for indexing a large block of text with some attributes, not a query. If you start using Solr directly, you can create your own schema, and have far more granular control of how your search works. Yes, it's going to take longer to implement, but you will love the results. Filtering by date is just the beginning.
Here's a quick overview of the steps:
Create a new index using the CFAdmin. This is the easy way to create all the files you need.
Modify the schema. The schema is in [cfroot]/solr/multicore/[your index name]/conf/
The top half of the schema is <types>. This defines all the datatypes you could use. The bottom half is the <fields>, and this is where you're going to be making most of your changes. It's pretty straightforward, just like a table. Create a field for each "column" you want to include. "indexed" means that you want to make that field searchable. "stored" means that you want the exact data stored, so that you can use it to display results. Because I'm using CF9's ORM, I don't store much beyond the primary key, and I use loadEntityByPK() on my results page.
After modifying the schema, you need to restart the solr service/daemon.
Use http://cfsolrlib.riaforge.org/ to index your data (the add method is a 'insert or modify' style method), and to perform the search.
To do a search, check out this example. It shows how to sort and filter by date. I didn't test it, so the format of the dates might be wrong, but you'll get the idea. http://pastebin.com/eBBYkvCW
Sorry this is answer is so general, I hope I can get you going down the right path here :)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas