I have an example product sales table which consists of around 15 columns and a couple thousand rows. The columns I'm most interested in look like this:
product_type currency amount order_time delivered_time
hoodie GBP 60.0 2021-03-10 14:32:07 2021-03-13 16:05:52
shirt EUR 30.0 2021-03-20 19:22:32 2021-03-24 11:18:46
...
There currently is a unique identifier but it isn't useful for broad analysis, there can be multiple products in an order but they'd all have a different identifier so you can't match them up.
What i want to do is create a new identifier column using a hash function, I've used the below code and showed an example output of what I get:
SELECT *, Md5(product_type||currency||amount)
FROM sales
product_type currency amount identifier
Coat GBP 100.0 825be52c31f1d92584720466d743e2cf
Coat GBP 100.0 825be52c31f1d92584720466d743e2cf
This code works for the 3 columns that I've included in the hash function, but I also want to include the two DATETIME columns into the function but it doesn't work. I've used this code to try include them, the code runs but the hash values i get are completed different to each other even if all the values in the columns match up:
SELECT *, Md5(product_type||currency||amount||TRUNC(order_time)||TRUNC(delivered_time))
I've used the TRUNC function on the two date columns as I'm not too concerned about the exact minutes or seconds, mainly interested in just the date itself. How could I include the two datetime columns without it messing up the hash function?
Use to_char(date, 'YYYY-MM-DD') instead of TRUNC()
Md5(product_type||currency||amount||to_char(order_time,'YYYY-MM-DD')||to_char(delivered_time,'YYYY-MM-DD'))
Or if the datatype is string/varchar, use SUBSTRING(date,1,10)
Md5(product_type||currency||amount||SUBSTRING(order_time,1,10)||SUBSTRING(delivered_time,1,10))
Related
I have a BigQuery dataset updating on irregular times (can be once, twice a week, or less). Data is structured as following.
id
Column1
Column2
data_date(timestamp)
0
Datapoint0
Datapoint00
2022-01-01
1
Datapoint1
Datapoint01
2022-01-01
2
Datapoint2
Datapoint02
2022-01-03
3
Datapoint3
Datapoint03
2022-01-03
4
Datapoint4
Datapoint04
2022-02-01
5
Datapoint5
Datapoint05
2022-02-01
6
Datapoint6
Datapoint06
2022-02-15
7
Datapoint7
Datapoint07
2022-02-15
Timestamp is a string in 'YYYY-MM-DD' format.
I want to make a chart and a pivot table in Google DataStudio that automatically filters by the latest datapoints ('2022-02-15' in the example). All the solutions I tried are either sub-optimal or just don't work:
Creating a support column doesn't work because I need to mix aggregated and non-aggregated fields (data_date and the latest data_date)
Adding a filter to the charts allows me to specify only a specific day - I would need to edit the chart regularly every time the underlyind data is updated
Using a dropdown filter allows me to dynamically filter whatever date I need. However I consider it suboptimal because I can't have it automatically select the latest date. Having a date filter can make it dynamic, but since the update time is not regular it may select a date range with multiple timestamps/or none at all, so it's also a sub-optimal solution
Honestly I'm out of ideas. I stupidly thought it was possible to add a column saying data_date = (select max(data_date) from dataset, but it seems not possible since max needs to work on aggregated data.
One possible solution could be creating a view that can have the latest data point, and referencing the view from the data studio.
CREATE OR REPLACE VIEW `project_id.dataset_id.table_name` AS
SELECT *
FROM `bigquery-public-data.covid19_ecdc_eu.covid_19_geographic_distribution_worldwide`
ORDER BY date DESC # or timestamp DESC
LIMIT 1
I have a data frame containing several registers from sold products, revenues, prices and dates
DATA CLIENT PRODUCT PRICE
2020-08-28 xxxxxxx RIOT 20.0
I am to group my information by year/month and product. I am running a group by to_period that extract the exactly information :
dfgmv = dfgift[['PRODUCT','PRICE']].groupby([dfgift.DATA.dt.to_period("M"), 'PRODUCT']).agg(['count','sum'])
This is the output :
PRICE
count sum
DATA PRODUCT
2020-08 RIOT 2 40.00
The question is that, as I export to excel, the date column is not interpreted as da date (yyyy-mm). I am trying to convert the yyyy-mm to something like yyyy-mm-dd so Excel understand it.
I´ve read several questions about multi index but my knowledge wasn't enough to use that info to apply here. I tried to change my index to datetime, but, as I run it, I lost the second index column (product).
dfgmv.index = pd.to_datetime(dfgmv.index.get_level_values(0).astype('datetime64[ns]'))
.
VALOR
count sum
DATA
2020-08-01 2 40.00
So, How can I change the information format without losing my index?
Index.set_levels is designed to allow for the setting of specific index level(s).
dfgmv.index = (
dfgmv.index.set_levels(dfgmv.index.levels[0].astype('datetime64[ns]'), level=0)
)
Result
PRICE
count sum
DATA PRODUCT
2020-08-01 RIOT 1 20.0
You can change your groupby to include start of month for each date you have:
dfgmv = dfgift[['PRODUCT','PRICE']].groupby([dfgift.DATA.astype('datetime64[M]'), 'PRODUCT']).agg(['count','sum'])
dfgmv
PRICE
count sum
DATA PRODUCT
2020-08-01 RIOT 1 20.0
I am pretty new to SQL, but i need to use it for my new job as the project requires it and as I am a non-IT-guy, it is more difficult for me, because thats my first time I work professionally with SQL.
Hopefully you can help me with it: (Sry for my english, i am a non-native speaker)
I need to start a query where I get unequal IDs from 2 different reference dates.
So I have one Table with following data:
DATES ID AMOUNT SID
201910 122424 99999 1
201911 41241242 99999 2
201912 12412424 -22222 3
...
GOAL:
So the ID's from the DATE: 201911 shall be compared with those from 201910
and the query should show me the unequal ID's. So only the unmatched ID's shall be displayed.
Out of this query, the Amount should be summed up and grouped into SIDs.
If you have two dates and you want sids that are only on one of them, then:
select sid
from t
where date in (201911, 201910)
group by sid
having count(distinct date) = 1;
I am trying to run a select on a table whereby the data in the table ranges across multiple days, thus it does not conform to daily data that the documentation eludes to.
Application of the xbar selection accross multiple days obviously results in data that is not ordered i.e. select last size, last price by 1 xbar time.second on data that includes 2 days would result in:
second | size price
====================
00:00:01 | 400 555.5
00:00:01 | 600 606.0
00:00:02 | 400 555.5
00:00:02 | 600 606.0
How can one add the current date in the selection such that the result like what is done in pandas can still be orderly across multiple days e.g: 2019-09-26 16:34:40
Furthermore how does one achieve this whilst maintaining a date format that is compatible with pandas once stored in csv?
NB: It is easiest for us to assist you if you provide code that can replicate a sample of the kind of table that you are working with. Otherwise we need to make assumptions about your data.
Assuming that your time column is of timestamp type (e.g. 2019.09.03D23:11:54.711811000), a simple solution is to xbar by one as a timespan, rather than using the time.second syntax:
select last size, last price by 0D00:00:01 xbar time from data
Using xbar keeps the time column as a timestamp rather than casting it to second type.
If your time column is of some other temporal type then you can still use this method if you have a date column in your table that you can use to cast time to a timestamp. This would look something like:
select last size, last price by 0D00:00:01 xbar date+time from data
I would suggest to group by both date and second, and the sum them
update time: date+time from
select last size, last price
by date: `date$time, time: 1 xbar `second$time from data
Or the other shorter and more efficient option is to sum date and second right in the group clause:
select last size, last price by time: (`date$time) + 1 xbar `second$time from data
I'm trying to create what seems like should be a pretty simple matrix report and I'm hoping someone can help. I have dataset that returns sales region, Date, and sales amount. The requirement is to compare sales for the various time periods to the current date. I'm looking to get my matrix to look something like this:
CurrentSales Date2Sales CurrentVSDate2 Date3Sales CurrentVSDate3
1 1000 1500 -500 800 200
2 1200 1000 200 900 300
3 1500 1100 400 1400 100
I can get the difference from one column to the next, but I need all columns to reference the CurrentSales column. Any help would be greatly appreciated.
Currently my data set is pulling in a date, region, product and sales amount. I then have three parameters, CurrentDate, PreviousMonth, PreviousQuarter. The regions and products are my row groups and the dates are the column groups. Next I added a column inside the group with the following expression: =Sum(Fields!SalesAmount.Value)-Previous(Sum(Fields!SalesAmount.Value),"BookingDate"). I know this isn't correct because it compares the values to the previous date in the column group and I need the comparision to be to the First date in the column group.
Example:
Using Expressions you can:
=iif(Sum(Fields!SalesAmount.Value)= Previous(Sum(Fields!Date2Sales.Value)),
=iif(Sum(Fields!EndBalance.Value)=0, Nothing, Sum(Fields!EndBalance.Value)) You can also use Switch.
The easiest way to get this result would probably be in your query. Add a field to every row returned maybe called "Current Sales." Use a correlated subquery there to get the right value for comparison. Then your comparison can be as simple as =Fields!Sales.Value - Fields!CurrentSales.Value or similar.
There are some ways to do this at the report level, but they are more of a pain: my current favorite of those is to use custom code embedded in the report. Another approach is to use Aggregates of aggregates.