We have lots of csv files with millions of data that are being pushed into Hive External Table by merging 12 files as a single file and then feeding the data to machine learning team.
CSV file is the raw file which has Phonenumber, Col1,Col2.....Created_date.
The Merge of 12 csv files into one csv huge csv file is loaded into Hive External Table.
So the fields we want to fetch are Phonenumber,Created_date(Timestamp)
Time stamp should be written in time slot/ time slab basing on hh:mm ( excluding date part and seconds part)
FOr example if the hh:mm falls in 00:00 to 00:15 it should write 1, similarly if it falls in 00:15 to 00:30 it should write 2....23:45 to 00:00 whould write 96.
So final result should look like
PhoneNo | TimeSlot/Slab
9999999| 1
8888888| 23
...
Thanks in Advance friends
Venkat
with t as (select timestamp '2017-03-23 22:47:01' as Created_date)
select (hour(Created_date)*60 + minute(Created_date)) div 15 + 1
from t
92
Related
So I have a database of a bunch of log files. They each have a date ranging over the span of a few months. The query I run returns a result of a few hundred per date.
Saving everything into one csv file is too big (About 60 000 rows) to use in excel like I want to.
So I want to create a csv file for each day, so 0401.csv 0402.csv 0403.csv and so on...
Below is a code I use to count the amount of entries in each date
SELECT CAST(Created_at AS DATE), COUNT(*) AS Count FROM tbTickets
WHERE Created_at BETWEEN '2020/04/01 00:00:00' AND '2020/08/01 23:59:59'
AND (lots of other conditions here)
GROUP BY CAST(Created_at AS DATE)
ORDER BY 1;
The above returns something like this
2020-04-01 150
2020-04-02 165
...
So what I want to do is save those 150 entries (not the number 150, but the 150 rows with all their columns) into their respective file.
The above code is using GROUP BY, and I am not advanced enough at SQL to know if I can use this to export files. In general in programming I should be able to loop through all the rows and count the amount of dates, order them by date, set the start date to the earliest and loop through all the rows and save the rows to their respective file. Does anyone know how to do this in SQL? Thanks!
I am trying to run a select on a table whereby the data in the table ranges across multiple days, thus it does not conform to daily data that the documentation eludes to.
Application of the xbar selection accross multiple days obviously results in data that is not ordered i.e. select last size, last price by 1 xbar time.second on data that includes 2 days would result in:
second | size price
====================
00:00:01 | 400 555.5
00:00:01 | 600 606.0
00:00:02 | 400 555.5
00:00:02 | 600 606.0
How can one add the current date in the selection such that the result like what is done in pandas can still be orderly across multiple days e.g: 2019-09-26 16:34:40
Furthermore how does one achieve this whilst maintaining a date format that is compatible with pandas once stored in csv?
NB: It is easiest for us to assist you if you provide code that can replicate a sample of the kind of table that you are working with. Otherwise we need to make assumptions about your data.
Assuming that your time column is of timestamp type (e.g. 2019.09.03D23:11:54.711811000), a simple solution is to xbar by one as a timespan, rather than using the time.second syntax:
select last size, last price by 0D00:00:01 xbar time from data
Using xbar keeps the time column as a timestamp rather than casting it to second type.
If your time column is of some other temporal type then you can still use this method if you have a date column in your table that you can use to cast time to a timestamp. This would look something like:
select last size, last price by 0D00:00:01 xbar date+time from data
I would suggest to group by both date and second, and the sum them
update time: date+time from
select last size, last price
by date: `date$time, time: 1 xbar `second$time from data
Or the other shorter and more efficient option is to sum date and second right in the group clause:
select last size, last price by time: (`date$time) + 1 xbar `second$time from data
I have a table whose records I want to copy to a db on a remote server after a specific interval of time. The number of records in the table is very high(in the range of millions) and also the columns can range from 40-50.
I thought about using pg_dump after the time-interval but that sounds inefficient as the whole table would be dumped again and again.
Assume the time interval to be 4 hours and the life-cycle of db start at 10:00.
No. of Records at 10:00 - 0
No. of Records at 14:00 - n
No. of Records at 18:00 - n+m
No. of Records at 22:00 - n+m+l
The script(shell) that I want to write should select 0 rows at 10:00,
n rows at 14:00, m rows at 18:00, l rows at 22:00.
Is there any other way through which I copy only those rows that have been added between the time intervals in order to remove the redundant rows that will come if I take a pg_dump every 4 hours?
I have a list of times in QlikView. For example:
1:45 am
2:34 am
3:55 am
etc.
How do I split it into groups like this:
1 - 2 am
2 - 3 am
4 - 5 am
etc.
I used the class function, but something is wrong. It works but it doesn't create time buckets, it creates some sort of converted decimal buckets.
You have a couple of options, by far the simplest would be to create a new field which reformats your time field, for example I created TimeBucket which formats the time field into hours, and appends this with the same time but with an hour added for the upper bound:
LOAD
TimeField,
Time(TimeField,'h tt') & ' - ' & Time(TimeField + maketime(1,0,0),'h tt') as TimeBucket;
LOAD
*
INLINE [
TimeField
1:45
2:34
3:55
16:45
17:56
];
This then results in the following:
However, depending on your exact requirements, this solution may have problems due to the nature of Time as this is a dual function.
Another alternative is to use intervalmatch as follows. One point to remember is that intervalmatch includes the end-points in an interval. This means for time, we have to make the "end" times be one second before the start of the next interval, otherwise we will generate two records instead of one if your source data has a time that sits on an interval boundary.
TimeBuckets:
LOAD
maketime(RecNo()-1,0,0) as Start,
maketime(RecNo()-1,59,59) as End,
trim(mid(time(maketime(RecNo()-1),'h tt'),1,2)) & ' - ' & trim(time(maketime(mod(RecNo(),24)),'h tt')) as Bucket
AUTOGENERATE(24);
SourceData:
LOAD
*
INLINE [
TimeField
1:45
2:34
3:55
16:45
17:56
];
BucketedSourceData:
INTERVALMATCH (TimeField)
LOAD
Start,
End
RESIDENT TimeBuckets;
LEFT JOIN (BucketedSourceData)
LOAD
*
RESIDENT TimeBuckets;
DROP TABLES SourceData, TimeBuckets;
This then results in the following:
More information on intervalmatch may be found in both the QlikView installed help as well as the QlikView Reference manual.
Write a nested if statement in your script:
If(TIME>1:45,'bucket 1',
If(TIME>2:45,'bucket 2','Others'
)
)
Not the most elegant, but if you can't get the 1:45 to work with the date() function, you can always convert to military time and just add the hours and minutes, then make buckets out of that.
I need to copy the content of a textfile (not csv) into a Postgres/Postgis database with a simple sql script.
The textfile looks like
#date value
2007 11 29 9 29 14.830 4.1
2007 12 7 21 46 25.560 5.3
So the date is compound by 6 values (year, month, day, hour, minute, second). I have one timestamp field in my DB where the date should be stored (as one single value).
I know I can use the COPY command to copy each column value of the file to a column value in the database,
COPY myDb FROM myValues.txt;
but I need to combine several values from the text file to one value in the database.
Sorry if the question has already been asked, I don't really know how to search for this as I just started with sql.