I am loading in strings that represent time frames with the format DD days, HH:MM:SS, and I want to create a graph in which these time frames are the expression. That means that I need to convert them into some number-like format. Interval() doesn't seem to be working. Any ideas? Should I reformat the string prior to loading to something other than DD days, HH:MM:SS so that it is more easily usable?
There are lots of ways you could go about solving this. Your idea of reformatting prior to loading sounds like a good option, since QlikView doesn't support regular expressions.
I think the best way to solve your problem is to write a simple Python script that replaces the spaces, comma, and letters with a colon. Then, you'll have a much more workable string in DD:HH:MM:SS format. Of course, this assumes you have a flat file that you can easily tweak:
import re
myreg = re.compile('(\s(day|days),\s)') ## we'll look for " day, " and " days, "
newfilestr = ""
with open('myfile.txt', 'r') as myfile:
for line in myfile:
newfilestr += re.sub(myreg, ':', line)
outputf = open('fixedtimeformat.txt', 'w')
outputf.write(newfilestr)
outputf.close()
Related
I'm dealing with files we get sent by a client - so we can only get changes to the files we get sent with a lot of effort. Sometimes, in a free text field, we get a mention of length using the double quote characters to mean inches. For example, a file might look like this.
"count","desc","start_date","end_date"
"3","it is tall","3/18/2019","4/20/2020"
"10","height: 108" is nice,","04/11/2016","09/22/2015"
"8","it is short","7/20/2019","8/22/2020"
We are using python/pandas. When I load it using:
import pandas as pd
df = pd.read_csv("sample.csv", dtype=str)
I get:
There are two issues I am hoping to solve:
More important issue: I'd like the second value of start_date to be 04/11/2019 (without the comma at the start and the double quote at the end.
Less important issue: I'd like the second value of desc to be height: 108" is nice, (with the inches indicator).
I know that the right thing to do is to get the file escape the quote using \" but, like I said, that will be a hard change to get.
You can exploit the pattern that the values are separated by "," and remove first and last ". This solution will break if the free text field contains ",".
import pandas as pd
import io
with open('sample.csv') as f:
t = f.read()
print(t)
Out:
"count","desc","start_date","end_date"
"3","it is tall","3/18/2019","4/20/2020"
"10","height: 108" is nice,","04/11/2016","09/22/2015"
"8","it is short","7/20/2019","8/22/2020"
Remove first and last " in every row and read_csv with delimiter ","
t = '\n'.join([i.strip('"') for i in t.split('\n')])
pd.read_csv(io.StringIO(t), sep='","', engine='python')
Out:
count desc start_date end_date
0 3 it is tall 3/18/2019 4/20/2020
1 10 height: 108" is nice, 04/11/2016 09/22/2015
2 8 it is short 7/20/2019 8/22/2020
I'm currently doing this to generate a dataframe:
dataframe = pd.read_sql("select date_trunc('minute', log_time) as date, .....
my output is a time that looks like this:
"date":"2020-06-01 00:08:00.000"
What I want to do is have a time output that looks like this in the json file that it is outputted to:
"date":"2020-06-08T23:01:00.000Z
I found documents that show you how to remove it but not sure how to add it. do I have to do this after the dataframe is made or is there something in my date_trun( command that should put it in this format
Based off our conversation in the comments section, I have edited your question and added in the JSON file that it is outputted to to the line What I want to do is have a time output that looks like this in the JSON file that it is outputted to: At the end of the data, the only thing that matters is the raw value to be accurate in your JSON file. Don't worry about what it looks like in your Jupyter Notebook. I think this is a common mistake that people make and one that I have made in the past as well.
I would suggest not worrying about the datetime format in pandas. Just go with pandas default date/time until the very end.
THEN, as a final step, just before exporting to a JSON, change the format of the field to:
df['TIME'] = pd.to_datetime(df['TIME']).dt.strftime('%Y-%m-%dT%H:%M:%S.%f').str[:-3] + ['Z']
That will change to a format of 2020-06-08T23:01:00.000Z .
Note .str[:-3] is required because strftime doesn't support milliseconds (3 decimals) according to the documentation and only micorseconds (6 decimals). As such, you need to truncate the last 3 decimals to change to millisecond format.
That specific format is not directly supported with T and Z, so I did a little bit of string manipulation.
I have a large text file that uses commas instead of periods to indicate decimals.
Is there a way to get the rxTexttoXdf function in the RevolScaleR package to view commas as periods?
I suspect I'm going to get so much flak for this post as it seems really simple
Edit:
I am currently using a workaround that involves importing the numeric columns as character type, followed by stripping the comma and replacing it with a period and then converting to numeric
library(dplyrXdf)
imported_data %>% #dataset with character types
mutate_if(is.character,
funs(gsub(",",".",.))) %>% #replace commas for period
mutate_if(is.character, as.numeric) %>% #convert character to numeric
persist(cleaned_file) # cleaned_file being a file path
It feels like there are much cleaner ways of doing this
RxTextData has a decimalPoint argument for just this purpose.
Assuming your text file is European csv (columns are ; separated, , is the decimal point):
txt <- RxTextData("your/file.txt", decimalPoint=",", delimiter=";")
xdf <- rxDataStep(txt, "imported.xdf")
# do stuff with xdf
In general, it's a good idea to use data source objects to refer to files, rather than filenames. You can also use rxDataStep for just about everything.
In pig script, I would like to find a way to escape the delimiter character in my data so that it doesn't get interpreted as extra columns. For example, if I'm using colon as a delimiter, and I have a column with value "foo:bar" I want that string interpreted as a single column without having the loader pick up the comma in the middle.
You can try http://pig.apache.org/docs/r0.12.0/func.html#regex-extract-all
A = LOAD 'somefile' AS (s:chararray);
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(s, '(.*) : (.*)'));
The regex might have to be adapted.
It seems Pig takes the Input as the string its not so intelligent to identify how what is data or what is not.
The pig Storage works on the Strong Tokenizer. So if u want to do something like
a = LOAD '/abc/def/file.txt' USING PigStorage(':');
It doesn't seems to be solving your problem. But if we can write our own PigStorage() Method possibly we could come across some solution.
I will try posting the Code to resolve this.
you can use STRSPLIT(string, regex, limit); for the column split based on the delimiter.
I have a big string, precisely - an XSLT code - that I would like to hardcode in my VB.net program. I tried with putting " before every quotation mark, but it still didn't work out, and it's pretty mocking to place it 100 times. Using Chr(34) is also not the best solution.
Is there some way, like to put # (or another character) before the string itself that will define and work for all the characters in the string that need to be escaped ?
If it is a large string. Why not save it to file and then read the file into memory before you want to use it. That way you don't have to do any escaping and it will be easy to modify if you decide to change it.