Error with dateutil.parser when iterating through list - text-mining

Converted large text file to list of strings (each row = one element in list) ['...','...','...']
sample_data = ['2017-May-15 13:56:49.578 Event Dispense Sc 06mm Beschichtungsbreite ist: 5.99 mm', '2017-May-15 14:12:11.062 Event Runtime SC 09mm neuer Druck: 27.560PSI']
Trying to extract dates from each list element (each row contains one date with standardized format)
my code:
dparser.parse(sample_data[0],fuzzy=True))
returns the desired date.
However, when trying to iterate through the list as shown below
for elements in sample_data:
dparser.parse(elements,fuzzy=True)
I get an error message: ValueError: Unknown string format

Though I can not see the actual data, from the documentation http://dateutil.readthedocs.io/en/stable/parser.html . It means the tzinfo is not a valid string format
Example: if the date is 15thMarch 2018 and not 15th March 2018. it would raise a ValueError, try inspecting the list to know if that's the case.

Solved with regex functions and some wrangling.
Still can't tell why iteration with dparser.parse doesn't work.

Related

Understanding the "Not found: Dataset ### was not found in location US" error

I know this topic has come up many times but still here I am. Data processing location seems consistent (dataset, US; query: US) and I am using backticks & long format in the FROM clause
Below are two sequences of code. The first one works perfectly:
SELECT station_id
FROM `bigquery-public-data.austin_bikeshare.bikeshare_stations`
Whereas the following returns an error message:
SELECT bikeshare_stations.station_id
FROM `bigquery-public-data.austin_bikeshare`
Not found: Dataset glassy-droplet-347618:bigquery-public-data was not found in location US
My question, thus, is why do the first lines of text work while the second doesn't?
You need to understand the different parts of the backticks:
bigquery-public-data is the name of the project;
austin_bikeshare is the name of the schema (aka dataset in BQ); and
bikeshare_stations is the name of the table/view.
Therefore, the shorter format you are looking for is: austin_bikeshare.bikeshare_stations (instead of bigquery-public-data.austin_bikeshare).
Using bigquery-public-data.austin_bikeshare means that you have a schema called bigquery-public-data that contains a table called austin_bikeshare , when this is not true.

how to iterate the required part of string in python

As a part of my project in python I have to iterate the roll number which has 10 digits
our roll numbers be like:
188w1a0501,188w1a0502--------188w1a0599,188w1a05a1--188w1a05a9,188w1a05b1--188w1a05b9----
upto the last member .
getting these numbers one by one into a variable so that i can send them to my function.
How to do this?
You can make use of regular expression to get the roll numbers.
import re
data='188w1a0501, 188w1a0502--------188w1a0599, 188w1a05a1--188w1a05a9, 188w1a05b1--188w1a05b9----'
tokens = re.split(r",|-|\( | \)", data)
formatted = [token.strip() for token in tokens if len(token.strip())>0]
print(formatted)
Output would be
['188w1a0501', '188w1a0502', '188w1a0599', '188w1a05a1', '188w1a05a9', '188w1a05b1', '188w1a05b9']

Need to parse a date range location in a field to two SQLStatements

Have a custom field that contains a date range in the following format:
3/16/20 - 2/22/20
What I need to do is separate this one line into two different fields, the first selection/range and the second range, so if you take the screenshot I need it to be separated to 3/16/20 for one field and 3/22/20 for the other field.
Currently I have this and something is causing an error randomly and I want to make sure it is not the SQL statement
For the first selection, I use the following:
TO_DATE(LTRIM(SUBSTR({custbody_shipwindow}, 1,(INSTR({custbody_shipwindow}, '-')-1))),'mm/dd/yy')
For the second selection I use the following:
TO_DATE(LTRIM(SUBSTR({custbody_shipwindow},(INSTR({custbody_shipwindow}, '-')+1), LENGTH({custbody_shipwindow}))),'mm/dd/yy')
Try:
TO_DATE(REGEXP_SUBSTR(TRIM({custbody_shipwindow}),'^[^ -]+'),'MM/DD/YY')
TO_DATE(REGEXP_SUBSTR(TRIM({custbody_shipwindow}),'[^ -]+$'),'MM/DD/YY')
or to be safe, but possibly not return the date:
TO_DATE(REGEXP_SUBSTR(TRIM({custbody_shipwindow}),'^[0-9]{1,2}/[0-9]{1,2}/[0-9]{1,2}'),'MM/DD/YY')
TO_DATE(REGEXP_SUBSTR(TRIM({custbody_shipwindow}),'[0-9]{1,2}/[0-9]{1,2}/[0-9]{1,2}$'),'MM/DD/YY')

How do I use values within a list to specify changing selection conditions and export paths?

I'm trying to split a large csv data using a condition. To automate this process, I'm pulling a list of unique conditions from a column in the data set and wanting to use this list within a loop to specify condition and also rename the export file.
I've converted the array of values into a list and have tried fitting my function into a loop, however, I believe syntax is the main error.
# df1718 is my df
# znlist is my list of values (e.g. 0 1 2 3 4)
# serial is specified at the top e.g. '4'
for x in znlist:
dftemps = df1718[(df1718.varname == 'RoomTemperature') & (df1718.zone == x)]
dftemps.to_csv('E:\\path\\test%d_zone(x).csv', serial)
So in theory, I would like each iteration to export the data relevant to the next zone in the list and the export file to be named test33_zone0.csv (for example). Thanks for any help!
EDIT:
The error I am getting is: "delimiter" must be string, not int
So if the error is in saving the file try this
dftemps.to_csv('E:\\path\\test{}_zone{}.csv'.format(str(serial),str(x)))

How to use Bioproject ID, for example, PRJNA12997, in biopython?

I have an Excel file in which are given more then 2000 organisms, where each one of them has a Bioproject ID associated (like PRJNA12997). The idea is to use these IDs to get the sequence for a later multiple alignment with other five sequences that I have in a text file.
Can anyone help me understand how I can do this using biopython? At least the part with the bioproject ID.
You can first get the info using Bio.Entrez:
from Bio import Entrez
Entrez.email = "Your.Name.Here#example.org"
# This call to efetch fails sometimes with a 400 error.
handle = Entrez.efetch(db="bioproject", id="PRJNA12997")
I've been trying, and Entrez.read(handle) doesn't seems to work. But if you do record_xml = handle.read() you'll get the XML entry for this record. In this XML you can get the ID for the organism, in this case 12997.
handle = Entrez.esearch(db="nuccore", term="12997[BioProject]")
search_results = Entrez.read(handle)
Now you can efecth from your search results. At this point you should use Biopython to parse whatever you will get in the efetch step, playing with the rettype http://www.ncbi.nlm.nih.gov/books/NBK25499/table/chapter4.T._valid_values_of__retmode_and/
for result in search_results["IdList"]:
entry = Entrez.efetch(db="nuccore", id=result, rettype="fasta")
this_seq_in_fasta = entry.read()