How to search for a specific word within a text? - pandas

I have a file, of type txt, with the following text:
The dataset is available at: https://archive.ics.uci.edu/ml/datasets.php
The file name is Cancer_Data.xml
This is one of three domains provided by the Oncology Institute that has repeatedly appeared in the machine learning literature.
I need to search within this text the word that accompanies the "xml". I tried to do the following implementation:
import pandas as pd
with open(local_arquivo, "r") as file_read:
for line in file_read:
var_split = line.split()
for i in range(0, len(var_split)):
if(var_split[i].str.contains('xml')):
archive_name = var_split.iloc[i]
The idea was to separate the text using the split function and then look for the part that contains the 'xml'. However, when I run it, the following error appears:
AttributeError: 'str' object has no attribute 'str'
I would like the output to be:
archive_name = Cancer_Data.xml

Try
if('xml' in var_split[i]):
source: https://docs.python.org/3/reference/expressions.html#in

Related

Failed exporting df.to_csv using a variable name in the path

I am using a function MyFunction(DataName) that creates a pd.DataFrame(). After certain modifications to data, I am able to export such dataframe into csv with this code:
df.to_csv (r'\\kant\kjemi-u1\izarc\pc\Desktop\out.csv', index = True, header=True)
Creating an 'out.csv' file which is overwritten everytime the code is run. However when I try to give a specific name (for instance the name of the data used to fill in the dataframe, for multiple exports like this:
df.to_csv (fr'\\kant\kjemi-u1\izarc\pc\Desktop\{DataName}.csv', index = True, header=True)
I have this error:
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
in
----> 1 MyFunction(DataName)
I am new in the programming world so any ideas of how I can overcome this problem are very welcomed. Thank you very much!
If I understand you right (and given that the fr in your code should be simply r), you want to have your to_csv statement dynamic and what is supposed to go within the brackets to change. So, assume you dataframe if df. Then, do this:
DataName = "df"
NewFinger.to_csv (r'\\kant\kjemi-u1\izarc\pc\Desktop\{}.csv'.format(DataName), index = True, header=True)
thanks for your help. In the beggining I was confused with 'NewFinger' I thought it was some sort of module I needed to install. I could not find information in google. However I solved the issue based on your suggestion actually with the following code:
DataName = "whichever name"
df.to_csv (r'\\kant\kjemi-u1\izarc\pc\Desktop\{}.csv'.format(DataName), index = True, header=True)

Correct use of append and Alignat from pylatex while pdf creating (python)

I want to save some formulas from latex in pdf
from pylatex import Document, Section, Subsection, Command,Package, Alignat
doc = Document(default_filepath='basic.tex', documentclass='article')
doc.append('Solve the equation:')
doc.append(r'$$\frac{x}{10} = 0 \\$$',Alignat(numbering=False, escape=False))
doc.generate_pdf("test", clean_tex=True)
But I get an error:
doc.append(r'$$\frac{x}{10} = 0 \\$$',Alignat(numbering=False, escape=False))
TypeError: append() takes 2 positional arguments but 3 were given
How should I solve my problem?
This answer comes late, but I guess there is no harm: The Alignat environment cannot be passed to append like that, instead the appended formula is enclosed in it. Also it is a math environment, so the $$ are not necessary.
from pylatex import Document, Section, Subsection, Command,Package, Alignat
doc = Document(default_filepath='basic.tex', documentclass='article')
doc.append('Solve the equation:')
with doc.create(Alignat(numbering=False, escape=False)) as agn:
agn.append(r'\frac{x}{10} = 0')
Output:

Prepare a csv file for process mining

hope you are doing well !
I was following tutorials for process mining using 'PM4PY', but I found difficulties in the csv file ,
in my csv file I have this columns : 'id', 'status', 'mailID', 'date'.... ('status' is same as 'activity' that contain some specific choises )
my csv file contains a lot of data.
to follow process mining tutorial I must have in my columns something like 'case:concept:name' ... but I don't know how can I make it
In your case, I assume 'id' would be the same as the Case ID in normal process mining terminology. Similarly, 'status' corresponds to Activity ID and 'date' would correspond to the timestamp.
The best option is to first read into a pandas dataframe before feeding into PM4Py.
For a detailed understanding of how to do this, here is an example below. As you have not mentioned all the columns that you have in your csv file, let us assume that currently you only have [ 'id', 'status', 'date' ] as your column list. The following code can be adapted to any number of columns you have (by adding them to the list named cols) :
import pandas as pd
from pm4py.objects.conversion.log import converter as log_converter
path = '' # Enter path to the csv file
data = pd.read_csv(path)
cols = ['case:concept:name','concept:name','time:timestamp']
data.columns = cols
data['time:timestamp'] = pd.to_datetime(data['time:timestamp'])
data['concept:name'] = data['concept:name'].astype(str)
log = log_converter.apply(data, variant=log_converter.Variants.TO_EVENT_LOG)
Here we have changed the column names and their datatypes as required by the PM4Py package. Convert this dataframe into an event log using the log_converter function. Now you can perform your regular process mining tasks on this event log object. For instance, if you wish to create a Directly-Follows Graph from the event log, you can use the following line of code :
from pm4py.algo.discovery.dfg import algorithm as dfg_algorithm
dfg = dfg_algorithm.apply(log)
first you need import your csv file using pandas, then convert to an event log object, finally you can use in pm4py.
reference:
https://pm4py.fit.fraunhofer.de/documentation

How to import Pandas data frames in a loop [duplicate]

So what I'm trying to do is the following:
I have 300+ CSVs in a certain folder. What I want to do is open each CSV and take only the first row of each.
What I wanted to do was the following:
import os
list_of_csvs = os.listdir() # puts all the names of the csv files into a list.
The above generates a list for me like ['file1.csv','file2.csv','file3.csv'].
This is great and all, but where I get stuck is the next step. I'll demonstrate this using pseudo-code:
import pandas as pd
for index,file in enumerate(list_of_csvs):
df{index} = pd.read_csv(file)
Basically, I want my for loop to iterate over my list_of_csvs object, and read the first item to df1, 2nd to df2, etc. But upon trying to do this I just realized - I have no idea how to change the variable being assigned when doing the assigning via an iteration!!!
That's what prompts my question. I managed to find another way to get my original job done no problemo, but this issue of doing variable assignment over an interation is something I haven't been able to find clear answers on!
If i understand your requirement correctly, we can do this quite simply, lets use Pathlib instead of os which was added in python 3.4+
from pathlib import Path
csvs = Path.cwd().glob('*.csv') # creates a generator expression.
#change Path(your_path) with Path.cwd() if script is in dif location
dfs = {} # lets hold the csv's in this dictionary
for file in csvs:
dfs[file.stem] = pd.read_csv(file,nrows=3) # change nrows [number of rows] to your spec.
#or with a dict comprhension
dfs = {file.stem : pd.read_csv(file) for file in Path('location\of\your\files').glob('*.csv')}
this will return a dictionary of dataframes with the key being the csv file name .stem adds this without the extension name.
much like
{
'csv_1' : dataframe,
'csv_2' : dataframe
}
if you want to concat these then do
df = pd.concat(dfs)
the index will be the csv file name.

Pandas HDF5 Select with Where on non natural-named columns

in my continuing spree of exotic pandas/HDF5 issues, I encountered the following:
I have a series of non-natural named columns (nb: because of a good reason, with negative numbers being "system" ids etc), which normally doesn't give an issue:
fact_hdf.select('store_0_0', columns=['o', 'a-6', 'm-13'])
however, my select statement does fall over it:
>>> fact_hdf.select('store_0_0', columns=['o', 'a-6', 'm-13'], where=[('a-6', '=', [0, 25, 28])])
blablabla
File "/srv/www/li/venv/local/lib/python2.7/site-packages/tables/table.py", line 1251, in _required_expr_vars
raise NameError("name ``%s`` is not defined" % var)
NameError: name ``a`` is not defined
Is there any way to work around it? I could rename my negative value from "a-1" to a "a_1" but that means reloading all of the data in my system. Which is rather much! :)
Suggestions are very welcome!
Here's a test table
In [1]: df = DataFrame({ 'a-6' : [1,2,3,np.nan] })
In [2]: df
Out[2]:
a-6
0 1
1 2
2 3
3 NaN
In [3]: df.to_hdf('test.h5','df',mode='w',table=True)
In [5]: df.to_hdf('test.h5','df',mode='w',table=True,data_columns=True)
/usr/local/lib/python2.7/site-packages/tables/path.py:99: NaturalNameWarning: object name is not a valid Python identifier: 'a-6'; it does not match the pattern ``^[a-zA-Z_][a-zA-Z0-9_]*$``; you will not be able to use natural naming to access this object; using ``getattr()`` will still work, though
NaturalNameWarning)
/usr/local/lib/python2.7/site-packages/tables/path.py:99: NaturalNameWarning: object name is not a valid Python identifier: 'a-6_kind'; it does not match the pattern ``^[a-zA-Z_][a-zA-Z0-9_]*$``; you will not be able to use natural naming to access this object; using ``getattr()`` will still work, though
NaturalNameWarning)
/usr/local/lib/python2.7/site-packages/tables/path.py:99: NaturalNameWarning: object name is not a valid Python identifier: 'a-6_dtype'; it does not match the pattern ``^[a-zA-Z_][a-zA-Z0-9_]*$``; you will not be able to use natural naming to access this object; using ``getattr()`` will still work, though
NaturalNameWarning)
There is a very way, but would to build this into the code itself. You can do a variable substitution on the column names as follows. Here is the existing routine (in master)
def select(self):
"""
generate the selection
"""
if self.condition is not None:
return self.table.table.readWhere(self.condition.format(), start=self.start, stop=self.stop)
elif self.coordinates is not None:
return self.table.table.readCoordinates(self.coordinates)
return self.table.table.read(start=self.start, stop=self.stop)
If instead you do this
(Pdb) self.table.table.readWhere("(x>2.0)",
condvars={ 'x' : getattr(self.table.table.cols,'a-6')})
array([(2, 3.0)],
dtype=[('index', '<i8'), ('a-6', '<f8')])
e.g. by subsituting x with the column reference, you can get the data.
This could be done on detection of invalid column names, but is pretty tricky.
Unfortunately I would suggest renaming your columns.