Importing a TermDocumentMatrix into R - text-mining

I am working on a qualitative analysis project in the tm package of R. I have built a corpus and created a term document matrix and long story short I need to edit my term document matrix and conflate some of its rows. To do this I have exported it out of R using
write.csv()
I then have imported the csv file back into R but am struggling to figure out how to get R to read it as a TermDocumentMatrix or DocumentTermMatrix.
I tried using the suggestions of the following example code with no avail.
It seems to keep reading my matrix as if it was a corpus and each cell as a single document.
# change this file location to suit your machine
file_loc <- "C:\\Documents and Settings\\Administrator\\Desktop\\Book1.csv"
# change TRUE to FALSE if you have no column headings in the CSV
x <- read.csv(file_loc, header = TRUE)
require(tm)
corp <- Corpus(DataframeSource(x))
dtm <- DocumentTermMatrix(corp)
Is there any way to import in a csv matrix that will be read as a termdocumentmatrix or documenttermmatrix without having R read the csv as if each cell is a document?

You're not reading documents, so skip the Corpus() step. This should work directly:
myDTM <- as.DocumentTermMatrix(x, weighting = weightTf)
For next time, consider saving the TDM object as .RData as this will not require conversion, and is also much more efficient.

If you want to keep the format of any data, I would recommend to use the save() function.
You can save any R objects into a .RData file. And when you want to retrieve the data, you can use the load() function.

Related

loading input dataset for every iteration in for loop in palantir

I have a #transform_pandas code which loads the input file for computing.
Inside the compute function I have a for loop which has to read the complete input data and filter accordingly for every iteration.
#transform_pandas(
Output("/FCA_Foundry/dataset1"),
source_df=Input(sample),
)
I have the below code where I'm trying to read source_df dataset for every iteration in for loop and filter the dataset specifically to the year and family and do the computation.
def compute(source_df):
for entire_row in vhcl_df.itertuples():
modyr = entire_row[1]
fam = str(entire_row[2])
/* source_df should be read again here.
source_df = source_df.loc[source_df['i_yr']==modyr]
source_df = source_df.loc[source_df['fam']==fam]
...
Is there a way to achieve this. Thank you for your support.
As already suggested by #nicornk in the comments, you should create a new .copy() item of your source_df right after you declare the transform.
The two filtering steps (that can ben also merged in one, if you don't need to work just on the "modyr filtered" source_df.
Please note that modyr, fam are actual colnames of vhcl_df, it is actually sufficient to
#transform_pandas(
Output("/FCA_Foundry/dataset1"),
source_df=Input(sample),
vhcl_df=Input(path)
)
def compute(source_df, vhcl_df):
for modyr, fam in vhcl_df.items():
temp_df = source_df.copy()
temp_df = source_df.loc[source_df['i_yr']==modyr]
temp_df = source_df.loc[source_df['fam']==str(fam)]
which, in a more concise and clean way is actually writable as
def compute(source_df, vhcl_df):
for modyr, fam in vhcl_df.items():
temp_df = source_df.copy()
filtered_temp_df = temp_df[(temp_df.i_yr==modyr) & (temp_df.fam==str(fam))]
PS: Remember that if source_df is big, you should proceed with PySpark (see foundry docs)
Note that transform_pandas should only be used on datasets that can fit into memory. If you have larger datasets that you wish to filter down first before converting to Pandas, you should write your transformation using the transform_df() decorator and the pyspark.sql.SparkSession.createDataFrame() method.

How should I merge monthly datasets into on dataset for cleaning?

I am working on a case study for a ride share. The data is broken up into monthly datasets, and in order to analyze the data over the last year I would need to merge the data. I uploaded all the data to both BigQuery and Rstudio but am unsure of the best way to make one large dataset.
I may not even have to do this, but I believe that to find trends I should have all the data in one datatable. If this is not the case then I will clean the data one month at a time.
Maybe use purrr::map_dfr()? It's like lapply() and rbind() rolled into one.
library(tidyverse)
all_the_tables <-
map_dfr( # union as it loops over the function
.x = list.files(pattern = ".csv"), # input for the function
.f = read_csv # the function
)
If it's more complicated and you need something to vary the source by you can use something like.
map_dfr(
.x = list.files(pattern = ".csv"),
.f = # the tilde let's you make a more complex sequence of steps
~read_csv(path = .x) |>
mutate(source = .x)
)
If it's a lot of files consider using vroom::vroom()

How to import Pandas data frames in a loop [duplicate]

So what I'm trying to do is the following:
I have 300+ CSVs in a certain folder. What I want to do is open each CSV and take only the first row of each.
What I wanted to do was the following:
import os
list_of_csvs = os.listdir() # puts all the names of the csv files into a list.
The above generates a list for me like ['file1.csv','file2.csv','file3.csv'].
This is great and all, but where I get stuck is the next step. I'll demonstrate this using pseudo-code:
import pandas as pd
for index,file in enumerate(list_of_csvs):
df{index} = pd.read_csv(file)
Basically, I want my for loop to iterate over my list_of_csvs object, and read the first item to df1, 2nd to df2, etc. But upon trying to do this I just realized - I have no idea how to change the variable being assigned when doing the assigning via an iteration!!!
That's what prompts my question. I managed to find another way to get my original job done no problemo, but this issue of doing variable assignment over an interation is something I haven't been able to find clear answers on!
If i understand your requirement correctly, we can do this quite simply, lets use Pathlib instead of os which was added in python 3.4+
from pathlib import Path
csvs = Path.cwd().glob('*.csv') # creates a generator expression.
#change Path(your_path) with Path.cwd() if script is in dif location
dfs = {} # lets hold the csv's in this dictionary
for file in csvs:
dfs[file.stem] = pd.read_csv(file,nrows=3) # change nrows [number of rows] to your spec.
#or with a dict comprhension
dfs = {file.stem : pd.read_csv(file) for file in Path('location\of\your\files').glob('*.csv')}
this will return a dictionary of dataframes with the key being the csv file name .stem adds this without the extension name.
much like
{
'csv_1' : dataframe,
'csv_2' : dataframe
}
if you want to concat these then do
df = pd.concat(dfs)
the index will be the csv file name.

Trying to load an hdf5 table with dataframe.to_hdf before I die of old age

This sounds like it should be REALLY easy to answer with Google but I'm finding it impossible to answer the majority of my nontrivial pandas/pytables questions this way. All I'm trying to do is to load about 3 billion records from about 6000 different CSV files into a single table in a single HDF5 file. It's a simple table, 26 fields, mixture of strings, floats and ints. I'm loading the CSVs with df = pandas.read_csv() and appending them to my hdf5 file with df.to_hdf(). I really don't want to use df.to_hdf(data_columns = True) because it looks like that will take about 20 days versus about 4 days for df.to_hdf(data_columns = False). But apparently when you use df.to_hdf(data_columns = False) you end up with some pile of junk that you can't even recover the table structure from (or so it appears to my uneducated eye). Only the columns that were identified in the min_itemsize list (the 4 string columns) are identifiable in the hdf5 table, the rest are being dumped by data type into values_block_0 through values_block_4:
table = h5file.get_node('/tbl_main/table')
print(table.colnames)
['index', 'values_block_0', 'values_block_1', 'values_block_2', 'values_block_3', 'values_block_4', 'str_col1', 'str_col2', 'str_col3', 'str_col4']
And any query like df = pd.DataFrame.from_records(table.read_where(condition)) fails with error "Exception: Data must be 1-dimensional"
So my questions are: (1) Do I really have to use data_columns = True which takes 5x as long? I was expecting to do a fast load and then index just a few columns after loading the table. (2) What exactly is this pile of garbage I get using data_columns = False? Is it good for anything if I need my table back with query-able columns? Is it good for anything at all?
This is how you can create an HDF5 file from CSV data using pytables. You could also use a similar process to create the HDF5 file with h5py.
Use a loop to read the CSV files with np.genfromtxt into a np array.
After reading the first CSV file, write the data with .create_table() method, referencing the np array created in Step 1.
For additional CSV files, write the data with .append() method, referencing the np array created in Step 1
End of loop
Updated on 6/2/2019 to read a date field (mm/dd/YYY) and convert to datetime object. Note changes to genfromtxt() arguments! Data used is added below the updated code.
import numpy as np
import tables as tb
from datetime import datetime
csv_list = ['SO_56387241_1.csv', 'SO_56387241_2.csv' ]
my_dtype= np.dtype([ ('a',int),('b','S20'),('c',float),('d',float),('e','S20') ])
with tb.open_file('SO_56387241.h5', mode='w') as h5f:
for PATH_csv in csv_list:
csv_data = np.genfromtxt(PATH_csv, names=True, dtype=my_dtype, delimiter=',', encoding=None)
# modify date in fifth field 'e'
for row in csv_data :
datetime_object = datetime.strptime(row['my_date'].decode('UTF-8'), '%m/%d/%Y' )
row['my_date'] = datetime_object
if h5f.__contains__('/CSV_Data') :
dset = h5f.root.CSV_Data
dset.append(csv_data)
else:
dset = h5f.create_table('/','CSV_Data', obj=csv_data)
dset.flush()
h5f.close()
Data for testing:
SO_56387241_1.csv:
my_int,my_str,my_float,my_exp,my_date
0,zero,0.0,0.00E+00,01/01/1980
1,one,1.0,1.00E+00,02/01/1981
2,two,2.0,2.00E+00,03/01/1982
3,three,3.0,3.00E+00,04/01/1983
4,four,4.0,4.00E+00,05/01/1984
5,five,5.0,5.00E+00,06/01/1985
6,six,6.0,6.00E+00,07/01/1986
7,seven,7.0,7.00E+00,08/01/1987
8,eight,8.0,8.00E+00,09/01/1988
9,nine,9.0,9.00E+00,10/01/1989
SO_56387241_2.csv:
my_int,my_str,my_float,my_exp,my_date
10,ten,10.0,1.00E+01,01/01/1990
11,eleven,11.0,1.10E+01,02/01/1991
12,twelve,12.0,1.20E+01,03/01/1992
13,thirteen,13.0,1.30E+01,04/01/1993
14,fourteen,14.0,1.40E+01,04/01/1994
15,fifteen,15.0,1.50E+01,06/01/1995
16,sixteen,16.0,1.60E+01,07/01/1996
17,seventeen,17.0,1.70E+01,08/01/1997
18,eighteen,18.0,1.80E+01,09/01/1998
19,nineteen,19.0,1.90E+01,10/01/1999

How to generate a fits file from the beginning

In this post, they explain how to generate a fits file from ascii file. However, I also would like to know how to define header and data into fits file. (Converting ASCII Table to FITS image)
For example, when I call a spectral fits file with astropy (which is downloaded from a telescope), I can call data and header separately.
I.E
In [1]:hdu = fits.open('observation.fits', memmap=True)
In [2]:header = hdu[0].header
In [3]:header
Out [3]:
SIMPLE = T / conforms to FITS standard
BITPIX = 8
NAXIS = 1
NAXIS1 = 47356
EXTEND = T
DATE = 'date' / file creation date (YYYY-MM-DDThh:mm:ss UT)
ORIGIN = 'XXX ' / European Southern Observatory
TELESCOP= 'XXX' / ESO Telescope Name
INSTRUME= 'Instrument' / Instrument used.
OBJECT = 'ABC ' / Original target.
RA = 30.4993 / xx:xx:xx.x RA (J2000) pointing
DEC = -20.0009 / xx:xx:xx.x DEC (J2000) pointing
CTYPE1 = 'WAVE ' / wavelength axis in nm
CRPIX1 = 0. / Reference pixel in z
CRVAL1 = 298.903594970703 / central wavelength
CDELT1 = 0.0199999995529652 / nm per pixel
CUNIT1 = 'nm ' / spectral unit
..
bla bla
..
END
In [3]:data = hdu[0].data
In [4]:data
Out [4]:array([ 1000, 1001, 1002, ...,
5.18091546e-13, 4.99434453e-13, 4.91280864e-13])
Lets assume, I have data like below
WAVE FLUX
1000 2.02e-12
1001 3.03e-12
1002 4.04e-12
..
bla bla
..
So, I'd like to generate a spectral fits file with my own data (with its own header).
Mini question : Now lets assume, I generate spectral fits file correctly, but I realised that I forgot to take logarithm of WAVE values in X axis (1000, 1001, 1002, ....) . How can I do that without touching FLUX values of Y-axis (2.02e-12, 3.03e-13, 4.04e-13) ?
FITS files are organized as one or more HDUs (Header Data Units) consisting, as the name suggests, as one data object (generally, a single array for an observation, though sometimes something else like a table), and the header of metadata that goes with that data.
To create a file from scratch, especially an image, the simplest way is to directly create an ImageHDU object:
>>> from astropy.io import fits
>>> hdu = fits.ImageHDU()
Just as with an HDU read from an existing file, this HDU has a (mostly empty) header, and an empty data attribute that you can then assign to:
>>> hdu.data = np.array(<some numpy array>)
>>> hdu.header['TELESCOP'] = 'Gemini'
When you're satisfied you can write the HDU out to a file with:
>>> hdu.writeto('filename.fits')
(Note: A lot of the documentation you'll see demonstrates a more complex process of creating an HDUList object, appending the HDU to the HDU list, and then writing the full HDU list. This is only necessary if you're creating a multi-extension FITS file. For a single HDU, you can use hdu.writeto directly and the framework will handle the other structural details.)
In general you don't need to manipulate the headers that describe the format of the data itself--that is automatic and should not be touched by hand (FITS has the unfortunate misfeature of mixing information about data structure with actual metadata). You can see more examples on how to manipulate FITS data here: http://docs.astropy.org/en/stable/generated/examples/index.html#astropy-io
Your other question pertains to manipulating the WCS (World Coordinate System) of the image, and in particular for spectral data this can be non-trivial. I would ask a separate question about that with more details about what you hope to accomplish.