Pandas / Dask Reading a semi-tabular text

Pandas / Dask Reading a semi-tabular text - pandas

I have a text file that looks like this:
Version:23
Developer: Ali
NAME AGE IN
- Carol 22 no
- Kyle 31 yes
...
I ma reading it using Dask dataframe (which should be similar to Pandas). The result table should be dataframe look like this:
NAME AGE IN
Carol 22 no
Kyle 31 yes
I am having trouble to get rid of the dash in each row ('-') below the column name '-'. I tried
dd.read_csv(filepath, header = 3, sep="\s+")
which results in a dataframe with has different row size and brings more problems,
and I also tried using multiple delimiters but still giving errors.
dd.read_csv(filepath, header = 3, sep="\s-\s+")

dask.dataframe assumes your data is already in a tabular format. If you insist on using dask, then you will get further with dask.bag, which will load the file line by line. You can then filter out the lines that do not start with a dash, and process the ones that do, encoding them as a json object/dict, which you later convert to dataframe with .to_dataframe().

Related

Removing Non-English Words from CSV - NLTK

I am relatively new to Python and NLTK and have a hold of Flickr data stored in CSV and want to remove non-english words from the tags column. I keep getting errors saying "expected a String or a byte-like object". I have a feeling it's to do with the fact the tags column is in a Pandas Series datatype currently and not a String. However, none of the related solutions I've seen on Stack have worked when it comes to converting to string.
I have this code:
#converting pandas df to string
filtered_new = df_filtered_english_only.applymap(str)
#check it's converted to string
from pandas.api.types import is_string_dtype
is_string_dtype(filtered_new['tags'])
filtered_new['tags'].dropna(inplace=True)
tokens = filtered_new['tags'].apply(word_tokenize)
#print(tokens)
#remove non-English tags
#initialise corpus of englihs word from nltk
words = set(nltk.corpus.words.words())
" ".join(w for w in nltk.word_tokenize(df_filtered_english_only["tags"]) \
if w.lower() in words or not w.isalpha())
Any ideas how to resolve this?

Generally: You should give an example of your dataset.
What is the previous content of the column "tags"? How are tags separated? How is "no tags" expressed and is there a difference between "empty list" and "NAN"?
I assume tags can contain multiple words so that is important, also when it comes to removing non-english words.
But for simplicity sake let's assume there are only one-word-tags and they are separated by a whitespace, so that each rows content is a string. Also let's assume that empty rows (no tags) have the default NA value for pandas (numpy.NaN). And since you probably read the file with pandas some values might have been auto-converted to numbers.
Setup:
import numpy
import pandas
import nltk
df = pandas.DataFrame({"tags": ["bird dog cat xxxyyy", numpy.NaN, "Vogel Hund Katze xxxyyy", 123]})
> tags
0 bird dog cat xxxyyy
1 NaN
2 Vogel Hund Katze xxxyyy
3 123
Drop NA rows and tokenize:
df.dropna(inplace=True)
tokens = df["tags"].astype(str).apply(nltk.word_tokenize)
> 0 [bird, dog, cat, xxxyyy]
2 [Vogel, Hund, Katze, xxxyyy]
3 [123]
Name: tags, dtype: object
Filter by known words, always allow non-alpha:
words = set(nltk.corpus.words.words())
filtered = [" ".join(w for w in row if w.lower() in words or not w.isalpha()) for row in tokens]
> ['bird dog cat', '', '123']
The main problem in your code probably results from you doing a flat iteration over a nested list (you already tokenized so now each row in the pandas Series is a list). If you modify the iteration to be nested as well as I did in the example the code should run.
Also you should never do string conversion (be it .astype(str) or any other way) BEFORE removing NAs because then NAs will become something like 'nan' and will not be removed. First drop NA to handle empty cells, then convert to handle other stuff like numbers etc.

How to convert dataframe columns into a dictionary with one key and multiple value without tuples?

I got a large dataframe like that, which basically looks like this:
Name
km
Min
Max
test
24.6
43
555
test
63.9
31
666
which I would like to turn into a dictionary like:
{24.6: ["test",43,555],
63.9: ["test",31,666]}
What I found so far was https://stackoverflow.com/a/67496211/9218349, which would result in:
dict(zip(zip(df.km),zip(df.Name, df.Min,df.Max)))
this way i receive a dictionary of tuples, but I wont the keys to be floats and the values to be strings and floats. The floats should generally have 2 decimal places.
How would I do that?

Use to_dict('list') on the transposed DataFrame:
df.set_index('km').T.to_dict('list')
output:
{24.6: ['test', 43, 555], 63.9: ['test', 31, 666]}
NB. note that in case you have duplicated values in "km", as you can only have unique keys in a dictionary, only the latest row will be kept

How to tokenize a single column in a CSV file with 2 columns using Pandas DataFrame

I am trying to perform a sentiment analysis using a Bayesian Classifier and I have a CSV file consisting of rows with the following structure:
Column 1: Either 1 or 0
Column 2: String
Example: 1 | This is a great movie
I am using Pandas when reading the CSV file (read_csv).
After reading each row from the CSV file has the following structure:
1;This is a great movie
0;This is a bad movie
I would like to tokenize each string in column 2. However, I have not managed to do this. How do I tackle this problem?

Assuming the df looks like (just replace column name from 0 to column_name which you have as header:
0
0 1;This is a great movie
1 0;This is a bad movie
pd.DataFrame(df[0].apply(lambda x: x.split(";")).values.tolist(),columns=['A','B'])
A B
0 1 This is a great movie
1 0 This is a bad movie

Pyspark dataframe: crosstab or other method to make row label as new columns

I have a pyspark dataframe as follows in the picture:
I.e. i have four columns: year, word, count, frequency. The year is from 2000 to 2015.
I could like to have some operation on the (pyspark) dataframe so that i get the result in a format as the following picture:
The new dataframe column should be : word, frequency_2000, frequency_2001, frequency_2002, ..., frequency_2015.
With the frequency of each word in each year coming from previous dataframe.
Any advice how I could write efficient code?
Also, please rename the title if you could come up some more informative.

After some research, I found a solution:

Now, the crosstab function can get the output directly :
topw_ys.crosstab("word", "year").toPandas()
Results:
word_year 2000 2015
0 mining 10 6
1 system 11 12
...

Storing .csv in HDF5 pandas

I was experimenting with HDF and it seems pretty great because my data is not normalized and it contains a lot of text. I love being able to query when I read data into pandas.
loc2 = r'C:\\Users\Documents\\'
(my dataframe with data is called 'export')
hdf = HDFStore(loc2+'consolidated.h5')
hdf.put('raw', export, format= 'table', complib= 'blosc', complevel=9, data_columns = True, append = True)
21 columns and about 12 million rows so far and I will add about 1 million rows per month.
1 Date column [I convert this to datetime64]
2 Datetime columns (one of them for each row and the other one is null about 70% of the time) [I convert this to datetime64]
9 text columns [I convert these to categorical which saves a ton of room]
1 float column
8 integer columns, 3 of these can reach a max of maybe a couple of hundred and the other 5 can only be 1 or 0 values
I made a nice small h5 table and it was perfect until I tried to append more data to it (literally just one day of data since I am receiving daily raw .csv files). I received errors which showed that the dtypes were not matching up for each column although I used the same exact ipython notebook.
Is my hdf.put code correct? If I have append = True does that mean it will create the file if it does not exist, but append the data if it does exist? I will be appending to this file everyday basically.
For columns which only contain 1 or 0, should I specify a dtype like int8 or int16 - will this save space or should I keep it at int64? It looks like some of my columns are randomly float64 (although no decimals) and int64. I guess I need to specify the dtype for each column individually. Any tips?
I have no idea what blosc compression is. Is that the most efficient one to use? Any recommendations here? This file is mainly used to quickly read data into a dataframe to join to other .csv files which Tableau is connected to

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas