How to decode a .csv .gzip file containing tweets? - pandas

I'm trying to do a twitter sentiment analysis and my dataset is a couple of .csv.gzip files.
This is what I did to convert them to all to one dataframe.
(I'm using google colab, if that has anything to do with the error, filename or something)
apr_files = [file[9:] for file in csv_collection if re.search(r"04+", file)]
apr_files
Output:
['0428_UkraineCombinedTweetsDeduped.csv.gzip',
'0430_UkraineCombinedTweetsDeduped.csv.gzip',
'0401_UkraineCombinedTweetsDeduped.csv.gzip']
temp_list = []
for file in apr_files:
print(f"Reading in {file}")
# unzip and read in the csv file as a dataframe
temp = pd.read_csv(file, compression="gzip", header=0, index_col=0)
# append dataframe to temp list
temp_list.append(temp)
Error:
Reading in 0428_UkraineCombinedTweetsDeduped.csv.gzip
Reading in 0430_UkraineCombinedTweetsDeduped.csv.gzip
/usr/local/lib/python3.7/dist-packages/IPython/core/interactiveshell.py:2882: DtypeWarning: Columns (15) have mixed types.Specify dtype option on import or set low_memory=False.
exec(code_obj, self.user_global_ns, self.user_ns)
Reading in 0401_UkraineCombinedTweetsDeduped.csv.gzip
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-26-5cba3ca01b1e> in <module>()
3 print(f"Reading in {file}")
4 # unzip and read in the csv file as a dataframe
----> 5 tmp_df = pd.read_csv(file, compression="gzip", header=0, index_col=0)
6 # append dataframe to temp list
7 tmp_df_list.append(tmp_df)
8 frames
/usr/local/lib/python3.7/dist-packages/pandas/_libs/parsers.pyx in pandas._libs.parsers.raise_parser_error()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb8 in position 8048: invalid start byte
I assumed that this error might be because the tweets contain multiple characters (like emoji, non-english characters, etc.).

I just switched to Jupyter Notebook, and It worked fine there.
As of now, I don't know what was the issue with Google Colab though.

Related

How to solve for delimter conflicts

I have a large .TXT file which is delimited by ";". Unfortunately some of my values contain ";" aswell, which in that case is not a delimiter but recognized as delimiter by pandas. Becasue of this I have difficulties reading the .txt. files into pandas because some lines have more columns than the others. Background: I am trying to combine several .txt files into 1 dataframe and get the following error: ParserError: Error tokenizing data. C error: Expected 21 fields in line 443, saw 22.
So when checking line 443 I saw indeed that that line had 1 more instance of ";" because it was part of one of the values.
Reproduction:
Text file 1:
1;2;3;4
23123213;23123213;23123213;23123213
123;123;123;123
123;123;123;123
1;1;1;1
123;123;123;123
12;12;12;12
3;3;3;3
Text file 2:
1;2;3;4
23123213;23123213;23123213;23123213
123;123;123;123
123;123;12;3;123
1;1;1;1
123;123;123;123
12;12;12;12
3;3;3;3
Code:
import pandas as pd
import glob
import os
path = r'C:\Users\file'
all_files = glob.glob(path + "/*.txt")
li = []
for filename in all_files:
df = pd.read_csv(filename, index_col=None, header=0, delimiter=';')
li.append(df)
frame = pd.concat(li, axis=0, ignore_index=True)

using tfrecord but getting file too large

I am trying to create a tfrecord from a folder of numpy arrays, the folder contains about 2000 numpy files of 50mb each.
def convert(image_paths,out_path):
# Args:
# image_paths List of file-paths for the images.
# labels Class-labels for the images.
# out_path File-path for the TFRecords output file.
print("Converting: " + out_path)
# Number of images. Used when printing the progress.
num_images = len(image_paths)
# Open a TFRecordWriter for the output-file.
with tf.python_io.TFRecordWriter(out_path) as writer:
# Iterate over all the image-paths and class-labels.
for i, (path) in enumerate(image_paths):
# Print the percentage-progress.
print_progress(count=i, total=num_images-1)
# Load the image-file using matplotlib's imread function.
img = np.load(path)
# Convert the image to raw bytes.
img_bytes = img.tostring()
# Create a dict with the data we want to save in the
# TFRecords file. You can add more relevant data here.
data = \
{
'image': wrap_bytes(img_bytes)
}
# Wrap the data as TensorFlow Features.
feature = tf.train.Features(feature=data)
# Wrap again as a TensorFlow Example.
example = tf.train.Example(features=feature)
# Serialize the data.
serialized = example.SerializeToString()
# Write the serialized data to the TFRecords file.
writer.write(serialized)
i think it converts about 200 files and then i get this
Converting: tf.recordtrain
- Progress: 3.6%Traceback (most recent call last):
File "tf_record.py", line 71, in <module>
out_path=path_tfrecords_train)
File "tf_record.py", line 54, in convert
writer.write(serialized)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/lib/io/tf_record.py", line 236, in write
self._writer.WriteRecord(record, status)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/errors_impl.py", line 528, in __exit__
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.OutOfRangeError: tf.recordtrain; File too large
Any suggestions to fix this would be helpful, Thanks in advance.
I'm not sure what the limits are to tfrecords but the more common approach assuming you have enough disk space is to store your dataset over several tfrecords file e.g. store every 20 numpy files in a different tfrecords file.

Convert Pandas DataFrame to bytes-like object

Hi I am trying to convert my df to binary and store it in a variable.
my_df:
df = pd.DataFrame({'A':[1,2,3],'B':[4,5,6]})
my code:
import io
towrite = io.BytesIO()
df.to_excel(towrite) # write to BytesIO buffer
towrite.seek(0) # reset pointer
I am getting AttributeError: '_io.BytesIO' object has no attribute 'write_cells'
Full Traceback:
AttributeError Traceback (most recent call last)
<ipython-input-25-be6ee9d9ede6> in <module>()
1 towrite = io.BytesIO()
----> 2 df.to_excel(towrite) # write to BytesIO buffer
3 towrite.seek(0) # reset pointer
4 encoded = base64.b64encode(towrite.read()) #
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\frame.py in to_excel(self, excel_writer, sheet_name, na_rep, float_format, columns, header, index, index_label, startrow, startcol, engine, merge_cells, encoding, inf_rep, verbose, freeze_panes)
1422 formatter.write(excel_writer, sheet_name=sheet_name, startrow=startrow,
1423 startcol=startcol, freeze_panes=freeze_panes,
-> 1424 engine=engine)
1425
1426 def to_stata(self, fname, convert_dates=None, write_index=True,
C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\formats\excel.py in write(self, writer, sheet_name, startrow, startcol, freeze_panes, engine)
624
625 formatted_cells = self.get_formatted_cells()
--> 626 writer.write_cells(formatted_cells, sheet_name,
627 startrow=startrow, startcol=startcol,
628 freeze_panes=freeze_panes)
AttributeError: '_io.BytesIO' object has no attribute 'write_cells'
I solved the issue by upgrading pandas to newer version.
import io
towrite = io.BytesIO()
df.to_excel(towrite) # write to BytesIO buffer
towrite.seek(0)
print(towrite)
b''
print(type(towrite))
_io.BytesIO
if you want to see the bytes-like object use getvalue,
print(towrite.getvalue())
b'PK\x03\x04\x14\x00\x00\x00\x08\x00\x00\x00!\x00<\xb
Pickle
Pickle is a reproducible format for a Pandas dataframe, but it's only for internal use among trusted users. It's not for sharing with untrusted users due to security reasons.
import pickle
# Export:
my_bytes = pickle.dumps(df, protocol=4)
# Import:
df_restored = pickle.loads(my_bytes)
This was tested with Pandas 1.1.2. Unfortunately this failed for a very large dataframe, but then what worked is pickling and parallel-compressing each column individually, followed by pickling this list. Alternatively you can pickle chunks of the large dataframe.
CSV
If you must use a CSV representation:
df.to_csv(index=False).encode()
Note that various datatypes are lost when using CSV.
Parquet
See this answer. Note that various datatypes are converted when using parquet.
Excel
Avoid its use for the most part because it limits the max number of rows and columns.
I required to upload the file object to S3 via boto3 which didn't accept the pandas bytes object. So building on the answer from Asclepius I cast the object to a BytesIO, eg:
from io import BytesIO
data = BytesIO(df.to_csv(index=False).encode('utf-8'))

Dask array from_npy_stack misses info file

Action
Trying to create a Dask array from a stack of .npy files not written by Dask.
Problem
Dask from_npy_stack() expects an info file, which is normally created by to_npy_stack() function when creating .npy stack with Dask.
Attempts
I found this PR (https://github.com/dask/dask/pull/686) with a description of how the info file is created
def to_npy_info(dirname, dtype, chunks, axis):
with open(os.path.join(dirname, 'info'), 'wb') as f:
pickle.dump({'chunks': chunks, 'dtype': x.dtype, 'axis': axis}, f)
Question
How do I go about loading .npy stacks that are created outside of Dask?
Example
from pathlib import Path
import numpy as np
import dask.array as da
data_dir = Path('/home/tom/data/')
for i in range(3):
data = np.zeros((2,2))
np.save(data_dir.joinpath('{}.npy'.format(i)), data)
data = da.from_npy_stack('/home/tom/data')
Resulting in the following error:
---------------------------------------------------------------------------
IOError Traceback (most recent call last)
<ipython-input-94-54315c368240> in <module>()
9 np.save(data_dir.joinpath('{}.npy'.format(i)), data)
10
---> 11 data = da.from_npy_stack('/home/tom/data/')
/home/tom/vue/env/local/lib/python2.7/site-packages/dask/array/core.pyc in from_npy_stack(dirname, mmap_mode)
3722 Read data in memory map mode
3723 """
-> 3724 with open(os.path.join(dirname, 'info'), 'rb') as f:
3725 info = pickle.load(f)
3726
IOError: [Errno 2] No such file or directory: '/home/tom/data/info'
The function from_npy_stack is short and simple. Agree that it probably ought to take the metadata as an optional argument for cases such as yours, but you could simply use the lines of code after loading the "info" file assuming you have the right values to. Some of these values, i.e., dtype and the shape of each array for making chunks, could presumably be obtained by looking at the first of the data files
name = 'from-npy-stack-%s' % dirname
keys = list(product([name], *[range(len(c)) for c in chunks]))
values = [(np.load, os.path.join(dirname, '%d.npy' % i), mmap_mode)
for i in range(len(chunks[axis]))]
dsk = dict(zip(keys, values))
out = Array(dsk, name, chunks, dtype)
Also, note that we are constructing the names of the files here, but you might want to get those by doing a listdir or glob.

Pandas Group Example Errors

I am trying to replicate one example out of Wes McKinney's book on Pandas, the code is here (it assumes all names datafiles are under names folder)
# -*- coding: utf-8 -*-
import numpy as np
import pandas as pd
years = range(1880, 2011)
pieces = []
columns = ['name', 'sex', 'births']
for year in years:
path = 'names/yob%d.txt' % year
frame = pd.read_csv(path, names=columns)
frame['year'] = year
pieces.append(frame)
names = pd.concat(pieces, ignore_index=True)
names
def get_tops(group):
return group.sort_index(by='births', ascending=False)[:1000]
grouped = names.groupby(['year','sex'])
grouped.apply(get_tops)
I am using Pandas 0.10 and Python 2.7. The error I am seeing is this:
Traceback (most recent call last):
File "names.py", line 21, in <module>
grouped.apply(get_tops)
File "/usr/local/lib/python2.7/dist-packages/pandas-0.10.0-py2.7-linux-i686.egg/pandas/core/groupby.py", line 321, in apply
return self._python_apply_general(f)
File "/usr/local/lib/python2.7/dist-packages/pandas-0.10.0-py2.7-linux-i686.egg/pandas/core/groupby.py", line 324, in _python_apply_general
keys, values, mutated = self.grouper.apply(f, self.obj, self.axis)
File "/usr/local/lib/python2.7/dist-packages/pandas-0.10.0-py2.7-linux-i686.egg/pandas/core/groupby.py", line 585, in apply
values, mutated = splitter.fast_apply(f, group_keys)
File "/usr/local/lib/python2.7/dist-packages/pandas-0.10.0-py2.7-linux-i686.egg/pandas/core/groupby.py", line 2127, in fast_apply
results, mutated = lib.apply_frame_axis0(sdata, f, names, starts, ends)
File "reduce.pyx", line 421, in pandas.lib.apply_frame_axis0 (pandas/lib.c:24934)
File "/usr/local/lib/python2.7/dist-packages/pandas-0.10.0-py2.7-linux-i686.egg/pandas/core/frame.py", line 2028, in __setattr__
self[name] = value
File "/usr/local/lib/python2.7/dist-packages/pandas-0.10.0-py2.7-linux-i686.egg/pandas/core/frame.py", line 2043, in __setitem__
self._set_item(key, value)
File "/usr/local/lib/python2.7/dist-packages/pandas-0.10.0-py2.7-linux-i686.egg/pandas/core/frame.py", line 2078, in _set_item
value = self._sanitize_column(key, value)
File "/usr/local/lib/python2.7/dist-packages/pandas-0.10.0-py2.7-linux-i686.egg/pandas/core/frame.py", line 2112, in _sanitize_column
raise AssertionError('Length of values does not match '
AssertionError: Length of values does not match length of index
Any ideas?
I think this was a bug introduced in 0.10, namely issue #2605,
"AssertionError when using apply after GroupBy". It's since been fixed.
You can either wait for the 0.10.1 release, which shouldn't be too long from now, or you can upgrade to the development version (either via git or simply by downloading the zip of master.)