Uploading data to google colab comes as dictionary, unlike uploading to notebook. Why is this? - google-colaboratory

How can I change the dictionary to dataframe in colab?
I added two pictures. One from colab and the other from notebook.
https://i.stack.imgur.com/o9yMf.png
https://i.stack.imgur.com/DcY8T.png
Thanks!

Using your notebook you have read your data using Pandas library
data = pd.read_csv('data.csv')
And that's why you it was uploaded as dataframe. While the files.upload() funstion it uplodes your files as dictionary and you need to read it as dataframe. However, you just need to read your data again after it has been uploeded using
data = pd.read_csv('DailyDelhiClimateTest.csv.csv')
Best of luck :)

Related

Trouble reading Blob Storage File into Azure ML Notebook

I have an Excel file uploaded to my ML workspace.
I can access the file as an azure FileDataset object. However, I don't know how to get it into a pandas DataFrame since 'FileDataset' object has no attribute 'to_dataframe'.
Azure ML notebooks seem to make a point of avoiding pandas for some reason.
Does anyone know how to get blob files into pandas dataframes from within Azure ML notebooks?
To explore and manipulate a dataset, it must first be downloaded from the blob source to a local file, which can then be loaded in a pandas DataFrame.
Here are the steps to follow for this procedure:
Download the data from Azure blob with the following Python code sample using Blob service. Replace the variable in the following code with your specific values:
from azure.storage.blob import BlobServiceClient
import pandas as pd
STORAGEACCOUNTURL= <storage_account_url>
STORAGEACCOUNTKEY= <storage_account_key>
LOCALFILENAME= <local_file_name>
CONTAINERNAME= <container_name>
BLOBNAME= <blob_name>
#download from blob
t1=time.time()
blob_service_client_instance =
BlobServiceClient(account_url=STORAGEACCOUNTURL,
credential=STORAGEACCOUNTKEY)
blob_client_instance =
blob_service_client_instance.get_blob_client(CONTAINERNAME, BLOBNAME,
snapshot=None)
with open(LOCALFILENAME, "wb") as my_blob:
blob_data = blob_client_instance.download_blob()
blob_data.readinto(my_blob)
t2=time.time()
print(("It takes %s seconds to download "+BLOBNAME) % (t2 - t1))
Read the data into a pandas DataFrame from the downloaded file.
#LOCALFILE is the file path
dataframe_blobdata = pd.read_csv(LOCALFILENAME)
For more details you can follow this link

How can I open a large parquet file with Keras?

I've tried looking for this and haven't had any meaningful results.
I have a keras model that has multi input and my data was getting too large for my pandas approach so I preprocessed it and saved it parquet file. I'm not sure how to open it with keras.
I looked up tf.datasets but I still cannot figure out how to read a parquet file that I can pass to my model.
Does anyone know how to use open parquet files? I can't seem to figure out how to do this in tensorflow and can't find anything related to it in keras.
You can probably keep your pandas approach, but you would have to breakdown your data into chunks.
If you have already broken it down to create your parquet file, you should be able to use the same method to have only a subset of your data opened in pandas at a time.
If you need to extract the data from your parquet file here's a link on how to create chunks of data for a pandas dataframe:
How to read a CSV file subset by subset with Pandas?
Once you have a chunk of data you can call model.fit on that chunk of data and then go on to the next chunk and call model.fit
You can look into TensorFlow I/O which is a collection of file systems and file formats that are not available in TensorFlow's built-in support. Here you can find functionalities such tfio.IODataset.from_parquet, and also tfio.IOTensor.from_parquet to work with the parquet file formats.
!pip install tensorflow_io -U -q
import tensorflow_io as tfio
df = pd.DataFrame({"data": tf.random.normal([20], 0, 1, tf.float32),
"label": np.random.randint(2, size=(20))})
df.to_parquet("df.parquet")
pd.read_parquet('/content/df.parquet')[:2]
data label
0 0.721347 1
1 -1.215225 1
ds = tfio.IODataset.from_parquet('/content/df.parquet')
ds
FYI, I think you should also consider using the feather format rather than the parquet file format, AFAIK, the parquet file can be really heavy to load and can slow down your training pipelines, whereas feather is comparatively fast (very fast).

Sklearn datasets default data structure is pandas or numPy?

I'm working through an exercise in https://www.oreilly.com/library/view/hands-on-machine-learning/9781492032632/ and am finding unexpected behavior on my computer when I fetch a dataset. The following code returns
numpy.ndarray
on the author's Google Collab page, but returns
pandas.core.frame.DataFrame
on my local Jupyter notebook. As far as I know, my environment is using the exact same versions of libraries as the author. I can easily convert the data to a numPy array, but since I'm using this book as a guide for novices, I'd like to know what could be causing this discrepancy.
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784', version=1)
mnist.keys()
type(mnist['data'])
The author's Google Collab is at the following link, scrolling down to the "MNIST" heading. Thanks!
https://colab.research.google.com/github/ageron/handson-ml2/blob/master/03_classification.ipynb#scrollTo=LjZxzwOs2Q2P.
Just to close off this question, the comment by Ben Reiniger, namely to add as_frame=False, is correct. For example:
mnist = fetch_openml('mnist_784', version=1, as_frame=False)
The OP has already made this change to the Colab code in the link.

How can I read and manipulate large csv files in Google Colaboratory while not using all the RAM?

I am trying to import and manipulate compressed .csv files (that are each about 500MB in compressed form) in Google Colaboratory. There are 7 files. Using pandas.read_csv(), I "use all the available RAM" just after 2 files are imported and I have to restart my runtime.
I have searched forever on here looking for answers and have tried all the ones I came across, but none work. I have the files in my google drive and am mounted to it.
How can I read all of the files and manipulate them without using all the RAM? I have 12.72GB of RAM and 358.27GM of Disk.
Buying more RAM isn't an option.
To solve my problem, I created 7 cells (one for each data file). Within each cell I read the file, manipulated it, saved what I needed, then deleted everything:
import pandas as pd
import gc
df = pd.read_csv('Google drive path', compression = 'gzip')
filtered_df = df.query('my query condition here')
filtered_df.to_csv('new Google drive path', compression = 'gzip')
del df
del filtered_df
gc.collect()
After all 7 files, each about 500MB, for a total row-by-column size of 7,000,000 by 100, my RAM has stayed under 1MB.
Just using del didn't free up enough RAM. I had to use gc.collect() after in each cell.

Use tf.TextLineReader to read to a np.array in TensorFlow

I need to read a file in my train module into a np.array (i want to use the array as label_keys in a DNNClassifier).
I tried tf.read_file and tf.TextLineReader() but i can´t get them to just output the rows to a np.array.
Is it possible?
(why not just read a file with open? I´m training in GCS and want to get the file from storage :)
To access a file from GCS using TensorFlow, you can use the Python tf.gfile.GFile API, which acts like a regular Python file object, but allows you to use TensorFlow's filesystem connectors:
with tf.gfile.GFile("gs://...") as f:
file_contents = f.read()