How to split a column in many different columns? - pandas

I have a dataset and in one of it columns I have many values that I want to convert to new columns:
"{'availabilities': {'bikes': 4, 'stands': 28, 'mechanicalBikes': 4, 'electricalBikes': 0, 'electricalInternalBatteryBikes': 0, 'electricalRemovableBatteryBikes': 0}, 'capacity': 32}"
I tried to use str.split() and received the error because of the patterns.
bikes_table_ready[['availabilities',
'bikes',
'stands',
'mechanicalBikes',
'electricalBikes',
'electricalInternalBatteryBikes',
'electricalRemovableBatteryBikes',
'capacity']]= bikes_table_ready.totalStands.str.extract('{.}', expand=True)
ValueError: pattern contains no capture groups
Which patterns should I use to have it done?

IIUC, use ast.literal_eval with pandas.json_normalize.
With a dataframe df with two columns (id) and the column to be splitted (col), it gives this :
import ast
​
df["col"] = df["col"].apply(lambda x: ast.literal_eval(x.strip('"')))
​
out = df.join(pd.json_normalize(df.pop("col").str["availabilities"]))
# Output :
print(out.to_string())
id bikes stands mechanicalBikes electricalBikes electricalInternalBatteryBikes electricalRemovableBatteryBikes
0 id001 4 28 4 0 0 0

Welcome to Stack Overflow! Please provide a minimal reproducible example demonstrating the problem. To learn more about this community and how we can help you, please start with the tour and read How to Ask and its linked resources.
That being said, it seems that the data you are trying to use the method str.split() is not actually a string. Check this to find more about data types. It seems you are trying to retrieve the information from a Python List "[xxx] Or Dictionary "dicName{"Key":"value}". If that's the case, try checking this link which talks about how to use Python Lists or this which talks about dictionaries.

Related

how to use for loop to split dataframe (list in list) and assign new name

This is the data I am working with:
df = [[1,2,3], [2,3,4], [5,6,7]]
I want to make sure I get a new name for each df[1] and df[2] etc so that I can use it later in my codes.
I tried using the following code:
for i in range(len(df)):
df_"{}".format(i) = df[i]
Obviously I am getting error for trying to declare dataframe like that.
How best can I achieve this? of splitting list in list into seperate dataframes using for loop?
Note: I am newbie to python hence if I missed anything obvious please help me point that out.
Use:
dfps = [[1,2,3], [2,3,4], [5,6,7]]
dic_of_dfs = {}
for i, row in enumerate(dfps):
dic_of_dfs.update({f"df_{i}":pd.DataFrame(row)})
dic_of_dfs["df_0"]
output:

Not able to drop multiple columns from a .csv file in Pandas

I'm reading a csv file that has 7 columns
df = pd.read_csv('DataSet.csv',delimiter=',',usecols=['Wheel','Date','1ex','2ex','3ex','4ex','5ex'])
The problem is that the model I want to train with it, is complaining about the first 2 columns being Strings, so I want to drop them.
I first tried not to read the from the beginning with :
df = pd.read_csv('DataSet.csv',delimiter=',',usecols=['1ex','2ex','3ex','4ex','5ex'])
but it only shifted the values of two columns..so I decided to drop them.
The problem is that I'm only able to drop the first column 'Date' with
train_df.drop(columns=['Date'], inplace=True)
, train_df is a portion of df uses for testing. How do I go to also drop 'Wheel' column?
I tried
train_df.drop(labels=[["Date","Wheel"]], inplace=True)
but i get KeyError: "[('Date', 'Wheel')] not found in axis"
so I tried
train_df.drop(columns=[["Date","Wheel"]], index=1, inplace=True)
but I still get the same error.
I'm so new to Python I'm out of resources to solve this.
As always many thanks.
Try:
train_df.drop(columns=["Date","Wheel"], index=1, inplace=True)
See the examples in https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html

Python KeyError when using pandas

I'm following a tutorial on NLP but have encountered a key error error when trying to group my raw data into good and bad reviews. Here is the tutorial link: https://towardsdatascience.com/detecting-bad-customer-reviews-with-nlp-d8b36134dc7e
#reviews.csv
I am so angry about the service
Nothing was wrong, all good
The bedroom was dirty
The food was great
#nlp.py
import pandas as pd
#read data
reviews_df = pd.read_csv("reviews.csv")
# append the positive and negative text reviews
reviews_df["review"] = reviews_df["Negative_Review"] +
reviews_df["Positive_Review"]
reviews_df.columns
I'm seeing the following error:
File "pandas\_libs\hashtable_class_helper.pxi", line 1500, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'Negative_Review'
Why is this happening?
You're getting this error because you did not understand how to structure your data.
When you do df['reviews']=df['Positive_reviews']+df['Negative_reviews'] you're actually summing the values of Positive reviews to Negative reviews(which does not exist currently) into the 'reviews' column (chich also does not exist).
Your csv is nothing more than a plaintext file with one text in each row. Also, since you're working with text, remember to enclose every string in quotation marks("), otherwise your commas will create fakecolumns.
With your approach, it seems that you'll still tag all your reviews manually (usually, if you're working with machine learning, you'll do this outside code and load it to your machine learning file).
In order for your code to work, you want to do the following:
import pandas as pd
df = pd.read_csv('TestFileFolder/57886076.csv', names=['text'])
## Fill with placeholder values
df['Positive_review']=0
df['Negative_review']=1
df.head()
Result:
text Positive_review Negative_review
0 I am so angry about the service 0 1
1 Nothing was wrong, all good 0 1
2 The bedroom was dirty 0 1
3 The food was great 0 1
However, I would recommend you to have a single column (is_review_positive) and have it to true or false. You can easily encode it later on.

How to stop Jupyter outputting truncated results when using pd.Series.value_counts()?

I have a DataFrame and I want to display the frequencies for certain values in a certain Series using pd.Series.value_counts().
The problem is that I only see truncated results in the output. I'm coding in Jupyter Notebook.
I have tried unsuccessfully a couple of methods:
df = pd.DataFrame(...) # assume df is a DataFrame with many columns and rows
# 1st method
df.col1.value_counts()
# 2nd method
print(df.col1.value_counts())
# 3rd method
vals = df.col1.value_counts()
vals # neither print(vals) doesn't work
# All output something like this
value1 100000
value2 10000
...
value1000 1
Currently this is what I'm using, but it's quite cumbersome:
print(df.col1.value_counts()[:50])
print(df.col1.value_counts()[50:100])
print(df.col1.value_counts()[100:150])
# etc.
Also, I have read this related Stack Overflow question, but haven't found it helpful.
So how to stop outputting truncated results?
If you want to print all rows:
pd.options.display.max_rows = 1000
print(vals)
If you want to print all rows only once:
with pd.option_context("display.max_rows", 1000):
print(vals)
Relevant documentation here.
I think you need option_context and set to some large number, e.g. 999. Advatage of solution is:
option_context context manager has been exposed through the top-level API, allowing you to execute code with given option values. Option values are restored automatically when you exit the with block.
#temporaly display 999 rows
with pd.option_context('display.max_rows', 999):
print (df.col1.value_counts())

Pandas HDF5 Select with Where on non natural-named columns

in my continuing spree of exotic pandas/HDF5 issues, I encountered the following:
I have a series of non-natural named columns (nb: because of a good reason, with negative numbers being "system" ids etc), which normally doesn't give an issue:
fact_hdf.select('store_0_0', columns=['o', 'a-6', 'm-13'])
however, my select statement does fall over it:
>>> fact_hdf.select('store_0_0', columns=['o', 'a-6', 'm-13'], where=[('a-6', '=', [0, 25, 28])])
blablabla
File "/srv/www/li/venv/local/lib/python2.7/site-packages/tables/table.py", line 1251, in _required_expr_vars
raise NameError("name ``%s`` is not defined" % var)
NameError: name ``a`` is not defined
Is there any way to work around it? I could rename my negative value from "a-1" to a "a_1" but that means reloading all of the data in my system. Which is rather much! :)
Suggestions are very welcome!
Here's a test table
In [1]: df = DataFrame({ 'a-6' : [1,2,3,np.nan] })
In [2]: df
Out[2]:
a-6
0 1
1 2
2 3
3 NaN
In [3]: df.to_hdf('test.h5','df',mode='w',table=True)
In [5]: df.to_hdf('test.h5','df',mode='w',table=True,data_columns=True)
/usr/local/lib/python2.7/site-packages/tables/path.py:99: NaturalNameWarning: object name is not a valid Python identifier: 'a-6'; it does not match the pattern ``^[a-zA-Z_][a-zA-Z0-9_]*$``; you will not be able to use natural naming to access this object; using ``getattr()`` will still work, though
NaturalNameWarning)
/usr/local/lib/python2.7/site-packages/tables/path.py:99: NaturalNameWarning: object name is not a valid Python identifier: 'a-6_kind'; it does not match the pattern ``^[a-zA-Z_][a-zA-Z0-9_]*$``; you will not be able to use natural naming to access this object; using ``getattr()`` will still work, though
NaturalNameWarning)
/usr/local/lib/python2.7/site-packages/tables/path.py:99: NaturalNameWarning: object name is not a valid Python identifier: 'a-6_dtype'; it does not match the pattern ``^[a-zA-Z_][a-zA-Z0-9_]*$``; you will not be able to use natural naming to access this object; using ``getattr()`` will still work, though
NaturalNameWarning)
There is a very way, but would to build this into the code itself. You can do a variable substitution on the column names as follows. Here is the existing routine (in master)
def select(self):
"""
generate the selection
"""
if self.condition is not None:
return self.table.table.readWhere(self.condition.format(), start=self.start, stop=self.stop)
elif self.coordinates is not None:
return self.table.table.readCoordinates(self.coordinates)
return self.table.table.read(start=self.start, stop=self.stop)
If instead you do this
(Pdb) self.table.table.readWhere("(x>2.0)",
condvars={ 'x' : getattr(self.table.table.cols,'a-6')})
array([(2, 3.0)],
dtype=[('index', '<i8'), ('a-6', '<f8')])
e.g. by subsituting x with the column reference, you can get the data.
This could be done on detection of invalid column names, but is pretty tricky.
Unfortunately I would suggest renaming your columns.