Pandas, importing JSON-like file using read_csv - pandas

I would like to import data from .txt to dataframe. I can not import it using classical pd.read_csv, while using different types of sep it throws me errors. Data I want to import Cell_Phones_&_Accessories.txt.gz is in a format.
product/productId: B000JVER7W
product/title: Mobile Action MA730 Handset Manager - Bluetooth Data Suite
product/price: unknown
review/userId: A1RXYH9ROBAKEZ
review/profileName: A. Igoe
review/helpfulness: 0/0
review/score: 1.0
review/time: 1233360000
review/summary: Don't buy!
review/text: First of all, the company took my money and sent me an email telling me the product was shipped. A week and a half later I received another email telling me that they are sorry, but they don't actually have any of these items, and if I received an email telling me it has shipped, it was a mistake.When I finally got my money back, I went through another company to buy the product and it won't work with my phone, even though it depicts that it will. I have sent numerous emails to the company - I can't actually find a phone number on their website - and I still have not gotten any kind of response. What kind of customer service is that? No one will help me with this problem. My advice - don't waste your money!
product/productId: B000JVER7W
product/title: Mobile Action MA730 Handset Manager - Bluetooth Data Suite
product/price: unknown
....

You can use jen for separator, then split by first : and pivot:
df = pd.read_csv('Cell_Phones_&_Accessories.txt', sep='¥', names=['data'], engine='python')
df1 = df.pop('data').str.split(':', n=1, expand=True)
df1.columns = ['a','b']
df1 = df1.assign(c=(df1['a'] == 'product/productId').cumsum())
df1 = df1.pivot('c','a','b')
Python solution with defaultdict and DataFrame constructor for improve performance:
from collections import defaultdict
data = defaultdict(list)
with open("Cell_Phones_&_Accessories.txt") as f:
for line in f.readlines():
if len(line) > 1:
key, value = line.strip().split(':', 1)
data[key].append(value)
df = pd.DataFrame(data)

Related

How to split a column in many different columns?

I have a dataset and in one of it columns I have many values that I want to convert to new columns:
"{'availabilities': {'bikes': 4, 'stands': 28, 'mechanicalBikes': 4, 'electricalBikes': 0, 'electricalInternalBatteryBikes': 0, 'electricalRemovableBatteryBikes': 0}, 'capacity': 32}"
I tried to use str.split() and received the error because of the patterns.
bikes_table_ready[['availabilities',
'bikes',
'stands',
'mechanicalBikes',
'electricalBikes',
'electricalInternalBatteryBikes',
'electricalRemovableBatteryBikes',
'capacity']]= bikes_table_ready.totalStands.str.extract('{.}', expand=True)
ValueError: pattern contains no capture groups
Which patterns should I use to have it done?
IIUC, use ast.literal_eval with pandas.json_normalize.
With a dataframe df with two columns (id) and the column to be splitted (col), it gives this :
import ast
​
df["col"] = df["col"].apply(lambda x: ast.literal_eval(x.strip('"')))
​
out = df.join(pd.json_normalize(df.pop("col").str["availabilities"]))
# Output :
print(out.to_string())
id bikes stands mechanicalBikes electricalBikes electricalInternalBatteryBikes electricalRemovableBatteryBikes
0 id001 4 28 4 0 0 0
Welcome to Stack Overflow! Please provide a minimal reproducible example demonstrating the problem. To learn more about this community and how we can help you, please start with the tour and read How to Ask and its linked resources.
That being said, it seems that the data you are trying to use the method str.split() is not actually a string. Check this to find more about data types. It seems you are trying to retrieve the information from a Python List "[xxx] Or Dictionary "dicName{"Key":"value}". If that's the case, try checking this link which talks about how to use Python Lists or this which talks about dictionaries.

Quick Crosswalk For State Abbreviations & State Names

For the millionth time I had a dataset today that listed full state names. But, I needed it to list state postal code abbreviations. Here is a code snip I wrote that mapped the changes for me using data from a generic website.
1) Anyone know of or think of a better solution?
2a) Anyone know of a better web reference? Using USPS sites (such as the ones below) will not seem to work with pd.read_html()
2b) I also had a hard time isolating the correct table from pd.read_html() and the wiki page at: https://en.wikipedia.org/wiki/List_of_U.S._state_abbreviations
import pandas as pd
# Make Generic Data For Demonstration Purpose
data = {'StName':['Wisconsin','Minnesota','Minnesota',
'Wisconsin','Florida','New York']}
df = pd.DataFrame(data)
# Get State Crosswalk From Generic Website
crosswalk = 'http://app02.clerk.org/menu/ccis/Help/CCIS%20Codes/state_codes.html'
states = pd.read_html(crosswalk)[0]
# Demo Crosswalking State Name to State Abbreviation
df['StAbbr'] = df['StName'].map(dict(zip(states['Description'],
states['Code'])))
# Demo Reverse Crosswalking Back to State Name
df['StNameAgain'] = df['StName'].map(dict(zip(states['Code'],
states['Description'])))

How to pass part-of-speech in WordNetLemmatizer?

I am preprocessing text data. However, I am facing issue with lemmatizing.
Below is the sample text:
'An 18-year-old boy was referred to prosecutors Thursday for allegedly
stealing about ¥15 million ($134,300) worth of cryptocurrency last
year by hacking a digital currency storage website, police said.',
'The case is the first in Japan in which criminal charges have been
pursued against a hacker over cryptocurrency losses, the police
said.', '\n', 'The boy, from the city of Utsunomiya, Tochigi
Prefecture, whose name is being withheld because he is a minor,
allegedly stole the money after hacking Monappy, a website where users
can keep the virtual currency monacoin, between Aug. 14 and Sept. 1
last year.', 'He used software called Tor that makes it difficult to
identify who is accessing the system, but the police identified him by
analyzing communication records left on the website’s server.', 'The
police said the boy has admitted to the allegations, quoting him as
saying, “I felt like I’d found a trick no one knows and did it as if I
were playing a video game.”', 'He took advantage of a weakness in a
feature of the website that enables a user to transfer the currency to
another user, knowing that the system would malfunction if transfers
were repeated over a short period of time.', 'He repeatedly submitted
currency transfer requests to himself, overwhelming the system and
allowing him to register more money in his account.', 'About 7,700
users were affected and the operator will compensate them.', 'The boy
later put the stolen monacoins in an account set up by a different
cryptocurrency operator, received payouts in a different
cryptocurrency and bought items such as a smartphone, the police
said.', 'According to the operator of Monappy, the stolen monacoins
were kept using a system with an always-on internet connection, and
those kept offline were not stolen.'
My code is:
import pandas as pd
import nltk
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
df = pd.read_csv('All Articles.csv')
df['Articles'] = df['Articles'].str.lower()
stemming = PorterStemmer()
stops = set(stopwords.words('english'))
lemma = WordNetLemmatizer()
def identify_tokens(row):
Articles = row['Articles']
tokens = nltk.word_tokenize(Articles)
token_words = [w for w in tokens if w.isalpha()]
return token_words
df['words'] = df.apply(identify_tokens, axis=1)
def stem_list(row):
my_list = row['words']
stemmed_list = [stemming.stem(word) for word in my_list]
return (stemmed_list)
df['stemmed_words'] = df.apply(stem_list, axis=1)
def lemma_list(row):
my_list = row['stemmed_words']
lemma_list = [lemma.lemmatize(word, pos='v') for word in my_list]
return (lemma_list)
df['lemma_words'] = df.apply(lemma_list, axis=1)
def remove_stops(row):
my_list = row['lemma_words']
meaningful_words = [w for w in my_list if not w in stops]
return (meaningful_words)
df['stem_meaningful'] = df.apply(remove_stops, axis=1)
def rejoin_words(row):
my_list = row['stem_meaningful']
joined_words = (" ".join(my_list))
return joined_words
df['processed'] = df.apply(rejoin_words, axis=1)
As it is clear from the code that I am using pandas. However here I have given sample text.
My problem area is :
def lemma_list(row):
my_list = row['stemmed_words']
lemma_list = [lemma.lemmatize(word, pos='v') for word in my_list]
return (lemma_list)
df['lemma_words'] = df.apply(lemma_list, axis=1)
Though the code is running without any error lemma function is not working expectedly.
Thanks in Advance.
In your code above you are trying to lemmatize words that have been stemmed. When the lemmatizer runs into a word that it doesn't recognize, it'll simply return that word. For instance stemming offline produces offlin and when you run that through the lemmatizer it just gives back the same word, offlin.
Your code should be modified to lemmatize words, like this...
def lemma_list(row):
my_list = row['words'] # Note: line that is changed
lemma_list = [lemma.lemmatize(word, pos='v') for word in my_list]
return (lemma_list)
df['lemma_words'] = df.apply(lemma_list, axis=1)
print('Words: ', df.ix[0,'words'])
print('Stems: ', df.ix[0,'stemmed_words'])
print('Lemmas: ', df.ix[0,'lemma_words'])
This produces...
Words: ['and', 'those', 'kept', 'offline', 'were', 'not', 'stolen']
Stems: ['and', 'those', 'kept', 'offlin', 'were', 'not', 'stolen']
Lemmas: ['and', 'those', 'keep', 'offline', 'be', 'not', 'steal']
Which is is correct.

Python KeyError when using pandas

I'm following a tutorial on NLP but have encountered a key error error when trying to group my raw data into good and bad reviews. Here is the tutorial link: https://towardsdatascience.com/detecting-bad-customer-reviews-with-nlp-d8b36134dc7e
#reviews.csv
I am so angry about the service
Nothing was wrong, all good
The bedroom was dirty
The food was great
#nlp.py
import pandas as pd
#read data
reviews_df = pd.read_csv("reviews.csv")
# append the positive and negative text reviews
reviews_df["review"] = reviews_df["Negative_Review"] +
reviews_df["Positive_Review"]
reviews_df.columns
I'm seeing the following error:
File "pandas\_libs\hashtable_class_helper.pxi", line 1500, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'Negative_Review'
Why is this happening?
You're getting this error because you did not understand how to structure your data.
When you do df['reviews']=df['Positive_reviews']+df['Negative_reviews'] you're actually summing the values of Positive reviews to Negative reviews(which does not exist currently) into the 'reviews' column (chich also does not exist).
Your csv is nothing more than a plaintext file with one text in each row. Also, since you're working with text, remember to enclose every string in quotation marks("), otherwise your commas will create fakecolumns.
With your approach, it seems that you'll still tag all your reviews manually (usually, if you're working with machine learning, you'll do this outside code and load it to your machine learning file).
In order for your code to work, you want to do the following:
import pandas as pd
df = pd.read_csv('TestFileFolder/57886076.csv', names=['text'])
## Fill with placeholder values
df['Positive_review']=0
df['Negative_review']=1
df.head()
Result:
text Positive_review Negative_review
0 I am so angry about the service 0 1
1 Nothing was wrong, all good 0 1
2 The bedroom was dirty 0 1
3 The food was great 0 1
However, I would recommend you to have a single column (is_review_positive) and have it to true or false. You can easily encode it later on.

Is there a way to speed up this webscraping iteration? Pandas

So I'm collecting data on a list of stocks and putting all that info into a dataframe. The list has about 700 stocks.
import pandas as pd
stock =['adma','aapl','fb'] # list has about 700 stocks which I extracted from a pickled dataframe that was storing the info.
#The site I'm visiting is below with the name of the stock added to the end of the end of the link
##http://finviz.com/quote.ashx?t=adma
##http://finviz.com/quote.ashx?t=aapl
I'm just extracting one portion of that site, evident by [-2] in the code below
df2 = pd.DataFrame()
for i in stock:
df = pd.read_html('http://finviz.com/quote.ashx?t={}'.format(i), header =0)[-2].set_index('SEC Form 4')
df['Stock'] = i.upper() # creating a column which has the name of the stock, so I can differentiate between stocks
df2 = df2.append(df)
It feels like I'm doing a few seconds per iteration and I have around 700 to go through at the moment. It's not terribly slow, but I was just curious if there is a more efficient method. Thanks.
Your current code is blocking, you don't proceed with retrieving the information from the next url until you are done with the current. Instead, you can switch to, for example, Scrapy which is based on twisted and working asynchronously processing multiple pages at the same time.