Not able to extract any column from a pandas dataframe [duplicate] - pandas

I have successfully read a csv file using pandas. When I am trying to print the a particular column from the data frame i am getting keyerror. Hereby i am sharing the code with the error.
import pandas as pd
reviews_new = pd.read_csv("D:\\aviva.csv")
reviews_new['review']
**
reviews_new['review']
Traceback (most recent call last):
File "<ipython-input-43-ed485b439a1c>", line 1, in <module>
reviews_new['review']
File "C:\Users\30216\AppData\Local\Continuum\Anaconda2\lib\site-packages\pandas\core\frame.py", line 1997, in __getitem__
return self._getitem_column(key)
File "C:\Users\30216\AppData\Local\Continuum\Anaconda2\lib\site-packages\pandas\core\frame.py", line 2004, in _getitem_column
return self._get_item_cache(key)
File "C:\Users\30216\AppData\Local\Continuum\Anaconda2\lib\site-packages\pandas\core\generic.py", line 1350, in _get_item_cache
values = self._data.get(item)
File "C:\Users\30216\AppData\Local\Continuum\Anaconda2\lib\site-packages\pandas\core\internals.py", line 3290, in get
loc = self.items.get_loc(item)
File "C:\Users\30216\AppData\Local\Continuum\Anaconda2\lib\site-packages\pandas\indexes\base.py", line 1947, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas\index.pyx", line 137, in pandas.index.IndexEngine.get_loc (pandas\index.c:4154)
File "pandas\index.pyx", line 159, in pandas.index.IndexEngine.get_loc (pandas\index.c:4018)
File "pandas\hashtable.pyx", line 675, in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:12368)
File "pandas\hashtable.pyx", line 683, in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:12322)
KeyError: 'review'
**
Can someone help me in this ?

I think first is best investigate, what are real columns names, if convert to list better are seen some whitespaces or similar:
print (reviews_new.columns.tolist())
I think there can be 2 problems (obviously):
1.whitespaces in columns names (maybe in data also)
Solutions are strip whitespaces in column names:
reviews_new.columns = reviews_new.columns.str.strip()
Or add parameter skipinitialspace to read_csv:
reviews_new = pd.read_csv("D:\\aviva.csv", skipinitialspace=True)
2.different separator as default ,
Solution is add parameter sep:
#sep is ;
reviews_new = pd.read_csv("D:\\aviva.csv", sep=';')
#sep is whitespace
reviews_new = pd.read_csv("D:\\aviva.csv", sep='\s+')
reviews_new = pd.read_csv("D:\\aviva.csv", delim_whitespace=True)
EDIT:
You get whitespace in column name, so need 1.solutions:
print (reviews_new.columns.tolist())
['Name', ' Date', ' review']
^ ^

import pandas as pd
df=pd.read_csv("file.txt", skipinitialspace=True)
df.head()
df['review']

dfObj['Hash Key'] = (dfObj['DEAL_ID'].map(str) +dfObj['COST_CODE'].map(str) +dfObj['TRADE_ID'].map(str)).apply(hash)
#for index, row in dfObj.iterrows():
# dfObj.loc[`enter code here`index,'hash'] = hashlib.md5(str(row[['COST_CODE','TRADE_ID']].values)).hexdigest()
print(dfObj['hash'])

Related

Simple Pandas DataFrame read_csv then GroupBy with Count / KeyError

I'm just trying to get a count of rows for a values in a given column, for example:
CSV Data:
'Occupation','data'
'Carpenter','data1'
'Carpenter','data2'
'Carpenter','data3'
'Painter','data1'
'Painter','data2'
'Programmer','data1'
'Programmer','data2'
'Programmer','data3'
'Programmer','data4'
Program:
filename = "./data/TestGroup.csv"
df = pd.read_csv(filename)
print(df.head())
print("Computing stats by HandRank... ")
df_stats = df[['data']].groupby(['Occupation']).agg(['count'])
# also tried: df_stats = df[['Occupation']].groupby(['Occupation']).agg(['count'])
print(df_stats.head())
How can I get the count in a variable? does .groupby and .agg return another dataframe?
Output/Error:
'Occupation' 'data'
0 'Carpenter' 'data1'
1 'Carpenter' 'data2'
2 'Carpenter' 'data3'
3 'Painter' 'data1'
4 'Painter' 'data2'
Computing stats by HandRank...
Traceback (most recent call last):
File "C:\Apps\PokerHandGenerator_Copy_not_Source\Server\TestPandasGroupBy.py", line 17, in <module>
df_stats = df.groupby(['Occupation']).agg(['count'])
File "C:\Apps\ProcessData\venv\lib\site-packages\pandas\core\frame.py", line 6714, in groupby
return DataFrameGroupBy(
File "C:\Apps\ProcessData\venv\lib\site-packages\pandas\core\groupby\groupby.py", line 560, in __init__
grouper, exclusions, obj = get_grouper(
File "C:\Apps\ProcessData\venv\lib\site-packages\pandas\core\groupby\grouper.py", line 811, in get_grouper
raise KeyError(gpr)
KeyError: 'Occupation'
The df.head() shows it is using "Occupation" as my column name.
Pandas sees the first column as 'Occupation' not Occupation.
use this:-
df_stats = df.groupby("'Occupation'").agg(['count'])
instead of using this:-
df_stats = df[['data']].groupby(['Occupation']).agg(['count'])

How to convert Pandas dataframe to PyArrow table with a union type in the schema?

I have a Pandas dataframe with a column that contains a list of dict/structs. One of the keys (thing in the example below) can have a value that is either an int or a string. Is there a way to define a PyArrow type that will allow this dataframe to be converted into a PyArrow table, for eventual output to a Parquet file?
I tried using pa.union for this, but I seem to be doing something not supported/implemented.
import pandas as pd
import pyarrow as pa
df = pd.DataFrame(data={"id": [1, 2], "dict": [{"thing": 1}, {"thing": "two"}]})
schema = pa.schema([
pa.field("id", pa.int64()),
pa.field("dict", pa.struct([
("thing", pa.union([
pa.field("int64", pa.int64()),
pa.field("string", pa.string()),
], "sparse"))
]))
])
t = pa.Table.from_pandas(df, schema=schema)
Result:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "pyarrow/table.pxi", line 1394, in pyarrow.lib.Table.from_pandas
File "/usr/local/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 587, in dataframe_to_arrays
arrays = [convert_column(c, f)
File "/usr/local/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 587, in <listcomp>
arrays = [convert_column(c, f)
File "/usr/local/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 574, in convert_column
raise e
File "/usr/local/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 568, in convert_column
result = pa.array(col, type=type_, from_pandas=True, safe=safe)
File "pyarrow/array.pxi", line 292, in pyarrow.lib.array
File "pyarrow/array.pxi", line 83, in pyarrow.lib._ndarray_to_array
File "pyarrow/error.pxi", line 105, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: ('sparse_union', 'Conversion failed for column dict with type object')
The help text for pa.union doesn't give an example of how to use it.
>>> help(pa.union)
Help on built-in function union in module pyarrow.lib:
union(...)
union(children_fields, mode, type_codes=None)
Create UnionType from children fields.
A union is defined by an ordered sequence of types; each slot in the union
can have a value chosen from these types.
Parameters
----------
fields : sequence of Field values
Each field must have a UTF8-encoded name, and these field names are
part of the type metadata.
mode : str
Either 'dense' or 'sparse'.
type_codes : list of integers, default None
Returns
-------
type : DataType
It looks like it's not implemented yet in pyarrow 2.0.0:
import pandas as pd
import pyarrow as pa
union = pa.union([
pa.field("int64", pa.int64()),
pa.field("string", pa.string()),
], 'sparse')
pa.array([1, 'two'], union)
---------------------------------------------------------------------------
ArrowNotImplementedError Traceback (most recent call last)
<ipython-input-72-f7ec6792b124> in <module>
10 ], 'sparse')
11
---> 12 pa.array([1, 'two'], union)
/nix/store/aagq4nyc9m4ikjda1mykgv125v792zk7-python3-3.7.7-env/lib/python3.7/site-packages/pyarrow/array.pxi in pyarrow.lib.array()
/nix/store/aagq4nyc9m4ikjda1mykgv125v792zk7-python3-3.7.7-env/lib/python3.7/site-packages/pyarrow/array.pxi in pyarrow.lib._sequence_to_array()
/nix/store/aagq4nyc9m4ikjda1mykgv125v792zk7-python3-3.7.7-env/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status()
/nix/store/aagq4nyc9m4ikjda1mykgv125v792zk7-python3-3.7.7-env/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()
ArrowNotImplementedError: sparse_union
PyArrow has a built in method .from_pandas()
https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.from_pandas
import pandas as pd
import pyarrow as pa
df = pd.DataFrame({
... 'int': [1, 2],
... 'str': ['a', 'b']
... })
pa.Table.from_pandas(df)
<pyarrow.lib.Table object at 0x7f05d1fb1b40>

TypeError: '<' not supported between instances of 'str' and 'int' Doc2Vec

Any ideas why this error is being thrown
"TypeError: '<' not supported between … 'str' and 'int'" when doc-tag not present for most_similar()
I have a list of .txt documents stored in my data folder and want to compare one doc to another through my flask app on localhost.
Traceback (most recent call last):
File "C:\Users\ibrahimm\AppData\Local\Continuum\anaconda3\lib\site-packages\flask\app.py", line
2463, in __call__
return self.wsgi_app(environ, start_response)
File "C:\Users\ibrahimm\AppData\Local\Continuum\anaconda3\lib\site-packages\flask\app.py", line
2449, in wsgi_app
response = self.handle_exception(e)
File "C:\Users\ibrahimm\AppData\Local\Continuum\anaconda3\lib\site-packages\flask\app.py", line
1866, in handle_exception
reraise(exc_type, exc_value, tb)
File "C:\Users\ibrahimm\AppData\Local\Continuum\anaconda3\lib\site-packages\flask\_compat.py", line
39, in reraise
raise value
File "C:\Users\ibrahimm\AppData\Local\Continuum\anaconda3\lib\site-packages\flask\app.py", line
2446, in wsgi_app
response = self.full_dispatch_request()
File "C:\Users\ibrahimm\AppData\Local\Continuum\anaconda3\lib\site-packages\flask\app.py", line
1951, in full_dispatch_request
rv = self.handle_user_exception(e)
File "C:\Users\ibrahimm\AppData\Local\Continuum\anaconda3\lib\site-packages\flask\app.py", line
1820,
in handle_user_exception
reraise(exc_type, exc_value, tb)
File "C:\Users\ibrahimm\AppData\Local\Continuum\anaconda3\lib\site-packages\flask\_compat.py", line
39, in reraise
raise value
File "C:\Users\ibrahimm\AppData\Local\Continuum\anaconda3\lib\site-packages\flask\app.py", line
1949,
in full_dispatch_request
rv = self.dispatch_request()
File "C:\Users\ibrahimm\AppData\Local\Continuum\anaconda3\lib\site-packages\flask\app.py", line
1935,
in dispatch_request
return self.view_functions[rule.endpoint](**req.view_args)
File "C:\Users\ibrahimm\Desktop\doc2vec-compare-doc-demo\app.py", line 56, in api_compare_2
vec1 = d2v_model.docvecs.most_similar(data['doc1'])
File "C:\Users\ibrahimm\AppData\Local\Continuum\anaconda3\lib\site-
packages\gensim\models\keyedvectors.py", line 1715, in most_similar
elif doc in self.doctags or doc < self.count:
TypeError: '<' not supported between instances of 'str' and 'int'\
app.py
#app.route('/api/compare_2', methods=['POST'])
def api_compare_2():
data = request.get_json()
if not 'doc1' in data or not 'doc2' in data:
return 'ERROR'
vec1 = d2v_model.docvecs.most_similar(data['doc1'])
vec2 = d2v_model.docvecs.most_similar(data['doc2'])
vec1 = gensim.matutils.full2sparse(vec1)
vec2 = gensim.matutils.full2sparse(vec2)
print (data)
print (vec2)
print (vec1)
return jsonify(sim=gensim.matutils.cossim(vec1, vec2))
#app.route('/api/compare_all', methods=['POST'])
def api_compare_all():
data = request.get_json()
if not 'doc' in data:
return 'ERROR'
vec = d2v_model.docvecs.most_similar(data['doc'])
res = d2v_model.docvecs.most_similar([vec], topn=5)
return jsonify(list=res)
model.py
def load_model():
try:
return gensim.models.doc2vec.Doc2Vec.load("doc2vec.model2")
except:
print ('Model not found!')
return None
def train_model():
#path to the input corpus files
data="data"
#tagging the text files
class DocIterator(object):
def __init__(self, doc_list, labels_list):
self.labels_list = labels_list
self.doc_list = doc_list
def __iter__(self):
for idx, doc in enumerate(self.doc_list):
yield TaggedDocument(words=doc.split(), tags=[self.labels_list[idx]])
docLabels = [f for f in listdir(data) if f.endswith('.txt')]
print(docLabels)
data = []
for doc in docLabels:
data.append(open(r'C:\Users\ibrahimm\Desktop\doc2vec-compare-doc-demo\data\\' + doc,
encoding='cp437').read())
tokenizer = RegexpTokenizer(r'\w+')
stopword_set = set(stopwords.words('english'))
#This function does all cleaning of data using two objects above
def nlp_clean(data):
new_data = []
for d in data:
new_str = d.lower()
dlist = tokenizer.tokenize(new_str)
dlist = list(set(dlist).difference(stopword_set))
new_data.append(dlist)
return new_data
data = nlp_clean(data)
it = DocIterator(data, docLabels)
#train doc2vec model
model = gensim.models.Doc2Vec(size=300, window=15, min_count=4, workers=10,alpha=0.025, min_alpha=0.025, iter=20) # use fixed learning rate
model.build_vocab(it)
model.train(it, epochs=model.iter, total_examples=model.corpus_count)
model.save("doc2vec.model2")
If you try to look-up a string doc-tag that's not in the model, you unfortunately get this confusing error, instead of a clearer error. (See gensim's open-issue: https://github.com/RaRe-Technologies/gensim/issues/1737#issuecomment-346995119 )
Whatever is in data['doc1'] isn't a tag in the model.
You may be able to pre-check, before attempting a most_similar() operation, by looking at whether data['doc1'] in model.docvecs is True.
TypeError: '<' not supported between instances of 'str' and 'int'
[35182] Failed to execute script docker-compose
This error is was as a result of copy and paste code with a wrong quotation mark(). change this to this ''

Getting sum by grouping other column

I have a dataframe as follows
Occupation, Genre, Rating
I have taken sum of all rating as totalRating. Now I want to create neeew column w_rating which take (rating >3)/totalRating for particular Occupation,Genre Combination. My dataframe name is joinedRDD so i amwriting below query
resultDF = joinedDF3.groupby([joinedDF3["Occupation"],joinedDF3["Genre"]]).withColumn(wa_rating, sum(Rating>3)/totalRating).collect()
but it is showing error
AttributeError: 'GroupedData' object has no attribute 'withColumn'
So it is clear from error that we cannot use withColumn with groupby
So my question is how to do it?
Below is my updated code.
from pyspark.sql import SparkSession
from pyspark.sql.types import (StructField,StructType,IntegerType,StringType)
from pyspark.sql import Row
from pyspark.sql.functions import sum
import pyspark.sql.functions as F
from pyspark.sql.functions import lit
spark = SparkSession.builder.appName("Movielens Analysis").getOrCreate()
def refineMovieDF(row):
genre=[]
movieData =row[0].split("|")
for i in range(len(movieData)-5):
if int(movieData[i+5]) ==1:
genre.append((int(movieData[0]),i))
return genre
ratingSchema =StructType(fields=[StructField("UserId",IntegerType(),True),StructField("MovieId",IntegerType(),True),StructField("Rating",IntegerType(),True),StructField("TimeStamp",IntegerType(),True)])
ratingsDF = spark.read.load("ml-100k/u.data", format="csv",sep="\t", inferSchema=True, header=False,schema=ratingSchema)
genreSchema =StructType(fields=[StructField("Genre",StringType(),True),StructField("GenreId",IntegerType(),True)])
genreDF = spark.read.load("ml-100k/u.genre",format="csv",sep="|",inferSchema=True, header=False,schema=genreSchema)
userSchema =StructType(fields=[StructField("UserId",IntegerType(),True),StructField("Age",IntegerType(),True),StructField("Gender",StringType(),True),StructField("Occupation",StringType(),True),StructField("ZipCode",IntegerType(),True)])
usersDF = spark.read.load("ml-100k/u.user",format="csv",sep="|",inferSchema=True, header=False,schema=userSchema)
movieSchema =StructType(fields=[StructField("MovieRow",StringType(),True)])
movieDF = spark.read.load("ml-100k/u.item",format="csv",inferSchema=True, header=False,schema=movieSchema)
movieRefinedRDD = movieDF.rdd.flatMap(refineMovieDF)
movieSchema =StructType(fields=[StructField("MovieId",IntegerType(),True),StructField("GenreId",IntegerType(),True)])
movieRefinedDf = spark.createDataFrame(movieRefinedRDD, movieSchema)
joinedDF1 = ratingsDF.join(usersDF,ratingsDF.UserId==usersDF.UserId).select(usersDF["Occupation"],ratingsDF["Rating"],ratingsDF["MovieId"])
joinedDF3 = joinedDF1.join(joinedDF2,joinedDF1.MovieId == joinedDF2.MovieId).select(joinedDF1["Occupation"],joinedDF1["Rating"],joinedDF2["Genre"])
totalRating = joinedDF3.groupBy().sum("Rating").collect()
resultDF = joinedDF3.groupby([joinedDF3["Occupation"],joinedDF3["Genre"]]).agg((sum(joinedDF3["Rating"]>3)/totalRating).alias(wa_rating)).collect()
print(resultDF)
Now I am getting below error.
2019-08-06 22:24:20 INFO BlockManagerInfo:54 - Removed broadcast_11_piece0 on 10.0.2.15:58903 in memory (size: 4.3 KB, free: 413.8 MB)
Traceback (most recent call last):
File "/home/cloudera/workspace/MovielensAnalysis.py", line 59, in <module>
resultDF = joinedDF3.groupby([joinedDF3["Occupation"],joinedDF3["Genre"]]).agg((sum(joinedDF3["Rating"]>3)/totalRating).alias(wa_rating)).collect()
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/column.py", line 116, in _
File "/usr/local/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/usr/local/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o129.divide.: java.lang.RuntimeException: Unsupported literal type class java.util.ArrayList [[572536]]

Spacy phrasematcher does not get matcher name

I am new to phraseMatcher and want to extract some keyword from my emails.
Everything is working well except that I can't get a name of added matcher.
This is my code below:
def main():
patterns_months = 'phraseMatcher/months.txt'
text_loc = 'phraseMatcher/text.txt'
nlp = spacy.blank('en')
nlp.vocab.lex_attr_getters ={}
phrases_months = read_gazetter(patterns_months)
txts = read_text(text_loc, n=n)
months = [nlp(text) for text in phrases_months]
matcher = PhraseMatcher(nlp.vocab)
matcher.add('MONTHS', None, *months)
print(nlp.vocab.strings['MONTHS'])
for txt in txts:
doc = nlp(txt)
matches = matcher(doc)
for match_id ,start, end in matches:
span = doc[start: end]
label = nlp.vocab.strings[match_id]
print(label, span.text, start, end)
The result:
12298211501233906429 <--- this is from print(nlp.vocab.strings['MONTHS'])
Traceback (most recent call last):
File "D:/workspace/phraseMatcher/venv/phraseMatcher.py", line 71, in <module>
plac.call(main)
File "D:\workspace\phraseMatcher\venv\lib\site-packages\plac_core.py", line 328, in call
cmd, result = parser.consume(arglist)
File "D:\workspace\phraseMatcher\venv\lib\site-packages\plac_core.py", line 207, in consume
return cmd, self.func(*(args + varargs + extraopts), **kwargs)
File "D:/workspace/phraseMatcher/venv/phraseMatcher.py", line 47, in main
label = nlp.vocab.strings[match_id]
File "strings.pyx", line 117, in spacy.strings.StringStore.__getitem__
KeyError: "[E018] Can't retrieve string for hash '18446744072093410045'."
spaCy version:** 2.0.12
Platform:** Windows-7-6.1.7601-SP1
Python version:** 3.7.0
I can't find what I did wrong. It is simple and I read these already:
Using PhraseMatcher in SpaCy to find multiple match types
Help me, thanks in advance.