When I store and load a numpy array I have no issues unless the array only has a single element. When this happens, I am able to store and retrieve it, but the resulting type is not an array.
I was expecting that I would be able to retrieve the single element as I do multiple element arrays
import numpy as np
# set up the lists
listW = ["The Dog","The Cat"]
list = ["The Pig"]
# convert lists to arrays
arrayW = np.array(listW)
array = np.array(list)
# Displaying the original array
print('ArrayW:', arrayW)
print('Array:', array)
print('ArrayW[0]:', arrayW[0])
print('Array[0]:', array[0])
# storage files
fileW = "C:\\Test\\testW"
file = "C:\\Test\\test"
# Saving the original array
np.savetxt(fileW, arrayW, fmt='%s', delimiter=',')
np.savetxt(file, array, fmt='%s', delimiter=',')
# Loading the array from the saved files
testW = np.loadtxt(fileW, dtype=object, delimiter=',')
test = np.loadtxt(file, dtype=object, delimiter=',')
# print out results
print("testW:", testW)
print("test:", test)
print("testW[0]:", testW[0])
print("test[0]:", test[0])
When you run this, you get the following output:
ArrayW: ['The Dog' 'The Cat']
Array: ['The Pig']
ArrayW[0]: The Dog
Array[0]: The Pig
testW: ['The Dog' 'The Cat']
test: The Pig
testW[0]: The Dog
IndexError: too many indices for array: array is 0-dimensional, but 1 were indexed
The error is due to the fact that the read in value 'test' is not an array. This works if the stored array has more than 1 value.
Related
I have a pandas dataframe as follows.
thi 0.969378
text 0.969378
is 0.969378
anoth 0.699030
your 0.497120
first 0.497120
book 0.497120
third 0.445149
the 0.445149
for 0.445149
analysi 0.445149
I want to convert it to a list of tuples as follows.
[["this", 0.969378], ["text", 0.969378], ..., ["analysi", 0.445149]]
My code is as follows.
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk import word_tokenize
from nltk.stem.porter import PorterStemmer
def tokenize(text):
tokens = word_tokenize(text)
stems = []
for item in tokens: stems.append(PorterStemmer().stem(item))
return stems
# your corpus
text = ["This is your first text book", "This is the third text for analysis", "This is another text"]
# word tokenize and stem
text = [" ".join(tokenize(txt.lower())) for txt in text]
vectorizer = TfidfVectorizer()
matrix = vectorizer.fit_transform(text).todense()
# transform the matrix to a pandas df
matrix = pd.DataFrame(matrix, columns=vectorizer.get_feature_names())
# sum over each document (axis=0)
top_words = matrix.sum(axis=0).sort_values(ascending=False)
print(top_words)
I tried the following two options.
list(zip(*map(top_words.get, top_words)))
I got the error as TypeError: cannot do label indexing on <class 'pandas.core.indexes.base.Index'> with these indexers [0.9693779251346359] of <class 'float'>
list(top_words.itertuples(index=True))
I got the error as AttributeError: 'Series' object has no attribute 'itertuples'.
Please let me know a quick way of doing this in pandas.
I am happy to provide more details if needed.
Use zip by index with map tuples to lists:
a = list(map(list,zip(top_words.index,top_words)))
Or convert index to column, convert to nupy array and then to lists:
a = top_words.reset_index().to_numpy().tolist()
print (a)
[['thi', 0.9693780000000001], ['text', 0.9693780000000001],
['is', 0.9693780000000001], ['anoth', 0.69903],
['your', 0.49712], ['first', 0.49712], ['book', 0.49712],
['third', 0.44514899999999996], ['the', 0.44514899999999996],
['for', 0.44514899999999996], ['analysi', 0.44514899999999996]]
Given two files
We need to find of all the number overlapping the data from both the file which are prime.
For check the prime number we need to develop a function called check_prime and use the same.
My code :
import math
def is_prime(num):
if num == 1:
return False
if num == 2:
return True
for i in range(2,int(math.sqrt(num))+1):
if num % i == 0:
return False
return True
one = []
theFile = open("One.txt", "r")
array = []
for val in theFile:
array.append(val)
print(array)
theFile = open("Two.txt", "r")
array1 = []
for val in theFile:
array1.append(val)
print(array1)
for i in array:
one.append(i)
print(one)
You are almost there but here are the missing bits in your code:
1) Reading from the files
To avoid writing twice the same code to open both files, and to handle more than two files, we can loop through the file names instead of opening each one separately
So instead of:
theFile = open("One.txt", "r")
#[...]
theFile = open("Two.txt", "r")
We could use:
file_names = ['One.txt', 'Two.txt']
for i in file_names:
theFile = open(i, "r")
2) Extracting the numbers from the files
Then you extract the values in the text file. The list of numbers in each file gets imported as a list containing a string with numbers in it.
So there are 2 things we need to do:
1) extract the string from the list
2) read each string number in the list separated by commas.
If you do:
for val in theFile:
array.append(val)
You will only append one list containing one string to your array.
In your code, you create two lists: array and array1 but then only loop through the array list which puts in your one list only the data from the array list, not using array1 at all. Nothing to worry about, I also get confused sometimes between array[1] and array1 if I name several lists ending in 1,2,3.
So instead we could do:
for val in theFile:
array = array + val.split(",")
We use + because we want all the number-strings in one single list and not one list containing several lists (you can try to replace this by: array = array.append(val.split(",")) and you'll see you get a list containing lists but what we want is all number-strings from all files in one single list so better to concatenate the elements in the lists into one single list.
Now that you have your array list that contains all string-numbers from your text files, you need to transform them into integers so you can run your excellent is_prime function.
So we create a second list that I've called array2 where we will store the string-numbers as integers and not as strings.
The final output that you want is a list of the unique prime numbers in both text files, so we check that the number is not already in array2 before appending it.
for nbrs in array:
if int(nbrs) not in array2:
array2.append(int(nbrs))
Almost there! You've already done the rest of the work from there on:
You need to pass all the unique numbers in array2 to your is_prime function to check whether they are prime or not.
We store the result of the is_prime function (True or False) into the list is_nbr_prime.
is_nbr_prime = []
for i in array2:
is_nbr_prime.append(is_prime(i))
Now, because you want to return the number themselves, we need to find the indexes of the prime numbers to extract them from array2, which are the indexes of the True values in is_nbr_prime:
idx = [i for i, val in enumerate(is_nbr_prime) if val] #we get the index of the values that are True in is_nbr_prime list
unique_prime_nbrs = [array2[i] for i in idx] # we pass the index to array2 containing the list of unique numbers to take out only prime numbers.
That's it, you have your unique prime numbers in the list unique_prime_nbrs .
If we put all the steps together into two functions, the final code is:
def is_prime(num):
if num == 1:
return False
if num == 2:
return True
for i in range(2,int(math.sqrt(num))+1):
if num % i == 0:
return False
return True
def check_prime(file_names):
array = []
array2 = []
for i in file_names:
theFile = open(i, "r")
for val in theFile:
array = array + val.split(",")
for nbrs in array:
if int(nbrs) not in array2:
array2.append(int(nbrs))
is_nbr_prime = []
for i in array2:
is_nbr_prime.append(is_prime(i))
idx = [i for i, val in enumerate(is_nbr_prime) if val]
unique_prime_nbrs = [array2[i] for i in idx]
return unique_prime_nbrs
To call the function, we need to pass a list of file names, for instance:
file_names = ['One.txt', 'Two.txt']
unique_prime_nbrs = check_prime(file_names)
print(unique_prime_nbrs)
[5, 7, 13, 17, 19, 23]
There is a bunch of stuff you need to do:
When reading the text from the input files, convert it to integers before storing anywhere.
Instead of making array a list, make it a set. This will enable testing membership in much shorter time.
Before storing a values from the first file in array, check if it is a prime, using the is_prime function you wrote.
When reading the integers from the second file, before adding the values to array1, test if they are already in array. No need to heck for prime-ness, because array would already contain only primes.
Finally, before outputting the values from array1 you would need to convert them back to strings, and use the join string method to join them with separating commas.
So, get to it.
What should be the dictionary form_data
Desired Output from python code >> data = parse.urlencode(form_data).encode():
"entry.330812148_sentinel=&entry.330812148=Test1&entry.330812148=Test2&entry.330812148=Test3&entry.330812148=Test4"
I tried various dictionary structures including ones with None, [] and dictionary within dictionary but I am unable to get this output
form_data = {'entry.330812148_sentinel':None,
'entry.330812148':'Test1',
'entry.330812148':'Test2',
'entry.330812148':'Test3',
'entry.330812148':'Test4'}
from urllib import request, parse
data = parse.urlencode(form_data).encode()
print("Printing Parsed Form Data........")
"entry.330812148_sentinel=&entry.330812148=Test1&entry.330812148=Test2&entry.330812148=Test3&entry.330812148=Test4"
You can use parse_qs from urllib.parse to return the python data structure
import urllib.parse
>>> s = 'entry.330812148_sentinel=&entry.330812148=Test1&entry.330812148=Test2&entry.330812148=Test3&entry.330812148=Test4'
>>> d1 = urllib.parse.parse_qs(s)
>>> d1
{b'entry.330812148': [b'Test1', b'Test2', b'Test3', b'Test4']}
I am converting fixedwidth file to delimiter file ('|' delimiter) using pandas read_fwf method. My input file ("infile.txt") is around 16GB and 9.9 Million records, while creating a dataframe it is occupying almost 3times of memory(around 48GB) before it creates outputfile. Can someone help me in impoving below logic please and through somelight where this extra memory is from (I know 'seq_id, fname and loaddatime will occupy some space it should in couple of GBs only).
Note:
I am processing multiple files(similar size files) in loop one after the other. so i have to clear the memory before next file takes over.
'''infile.txt'''
1234567890AAAAAAAAAA
1234567890BBBBBBBBBB
1234567890CCCCCCCCCC
'''test_layout.csv'''
FIELD_NAME,START_POS,END_POS
FIELD1,0,10
FIELD2,10,20
'''test.py'''
import datetime
import pandas as pd
import csv
from collections import OrderedDict
import gc
seq_id = 1
fname= 'infile.txt'
loadDatetime = '04/10/2018'
in_layout = open("test_layout.csv","rt")
reader = csv.DictReader(in_layout)
boundries, col_names = [[],[]]
for row in reader:
boundries.append(tuple([int(str(row['START_POS']).strip()) , int(str(row['END_POS']).strip())]))
col_names.append(str(row['FIELD_NAME']).strip())
dataf = pd.read_fwf(fname, quoting=3, colspecs = boundries, dtype = object, names = col_names)
len_df = len(dataf)
'''Used pair of key, value tuples and OrderedDict to preserve the order of the columns'''
mod_dataf = pd.DataFrame(OrderedDict((('seq_id',[seq_id]*len_df),('fname',[fname]*len_df))), dtype=object)
ldt_ser = pd.Series([loadDatetime]*len_df,name='loadDatetime', dtype=object)
dataf = pd.concat([mod_dataf, dataf],axis=1)
alldfs = [mod_dataf]
del alldfs
gc.collect()
mod_dataf = pd.DataFrame()
dataf = pd.concat([dataf,ldt_ser],axis=1)
dataf.to_csv("outfile.txt", sep='|', quoting=3, escapechar='\\' , index=False, header=False,encoding='utf-8')
''' Release Memory used by DataFrames '''
alldfs = [dataf]
del ldt_ser
del alldfs
gc.collect()
dataf = pd.DataFrame()
I used garbage collector , del dataframe and initialised to clear memory used but still total memory is not released from dataframe.
Inspired by https://stackoverflow.com/a/49144260/2799214
'''OUTPUT'''
1|infile.txt|1234567890|AAAAAAAAAA|04/10/2018
1|infile.txt|1234567890|BBBBBBBBBB|04/10/2018
1|infile.txt|1234567890|CCCCCCCCCC|04/10/2018
I had the same problem as you using https://stackoverflow.com/a/49144260/2799214
I found a solution using gc.collect() by splitting my code in different methods within a class. For example:
Class A:
def __init__(self):
# your code
def first_part_of_my_code(self):
# your code
# I want to clear my dataframe
del my_dataframe
gc.collect()
my_dataframe = pd.DataFrame() # not sure whether this line really helps
return my_new_light_dataframe
def second_part_of_my_code(self):
# my code
# same principle
So When the program call the methods, The garbage collector clear the memory once the program leaves the method.
How to get data by querying radius from ball tree? For example
from sklearn.neighbors import BallTree
import pandas as pd
bt = BallTree(df[['lat','lng']], metric="haversine")
for idx, row in df.iterrow():
res = df[bt.query_radius(row[['lat','lng']],r=1)]
I want to get those rows in df that are in radius r=1. But it throws type error
TypeError: unhashable type: 'numpy.ndarray'
Following the first answer I got index out of range when iterating over the rows
5183
(5219, 25)
5205
(5219, 25)
5205
(5219, 25)
5221
(5219, 25)
Traceback (most recent call last):
File "/Users/Chu/Documents/dssg2018/sa4.py", line 45, in <module>
df.loc[idx,word]=len(df.iloc[indices[idx]][df[word]==1])/\
IndexError: index 5221 is out of bounds for axis 0 with size 5219
And the code is
bag_of_words = ['beautiful','love','fun','sunrise','sunset','waterfall','relax']
for idx,row in df.iterrows():
for word in bag_of_words:
if word in row['caption']:
df.loc[idx, word] = 1
else:
df.loc[idx, word] = 0
bt = BallTree(df[['lat','lng']], metric="haversine")
indices = bt.query_radius(df[['lat','lng']],r=(float(10)/40000)*360)
for idx,row in df.iterrows():
for word in bag_of_words:
if word in row['caption']:
print(idx)
print(df.shape)
df.loc[idx,word]=len(df.iloc[indices[idx]][df[word]==1])/\
np.max([1,len(df.iloc[indices[idx]][df[word]!=1])])
The error is not in the BallTree, but the indices returned by it are not used properly for putting it into index.
Do it this way:
for idx, row in df.iterrows():
indices = bt.query_radius(row[['lat','lng']].values.reshape(1,-1), r=1)
res = df.iloc[[x for b in indices for x in b]]
# Do what you want to do with res
This will also do (since we are sending only a single point each time):
res = df.iloc[indices[0]]
Explanation:
I'm using scikit 0.20. So the code you wrote above:
df[bt.query_radius(row[['lat','lng']],r=1)]
did not work for me. I needed to make it a 2-d array by using reshape().
Now bt.query_radius() returns array of array of indices within the radius r specified as mentioned in the documentation:
ind : array of objects, shape = X.shape[:-1]
each element is a numpy integer array listing the indices of neighbors of the corresponding point. Note that unlike the results of
a k-neighbors query, the returned neighbors are not sorted by distance
by default.
So we needed to iterate two arrays to reach the actual indices of the data.
Now once we got the indices, in a pandas Dataframe, iloc is the way to access data with indices.
Update:
You dont need to query the bt each time for individual points. You can send all the df at once to return a 2-d array containing the indices of points within the radius to the point specified that index.
indices = bt.query_radius(df, r=1)
for idx, row in df.iterrows():
nearest_points_index = indices[idx]
res = df.iloc[nearest_points_index]
# Do what you want to do with res