File operation is slower is there a faster look up method in Python? - numpy

I am storing the values of the form given below into a file:
143 800 'Ask'
213 457 'Comment'
424 800 'Hi'
The first column contains unique elements here.
However, the look up on the values of the first column is quite inefficient when I am storing it in file format, is there a more efficient way in Python for a faster look-up.
I am aware of dictionaries in python for accomplishing this, but I am looking for some other method. Since the data I have consits of trillions of records..therefore I can not keep them in dictionary in the RAM. Therefore, I am searching for some other method.
Also with each program exection the rows are going to be inserted in the case of databases, how to overcome that, an example of what I am getting confused about in databases is given below:
143 800 'Ask'
213 457 'Comment'
424 800 'Hi'
143 800 'Ask'
213 457 'Comment'
424 800 'Hi'

Here's a full code example using sqlite3, showing how to initialise the database, put data into it, and get a single row of data out.
import sqlite3
conn = sqlite3.connect(':memory:')
conn.execute("""CREATE TABLE Widget (id INTEGER PRIMARY KEY,
serial_number INTEGER,
description TEXT);""")
my_data = [ [143, 800, 'Ask'],
[213, 457, 'Comment'],
[424, 800, 'Hi'] ]
for row in my_data:
conn.execute("INSERT INTO Widget (id, serial_number, description) VALUES (?,?,?);" , row )
conn.commit() # save changes
res = conn.execute("SELECT * FROM Widget WHERE id=143")
row = res.next()
print row #prints (143, 800, u'Ask')
Note the use of the special filename :memory: to open a temporary database.

What you're asking for is probably called a "Database table" and an "Index". The classic approach is to have a supplementary file (index) which maps the keys of the data tuples in the table to absolute positions of the tuples in the file.

I don't understand, you want to be able to search faster in the file itself, or with the file content in python? In the latter, use a dictionary with the unique elements as key.
values = {143:[800,'ask'], 213:[457,'Comment'], 424:[800:'Hi']}

If you need to look things up in a persistent store, use a database. One example is sqlite, which is built-in.

Also with each program exection the rows are going to be inserted
If you want to keep the storage in a file, the way you do it, then the simple solution to prevent duplicate entries from appearing the next execution would be to simply truncate the file first. You can do this, by opening it with the w flag:
f = open('filename', 'w')
# ...
f.close()
However it sounds as if you just want to store some data while the program is executed, i.e. you want to keep data around without making it persistent. If that’s the case, then I am wondering why you actually store the contents in a file.
The more obvious way, which is also pythonic (although it’s not special to Python), would be to keep it in a dictionary during the program execution. A dictionary is mutable, so you can change its content all the time: You can add new entries, or even update entries if you later get more information on them.
I knew about this from of storing in dictionary but at times I don't have values for values[143][1] ie the string 'None' is stored in its place
That’s not a problem at all. You can easily store an entry with 143 as the key and None as its value, or even an array of None values:
>>> values[143] = [ None, None ]
That way, the dictionary will still remember that you entered the key, so a check if the key is in the dictionary will return true:
>>> 143 in values
True
Is there any other way other than dictionaries in python for accomplising the same, I was aware of dictionaries...I am just searching for some other way.
No, there usually is only one way to do something right in Python, as also told by the Zen of Python: “There should be one-- and preferably only one --obvious way to do it.”
As such, no, there is probably not an appropriate way to use dictionaries without dictionaries. But then again, why are you searching for some other way? It does not sound to me, as if you have a good reason to do so, and if you have, you have to make sure that you explain why certain ways are undesirable for you to use.

Related

Can anyone explain what (1, 88, 2) means when getting error: "ValueError: Must pass 2-d input. shape=(1, 88, 2)"

Ive been messing with dataframes and lists, trying to understand hw they work, and I was wondering if someone could explain something for me about a list I cant seem to make into a dataframe because its not a 2-d input...
So I am downloading the companies listed on a stock exchange. The stock exchange has about 500 companies. Each company can be in one or more index.
bovespa = pd.read_csv('D:\Libraries\Downloads\IbovParts.csv', sep= ';')
This makes a dataframe from a file, which is a list of all the listed companies on the Brazilian B3 index, with 4 columns: the company name, type of stock, the code and which indexes the stock is part of, for example:
From this dataframe, I want to create a set of smaller dataframes, each of which will contain all the companies in that particular index.
Im not sure its the best way, but I found some similar code that creates a dictionary, where the index name is the key and the value is a list of all the stocks in that particular index.
First I manually made a list of the indexes:
list_of_indexes = ['AGFS', 'BDRX', 'GPTW', 'IBOV', 'IBRA', 'IBXL', 'IBXX', 'ICO2', 'ICON', 'IDIV', 'IFIL', 'IFIX', 'IFNC', 'IGCT', 'IGCX', 'IGNM', 'ISEE', 'ITAG', 'IVBX', 'MLCX', 'SMLL', 'UTIL']
Then this is the code that creates a dictionary of keys (index name) and values (empty lists) then fills the lists:
indexes = {key:[] for key in list_of_indexes}
for k in indexes:
mask = bovespa['InIndexes'].str.contains(k)
list = bovespa.loc[mask, ['Empresa','Code']]
indexes[k].append(list)
This seems to work fine. Checking the printout it does what I want it to do.
Now, I want to choose one of the indexes (for example 'IBOV') and create a new dataframe which contains ONLY the codes of the companies in IBOV. I can then use this list of codes in the yf library to download the financial data for the companies of 'IBOV'.
To do this I tried this code, hoping to get a dataframe with an index, the company name and the company code:
IBOV_codes_df = pd.DataFrame(indexes.get('IBOV'))
and got this error:
ValueError: Must pass 2-d input. shape=(1, 88, 2)
The 'type' of the data Im using (indexes.get('IBOV')) is a list:
type(indexes.get('IBOV'))
returns list, but the pd.DataFrame cant use it. Also, I cant call any of the individual elements in the list. This is what the list looks like (in jupyter):
indexes.get('IBOV')
At first I thought it was a 'normal' list with 88 rows and 2 columns, then I noticed the second square bracket AFTER columns, and len(list) told me this list had only one line. Im still fuzzy on lists and dataframes etc...
Anyway, this error seems to be quite common, and I found a solution here on stackoverflow:
pd.DataFrame(IBOV_codes[0])
Unfortunately, the post on stackoverflow just told the original poster to "do this" with no explanation and it worked. It also worked for me, and created a dataframe that is identical in appearance to the list (but without the brackets, obviously.)
Logically, as there is only one line in the list, [0] is the only callable line to use, so it makes sense. My first question in... why?? What the hecks going on? How can python make a dataframe from a list with only one long, confusing string(?) element? I know its pretty smart, but seriously? How? Also, if there is only one line, why does python throw the error: shape=(1, 88, 2). How is that possible? What does shape=(1, 88, 2) mean or look like? I thought the shape would be (1,1): One row and one column. Very confusing.
My second question is about indexing...
In the original dataframe made from the csv, the list of ALL companies, the index (I assume) is the list of numbers: 0, 1, 2 ... 513.
When I start slicing, and create the final dataframe, using pd.DataFrame(IBOV_codes[0]), the index column is 1, 12,17,24,34... 492, 496, 497, 506, 511. Each company has the same 'index' it had when read from the csv.
The numbers are still sequential, but the index is missing loads of numbers. Are these indexes still integers? Or have they become strings/objects? What would be the best code of practice? To reindex to 0, 1,2,3,4 etc?
If anyone can clear things up, "Thanks!"

SHOW KEYS in Aerospike?

I'm new to Aerospike and am probably missing something fundamental, but I'm trying to see an enumeration of the Keys in a Set (I'm purposefully avoiding the word "list" because it's a datatype).
For example,
To see all the Namespaces, the docs say to use SHOW NAMESPACES
To see all the Sets, we can use SHOW SETS
If I want to see all the unique Keys in a Set ... what command can I use?
It seems like one can use client.scan() ... but that seems like a super heavy way to get just the key (since it fetches all the bin data as well).
Any recommendations are appreciated! As of right now, I'm thinking of inserting (deleting) into (from) a meta-record.
Thank you #pgupta for pointing me in the right direction.
This actually has two parts:
In order to retrieve original keys from the server, one must -- during put() calls -- set policy to save the key value server-side (otherwise, it seems only a digest/hash is stored?).
Here's an example in Python:
aerospike_client.put(key, {'bin': 'value'}, policy={'key': aerospike.POLICY_KEY_SEND})
Then (modified Aerospike's own documentation), you perform a scan and set the policy to not return the bin data. From this, you can extract the keys:
Example:
keys = []
scan = client.scan('namespace', 'set')
scan_opts = { 'concurrent': True, 'nobins': True, 'priority': aerospike.SCAN_PRIORITY_MEDIUM }
for x in (scan.results(policy=scan_opts)): keys.append(x[0][2])
The need to iterate over the result still seems a little clunky to me; I still think that using a 'master-key' Record to store a list of all the other keys will be more performant, in my case -- in this way, I can simply make one get() call to the Aerospike server to retrieve the list.
You can choose not bring the data back by setting includeBinData in ScanPolicy to false.

Reading Fortran binary file in Python

I'm having trouble reading an unformatted F77 binary file in Python.
I've tried the SciPy.io.FortraFile method and the NumPy.fromfile method, both to no avail. I have also read the file in IDL, which works, so I have a benchmark for what the data should look like. I'm hoping that someone can point out a silly mistake on my part -- there's nothing better than having an idiot moment and then washing your hands of it...
The data, bcube1, have dimensions 101x101x101x3, and is r*8 type. There are 3090903 entries in total. They are written using the following statement (not my code, copied from source).
open (unit=21, file=bendnm, status='new'
. ,form='unformatted')
write (21) bcube1
close (unit=21)
I can successfully read it in IDL using the following (also not my code, copied from colleague):
bcube=dblarr(101,101,101,3)
openr,lun,'bcube.0000000',/get_lun,/f77_unformatted,/swap_if_little_endian
readu,lun,bcube
free_lun,lun
The returned data (bcube) is double precision, with dimensions 101x101x101x3, so the header information for the file is aware of its dimensions (not flattend).
Now I try to get the same effect using Python, but no luck. I've tried the following methods.
In [30]: f = scipy.io.FortranFile('bcube.0000000', header_dtype='uint32')
In [31]: b = f.read_record(dtype='float64')
which returns the error Size obtained (3092150529) is not a multiple of the dtypes given (8). Changing the dtype changes the size obtained but it remains indivisible by 8.
Alternately, using fromfile results in no errors but returns one more value that is in the array (a footer perhaps?) and the individual array values are wildly wrong (should all be of order unity).
In [38]: f = np.fromfile('bcube.0000000')
In [39]: f.shape
Out[39]: (3090904,)
In [42]: f
Out[42]: array([ -3.09179121e-030, 4.97284231e-020, -1.06514594e+299, ...,
8.97359707e-029, 6.79921640e-316, -1.79102266e-037])
I've tried using byteswap to see if this makes the floating point values more reasonable but it does not.
It seems to me that the np.fromfile method is very close to working but there must be something wrong with the way it's reading the header information. Can anyone suggest how I can figure out what should be in the header file that allows IDL to know about the array dimensions and datatype? Is there a way to pass header information to fromfile so that it knows how to treat the leading entry?
I played a bit around with it, and I think I have an idea.
How Fortran stores unformatted data is not standardized, so you have to play a bit around with it, but you need three pieces of information:
The Format of the data. You suggest that is 64-bit reals, or 'f8' in python.
The type of the header. That is an unsigned integer, but you need the length in bytes. If unsure, try 4.
The header usually stores the length of the record in bytes, and is repeated at the end.
Then again, it is not standardized, so no guarantees.
The endianness, little or big.
Technically for both header and values, but I assume they're the same.
Python defaults to little endian, so if that were the the correct setting for your data, I think you would have already solved it.
When you open the file with scipy.io.FortranFile, you need to give the data type of the header. So if the data is stored big_endian, and you have a 4-byte unsigned integer header, you need this:
from scipy.io import FortranFile
ff = FortranFile('data.dat', 'r', '>u4')
When you read the data, you need the data type of the values. Again, assuming big_endian, you want type >f8:
vals = ff.read_reals('>f8')
Look here for a description of the syntax of the data type.
If you have control over the program that writes the data, I strongly suggest you write them into data streams, which can be more easily read by Python.
Fortran has record demarcations which are poorly documented, even in binary files.
So every write to an unformatted file:
integer*4 Test1
real*4 Matrix(3,3)
open(78,format='unformatted')
write(78) Test1
write(78) Matrix
close(78)
Should ultimately be padded by an np.int32 values. (I've seen references that this tells you the record length, but haven't verified persconally.)
The above could be read in Python via numpy as:
input_file = open(file_location,'rb')
datum = np.dtype([('P1',np.int32),('Test1',np.int32),('P2',np.int32),('P3',mp.int32),('MatrixT',(np.float32,(3,3))),('P4',np.int32)])
data = np.fromfile(input_file,datum)
Which should fully populate the data array with the individual data sets of the format above. Do note that numpy expects data to be packed in C format (row major) while Fortran format data is column major. For square matrix shapes like that above, this means getting the data out of the matrix requires a transpose as well, before using. For non square matrices, you will need to reshape and transpose:
Matrix = np.transpose(data[0]['MatrixT']
Transposing your 4-D data structure is going to need to be done carefully. You might look into SciPy for automated ways to do so; the SciPy package seems to have Fortran related utilities which I have not fully explored.

How to load multiple values for same key in property files?

See this property file:
S=1
M=2
[...]
IA=i
S=g
First, the value 1 will be assigned to S and then g will be assigned to S in the last line.
I want to keep multiple values for the same key S, how can I do this?
I think you need to specify whether this is a Java properties file or some other file, as I've rarely seen it possible to define two separate "keys" with different values without some kind of a section break (ie .ini files).
The only other way I can think of for reading this type of file would be to pull it in as nested dict using the alphabet as the index (a-z, aa-az, etc), storing the key value pairs you've seen, so for example you'd have "a" = {S='1'}, "b" = {M='2'} [..] "z" = {S='g'} and then you could do a query for "letterKey" in dict "alpha" where key.value("innerKey") = 'S', which would give you both 'a' and 'z' due to the 'S' = '1' and 'g'. This may not be any easier than simply rewriting some of the existing code though.
Since a dictionary can't have multiple keys with the same "index" ie 'S' can't appear twice, if you could do as a commenter suggested and store two values in an array in 'S' and reference them via position S.value([0]) and S.value([1]), you'd have a much better program overall.

Compare 2 datasets with dbunit?

Currently I need to create tests for my application. I used "dbunit" to achieve that and now need to compare 2 datasets:
1) The records from the database I get with QueryDataSet
2) The expected results are written in the appropriate FlatXML in a file which I read in as a dataset as well
Basically 2 datasets can be compared this way.
Now the problem are columns with a Timestamp. They will never fit together with the expected dataset. I really would like to ignore them when comparing them, but it doesn't work the way I want it.
It does work, when I compare each table for its own with adding a column filter and ignoreColumns. However, this approch is very cumbersome, as many tables are used in that comparison, and forces one to add so much code, it eventually gets bloated.
The same applies for fields which have null-values
A probable solution would also be, if I had the chance to only compare the very first column of all tables - and not by naming it with its column name, but only with its column index. But there's nothing I can find.
Maybe I am missing something, or maybe it just doesn't work any other way than comparing each table for its own?
For the sake of completion some additional information must be posted. Actually my previously posted solution will not work at all as the process reading data from the database got me trapped.
The process using "QueryDataset" did read the data from the database and save it as a dataset, but the data couldn't be accessed from this dataset anymore (although I could see the data in debug mode)!
Instead the whole operation failed with an UnsupportedOperationException at org.dbunit.database.ForwardOnlyResultSetTable.getRowCount(ForwardOnlyResultSetTable.java:73)
Example code to produce failure:
QueryDataSet qds = new QueryDataSet(connection);
qds.addTable(“specificTable”);
qds.getTable(„specificTable“).getRowCount();
Even if you try it this way it fails:
IDataSet tmpDataset = connection.createDataSet(tablenames);
tmpDataset.getTable("specificTable").getRowCount();
In order to make extraction work you need to add this line (the second one):
IDataSet tmpDataset = connection.createDataSet(tablenames);
IDataSet actualDataset = new CachedDataSet(tmpDataset);
Great, that this was nowhere documented...
But that is not all: now you'd certainly think that one could add this line after doing a "QueryDataSet" as well... but no! This still doesn't work! It will still throw the same Exception! It doesn't make any sense to me and I wasted so much time with it...
It should be noted that extracting data from a dataset which was read in from an xml file does work without any problem. This annoyance just happens when trying to get a dataset directly from the database.
If you have done the above you can then continue as below which compares only the columns you got in the expected xml file:
// put in here some code to read in the dataset from the xml file...
// and name it "expectedDataset"
// then get the tablenames from it...
String[] tablenames = expectedDataset.getTableNames();
// read dataset from database table using the same tables as from the xml
IDataSet tmpDataset = connection.createDataSet(tablenames);
IDataSet actualDataset = new CachedDataSet(tmpDataset);
for(int i=0;i<tablenames.length;i++)
{
ITable expectedTable = expectedDataset.getTable(tablenames[i]);
ITable actualTable = actualDataset.getTable(tablenames[i]);
ITable filteredActualTable = DefaultColumnFilter.includedColumnsTable(actualTable, expectedTable.getTableMetaData().getColumns());
Assertion.assertEquals(expectedTable,filteredActualTable);
}
You can also use this format:
// Assert actual database table match expected table
String[] columnsToIgnore = {"CONTACT_TITLE","POSTAL_CODE"};
Assertion.assertEqualsIgnoreCols(expectedTable, actualTable, columnsToIgnore);