Is there a significant difference between data['column_name'] vs data.column_name [duplicate] - pandas

This question already has answers here:
Accessing Pandas column using squared brackets vs using a dot (like an attribute)
(5 answers)
Closed 4 years ago.
For example, I'm studying an example like this:
train['Datetime'] = pd.to_datetime(train.Datetime,format='%d-%m-%Y %H:%M')
If I run train['Datetime'].head() and train.Datetime.head(), the results are identical. So why use one over the other? Or why use both?

I have used both. I think the most important consideration is about how sustainable and flexible you want your code to be. For quick checks and "imperative programming" (like Jupyter Notebooks), you could use the minimal shorthand:
train.Datetime.head()
However pretty soon you will realize that when you want to pass variables around that may come from a UI or some other source or debug code efficiently, full notation like this:
train['Datetime'].head()
has main benefits, and it is good to make it a habit early on when programming.
First, in Integrated Development Environments (IDE's) used for editing code, the string 'Datetime' will be highlighted to remind you that it is a "hard dependency" in your code. Whereas the Datetime (no quotes, just a .) will not show the highlighting.
This may not sound like a big deal, but when you are looking a 100's of lines of code (or more), seeing where you have "hardcoded" a variable name is important.
The other main advantage of [] notation is that you can pass in string variables to the notation.
import pandas as pd
import numpy as np
# make some data
n=100
df = pd.DataFrame({
'Fruit': np.random.choice(['Apple', 'Orange', 'Grape'], n),
'Animal': np.random.choice(['Cat', 'Dog', 'Fish'], n),
'x1': np.random.randn(n)})
# some name from a user interface. It could be "Fruit" or "Animal"
group = "Animal"
# use that string variable in an expression (in this case, as a group by)
df.groupby(group).agg(['count', 'mean', 'std'])
Here, even in Stack overflow, you can see that in the df.groupby() that there are no hardcoded strings (in red text). This sepration of user inputs and code that does something is subtle, but extremely important.
Good luck!

There will be issue when the column name contain blank spaces, in that case indexing is must.

Related

Migrating to Qt6: Is there a method to finding long-form names?

I am migrating a codebase of PyQt5 to PyQt6. I read the stackoverflow question another user asked:
Migrating to Qt6/PyQt6: what are all the deprecated short-form names in Qt5?
My question is simply a variation of this, ie, in my case I've spent several hours trying to find the longer form for the following:
def flags(self, index: QModelIndex) -> Qt.QFlags:
return Qt.ItemIsDropEnabled | Qt.ItemIsEnabled | Qt.ItemIsEditable | Qt.ItemIsSelectable | Qt.ItemIsDragEnabled
Error received:
AttributeError: type object 'Qt' has no attribute 'ItemFlags'.
Previously I've been able to figure out the long-form equivalent required in Qt6, but in this case I can't figure it out. (When this one is solved, I will probably have to find the equivalence for
the return values in the code example above: Qt.ItemIsDropEnabled, etc.)
I would have been happy to have posted this in the form of a comment under the other posted question, but stackoverflow says I need 50 reputation pts to comment.
The "flags" (plural form) refers to the combination of enum values, which instead refer to each single value (see the documentation).
As written in the comment of the answer you linked, the PyQt maintainer seems to be very resistant to any "shorter" solution, and he chose to always use the Enum namespace for Python consistence even for flags, and even if some Qt enum/flag names are not consistent for backward compatibility; this might be very confusing, not to mention the fact that this obviously means even longer code lines (and we all know how we already struggle with the length of Qt object names).
To clarify, here's a typical and confusing case (in Qt terms, and valid with PyQt5):
Qt.AlignCenter is a Qt::AlignmentFlag, but its actually an enum (even if it's named "flag"); see the result for PyQt5:
>>> test = Qt.AlignCenter
>>> print(type(test))
<class 'PyQt5.QtCore.Qt.AlignmentFlag'>
>>> print(test)
132
Qt.AlignHCenter|Qt.AlignVCenter results in a Qt::Alignment, but it's actually a flag, even if it's actually equal to Qt.AlignCenter:
>>> test = Qt.AlignHCenter|Qt.AlignVCenter
>>> print(type(test))
<class 'PyQt5.QtCore.Alignment'>
>>> print(test)
<PyQt5.QtCore.Alignment object at 0xb30ff41c>
>>> print(int(test))
132
In any case, the point remains: if you want to use a value, you must use the enum namespace, which is Qt.ItemFlag without the trailing "s".
def flags(self, index: QModelIndex) -> Qt.QFlags:
return (
Qt.ItemFlag.ItemIsDropEnabled
| Qt.ItemFlag.ItemIsEnabled
| Qt.ItemFlag.ItemIsEditable
| Qt.ItemFlag.ItemIsSelectable
| Qt.ItemFlag.ItemIsDragEnabled
)
Remember to always refer to the official C++ API documentation of classes and objects in order to understand the different types and get their proper names: none of the Python docs are good for that, both PyQt and "Qt for Python" (aka, PySide).

Text file import and how to manipulate select rows and columns into different arrays and do calculations

I have limited experience in ipython and I work in a research lab. We run an experiment that outputs a .txt file once it is finished taking results. Up until now due to the format of the .txt file we need first copy and paste into an excel sheet and then do a bunch of annoying copying and pasting in order to get the different rows in the order we want.
The only important columns that we need are the "well" column and the "Abs" column. I need to find a way so that I can identify the Abs numbers with with its associated well identity. The end goal is for me to write a script doing some super simple math manipulations such as average well H01 and H02 so that I can subtract that number from the rest of the wells.
That might have been confusing but let me know if you have any questions or ideas on places to start other than reading in the file (which is all I am able to do right now). Your help would be greatly appreciated!
Text file produced by machine
Code that I have attempted (not very much)
from future import division
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
%matplotlib inline
data = np.loadtxt('python test file.txt',skiprows= 3, usecols=(1,7), dtype= str, unpack=True)
print (data)
This is what is printed...
[["b'A01'" "b'B01'" "b'C01'" "b'D01'" "b'E01'" "b'F01'" "b'G01'" "b'H01'"
"b'H02'" "b'G02'" "b'F02'" "b'E02'" "b'D02'" "b'C02'" "b'B02'" "b'A02'"
"b'A03'" "b'B03'" "b'C03'" "b'D03'" "b'D04'" "b'C04'" "b'B04'" "b'A04'"]
["b'6'" "b'6'" "b'5.3501'" "b'6'" "b'6'" "b'6'" "b'3.59128'" "b'0.177349'"
"b'0.174828'" "b'3.42995'" "b'6'" "b'5.37723'" "b'5.39004'" "b'5.54484'"
"b'6'" "b'6'" "b'5.35271'" "b'3.78453'" "b'5.41057'" "b'6'" "b'6'"
"b'5.3402'" "b'3.04992'" "b'6'"]]
I'm not sure if I've managed to completely understand your question. First, when you are reading the file, you are specifying that each element is a string (dtype=str). However, you have one column of strings and one column of floats. Then, you are also using the argument unpack=True which means that numpy will output more than one array, but you are collecting only data. If this is not clear, don't worry, it will become clearer with the example below.
Another thing is that in Python the default is to use unicode for strings, and that is why you are seeing the b in front of the values that you are reading.
To import the columns you want, you can do the following:
well, absorption = np.loadtxt('file.txt',
skiprows=3,
usecols=(1,7),
dtype={
'names': ('Well', 'Abs'),
'formats': ('U3', 'f4')},
unpack=True)
In the command above, pay attention to how you specify the formats. For example, the f4 means you are going to grab a float with 4 decimals. If you would change it to f2, you would get less decimals, etc. Be careful with what you need. The same with the U3 meaning that you are grabbing 3 unicode characters, but you may need something different.
Now, you wanted to work on your data. First, you need to find what data lines correspond to H01 and H02. You can do the following:
blanks_indexes = np.where(np.logical_or(well=='H01', well=='H02'))
And if you want to calculate the average:
abs_blanks = np.mean(absorption[blanks_indexes])
And simply:
corrected_abs = absorption - abs_blanks
I think this can get you started. Depending on where are you coming from, you may check Pandas, which is a great tool to work with tabular data like the one you have, but you have to learn it, so it is up to you to decide whether to invest the time or not.

Accessing a .fits file and plotting its columns

I'm trying to access a .fits file and plotting two columns (out of many!).
I used pyfits to access the file, and
plt.plotfile('3XMM_DR5.fits', delimiter=' ', cols=(0, 1), names=('x-axis','y-axis'))
but that's not working. Are there any alternatives? And is there any way to open the file using python? In order to access the data table
According to the docs from matplotlib for plotfile:
Note: plotfile is intended as a convenience for quickly plotting data from flat files; it is not intended as an alternative interface to general plotting with pyplot or matplotlib.
This isn't very clear. I think by "flat files" it just means CSV data or something--this function isn't used very much in my experience, and it certainly does't know anything about FITS files, which are seldom used outside astronomy. You mentioned in your post that you did something with PyFITS, but that isn't demonstrated anywhere in your question.
PyFITS, incidentally, has been deprecated for several years now, and its functionality is integrated into Astropy.
You can open a table from a FITS file with astropy.Table.read:
from astropy.table import Table
table = Table.read('3XMM_DR5.fits')
then access the columns with square bracket notation like:
plt.plot(table['whatever the x axis column is named'], table['y axis column name'])

Why does official prefer concantane than hstack/vstack in Numpy?

I find that the latest documentation about hstack/vstack note that "you should prefer np.concatenate or np.stack".
But I think their readability is better than concatenate(a, 0) or concatenate(a, 1)
All 3 'stack' functions use concatenate (as does np.append and column_stack). It's instructive to look at their code. np.source(np.hstack) for example.
What they all do is massage the dimensions of the input arrays, making sure they are are 1d or 2d etc, and then call concatenate with the appropriate axis. So in the long run it's a good idea to know how to use concatenate without the 'crutch' of the others.
But people will continue to use hstack and vstack where convenient. dstack and column_stack are less common. np.append is frequently misused and should be banished.
I think this 'preferred' note was added when np.stack was added. np.stack also uses concatenate, but in a somewhat more sophisticated way. It inserts a new axis (with expand_dims). I view it as a generalization of np.array. When given a list of matching arrays, np.array joins them on a new initial axis. np.stack does the same thing as a default, but lets us specify a different 'new' axis for concatenation.
I should qualify my answer. It is not official. Rather I'm making an educated guess based on knowledge of the code.

can a variable have multiple values

In algebra if I make the statement x + y = 3, the variables I used will hold the values either 2 and 1 or 1 and 2. I know that assignment in programming is not the same thing, but I got to wondering. If I wanted to represent the value of, say, a quantumly weird particle, I would want my variable to have two values at the same time and to have it resolve into one or the other later. Or maybe I'm just dreaming?
Is it possible to say something like i = 3 or 2;?
This is one of the features planned for Perl 6 (junctions), with syntax that should look like my $a = 1|2|3;
If ever implemented, it would work intuitively, like $a==1 being true at the same time as $a==2. Also, for example, $a+1 would give you a value of 2|3|4.
This feature is actually available in Perl5 as well through Perl6::Junction and Quantum::Superpositions modules, but without the syntax sugar (through 'functions' all and any).
At least for comparison (b < any(1,2,3)) it was also available in Microsoft Cω experimental language, however it was not documented anywhere (I just tried it when I was looking at Cω and it just worked).
You can't do this with native types, but there's nothing stopping you from creating a variable object (presuming you are using an OO language) which has a range of values or even a probability density function rather than an actual value.
You will also need to define all the mathematical operators between your variables and your variables and native scalars. Same goes for the equality and assignment operators.
numpy arrays do something similar for vectors and matrices.
That's also the kind of thing you can do in Prolog. You define rules that constraint your variables and then let Prolog resolve them ...
It takes some time to get used to it, but it is wonderful for certain problems once you know how to use it ...
Damien Conways Quantum::Superpositions might do what you want,
https://metacpan.org/pod/Quantum::Superpositions
You might need your crack-pipe however.
What you're asking seems to be how to implement a Fuzzy Logic system. These have been around for some time and you can undoubtedly pick up a library for the common programming languages quite easily.
You could use a struct and handle the operations manualy. Otherwise, no a variable only has 1 value at a time.
A variable is nothing more than an address into memory. That means a variable describes exactly one place in memory (length depending on the type). So as long as we have no "quantum memory" (and we dont have it, and it doesnt look like we will have it in near future), the answer is a NO.
If you want to program and to modell this behaviour, your way would be to use a an array (with length equal to the number of max. multiple values). With this comes the increased runtime, hence the computations must be done on each of the values (e.g. x+y, must compute with 2 different values x1+y1, x2+y2, x1+y2 and x2+y1).
In Perl , you can .
If you use Scalar::Util , you can have a var take 2 values . One if it's used in string context , and another if it's used in a numerical context .