Code completion and intellisense in VS Code is absolutely god-awful for me. In every language. I have extensions installed and updated but its always absolute trash.
import pandas as pd
data_all = pd.read_csv(DATA_FILE, header=None)
data_all. (press tab)
No suggestions.
Do you really not know its a Pandas DataFrame object, its literally the line above?
I have this issue in python, in ruby/rails, pretty much every langauge i try to use the completion is absolute garbage. Do i have an extension that is breaking other extensions? is code jsut this bad? Why is it so inexplicably useless?
Installed Currently:
abusaidm.html-s
nippets#0.2.1
alefragnani.numbered-bookmarks#8.0.2
bmewburn.vscode-intelephense-client#1.6.3
bung87.rails#0.16.11
bung87.vscode-gemfile#0.4.0
castwide.solargraph#0.21.1
CoenraadS.bracket-pair-colorizer#1.0.61
donjayamanne.python-extension-pack#1.6.0
ecmel.vscode-html-css#1.10.2
felixfbecker.php-debug#1.14.9
felixfbecker.php-intellisense#2.3.14
felixfbecker.php-pack#1.0.2
formulahendry.auto-close-tag#0.5.10
golang.go#0.23.2
groksrc.ruby#0.1.0
k--kato.intellij-idea-keybindings#1.4.0
KevinRose.vsc-python-indent#1.12.0
Leopotam.csharpfixformat#0.0.84
magicstack.MagicPython#1.1.0
miguel-savignano.ruby-symbols#0.1.8
ms-dotnettools.csharp#1.23.9
ms-mssql.mssql#1.10.1
ms-python.python#2021.2.636928669
ms-python.vscode-pylance#2021.3.1
ms-toolsai.jupyter#2021.3.619093157
ms-vscode.cpptools#1.2.2
rebornix.ruby#0.28.1
sianglim.slim#0.1.2
VisualStudioExptTeam.vscodeintellicode#1.2.11
wingrunr21.vscode-ruby#0.28.0
Zignd.html-css-class-completion
#1.20.0
If you check the IntelliSense of the read_csv() method (By hovering your mouse over it), you will see that it returns a DataFrame object
(function)
read_csv(reader: IO, sep: str = ...,
#Okay... very long definition but scroll to the end...
float_precision: str | None = ...) -> DataFrame
But if you use IntelliSense check the variable data_all
import pandas as pd
data_all = pd.read_csv(DATA_FILE, header=None)
It is listed as the default data type in python: Any. That's why your compiler isn't generating the autocomplete.
So, you simply need to explicitly tell your compiler that it is, in fact, a DataFrame object as shown.
import pandas as pd
from pandas.core.frame import DataFrame
DATA_FILE = "myfile"
data_all:DataFrame = pd.read_csv(DATA_FILE, header=None)
# Now all autocomplete options on data_all are available!
It might seem strange why the compiler cannot guess the data type in this example until you realize that the read_csv() method is overloaded with many definitions, and some of them return objects as Any type. So the compiler assumes the worst-case scenario and treats it as an Any type object unless specified otherwise.
Related
I am trying to transform an existing Python package to make it work with Structured Streaming in Spark.
The package is quite complex with multiple substeps, including:
Binary file parsing of metadata
Fourier Transformations of spectra
The intermediary & end results were previously stored in an SQL database using sqlalchemy, but we need to transform it to delta.
After lots of investigation, I've made the first part work for the binary file parsing but only by statically defining the column types in an UDF:
fileparser = F.udf(File()._parseBytes,FileDelta.getSchema())
Where the _parseBytes() method takes a binary stream and outputs a dictionary of variables
Now I'm trying to do this similarly for the spectrum generation:
spectrumparser = F.udf(lambda inputDict : vars(Spectrum(inputDict)),SpectrumDelta.getSchema())
However the Spectrum() init method generates multiple Pandas Dataframes as fields.
I'm getting errors as soon as the Executor nodes get to that part of the code.
Example error:
expected zero arguments for construction of ClassDict (for pandas.core.indexes.base._new_Index).
This happens when an unsupported/unregistered class is being unpickled that requires construction arguments.
Fix it by registering a custom IObjectConstructor for this class.
Overall, I feel like i'm spending way too much effort for building the Delta adaptation. Is there maybe an easy way to make these work?
I read in 1, that we could switch to the Pandas on spark API but to me that seems to be something to do within the package method itself. Is that maybe the solution, to rewrite the entire package & parsers to work natively in PySpark?
I also tried reproducing the above issue in a minimal example but it's hard to reproduce since the package code is so complex.
After testing, it turns out that the problem lies in the serialization when wanting to output (with show(), display() or save() methods).
The UDF expects ArrayType(xxxType()), but gets a pandas.Series object and does not know how to unpickle it.
If you explicitly tell the UDF how to transform it, the UDF works.
def getSpectrumDict(inputDict):
spectrum = Spectrum(inputDict["filename"],inputDict["path"],dict_=inputDict)
dict = {}
for key, value in vars(spectrum).items():
if type(value) == pd.Series:
dict[key] = value.tolist()
elif type(value) == pd.DataFrame:
dict[key] = value.to_dict("list")
else:
dict[key] = value
return dict
spectrumparser = F.udf(lambda inputDict : getSpectrumDict(inputDict),SpectrumDelta.getSchema())
I am trying to build a TF/IDF transformer (maps sets of words into count vectors) based on a Pandas series, in the following code:
tf_idf_transformer = TfidfTransformer()
return tf_idf_transformer.fit_transform( excerpts )
This fails with the following message:
ValueError: could not convert string to float: "I'm trying to work out, in general terms..."
Now, "excerpts" is a Pandas Series consisting of a bunch of text strings excerpted from StackOverflow posts, but when I look at the dtype of excerpts,
it says object. So, I reason that the problem might be that something is inferring the type of that Series to be float. So, I tried several ways to make the Series have dtype str:
I tried forcing the column types for the dataframe that includes "excerpts" to be str, but when I look at the dtype of the resulting Series, it's still object
I tried casting the entire dataframe that includes "excerpts" to dtypes str using Pandas.DataFrame.astype(), but the "excerpts" stubbornly have dtype object.
These may be red herrings; the real problem is with fit_transform. Can anyone suggest some way whereby I can see which entries in "excerpts" are causing problems or, alternatively, simply ignore them (leaving out their contribution to the TF/IDF).
I see the problem. I thought that tf_idf_transformer.fit_transform takes as the source argument an array-like of text strings. Instead, I now understand that it takes an (n,2)-array of text strings / token counts. The correct usage is more like:
count_vect = CountVectorizer()
excerpts_token_counts = count_vect.fit_transform( excerpts)
tf_idf_transformer = TfidfTransformer()
return tf_idf_transformer.fit_transform( excerpts_token_counts )
Sorry for my confusion (I should have looked at "Sample pipeline for text feature extraction and evaluation" in the TfidfTransformer documentation for sklearn).
I found this useful pandas describe function I can use like this:
My code is now cluttered with meta information and I wanted to introduce a debug version with debug meta information and one that just trains my NN. Just via a boolean switch. But I found some commands like pandas describe will not produce an output when wrapped in an if statement
The only workaround I found so far is wrapping it in a print function. It results in an ugly but correct output
Why is that or what am I doing wrong?
You can use display and HTML to get what you want.
from IPython.core.display import display, HTML
df = pd.DataFrame(dict(A=[1, 2], B=[3, 4]))
if True:
display(HTML(df.to_html()))
How to get the current time with GHCJS? Should i try to access Date or use Haskell base libraries? Is there an utility function somewhere in the GHCJS base libraries?
The Data.Time.Clock module seems to work well:
import Data.Time.Clock (getCurrentTime)
import Data.Time.Format -- Show instance
main = do
now <- getCurrentTime
print now
The solution i found currently is quite ugly, but it works for me, so maybe it can save some time to somebody:
{-# LANGUAGE JavaScriptFFI #-}
import GHCJS.Types( JSVal )
import GHCJS.Prim( fromJSString )
foreign import javascript unsafe "Date.now()+''" dateNow :: IO (JSVal)
asInteger = read (fromJSString dateNow) :: Integer -- this happens in IO
The ugliness comes from not finding a JSInteger type in GHCJS, which would be needed in order to get the result of Date.now() which is a long integer. So i need to produce a string concatenating a string to the result of Date.now() in Javascript. At this point i could get a JSString as result, but that would not be an instance of Read so using read would not work. So i get a JSValue and convert it to String using fromJSString.
Eventually there might be a JSInteger in GHCJS, or JSString might become an instance of Read, so if you are reading this from the future try out something more elegant!
I'm migrating from MATLAB to ipython and before taking the leap I'm going through my minimal workflow to make sure every operation I perform daily on MATLAB for data crunching is available on ipython.
I'm currently stuck on the very basic task of saving and loading numpy arrays via a one-line command, such as MATLAB's:
>>> save('myresults.mat','a','b','c')
>>> load('myresults.mat')
In particular, what I like about MATLAB's load command is that not only it reads
the data file but it loads the variables into the workspace, nothing else is needed to start working with them. Note that this is not the case with, for instance, numpy.load(), which requires another line to be able to assign the loaded values to the workspace variables. [ See: IPython: how to automagically load npz file and assign values to variables? ]
Based on the answers and comments to that question, I came up with this dirty-bad-engineering-ugly-coding-but-working solution. I know it's not pretty, and I would like to know if you can come up with the correct version of this [1].
I put this into iocustom.py:
def load(filename):
ip = get_ipython()
ip.ex("import numpy as np")
ip.ex("locals().update(np.load('" + filename + "'))")
so that I can run, from the ipython session:
import iocustom
load('myresults.npz')
and the variables are dumped to the workspace.
I find it hard to believe there's nothing built-in equivalent to this, and it's even harder to think that that 3-line function is the optimal solution. I would be very grateful if you could please suggest a more correct way of doing this.
Please keep in mind that:
I'm looking for a solution which would also work inside a script and a function.
I know there's "pickle" but I refuse to use more than one line of code for something as mundane as a simple 'save' and/or 'load' command.
I know there's "savemat" and "loadmat" available from scipy, but I would like to migrate completely, i.e., do not work with mat files but with numpy arrays.
Thanks in advance for all your help.
[1] BTW: how do people working with ipython save and load a set of numpy arrays easily? After hours of googling I cannot seem to find a simple and straightforward solution for this daily task.
If I save this as load_on_run.py:
import argparse
import numpy as np
if __name__=='__main__':
parser = argparse.ArgumentParser()
parser.add_argument('-l','--list', help='list variables', action='store_true')
parser.add_argument('filename')
__args = parser.parse_args()
data = np.load(__args.filename)
locals().update(data)
del parser, data, argparse, np
if __args.list:
print([k for k in locals() if not k.startswith('__')])
del __args
And then in ipython I can invoke it with %run:
In [384]: %run load_on_run testarrays.npz -l
['array2', 'array3', 'array4', 'array1']
In [385]: array3
Out[385]: array([-10, -9, -8, -7, -6, -5, -4, -3, -2, -1])
It neatly loads the arrays from the file into the ipython workspace.
I'm taking advantage of the fact that magic %run runs a script, leaving all functions and variables defined by it in the main namespace. I haven't looked into how it does this.
The script just takes a few arguments, loads the file (so far only .npz), and uses the locals().update trick to put its variables into the local namespace. Then I clear out the unnecessary variables and modules, leaving only the newly loaded ones.
I could probably define an alias for %run load_on_run.
I can also imagine a script along these lines that lets you load variables with an import: from <script> import *.
You could assign the values in the npz file to global variables:
import numpy as np
def spill(filename):
f = np.load(filename)
for key, val in f.iteritems():
globals()[key] = val
f.close()
This solution works in Python2 and Python3, and any flavor of interative shell,
not just IPython. Using spill is fine for interactive use, but not for scripts
since
It gives the file the ability to rebind arbitrary names to
arbitrary values. That can lead to surprising, hard to debug behavior, or even be a security risk.
Dynamically created variable names are hard to program with. As
the Zen of Python (import this) says, "Namespaces are one honking
great idea -- let's do more of those!" For a script it is better to
keep the values in the NpzFile, f, and access them by indexing,
such as f['x'].