I have a csv file with this structure:
id;matrix
1;[[1.2 1.3] [1.2 1.3] [1.2 1.3]]
I'm trying read the matrix field as numpy.ndarray using pandas.read_csv to read and making df.to_numpy() to convert the matrix, but the shape array result in (1,0). I was waiting for the shape equals (3,2) as:
matrix = [[1.2 1.3]
[1.2 1.3]
[1.2 1.3]]
I was try too numpy.asmatrix, but the result is like df.to_numpy()
A solution with pandas
Providing the format of the matrix column is consistent with that shown in the example, replace the spaces with ,, then use literal_eval to turn the string into a list of lists, and then apply np.array.
import pandas as pd
from ast import literal_eval
import numpy as np
# read the data
df = pd.read_csv('file.csv', sep=';')
# replace the spaces
df['matrix'] = df['matrix'].str.replace(' ', ',')
# apply literal_eval
df['matrix'] = df['matrix'].apply(literal_eval)
# apply numpy array
df['matrix'] = df['matrix'].apply(np.array)
print(type(df.iloc[0, 1]))
>>> numpy.ndarray
Each row of the matrix column will be an ndarray
The two apply calls can be combined into:
df['matrix'] = df['matrix'].apply(lambda x: np.array(literal_eval(x)))
Or this hot mess:
df['matrix'] = df['matrix'].str.replace(' ', ',').apply(lambda x: np.array(literal_eval(x)))
I personally prefer one transformation per line for code clarity.
Related
I want to convert a multidimensional climate data into the pandas data frame. The shape of my numpy array is temperature.shape -> (365,100,200) -> ["time", "longitude", "latitude"]. Then I would like to have the following columns in my pandas dataframe: columns=["time", "lon", "lat", "temp"].
I tried this code:
df = pd.DataFrame(temperature, columns=['time', 'lat', 'lon', 'temp'])
I got this error:
ValueError: Must pass 2-d input
How can I solve it? I could not find any hint in suggested topics. Thanks.
Pandas is expects a 2D array where the columns and rows correspond to the final data frame.
It looks like you're trying to unravel the (365,100,200) array in 365*100*200=7,300,000 individual records. This can be done by flattening the array if you have the values for each independent quantity along each access.
For example, unravelling a (3,4,5) shaped 3D array with X, Y and Z dimensions given by the lists/arrays x_index, y_index, z_index, rather than time, longitude, latitude and M replacing temperature:
import numpy as np
import pandas as pd
nx = 3
ny = 4
nz = 5
M = np.ndarray((nx,ny,nz))
for i in range(nx):
for j in range(ny):
for k in range(nz):
M[i,j,k] = (i+j)*k
# constructed nx by ny by nz matrix from function f(x,y,z) = (x+y)*z
x_index = list(range(nx))
y_index = list(range(ny))
z_index = list(range(nz))
# Get arrays/list giving the values of x/y/z
X, Y, Z = np.meshgrid(x_index,y_index,z_index)
# Make (3,4,5) arrays of each independent variable
pd.DataFrame({"M=(X+Y)*Z":M.flatten(), "X":X.flatten(), "Y":Y.flatten(), "Z":Z.flatten()})
# Flatten the data and independent variables to make 3*4*5=60 individual records
I am beginner in Python and I am stuck with data which is array of 32763 number, separated by comma. Please find the data here data
I want to convert this into two column 1 from (0:16382) and 2nd column from (2:32763). in the end I want to plot column 1 as x axis and column 2 as Y axis. I tried the following code but I am not able to extract the columns
import numpy as np
import pandas as pd
import matplotlib as plt
data = np.genfromtxt('oscilloscope.txt',delimiter=',')
df = pd.DataFrame(data.flatten())
print(df)
and then I want to write the data in some file let us say data1 in the format as shown in attached pic
It is hard to answer without seeing the format of your data, but you can try
data = np.genfromtxt('oscilloscope.txt',delimiter=',')
print(data.shape) # here we check we got something useful
# this should split data into x,y at position 16381
x = data[:16381]
y = data[16381:]
# now you can create a dataframe and print to file
df = pd.DataFrame({'x':x, 'y':y})
df.to_csv('data1.csv', index=False)
Try this.
#input as dataframe df, its chunk_size, extract output as list. you can mention chunksize what you want.
def split_dataframe(df, chunk_size = 16382):
chunks = list()
num_chunks = len(df) // chunk_size + 1
for i in range(num_chunks):
chunks.append(df[i*chunk_size:(i+1)*chunk_size])
return chunks
or
np.array_split
I have a simple Pandas data frame with two columns, 'Angle' and 'rff'. I want to get an interpolated 'rff' value based on entering an Angle that falls between two Angle values (i.e. between two index values) in the data frame. For example, I'd like to enter 3.4 for the Angle and then get an interpolated 'rff'. What would be the best way to accomplish that?
import pandas as pd
data = [[1.0,45.0], [2,56], [3,58], [4,62],[5,70]] #Sample data
s= pd.DataFrame(data, columns = ['Angle', 'rff'])
print(s)
s = s.set_index('Angle') #Set 'Angle' as index
print(s)
result = s.at[3.0, "rff"]
print(result)
You may use numpy:
import numpy as np
np.interp(3.4, s.index, s.rff)
#59.6
You could use numpy for this:
import numpy as np
import pandas as pd
data = [[1.0,45.0], [2,56], [3,58], [4,62],[5,70]] #Sample data
s= pd.DataFrame(data, columns = ['Angle', 'rff'])
print(s)
print(np.interp(3.4, s.Angle, s.rff))
>>> 59.6
I used this method to clean up the currency column of my data of "£" and ",". Also converted the non str values to NaN.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
### Reading the excel file with dtype
df = pd.read_excel("Housing Market B16+5Miles.xlsx", dtype={"Price" : str})
df.loc[df['Price'] == 'POA','Price'] = np.nan
House_Price = df["Price"].str.replace(",","").str.replace("£","").astype("float")
del df['Price']
df["Price"] = House_Price
df
df.describe()
by describing the dataframe, the column for the "Price" was all in decimals with an e-value at the end. Why did this happen and will it affect my analysis moving forward?
Your pandas settings might be set to display large numbers in scientific notation. You can change that using pd.set_option('display.float_format', lambda x: '%.3f' % x)
I have an RDD of DenseVector objects and I want to:
Select one of these vectors (one row)
Perform a multiplication of this vector to all other vector rows in order to compute a similarity (cosine)
Basically I am trying to perform a dot product between a vector and a matrix, starting from an RDD. For reference, the RDD contains TF-IDF values built with Spark ML, which furnishes a dataframe of SparseVectors, and have been mapped to DenseVectors in order to do the multiplication. The dataframe and corresponding RDD are called tfidf_df and tfidf_rdd respectively.
What I do, which works is (full script with sample data)
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import SparkSession
from pyspark.ml.feature import IDF, Tokenizer, CountVectorizer
from pyspark.mllib.linalg import DenseVector
import numpy as np
sc = SparkContext()
sqlc = SQLContext(sc)
spark_session = SparkSession(sc)
sentenceData = spark_session.createDataFrame([
(0, "I go to school school is good"),
(1, "I like school"),
(2, "I also like cinema")
], ["label", "sentence"])
tokenizer = Tokenizer(inputCol="sentence", outputCol="tokens")
tokens_df = tokenizer.transform(sentenceData)
# TF feats
count_vectorizer = CountVectorizer(inputCol="tokens",
outputCol="tf_features")
model = count_vectorizer.fit(tokens_df)
tf_df = model.transform(tokens_df)
print(model.vocabulary)
print(tf_df.rdd.take(5))
idf = IDF(inputCol="tf_features",
outputCol="tf_idf_features",
)
model = idf.fit(tf_df)
tfidf_df = model.transform(tf_df)
# Transform into RDD of dense vectors
tfidf_rdd = tfidf_df.select("tf_idf_features") \
.rdd \
.map(lambda row: DenseVector(row.tf_idf_features.toArray()))
print(tfidf_rdd.take(3))
# Select the test vector
test_label = 1
vec = tfidf_df.filter(tfidf_df.label == test_label) \
.select('tf_idf_features') \
.rdd \
.map(lambda row: DenseVector(row.tf_idf_features.toArray())).collect()[0]
rddB = tfidf_rdd.map(lambda row: np.dot(row/np.linalg.norm(row),
vec/np.linalg.norm(vec))) \
.zipWithIndex()
# print('*** multiplication', rddB.take(20))
# Sort the similarities
sorted_rddB = rddB.sortByKey(False)
print(sorted_rddB.take(20))
The test vector has been selected as the one whose label is 1. The end result with similarities is (from the last print statement) [(1.0000000000000002, 1), (0.27105728525552131, 0), (0.1991208898963957, 2)] where indexing has been used to trace back to the original dataset.
This works fine but looks a bit clunky. I'd be looking for the best practices to perform a multiplication between a selected row of a dataframe (vector) with all the dataframe vectors. Am open to any suggestions over the workflow, specifically performance - related.