concatenate values of string column and long column in panda dataframe - pandas

I have a pandas data frame which doesn't have an index yet (just artificial 1,2,3,.. index)
Column 'store', 'style' is string, column 'color', 'size' is a long int.
None of them are unique by themselves, but the concatenation of them are unique.
I want to concatenate them to produce an index, but
df2['store']+df2['style']+str(df2['color'])+str(df2['size'])
or
df2['store']+df2['style']+df2['color'].to_string()+df2['size'].to_string()
both doesn't work. I think it takes the whole column, force it to become a string and concatenate which results in weird symbols. And merges doesn't work correctly.
What's the correct way to concatenate a string column and a long column?

This should be:
df2['store'] + df2['style'] + df2['color'].astype(str) + df2['size'].astype(str)
Explanation: str(df2['size']) will make a string representation of the full column (one string, comparable as to what you see if you print the string), while .astype(str) will convert all values of the series to strings.
to_string gives the same result as str() (but takes optional parameters to control the result)

Related

Is there an equivalent of an f-string in Google Sheets?

I am making a portfolio tracker in Google Sheets and wanted to know if there is a way to link the "TICKER" column with the code in the "PRICE" column that is used to pull JSON data from Coin Gecko. I was wondering if there was an f-string like there is in Python where you can insert a variable into the string itself. Ergo, every time the Ticker column is updated the coin id will be updated within the API request string. Essentially, string interpolation
For example:
TICKER PRICE
BTC =importJSON("https://api.coingecko.com/api/v3/coins/markets?vs_currency=usd&ids={BTC}","0.current_price")
You could use CONCATENATE for this:
https://support.google.com/docs/answer/3094123?hl=en
CONCATENATE function
Appends strings to one another.
Sample Usage
CONCATENATE("Welcome", " ", "to", " ", "Sheets!")
CONCATENATE(A1,A2,A3)
CONCATENATE(A2:B7)
Syntax
CONCATENATE(string1, [string2, ...])
string1 - The initial string.
string2 ... - [ OPTIONAL ] - Additional strings to append in sequence.
Notes
When a range with both width and height greater than 1 is specified, cell values are appended across rows rather than down columns. That is, CONCATENATE(A2:B7) is equivalent to CONCATENATE(A2,B2,A3,B3, ... , A7,B7).
See Also
SPLIT: Divides text around a specified character or string, and puts each fragment into a separate cell in the row.
JOIN: Concatenates the elements of one or more one-dimensional arrays using a specified delimiter.

Python: Remove exponential in Strings

I have been trying to remove the exponential in a string for the longest time to no avail.
The column involves strings with alphabets in it and also long numbers of more than 24 digits. I tried converting the column to string with .astype(str) but it just reads the line as "1.234123E+23". An example of the table is
A
345223423dd234324
1.234123E+23
how do i get the table to show the full string of digits in pandas?
b = "1.234123E+23"
str(int(float(b)))
output is '123412299999999992791040'
no idea how to do it in pandas with mixed data type in column

Using Pyspark to convert column from string to timestamp

I have pyspark dataframe with 2 columns (Violation_Time, Time_First_Observed) which are captured as strings. Sample of data is below, where there it is captured as HHmm with "A" or "P" representing am or pm. Also, the data has error where some entries exceed 24HH.
Violation_Time Time_First_Observed
0830A 1600P
1450P 0720A
1630P 2540P
0900A 0100A
I would like to use pyspark to remove the "A" and "P" for both columns and subsequently convert the data (e.g., 0800, 1930 etc) into a timestamp for analysis purposes. I have tried to do this for the "Violation_Time" column and create a new column "timestamp" to store this (see code below). However, I can't seem to be able to do it. Any form of help is appreciate, thank you.
sparkdf3.withColumn('timestamp',F.to_timestamp("Violation_Time", "HH"))
sparkdf3.select(['Violation_Time','timestamp']).show()
You can use the following
sparkdf3 = sparkdf3.withColumn('timestamp', func.split(func.to_timestamp('Violation_Time', 'HHmm'), ' ').getItem(1))
sparkdf3.select(['Violation_Time','timestamp']).show()
Explanation
sparkdf3.withColumn('timestamp',
func.split(
func.to_timestamp('Violation_Time', 'HHmm') #Convert to timestamp. It will convert in datetime format
, ' '
).getItem(1) #Split on space and get first item
)

How to slice strings in a dataframe by finding another strings location and the location value differs in each row?

I have a dataframe column "df_weather[0]",like the following (strings are very long)
S.No,Values,
452,"temperature":83.0,
514,current "temperature":81.0,
653,new modified "temperature":89.0,
I want to extract the values after the string temperature. ie 83.0, 81.0 and 89.0
for this I found the location of the string "temperature" and stored all the three values into a list. Now the list has three locations.
Second, I tried to slice the string by taking each member in a list
list_weather = df_weather[1].str.find('"temperature":').tolist()
list_temp=[]
for member in list_weather:
list_temp.append(df_weather[1].str.slice(member+14,member+18))
but the last line takes the first value in the list "list_weather" (ie 14 as temperature is 12 characters long) and slices the full dataframe three times. Returns three sets,
first line has correct value, the second and third are not correct in each set.
ie "list_temp" has 83.0, ratu, temp (three sets).

Delete rows based on char in the index string

I have the following dataframe:
df = pd.DataFrame(np.random.randn(4, 1), index=['mark13', 'luisgimenez', 'miguel72', 'luis34'],columns=['probability'])
probability
mark13 -1.054687
luisgimenez 0.081224
miguel72 -0.893619
luis34 -1.576941
I would like to remove the rows where the last character in the index string does not contain a number .
The desired output would look something like this :
(dropping the row where the index does not finishes with a number)
probability
mark13 -1.054687
miguel72 -0.893619
luis34 -1.576941
I am sure the direction I need to get is the boolean indexing but I do not know how could I reference the last character in the index name
#use isdigt to check last char of your index to be used as a mask array to filter rows.
df[[e[-1].isdigit() for e in df.index]]
Out[496]:
probability
mark13 -0.111338
miguel72 0.548725
luis34 0.682949
You can use the str accessor to check if the last character is a number:
df[df.index.str[-1].str.isdigit()]
Out:
probability
mark13 -0.350466
miguel72 1.220434
luis34 -0.962123