show all unique value from column - pandas

I try to show all value from a dataframe column with this code combinaison_df['coalesce'].unique()
but I got this result :
array([1.12440191, 0.54054054, 0.67510549, ..., 1.011378 , 1.39026812,
1.99637024])
I need to see all the values. Do you have an idea to do?
Thanks

If I have this situation but don't want to change it for every print() statement I found this solution with a context manager quite handy:
with pd.option_context('display.max_rows', None):
print(combinaison_df['coalesce'].unique())

Related

splitting columns with str.split() not changing the outcome

Will I have to use the str.split() for an exercise. I have a column called title and it looks like this:
and i need to split it into two columns Name and Season, the following code does not through an error but it doesn't seem to be doing anything as well when i'm testing it with df.head()
df[['Name', 'Season']] = df['title'].str.split(':',n=1, expand=True)
Any help as to why?
The code you have in your question is correct, and should be working. The issue could be coming from the execution order of your code though, if you're using Jupyter Notebook or some method that allows for unclear ordering of code execution.
I recommend starting a fresh kernel/terminal to clear all variables from the namespace, then executing those lines in order, e.g.:
# perform steps to load data in and clean
print(df.columns)
df[['Name', 'Season']] = df['title'].str.split(':',n=1, expand=True)
print(df.columns)
Alternatively you could add an assertion step in your code to ensure it's working as well:
df[['Name', 'Season']] = df['title'].str.split(':',n=1, expand=True)
assert {'Name', 'Season'}.issubset(set(df.columns)), "Columns were not added"

How to how to create a Dataframe based on a condition

I want to create a new DataFrame from another for rows that meet a condition such as:
uk_cities_df['location'] = cities_df['El Tarter'].where(cities_df['AD'] == 'GB')
uk_cities_df[:5]
but the resulting uk_cities_df is returning NaN.
The csv file that I am needing to extract from has no headers so it used the first row values for such. I need to only include rows in uk_cities_df include the ISO code "GB" so "El Tarter" denotes the values for location and "AD" for iso code.
Could you please provide a visual of what uk_cities_df and cities_df look like ?
From what I can gather, I think you might be looking for the .loc operator,
you could try for example :
uk_cities_df['location'] = cities_df.loc[cities_df['AD'] == 'GB']['location']
Also, I did not really get what role 'El Tarter' plays here, maybe you could give more details ?

Nested Set Analysis in QlikSense

My User_ID has several images and with this piece of code I receive the largest IMAGE_NR .
max({$<USER_ID = {'8638087'}> } IMAGE_NR)
Every image number is linked to an IMAGE_ID. How do I get this?
In words:
Give me IMAGE_ID where IMAGE_NR = largest IMAGE_NR of my USER_ID
I tried following that doesn't work:
sum({$<max({<USER_ID={'8638087'}> } IMAGE_NR)>} IMAGE_ID)
Thank you for any thoughts!
Here is a step-by-step solution.
You mention that this already brings back the correct IMAGE_NR:
max({$<USER_ID = {'8638087'}> } IMAGE_NR)
Now you want the IMAGE_ID. The logic is:
=only({CORRECT_IMAGE_NR_SET_ANALYSIS} IMAGE_ID)
I generally prefer to avoid search mode (double quotes inside element list) and instead use a higher evaluation level on the top calculation, so I would recommend:
$(=
'only({<IMAGE_NR={'
& max({$<USER_ID = {'8638087'}> } IMAGE_NR)
& '}>} IMAGE_ID)'
)
This would also provide you a nice preview formula in the formula editor, something like:
only({<IMAGE_NR={210391287}>} IMAGE_ID)
I didn't test it, but i believe it's something like :
only({<IMAGE_NR={'$(=max(IMAGE_NR))'}, USER_ID={'8638087'}>} IMAGE_ID)
If the dimension of the table is USER_ID then this will return the IMAGE_ID of the maximum IMAGE_NR per USER_ID. Should work / return results for any dimension, but then will have to be read as maximum IMAGE_NR per whatever that dimesion is. The minus on IMAGE_NR is to get the largest/last sorted value
firstsortedvalue(IMAGE_ID,-IMAGE_NR)
If you're using it in a text box with no dimension then you can also add the set analysis
firstsortedvalue({<USER_ID={'8638087'}>} IMAGE_ID,-IMAGE_NR)

Spark Dataframe sql in java - How to escape single quote

I'm using spark-core, spark-sql, Spark-hive 2.10(1.6.1), scala-reflect 2.11.2. I'm trying to filter a dataframe created through hive context...
df = hiveCtx.createDataFrame(someRDDRow,
someDF.schema());
One of the column that I'm trying to filter has multiple single quotes in it. My filter query will be something similar to
df = df.filter("not (someOtherColumn= 'someOtherValue' and comment= 'That's Dany's Reply'"));
In my java class where this filter occurs, I tried to replace the String variable for e.g commentValueToFilterOut, which contains the value "That's Dany's Reply" with
commentValueToFilterOut= commentValueToFilterOut.replaceAll("'","\\\\'");
But when apply the filter to the dataframe I'm getting the below error...
java.lang.RuntimeException: [1.103] failure: ``)'' expected but identifier
s found
not (someOtherColumn= 'someOtherValue' and comment= 'That\'s Dany\'s Reply'' )
^
scala.sys.package$.error(package.scala:27)
org.apache.spark.sql.catalyst.SqlParser$.parseExpression(SqlParser.scala:49)
org.apache.spark.sql.DataFrame.filter(DataFrame.scala:768)
Please advise...
We implemented a workaround to overcome this issue.
Workaround:
Create a new column in the dataframe and copy the values from the actual column (which contains special characters in it, that may cause issues (like singe quote)), to the new column without any special characters.
df = df.withColumn("comment_new", functions.regexp_replace(df.col("comment"),"'",""));
Trim out the special characters from the condition and apply the filter.
commentToFilter = "That's Dany's Reply'"
commentToFilter = commentToFilter.replaceAll("'","");
df = df.filter("(someOtherColumn= 'someOtherValue' and comment_new= '"+commentToFilter+"')");
Now, the filter has been applied, you can drop the new column that you created for the sole purpose of filtering and restore it to the original dataframe.
df = df.drop("comment_new");
If you dont wnat to create a new column in the dataframe, you can also replace the special character with some "never-happen" string literal in the same column, for e.g
df = df.withColumn("comment", functions.regexp_replace(df.col("comment"),"'","^^^^"));
and do the same with the string literal that you want to apply against
comment_new commentToFilter = "That's Dany's Reply'"
commentToFilter = commentToFilter.replaceAll("'","^^^^");
df = df.filter("(someOtherColumn= 'someOtherValue' and comment_new= '"+commentToFilter+"')");
Once filtering is done restore the actual value by reverse-applying the string litteral
df = df.withColumn("comment", functions.regexp_replace(df.col("comment"),"^^^^", "'"));
Though It's not answer the actual issue, but someone having the same issue, can try this out as a workaround.
The actual solution could be, use sqlContext (instead of hiveContext) and / or Dataset (instead of dataframe) and / or upgrade to spark hive 2.12.
experts to debate & answer
PS: Thanks to KP, my lead

How can I use "Expression.Not" with text field?

How can I use "Expression.Not" with text field?
I need to select all records from NHQuestionCount except "ktest"
for example this code return runtime error
NHQuestionCount[] stats = NHQuestionCount.FindAll(Order.Asc("NameFull"), Expression.Not(Expression.Eq("NameFull", "ktest")));
I can't comment on the rest of your code, but your use of Expression is exactly right.