python pandas describe groupby output question - pandas

I use describe with groupby on a dataframe such as:
df_stats = df[["x","y"]].describe(df["key1"],df["key2"])
this produces the standard set of stats for "x" and "y: in df by key/key2 value combinations. these combinations are, in general, not know a priori.
if I print df_stats it will list the various combinations of key1/key2 values as row values but I
cannot find out where these combination values are being stored (should be in the df_stats dataframe, no?)
the objective is to be able to have other dataframe data to "look up" data in the df_stats rows so as to include it in calculations - appreciate any insights in advance - pulling my hair out
thx - j

Related

SQL fill null values with another column

My problem is that I have a dataframe which has null values, but these null values are filled with another column of the same data frame, then I would like to know how to take that column and put the information of the other column to fill the missing data. I'm using deepnote
link:
https://deepnote.com
For example:
Column A
Column B
Cell 1
Cell 2
NULL
Cell 4
My desired output:
Column A
Cell 1
Cell 4
I think it should be with sub queries and using some WHERE, any ideas?
thanks for the question and welcome to StackOverflow.
It is not 100% clear which direction you need your solution to go, so I am offering two alternatives which I think should get you going.
Pandas way
You seem to be working with Pandas dataframes. The usual way to work with Pandas dataframes is to use Pandas builtin functions. In this case, there is literally a function for filling null values, it's called fillna. We can use it to fill values from another column like this:
df_raw = pd.DataFrame(data={'Column A': ['Cell 1', None], 'Column B': ['Cell 2', 'Cell 4']})
# copying the original dataframe to a clean one
df_clean = df_raw.copy()
# applying the fillna to fill null values from another column
df_clean['Column A'] = df_clean['Column A'].fillna(df_clean['Column B'])
This will make your df_clean look like you need
Column A
Cell 1
Cell 4
Dataframe SQL way
You mentioned "queries" and "where" in your question which seems you might be playing with some combination of Python and SQL world. Enter DuckDB world which supports exactly this, in Deepnote we call these Dataframe SQLs.
You can query e.g. CSV files directly from these Dataframe SQL blocks, but you can also use a previously defined Dataframe.
select * from df_raw
In order to fill the null values like you are requesting, we can use standard SQL querying and a function called coalesce as Paul correctly pointed out.
select coalesce("Column A", "Column B") as "Column A" from df_raw
This will also create what you need in SQL world. In Deepnote, specifically, this will also give you a Dataframe.
Column A
Cell 1
Cell 4
Feel free to check out my project in Deepnote with these examples, and go ahead and duplicate it if you want to iterate on the code a bit. There is also plenty more alternatives, if you're in a real SQL database and want to update existing columns, you would use update statement. And if you are in a pure Python, this is of course also possible in a loop or using lambda functions.

Pandas dataframe mixed dtypes when reading csv

I am reading in a large dataframe that is throwing a DtypeWarning: Columns (I understand this warning) but am struggling to prevent it (I don't want to set low_memory to False as I would like to specify the correct dtypes.
For every columns, the majority of rows are float values and the last 3 rows are string (metadata basically, information about each column). I understand that I can set the dtype per column when reading in the csv, however I do not know how to change rows 1:n to be float32 for example and the last 3 rows to be strings. I would like to avoid reading in two separate CSVs. The resulting dtype of all columns after reading in the dataframe is 'object'. Below is a reproducible example. The dtype warning is not thrown when reading in i am guessing because of the size of the dataframe - however the result is exactly the same as the problem i am facing. i would like to make the first 3 rows float32 and the last 3 string so that they are the correct dtype. thank you!
reproducible example:
df = pd.DataFrame([[0.1, 0.2,0.3],[0.1, 0.2,0.3],[0.1, 0.2,0.3],
['info1', 'info2','info3'],['info1', 'info2','info3'],['info1', 'info2','info3']],
index=['index1', 'index2', 'index3', 'info1', 'info2', 'info3'],
columns=['column1', 'column2', 'column3'] )
df.to_csv('test.csv')
df1 = pd.read_csv('test.csv', index_col=0)

counting each value in dataframe

So I want to create a plot or graph. I have a time series data.
My dataframe looks like that:
df.head()
I need to count values in df['status'] (there are 4 different values) and df['group_name'] (2 different values) for each day.
So i want to have date index and count of how many times each value from df['status'] appear as well as df['group_name']. It should return Series.
I used spam.groupby('date')['column'].value_counts().unstack().fillna(0).astype(int) and it working as it should. Thank you all for help

Making Many Empty Columns in PySpark

I have a list of many dataframes each with a subset schema of a master schema. In order to union these dataframes, I need to construct a common schema among all the dataframes. My thought is that I need to create empty columns for all the missing columns for each of the dataframes. I have about on average 80 missing features and 100s of dataframes.
This is somewhat of a duplicate or inspired by Concatenate two PySpark dataframes
I am currently implementing things this way:
from pyspark.sql.functions import lit
for df in dfs: # list of dataframes
for feature in missing_features: # list of strings
df = df.withColumn(feature, lit(None).cast("string"))
This seems to be taking a significant amount of time. Is there a faster way to concat these dataframes with null in place of missing features?
You might be able to cut time a little by replacing your code with:
cols = ["*"] + [lit(None).cast("string").alias(f) for f in missing_features]
dfs_new = [df.select(cols) for df in dfs]

Working with dataframe / matrix to create an input for sklearn & Tensorflow

I am working with pandas / python /numpy / datalab/bigQuery to generate an input table for machine learning processing. The data is genomic - and right now, I am working with small subset of
174 rows
12430 columns
The column names are extracted from bigQuery (df_pik3ca_features = bq.Query(std_sql_features).to_dataframe(dialect='standard',use_cache=True))
at the same way, the row names are extracted: samples_rows = bq.Query('SELECT sample_id FROMspeedy-emissary-167213.pgp_orielresearch.pgp_PIK3CA_all_features_values_step_3GROUP BY sample_id')
what would be the easiest way to create a dataframe / matrix with named rows and columns that were extracted.
I explored the dataframes in pandas and could not find the way to pass the names as parameter.
for empty array, I was able to find the following (numpy) with no names:
a = np.full([num_of_rows, num_of_columns], np.nan)
a.columns
I know R very well (if there is no other way - I hope that I can use it with datalab)
any idea?
Many thanks!
If you have your column names and row names stored in lists then you can just use .loc to select the exact rows and columns you desire. Just make sure that the row names are in the index. You might need to do df.set_index('sample_id') to put the correct row name in the index.
Assuming the rows and columns are in variables row_names and col_names, do this.
df.loc[row_names, col_names]