selecting columns using below code but getting only top row - pandas

Selecting columns using pandas but getting only top row. Using following code
data1=data.loc[:,'Subject':'Sub-Component']
#column names being "Subject", "Original Product Version","Software Version", "Software Version #","Software Release","Component","Sub-Component"
data1=data.loc[:,'Subject':'Sub-Component']
I am expecting this would select all the columns but this is not selecting all columns and I am only getting the row names/ heads as result

If you want to select specific columns, add those columns names into a list and pass that:
col_list = ['Subject', 'Original Sub-Component']
data1 = data.loc[:, col_list]
If you want to select a range of columns, use iloc instead:
data1 = data.iloc[:, 3:8]

Related

automatically inferring column list for pivot in spark-sql

in spark-sql is there any way to automatically infer distinct column values in pivot operator
data_list = \[("maaruti",1000,"hyderabad"),
("tata",2000,"mumbai"),
("hyundai",1500,"delhi"),
("mahindra",1200,"chennai"),
("maaruti",1200,"mumbai"),
("tata",1000,"delhi"),
("hyundai",2000,"chennai"),
("mahindra",1500,"hyderabad"),
("tata",1100,"delhi"),
("mahindra",1200,"chennai")
]
df=spark.createDataFrame(data_list).toDF("company", "sales", "city",)
df.show()
dataframe approach
df.groupby("company","city").sum("sales").groupby("company").pivot("city").sum('sum(sales)').show()
here distinct values in city column are automatically infered
spark-sql approach
df.createOrReplaceTempView("tab")
spark.sql("""
select \* from tab pivot(sum(sales) as assum
for city in ('delhi','mumbai','hyderabad','chennai'))
""").show()
above snippet gives desired output however the column list is need to be specified manually for distinct city column values .is there any way to automatically do this

How to use a google sheets pivot query to output strings

I have a (much larger) table like this sample:
I am trying to output a table that looks like this:
The closest I can get with a pivot query returns numerical results in the value fields, rather than the desired text strings
=query(Data, "Select D,count(D) group by D Pivot B")
I resorted to a series of formulas to build my row and column headers, and then fill in the data field - See Version 3 in the sample sheet. But I couldn't figure out how to fill in the data with a single formula - as opposed to copying and pasting in the data field, which is not desirable with a dynamic number of row and column headers based on the original data.
Is there a way to wrap my data field formula (in cell B44 of the sample) in an arrayformula that will fill the data field, with a dynamic number of columns and rows?
Or even more advanced is there a formula that will deliver my desired results table in a single formula?
This should work, it's a bit difficult to explain, but i could demonstrate the various parts if you opened up your sheet to editable...
=ARRAYFORMULA(TRANSPOSE(QUERY(TRIM(SPLIT(TRANSPOSE(QUERY(QUERY({CHAR(10)&A2:A11,B2:B11&"|"&D2:D11&"|"},"select MAX(Col1) group by Col1 pivot Col2"),,9^9)),"|",0,0)),"select Col1,MAX(Col3) where Col1<>'' group by Col1 pivot Col2 order by Col1 desc label Col1'Project'")))

Pandas Pivot table : Get only the value counts and not columns

I am trying to make a pivot table with a data set with many columns.
When making a pivot table with code below I get all the columns which I don't want.
I only want the counts and not any other columns there. Can i achieve this ?
table1 = pd.pivot_table(dfCALCNoExcecption,index=['AD Platform','Agent Program'],columns=None,aggfunc='count')
The output of above code in excel output is like below( I have not pasted the whole as there are around 50 columns):
The Desired Output I am trying to get:
You can group by your data based on the columns 'AD Plataform' and 'Agent Program'. After that, you can sum all the values of the column that has the quantity of the machines. Here is my code:
df.groupby(['AD Plataform', 'Agent Program'])['AD Hostname'].sum()
This is not complete but a part of this can be achieved by Groupby. I am not sure how to rename the third column to "Count"
dfAgentTable3 = dfCALCNoExcecption.groupby(['AD Platform', 'Agent Program'])['AD Hostname'].count().sort_index(ascending=True)

Set panda dataframe index to column name when column name is not unique

I have two tables of stock tickers.
I create SQL joined query to combine the two tables.
query_combined = session\
.query(Table1, Table2)\
.join(Table2, Table1.ticker==Table2.ticker)
I then feed the SQL to Pandas to load in a frame:
df_combined = pandas\
.read_sql(query_combined.statement,
query_combined.session.bind,
index_col='ticker')
However, since there are two "tickers" columns from the joined tables, setting the index_col='ticker' results in a tuple for the index column of '(ticker, ticker)'. I just want to specify one of the "ticker" columns as the dataframe index but am unsure how.
I am new to pandas and am sure this is very simple, but in my hour of Googling, I haven't found the answer. Many thanks in advance for pointing me in the right direction.
Consider with_labels to qualify ambiguous columns with underscores <table>_<column>:
df_combined = (pandas
.read_sql(query_combined.with_labels().statement,
query_combined.session.bind,
index_col='Table1_ticker')
)
To shorten table name, alias the tables before the join:
t1 = aliased(t1, Table1)
t2 = aliased(t2, Table2)
query_combined = (session
.query(t1, t2)
.join(t2, t1.ticker==t2.ticker)
)

How can I aggregate Jsonb columns in postgres using another column type

I have the following data in a postgres table,
where data is a jsonb column. I would like to get result as
[
{field_type: "Design", briefings_count: 1, meetings_count: 13},
{field_type: "Engineering", briefings_count: 1, meetings_count: 13},
{field_type: "Data Science", briefings_count: 0, meetings_count: 3}
]
Explanation
Use jsonb_each_text function to extract data from jsonb column named data. Then aggregate rows by using GROUP BY to get one row for each distinct field_type. For each aggregation we also need to include meetings and briefings count which is done by selecting maximum value with case statement so that you can create two separate columns for different counts. On top of that apply coalesce function to return 0 instead of NULL if some information is missing - in your example it would be briefings for Data Science.
At a higher level of statement now that we have the results as a table with fields we need to build a jsonb object and aggregate them all to one row. For that we're using jsonb_build_object to which we are passing pairs that consist of: name of the field + value. That brings us with 3 rows of data with each row having a separate jsonb column with the data. Since we want only one row (an aggregated json) in the output we need to apply jsonb_agg on top of that. This brings us the result that you're looking for.
Code
Check LIVE DEMO to see how it works.
select
jsonb_agg(
jsonb_build_object('field_type', field_type,
'briefings_count', briefings_count,
'meetings_count', meetings_count
)
) as agg_data
from (
select
j.k as field_type
, coalesce(max(case when t.count_type = 'briefings_count' then j.v::int end),0) as briefings_count
, coalesce(max(case when t.count_type = 'meetings_count' then j.v::int end),0) as meetings_count
from tbl t,
jsonb_each_text(data) j(k,v)
group by j.k
) t
You can aggregate columns like this and then insert data to another table
select array_agg(data)
from the_table
Or use one of built-in json function to create new json array. For example jsonb_agg(expression)