Getting summaries by subgroups of your data {plyr}

highest profit margin (as point data)

highestMargin <- max(companiesData$margin)

highestMargin <- companiesData[companiesData$margin == max(companiesData$margin),]

highestMargin <- subset(companiesData, margin==max(margin))

the highest margin (as dataframe)
- NULL as the second argument for factors to split by

highestProfitMargin <- ddply(companiesData, NULL, summarize, bestMargin = max(margin))

the highest profit margin for each company? (add new column(s))

myResults <- ddply(companiesData, 'company', transform, highestMargin = max(margin), lowestMargin = min(margin))

highestProfitMargins <- ddply(companiesData, 'company', summarize, bestMargin = max(margin))

highestProfitMargins <- ddply(companiesData, 'company', transform, bestMargin = max(margin))

just the (entire) rows that have the highest profit margins.

highestProfitMargins <- ddply(companiesData, 'company', function(x) x[x$margin==max(x$margin),])

plyr::ddplr(): Input a data frame and get a data frame back.

ddply(mydata, c('column name of a factor to group by', 'column name of the second factor to group by'), summarize OR transform, newcolumn = myfunction(column name(s) I want the function to act upon))

the first argument
- the name of the original data frame
the second argument
- the name of the column or columns you want to subset your data by.
  - No parentheses are needed for just one factor if you're using quotation marks
  - NULL
    - select the row with highest margin in the entire data set
The third argument
- tells ddply() whether to return
  - summarize
    - doesn't give any information from other columns in the original data frame
  - transform
    - return your existing data frame with a new column
the fourth argument
- names the new column and then
- lists the function you want ddply() to use.

highest profit margin

highestMargin <- max(companiesData$margin)

highestMargin <- companiesData[companiesData$margin == max(companiesData$margin),]

highestMargin <- subset(companiesData, margin==max(margin))

the highest profit margin for each company?
- applying a function by groups
  - what R calls factors.
- "split-apply-combine" (Hadley Wickham {plyr})
  - Split up your data set by one or more factors,
  - apply some function,
  - then combine the results back into a data set.
- plyr package created by Hadley Wickham
  - group of "ply" functions
    - ddply(): Input a data frame and get a data frame back.
    - alply(): input an array and get back a list
    - ldply(): input a list and get back a data frame

install & load

install.packages("plyr")
library("plyr")

format of plyr::ddplr() 1: names in quotes

ddply(mydata, c('column name of a factor to group by', 'column name of the second factor to group by'), summarize OR transform, newcolumn = myfunction(column name(s) I want the function to act upon))

the first argument
- the name of the original data frame
the second argument
- the name of the column or columns you want to subset your data by.
  - No parentheses are needed for just one factor if you're using quotation marks
  - NULL
    - select the row with highest margin in the entire data set
The third argument
- tells ddply() whether to return
  - summarize
    - doesn't give any information from other columns in the original data frame
  - transform
    - return your existing data frame with a new column
the fourth argument
- names the new column and then
- lists the function you want ddply() to use.

highestProfitMargins <- ddply(companiesData, 'company', summarize, bestMargin = max(margin))

highestProfitMargins <- ddply(companiesData, 'company', transform, bestMargin = max(margin))

apply more than one function at a time

myResults <- ddply(companiesData, 'company', transform, highestMargin = max(margin), lowestMargin = min(margin))

just the (entire) rows that have the highest profit margins.

highestProfitMargins <- ddply(companiesData, 'company', function(x) x[x$margin==max(x$margin),])

function(x)
- an anonymous (unnamed, ad-hoc 特別指定) function is coming next
  - extracting a subset of x
    - x refers to the data frame that was passed into the anonymous function
  - the anonymous function is being passed into a ddply() statement that's splitting the data frame by company, what's returned is the matching row(s) for each company.
- [x$margin==max(x$margin),]
  - I want to match every row where x$margin equals the maximum of x$margin.
  - The comma after x$margin==max(x$margin)
    - tells R to return every column of those matching rows, since no columns were specified.
  - As an alternative, we could seek to return only one or several of the columns instead of all of them.

companiesData[companiesData$margin==max(companiesData$margin)

   - alone, without ddply\(\), gives the highest overall margin, not the highest margin for each company.

the highest margin in the entire data set
- NULL as the second argument for factors to split by

highestProfitMargin <- ddply(companiesData, NULL, summarize, bestMargin = max(margin))

format of plyr::ddplr() 2: dot before the column names

myresult <- ddply(mydata, .(column name of factor I'm splitting by, column name second factor I'm splitting by), summarize OR transform, newcolumn = myfunction(column name I want the function to act upon))

To get the highest profit margins for each company
- splitting the data frame by only one factor
  - company.
    - Even if you've only got one factor,
      - it needs to be in parentheses after that dot if you're using the dot

highestProfitMargins <- ddply(companiesData, .(company), summarize, bestMargin = max(margin))

Reference:

4 data wrangling tasks in R for advanced beginners | Computerworld

Summaries by Subgroups {plyr}

Getting summaries by subgroups of your data {plyr}

highest profit margin (as point data)

the highest margin (as dataframe)

the highest profit margin for each company? (add new column(s))

just the (entire) rows that have the highest profit margins.

plyr::ddplr(): Input a data frame and get a data frame back.

highest profit margin

the highest profit margin for each company?

format of plyr::ddplr() 1: names in quotes

the highest margin in the entire data set

format of plyr::ddplr() 2: dot before the column names

Reference:

results matching ""

No results matching ""