Getting summaries by subgroups of your data {plyr}

  • highest profit margin (as point data)
highestMargin <- max(companiesData$margin)

highestMargin <- companiesData[companiesData$margin == max(companiesData$margin),]

highestMargin <- subset(companiesData, margin==max(margin))
  • the highest margin (as dataframe)
    • NULL as the second argument for factors to split by
highestProfitMargin <- ddply(companiesData, NULL, summarize, bestMargin = max(margin))
  • the highest profit margin for each company? (add new column(s))
myResults <- ddply(companiesData, 'company', transform, highestMargin = max(margin), lowestMargin = min(margin))

highestProfitMargins <- ddply(companiesData, 'company', summarize, bestMargin = max(margin))

highestProfitMargins <- ddply(companiesData, 'company', transform, bestMargin = max(margin))
  • just the (entire) rows that have the highest profit margins.
highestProfitMargins <- ddply(companiesData, 'company', function(x) x[x$margin==max(x$margin),])

  • plyr::ddplr(): Input a data frame and get a data frame back.

ddply(mydata, c('column name of a factor to group by', 'column name of the second factor to group by'), summarize OR transform, newcolumn = myfunction(column name(s) I want the function to act upon))

  • the first argument
    • the name of the original data frame
  • the second argument
    • the name of the column or columns you want to subset your data by.
      • No parentheses are needed for just one factor if you're using quotation marks
      • NULL
        • select the row with highest margin in the entire data set
  • The third argument
    • tells ddply() whether to return
      • summarize
        • doesn't give any information from other columns in the original data frame
      • transform
        • return your existing data frame with a new column
  • the fourth argument
    • names the new column and then
    • lists the function you want ddply() to use.

  • highest profit margin
highestMargin <- max(companiesData$margin)

highestMargin <- companiesData[companiesData$margin == max(companiesData$margin),]

highestMargin <- subset(companiesData, margin==max(margin))
  • the highest profit margin for each company?

    • applying a function by groups
      • what R calls factors.
    • "split-apply-combine" (Hadley Wickham {plyr})
      • Split up your data set by one or more factors,
      • apply some function,
      • then combine the results back into a data set.
    • plyr package created by Hadley Wickham
      • group of "ply" functions
        • ddply(): Input a data frame and get a data frame back.
        • alply(): input an array and get back a list
        • ldply(): input a list and get back a data frame
  • install & load

    install.packages("plyr")
    library("plyr")
    
  • format of plyr::ddplr() 1: names in quotes

ddply(mydata, c('column name of a factor to group by', 'column name of the second factor to group by'), summarize OR transform, newcolumn = myfunction(column name(s) I want the function to act upon))

  • the first argument
    • the name of the original data frame
  • the second argument
    • the name of the column or columns you want to subset your data by.
      • No parentheses are needed for just one factor if you're using quotation marks
      • NULL
        • select the row with highest margin in the entire data set
  • The third argument
    • tells ddply() whether to return
      • summarize
        • doesn't give any information from other columns in the original data frame
      • transform
        • return your existing data frame with a new column
  • the fourth argument
    • names the new column and then
    • lists the function you want ddply() to use.
highestProfitMargins <- ddply(companiesData, 'company', summarize, bestMargin = max(margin))

highestProfitMargins <- ddply(companiesData, 'company', transform, bestMargin = max(margin))
  • apply more than one function at a time
myResults <- ddply(companiesData, 'company', transform, highestMargin = max(margin), lowestMargin = min(margin))
  • just the (entire) rows that have the highest profit margins.
highestProfitMargins <- ddply(companiesData, 'company', function(x) x[x$margin==max(x$margin),])
  • function(x)

    • an anonymous (unnamed, ad-hoc 特別指定) function is coming next

      • extracting a subset of x
        • x refers to the data frame that was passed into the anonymous function
      • the anonymous function is being passed into a ddply() statement that's splitting the data frame by company, what's returned is the matching row(s) for each company.
    • [x$margin==max(x$margin),]

      • I want to match every row where x$margin equals the maximum of x$margin.
      • The comma after x$margin==max(x$margin)
        • tells R to return every column of those matching rows, since no columns were specified.
      • As an alternative, we could seek to return only one or several of the columns instead of all of them.
companiesData[companiesData$margin==max(companiesData$margin)
   - alone, without ddply\(\), gives the highest overall margin, not the highest margin for each company.
  • the highest margin in the entire data set
    • NULL as the second argument for factors to split by
highestProfitMargin <- ddply(companiesData, NULL, summarize, bestMargin = max(margin))
  • format of plyr::ddplr() 2: dot before the column names

myresult <- ddply(mydata, .(column name of factor I'm splitting by, column name second factor I'm splitting by), summarize OR transform, newcolumn = myfunction(column name I want the function to act upon))

  • To get the highest profit margins for each company
    • splitting the data frame by only one factor
      • company.
        • Even if you've only got one factor,
          • it needs to be in parentheses after that dot if you're using the dot
highestProfitMargins <- ddply(companiesData, .(company), summarize, bestMargin = max(margin))

Reference:

results matching ""

    No results matching ""