Outlier Management

Evans Policy Analysis and Research Group (EPAR)
Loading...

Loading...

Loading...
Loading...
Loading...

Loading...

Loading...
About
About this app

This is an interactive visualization companion to the Decisions Matter - Controlling Outliers blog post (link tbd) and is designed to illustrate the tradeoffs associated with different commonly used outlier control methodologies as applied to agricultural household survey data. We provide three options for identifying outliers along with three additional options for modifying or removing them and box plots and visualizations to compare the results. On the first tab, we show side-by-side comparisons of each identification technique given the same replacement strategy, and on the second, we show the impact of a selected technique and replacement strategy on subgroups (either gender of household head or plot manager).

The three outlier detection techniques are percentile (assume any observation above or below selected cutoff percentiles is an outlier), MAD (assume any observation whose difference from the median is greater than the median deviation multiplied by a given factor is an outlier), and transformation: first apply a log transformation (we use the Yeo-Johnson technique, which uses the log for large values and is linear at small values), then classify any value above a cutoff z-score as an outlier.

The repacement options are to replace at the threshold (“Tails”), replace at the median (“Median”), or remove the observation from the sample (“Trim”).

This app is also designed to illustrate the effect of outlier control the components of a calculated value. Each of the indicators available is calculated as a ratio of two values in the surveys. Outlier control can be applied to any combination of the numerator, denominator, and final value. Experiment with different methods to see how the final estimates change.

How do I use this tool?
  • This app contains processed data from the LSMS-ISA surveys in Ethiopia, Nigeria, Malawi, Tanzania, and Uganda. Select a country and survey round to begin.
  • Select a variable from the dropdown
  • Set the lower and upper percentile thresholds (defaults are 1st and 99th); set lower to 0 or upper to 100 to disable outlier detection for that tail.
  • Set the coefficient for the median absolute deviation (default is 3.5)
  • Set the z-score for the Yeo-Johnson transformation (default is 3.5; 2.6 corresponds approximately to the 1st and 99th percentiles in normally distributed data)
  • Select which variables to which outlier control will be applied.
  • Determine what to do with the outliers - replace at the tails, median, or trim (all three methods will use the same replacement/trimming option)
  • The first checkbox determines whether sample weights are used when calculating summary statistics (default: on); disable to view sample summary statistics (this will also disable the checkbox below)
  • The second checkbox determines whether sample weights are used when calculating thresholds (default: on); disable to use the sample median instead of the weighted median for the MAD method, for example.
  • Click the “Go” button to run the analysis
Output
  • The program will produce three box plots, one for each of the numerator, denominator, and final calculated indicator. In tab 1, the box plots will show the three outlier detection methods compared to the raw data. In the second tab, the plots will show the raw data compared to the selected outlier control method disaggregated by subgroup.
  • Tables below each figure will present the raw and processed mean, median, minimum non-zero value (zeroes are not changed), maximum, and number of values removed or altered on the left tail (Lower N) and right tail (Upper N). Weights will be applied to the mean and median if the corresponding checkbox is active.
Citation

University of Washington, Evans Policy Research and Analysis (2025). . URL: URL

How can I contact you for help?

We welcome feedback and questions. Please email us at uw.eparx@gmail.com.