How to select the best dataset for your project - Transcript
How to Choose the Best Dataset for Your Project Transcript
Welcome to the online course on oceanographic satellite data products, produced by NOAA’s CoastWatch Program. In this module we will discuss how to select the best dataset for your application. The information presented here builds on the concepts discussed in several of the course’s earlier modules, including Satellite 101, Sea Surface Temperature, Ocean Color, and the Wind, Salinity, and Sea Surface Height. You might want to review those modules first before viewing this module. My name is Cara Wilson, I’m the node manager of the West Coast Node of NOAA’s CoastWatch Program. The materials I will be presenting in this video were produced from a collaboration from many members of NOAA’s CoastWatch Program including Dale Robinson, Melanie Abecassis, Ron Vogel, Shelly Tomlinson, me and the late Dave Foley.
How do you select the best dataset for your project? I will start by saying that there is no perfect dataset that will work for everybody. Each project has a unique set of requirements for ocean satellite data. Each dataset has characteristics that may be pros or cons for your project. The trick to finding the datasets that are best suited for your project is to strike a balance between the needs of your project and the properties of the available datasets. On the right, I've listed characteristics of datasets and some questions for you to answer in order to pick the correct dataset for each project. So for Temporal coverage I want to know if the satellite was flying during the dates of my study. For Geographical coverage I need to know if the dataset has data in my area of interest. Spatial resolution – Are the pixels small enough to resolve the features I’m interested in? If I’m looking at a large area is the spatial resolution so small that the dataset will be overly cumbersome to download and process? Temporal resolution: How often does the satellite fly over my area of interest? Latency / Quality: How fast do I need the data after it's been collected and at what quality? Missing data: How much missing data can I tolerate, and what are my options to fill in missing data? I will explore each of these issues in more detail during this presentation.
So first let's talk about temporal coverage. This is the first thing to consider to determine if data is available for your time period of interest. This image shows the timespan over which data has been collected by several ocean color sensors. If you needed chlorophyll data from the years 2005 through 2015, as pictured for study time period shown in yellow, then neither SeaWiFS or VIIRS alone could provide all the data you required. However, data from MODIS-Aqua is available for the entire time period.
Now, suppose that you need chlorophyll data from the years 2000 through 2015, as pictured for the study time period shown again in yellow. No single sensor covers the entire time period. You could piece together the data you need by using data from SeaWiFS and MODIS-Aqua. However, you would have to reconcile the difference between the measurement from the two sensors during their overlap period from 2002 to 2010. Alternatively, you could use one of the blended datasets, where data from many ocean color missions have been merged for you. One example is the European Space Agency’s Climate Change Initiative Ocean color dataset. The dataset merges data from the SeaWiFS, MERIS, MODIS-Aqua, and VIIRS missions to create a blended dataset extending from 1997 to the present.
Now let's consider spatial coverage. Spatial coverage, the area of the Earth’s surface over which data is collected, is not the same for all datasets. Many satellite datasets have global coverage, but some may have only regional coverage. You need to check if the dataset covers your area of interest. Non-global coverage may be because the sensor has a limited spatial coverage or the result of splitting the global files into regional sectors to reduce file size. The top map shows the daily coverage provided by the NOAA Geopolar dataset. This dataset merges data from many polar-orbiting and geostationary satellites. If your study site were off the coast of Washington State, this dataset would provide the spatial coverage you need. The bottom map shows the coverage of a sensor on the GOES-16 geostationary satellite. The circle shows the footprint of the GOES-16 sensor, which runs from 52°N to 52°S latitude and covers South American and eastern US waters, with some coverage in the Pacific. This dataset would not provide the spatial coverage you need for a study off the coast of Washington State.
Spatial resolution is the linear dimension on the ground represented by each pixel, and it is not the same among different satellite datasets. The slide shows SST maps of the Central California coast for two different sensors. On the left is a map from an older GOES-West sensor, which had a spatial resolution of about 4 km on a side for a pixel. The map looks blocky compared to the VIIRS map on the right, which has a spatial resolution of 750 m per side for a pixel. For work on the scale of the whole Central California coast, 4 km GOES data might have a high enough resolution for your needs. In addition, using a lower resolution dataset will make downloading and processing times for the data much faster. However, if you are interested in the feature visible in the circle on the 750 m resolution image, then the 4 km resolution data would not meet your needs.
What if your project was examining a much smaller area? Let’s zoom in on a smaller area of interest within San Francisco Bay. Once again, coverage for the GOES-West sensor is on the left and for the VIIRS sensor is on the right. The GOES-West sensor data gives, at most, 4 pixels going across the bay and about 20 pixels within the entire bay. The GOES coverage probably does not well represent the temperature of the bay or show temperature features well. On the right, coverage for the VIIRS sensors has about 28 times more pixels for the entire bay than does GOES-West. The higher spatial resolution better represents the temperatures within the bay and allows the detection of surface temperature features. So finer spatial resolution provides more details, but the amount of data that needs to be downloaded is larger.
Temporal resolution helps you address the question of how often you need a measurement for your project. If you are examining trends over the last 30 years, a monthly measurement is probably sufficient for your needs. However, if you are studying events in a dynamic coastal region, there may be a requirement for weekly, daily, or even hourly measurements. How often a dataset has a new measurement at the same location depends on the swath width of the sensors, whether the satellite is polar orbiting or geostationary, and if data from several sensors are blended together in a dataset. On the left is the coverage for the ASCAT polar orbiting sensor for 3 successive days. ASCAT is an active microwave sensor in a polar orbit with a 500 m swath width. For the location west of Mexico indicated by the white dot, you would get a wind measurement every 3 days. If you require data more often, you would need to select a different dataset. On the right is the daily coverage for the Cross-Calibrated Multi-Platform, or CCMP, dataset. This dataset blends data from many data sources, including active and passive sensors, to make a dataset with measurements as often as every 6 hours. The images on the right show daily wind maps from the CCMP dataset that provide a measurement each day of the 3-day period at the location indicated by the white dot.
Typically, there is a tradeoff between getting data very soon after it’s been collected and getting the highest quality data. For near real-time data, the data provider is making the data available as quickly as possible, often within a day or even an hour of acquisition. To do this, the strictest quality control for the data cannot be applied. The data in NRT datasets are not bad. More likely, some pixels with questionable data may not have been removed from the dataset. In contrast, science quality data is released after a delay to allow for strict quality controls to be applied. The slide shows how cloud cover may impact datasets where science quality and near real-time quality control is applied to chlorophyll data from the same VIIRS source data. The science quality version is on the left and the near real-time version is on the right. For the NRT image on the right, the inset zooms in on an area where clouds block the ocean color signal, creating gaps in the data. However, it turns out that clouds can have more subtle impacts on the ocean color signal, which can degrade the quality of pixels near the pixels blocked by clouds in the near real-time processing. The less rigorous quality control that is applied to near real-time data can miss identification of these degraded pixels. Now look at the image on the left, which was made from science quality data. The inset zooms in on the same cloudy area. You can see that the data gaps are larger. The science quality data were delayed by about 2 weeks to allow time for more rigorous quality control. The additional degraded pixels were identified and removed from the dataset during this quality control process, resulting in the larger data gaps.
The next few slides help illustrate how both science quality and near real-time data make important contributions to ocean management and research. If you are developing a model to predict the distribution of harmful algal blooms, you might develop a habitat model that uses environmental data like chlorophyll concentration to define conditions under which harmful algal blooms had formed in the past. You would want the best quality data available to create the model. In other words, you would use the science quality data. Since this would be a retrospective analysis, the two-week delay for science quality data would not impact your model development.
On the other hand, when using the model to predict the distribution of harmful algal blooms a few days into the future, you couldn’t use science quality data that is two-weeks old. Doing so would reduce your forecasting ability. Instead, you would use the most up-to-date data available, the near real-time data. You would need to accept the potential increased error in order to obtain the data quickly for prediction purposes.
How much missing data can your project tolerate? If you are constructing a daily time series for a specific, small area, then frequent missing data could make your timeseries unusable. However, if you are generating mean monthly values for a larger area, then data gaps may not be of concern. There can be many reasons for missing data, but for this example let’s consider two. The slide shows SST maps for the same day using microwave imagery on the left and infrared imagery on the right. For a project set in a cloudy region that needs to minimize missing SST data, you might consider using microwave SST data, since microwaves are not blocked by clouds. If your study is at site location 1 on the maps, this strategy seems to be effective. The microwave shows no data gaps, whereas the gaps from clouds can be seen in the infrared maps. In contrast, site location 2 falls within the coastal region where the ocean microwave data is missing due to contamination from the land microwave signal. If your study is at site location 2, then you would need to use a dataset containing infrared data.
Let’s look more closely into strategies for overcoming data gaps using a case study for reducing data gaps due to clouds in SST imagery. At the top left is a 1-day SST image from a polar-orbiting infrared satellite. There are gaps in the SST coverage that we want to minimize. One strategy to reduce data gaps is to take the average values over several days. That strategy is illustrated across the top of the slide, where 3-day, weekly, and monthly average composites are displayed. For example, for a 3-day composite data in each pixel contains a value that is an average of the day of interest, plus the day before and the data after. As more days are added to the averages, data gaps are reduced. However, the data is more smooth over more days. Notice how the finer structure visible within the circle in the 3-day image is less distinct in the weekly and monthly images. A second strategy is to use SST data from infrared sensors on geostationary satellites. This strategy is illustrated on the left-lower part of the slide. These sensors look at the same location on earth and capture an image every 30-60 minutes. The idea is that clouds will move around during the course of a day, increasing the chance of capturing data for a pixel to include in a daily composite image. Using daily geostationary imagery will often reduce data gaps, but spatial resolution can be less. Sensors on polar-orbiting infrared satellites can have sub-kilometer spatial resolution, whereas resolutions from infrared sensors on geostationary satellites typically range from 2-4 kilometers. A third strategy is to use microwave imagery, which can see through clouds. Daily microwave images will often reduce data gaps, but spatial resolution can be less. Spatial resolution from microwave sensors is about 25 km. In addition, microwave imagery has data masked out within approximately 50 km of shore. This might seem like a lot of things to consider but it's important to realize that these three strategies are only available for SST data. Most other measurements are only made from polar-orbiting satellites, and from only one place in the EMR spectrum, so often the only strategy available is to temporally composite the data.
Another strategy might be to select a blended data product for your project. Blended products merge sensor data from multiple sources in order to increase spatial coverage. In addition, an interpolation step is often included to fill in any remaining data gap. Two examples of blended products are included on the slide. On the left is the Multi-Scale Ultra High Resolution, or MUR, SST dataset, which merges data from polar-orbiting infrared and microwave imagery. On the right is the NOAA GeoPolar Blended SST dataset that is produced and distributed by NOAA. The GeoPolar Blended dataset merges data from polar-orbiting and geostationary sensors. The two datasets use different interpolation methods. This strategy might seem like the perfect solution, but one consideration is that with blended products, you generally don’t know which pixels come from observations, and which pixels are interpolated. The circled region on the upper three maps is an area where there are gaps in the source data used for the blended product. Now look closely at that area on the maps made from the two blended datasets toward the bottom of the slide. The results look different in that area. Which of the two datasets more accurately represent the environment? The answer to that question will likely vary with project location and over time. If knowing the level of accuracy is important for your project, then doing some sort of comparison to non-interpolated datasets would be an important step for you to consider.
This concludes this presentation on choosing the best dataset for your project. Please visit the data catalogs at NOAA CoastWatch Central and any of the CoastWatch Nodes listed on this slide. The catalogs have information to help you make decisions about the datasets to use. In addition, please feel free to contact us with questions about datasets.