I need to write a program to find the average GPS point from a population of points.
In practice the following happens:
- Each month a person records a GPS point of the same static asset.
- Because of the nature of GPS, these points differ slightly each month.
- Sometimes the person makes a mistake a records the wrong assest at a completely different location.
- Each GPS point has a certainty weight (HDOP) that indicates how accurate the current GPS data is. GPS points with better HDOP values are preferred over lower ones..
How do I determine the following:
- Deal with data with 2 values vs. a single value like age. (Find the average age in a population of people)
- Determine the outliers. In the example below these would be [-28.252, 25.018] and [-28.632, 25.219]
- After excluding the outliers, find the average GPS point in this it might be [-28.389, 25.245].
- It would be a bonus if can work the "weight" provided by HDOP value for each point.
One of the problems with multivariate data is deciding on, and then interpreting, a suitable metric for calculating distances, hence clever but somewhat hard-to-explain concepts such as Mahalanobis distance. But in this case surely the choice is obvious – Euclidean distance. I'd suggest a simple heuristic algorithm something like:
- Calculate the (unweighted) centroid of the data points, i.e. the (unweighted) means of the 2 coordinates
- Calculate the Euclidean distance of all the readings from the centroid
- Exclude any readings that are further than a certain distance (to be determined based on your experience and knowledge of the technology, or failing that a bit of
trial and errorcross-validation – 100m, 1km, 10km??)
- Calculate the weighted average of both coords of the remaining points, weighting by the inverse of the HDOP score (or some monotonic function of it – i had a quick look at the wikipedia page linked in the question and think maybe you don't need such a function but i'd need to study it further to be sure)
There are clearly several ways to make this more sophisticated, such as down-weighting outliers or using M-estimators rather than simply excluding them, but I'm not sure whether such sophistication is really necessary here.