Feeds:
Posts
Comments

Posts Tagged ‘Statistics’

Benford_1Have you ever wanted to do a quick sanity check on a long list of numbers? It might be a budget, worldwide sales by country or product, or a marketing forecast. There is a cute little trick that can possibly tell you if the numbers might be manufactured instead of real: Benford’s Law.

Benford’s Law, which is not really a “law of nature” but the result of more than 125 years of observation, states that the first digit of many real-life sets of numerical data is more likely to be a “1” then any other first digit, and the probability gets successively smaller for “2” through “9”. Intuitively, one might expect that the probability of the first digit would be evenly spread: about 11% for each possible first digit 1 through 9. Zero doesn’t count as a first digit in this case. The law works even with a set of numbers with vastly differently sized numbers based on the number of digits in the number. In fact, the more orders of magnitude covered by the data, the more accurately Benford’s Law seems to apply.

Benford_2In other words, a list that spans numbers as small as 100,000 and as large as billions is likely to follow the law closely. For example, this chart shows how closely the population of the 237 countries in the world (red bars) match Benford’s Law (the black dots).

The American astronomer Simon Newcomb published a paper in 1881 based on the fact that in his logarithm tables the earlier pages were much more worn than the other pages, implying that he was looking up numbers starting with 1 and 2 more often than others. If you have no idea what I’m even talking about, check this out. He postulated the formula in Benford’s law for first digits of 1 and 2. In 1938, physicist Frank Benford tested the theory on twenty different sets of numbers and was thus credited with the law. His data sets included the surface areas of 335 rivers, the sizes of 3,259 US populations, 1,800 molecular weights, and 308 numbers contained in an issue of Reader’s Digest.

Benford’s Law is not a law, and will not apply to sets of numbers that are restricted in value, like the phone numbers in Philadelphia (since almost all will start with 2, 4, or 6). A set of numbers that does not match Benford’s Law is not necessarily wrong, but might be worth a second look. If someone is manufacturing numbers, they are likely to not match Benford’s Laws.

Why does this law work? It has to do with the distribution of numbers in a logarithm scale, and explains why the wear on Simon Newcomb’s logarithm tables led to his initial discovery of the relationship.

Some relationships do not obey Benford’s Lw, including distributions created from square roots or reciprocals. It does not apply to numbers that are the result of mathematics combinations, like quantity times price, or sequentially assigned numbers like check numbers.

At various times, evidence based on Benford’s Law has been admitted in criminal cases at US local, state and federal levels. It has been used as evidence of fraud in the 2009 Iranian elections, although experts tend to discount Benford’s Law as a indicator or election fraud.

Mark Nigrini, a well-known South African author of Forensic Analytics, has shown that Benford’s Law could be used in forensic accounting and auditing, which is how this post started.

The last word:

Benford_3As I was talking about this post, my wife said that this law should also apply to the number of children in a family. In her genealogical research, it appeared to her that there are a lot of families with just a few children and, especially in the past, families with large number of children, more than 9. I could not find any overall statistics to support or deny this claim; most government statistics talk about 1, 2, and “3 or more” children. However, I did find one family tree that had the statistics I wanted covering 344 families with up to 15 children in a family.

Comments solicited.

Keep your sense of humor.

Walt.

Advertisements

Read Full Post »

I like statistics. When properly used, they can tell you what has actually happened in the past. Statistics can provide valuable information to help you run your company or for the government to run the country. Statistics can tell you how closely two sets of data are related, their correlation. You might notice, for example, that since you introduced pastel colored widgets, your sales to teenage girls have significantly increased. You might jump to the conclusion that teenage girls prefer pastel colored widgets, and you might be right. On the other hand, the increase in sales to teenage girls could be due to your increased marketing of widgets in women-only high schools and colleges.

When statistics tell you that two quantities vary together, most people will believe that they are related in some way. You should always beware of jumping to conclusions. Correlation does not equal causation. Here are three very high correlation examples from Tyler Vigen’s book Spurious Corrections.” I suspect there really is no relationship between the two quantities in each case.

CorrelationEven if there is an actual cause and effect relationship, it may not be in the direction you think.

Your company collects more and more data about its operation, products and customers. Additionally, thousands of data sets are available from public and private sources about behavior, health, poverty rates, driving accidents and just about anything you can think of. Given enough processor power, you can search for correlations among these data sets. Sometimes these “strange” correlations can prove valuable. A dozen years ago, an almost random check of the correlation between auto accidents involving personal injury or death across the counties of one state had a very high correlation with the number of people over 55 who were taking a specific medicine. The resulting investigation by the pharmacy company that manufactured the drug led to increased warnings to doctors and patients about a previously unsuspected age-dependent side effect.

When someone brings you one of these correlations, pay attention, but apply reason. Correlation is not causality

The last word:

President Obama and many other politicians on the left want to make it illegal for law abiding citizens to own a gun. In their view, only the government should have any weapons. They want to eliminate the Second Amendment to the US Constitution. The primary reason the first session of the US Congress included that amendment in the Bill of Rights was the recent experience with their prior government. The British Government severely limited gun possession in towns and cities; they could not police the rest of the colonies. They feared, rightly it turned out, that the colonists could use those weapons against the British government. The US Founding Fathers wanted to make sure that a future government could not take away citizens rights without the citizens having a last resort to deal with a run amok government.

President Obama will tell you that eliminating all legal guns is the solution to these tragic mass-shooting events. But we know that is a false argument. Almost every one of the mass shooting events in the past two decades has been in a “gun-free zone.” We have been steadily increasing the number of these zones, so it includes virtually every school, sporting event, shopping area, government facility, and even most portions of our military bases. We actually put signs up to indicate to potential terrorists of where they will have five to thirty minutes of unbothered time to kill as many unarmed victims as they can.

Consider the recent Oregon tragedy. Chris Mintz is student at Umpqua Community College. As a decorated Army veteran, he tried to stop the gunman before he entered the classroom where the gunman killed nine students. Mr. Mintz was shot seven times for his bravery. If Mr. Mintz had a weapon with him, the results could have been vastly different.

Oregon state law actually requires that colleges allow guns on campus in some circumstances. At a minimum, a college must allow a visitor with a carry permit to bring a gun on campus, but not necessarily a student. Until police arrived, the gunman was the only person with a weapon on the campus.

Gun control laws do not keep guns out of the hands of criminals and terrorists; they only keep them out of the hands of law-abiding citizens. Chicago, with restrictive gun control laws, had over 400 murders in 2014. That is the equivalent of an Umpqua Community College event every 8 days.

We are painting a target on the back of our children.

Comments solicited.

Keep your sense of humor.

Walt.

Read Full Post »

I like facts.  I especially like facts that are backed up by measurable and reproducible numbers.  The way people talk about numbers sometimes annoys me in three areas: accuracy, precision and presentation.

Accuracy is the degree of closeness of measurements of a quantity to that quantity’s actual value.

Precision is defined as the degree to which repeated measurements under unchanged conditions show the same results.

Presentation is how the numbers are presented for a specific purpose.

A given measurement can be accurate but not precise, precise but not accurate, neither, or both.  Consider your car’s odometer.  It will be very precise, probably able to reproduce the same distance measurement between two points along the same path within a few feet.  But it may not be very accurate, due to tire pressure, which lane you were driving in, or other factors.

These two terms are often used interchangeably in normal conversation or even in scientific papers.  Often the concept is expressed either in terms of significant digits or a range.  For example, the sun has a diameter of about 865,374 miles (more than three times the distance from Earth to the moon).  It is not exactly 865,374 miles across – that is just the best estimate.  But for most people, knowing that that sun is about 900,000 miles across is good enough for daily conversation.  That is a one significant digit answer.  870,000 miles would be a two-significant digit answer.

Most people make an unconscious correlation between the number of significant digits and accuracy.  If I told you there were 217 people at our meeting, you would believe that I actually counted them.  If I told you there were about 200 people at our meeting, you would believe I did not actually count them.  You might even think that I was deliberately overestimating for some reason, such as to show we had a lot of support for some position or action.  In fact what may have happened is that I estimated 200 people in the meeting, and Joe asked if I had counted the 17 people in the balcony.  When I said “no” Joe added them together and published the result.

When combining numbers in any way, the significance of the answer can be no higher than the least significance of any of the individual numbers.  Taking a one significant digit number (200) and adding to it a two significant digit number (17) should be a one significant digit number.  The reported number should have been 200, not 217.

A classic example of “precision enhancement” is women’s gymnastics at the Olympics and other significant events.  The score is formed by a number of judges.  Each judge provides two numbers: the degree of difficulty (D) and execution (E).  D starts at 0.0 and increases based on the skills successfully completed.  E starts at 10.0 and decreases based on errors in performance.  Long gone are the perfect tens of Nadia Comaneci.  World-class performers are typically in the 15.5 to 15.9 range.  Anything above 16.0 is an exceptional score.  Each judge provides a pair of numbers with three significant digits.  None of these values is very accurate or precise.  They are not real measurements; just subjective judgments based on a complex set of tables of difficulty and execution values.  Different judges will give different scores for the same performance, and the same judge will give different scores for virtually identical performances at different times.  When the performances were over, U.S. gymnast Alexandra Raisman and Russian Aliya Mustafina had the same score:  59.566.  The judges went to a tiebreaker based on individual events, which awarded the bronze medal to the Russian.  My problem is not with the tiebreaker, but with basing anything on a number with five significant digits based on a series of inaccurate numbers with only three significant digits at most.

Ever watch those CSI-like shows where they zoom in on a license plate or face from a grainy ATM camera.  That is a similar kind of “precision enhancement.”  If you start with a low resolution picture, you may be able to clean it up a little, but those extra pixels with the details just do not exist.

When it comes to presentation, most people cannot make much sense out of a huge table of numbers.  To make the hidden significance clearer, most people use statistics and resulting graphs and charts.  Ideally, these results actually give important clues to what the numbers are really saying.  However, “statistics” can be something very different.  Mark Twain is usually credited with originating the phrase “liars, damn liars and statisticians.”

The actual Twain quote is “There are three kinds of lies: lies, damned lies, and statistics” from his 1906 “Chapters from My Autobiography published in the North American Review. Twain himself attributed it to British Prime Minister Benjamin Disraeli.  However, the phrase was never used in any of Disraeli’s surviving writings and the earliest known use of the phrase was years after Disraeli’s death.  There are a few uses of the phrase or something very close to it in the period 1885-1891 in both England and the United States.  Samuel Clemens (Mark Twain’s real name) was born in 1835 and had starting writing for newspapers by the mid 1860s, and published his first book in 1869, The Innocents Abroad, or The New Pilgrims’ Progress.  He could certainly have heard the phrase, or even originated it.  After all, he had been in England in 1872.

As a group, politicians are very good at taking facts and turning them into “statistics” that support their particular viewpoint.  I quoted “statistics” because statistics is a science: the study of the collection, organization, analysis, interpretation and presentation of data.  When done correctly, statistics are extremely valuable at summarizing a lot of numbers into something that is easy to grasp and understand.  They can show how two or more sets of facts are related, and can indicate the degree of correlation between two sets of data.  This is often not the “statistics” used.

Statistics were a prime driver in the war against cigarettes after they showed a clear correlation between smoking and lung cancer.  Note that the statistics did not indicate whether or not there was any cause and effect; or, if there was which was the cause and which was the effect.  In the case of cigarettes and cancer, it was fairly obvious that getting cancer did not cause one to smoke.  But there could have been an unstudied third factor that caused both.  There probably is not a third factor in this particular case, but at least think about the possibility when some politician or advertisement uses “statistics” to “prove” cause and effect.

Another favorite trick is to hide exactly what is being measured. A pharmacy company can measure their drug against doing nothing, the placebo, and claim that it is 85% effective.  Their competitor’s drug may also be 85% effective, but that was not what they were measuring.  Or they may deliberately select and compare different classes of people in the study to skew the results in their favor.  In general, if others cannot get hold of the raw data and perform an independent statistical analysis, then beware.  Governments almost always hide the raw data.

Even if you never let facts get in the way of a good story, it is a good idea to know what the numbers really say.   If nothing else, it makes it harder to be blindsided by someone who really knows.

In reality, most people and organizations try to use statistics responsibly and usually succeed.  But when you get two entirely different sets of “statistics” about the same question, then at least one of them is cooking the numbers to suit their own purpose.  My advice: know who is putting out the information and determine what they gain by what they are showing.  Beware of surveys of “public opinion” unless they are by reputable pollsters and have provided their selection and measurement details.  Lots of these “public opinion” polls are based on self-selected respondents.  I.e., they are the people who called into a specific radio show or were customers at a specific chain of stores.  Not necessarily an unbiased group.

All valid survey results should include a “plus or minus x percent” statement indicating the calculated error range for the result.  When the plus or minus value is larger than the difference between the compared values, just ignore the poll.  “52% of the people surveyed prefer my brand, so you should also.  Survey plus or minus 5%.”  Five percent is bigger than the 4% difference between those who prefer my brand and the 48% who do not.

The last word:

Samuel Clemens took his most famous pen name from his work on Mississippi Riverboats.  Since the river constantly changed it was important to know how deep the water was right here right now.  Depth ranging by sonar was a little in the future, so the method was to tie a heavy weight to the end of a rope and throw the weight overboard while holding on to the other end and taking up any slack.  Simply measuring the length of the rope in the water gave you a good approximation of the water depth. Mississippi River sailors would usually tie a knot in the rope every width of their outstretched arms, about six feet.  This unit was a “fathom” and riverboats needed twelve feet of water to safely navigate.  When they threw the sounding line in they would announce the depth of the water by the number of knots, or marks, in the water.  Even sailors read the Bible, and it often uses “twain” for the number two.  So a sailor would shout out “by the mark twain” meaning it was safe, at the moment.

The weight at the end of the rope was often made of lead and was always called a “lead” even if it was just a big rock.  Potentially, the origin of the phrase “get the lead out?”

Comments solicited.

Keep your sense of humor.

Walt.

Read Full Post »