Language and population drop off at both ends from the log-log plot.
Benford's law is better than Zipf's for population and language, capturing the most common words better. It's below the log-log on the front end compared to Zipf's. But it seems sensitive to change.
Yule-Simon is best in the sense that it has an algebraic function that is easily solvable and is better than Zipf's, dropping off at the high on a log-log plot as is seen in population and language. It is based on evolution, I believe considering new species being added. When made "with memory" (not so algebraic, probably a differential equation), it was made to work really good. It might apply really well to social/computer networks where nodes are added. Words have connections to each other like a network.
Douple Pareto Log-Normal (DPLN) seems to have more interest, maybe even applicable to a lot of physics. It combines "geometric Brownian motion" (GBM) (a differential equation with a feed source and random changes) and Yule-Simon. The GBM is a "pure form" of Gibrat's law for cities. Gibrat's says cities start with a log normal distribution, which I believe causes the tail end to drop off since Yule drops of the other end. Pareto is log-log and has a "killing constant" that might drop off the tail. I do not know why they call it double pareto unless it is because it is like using two pareto curves, one for the top and one for the bottom.
The differential equations seem to be needed because it allows a "feedback", i.e. current state is used to calculating future states. For example, words, species, and cities are competing with each other for popularity in a limited "space". People feed words by employing them, environment feeds (employs) species, and cities employ (feed) people. But once feeding gets massive, there is drawback: the more a word is used, the less information it can convey due to how Shannon entropy/word is calculated. City density starts decreasing the efficiency benefits. Environments run out of food. On the tail end, rare words carry a lot of info, but few people know them. Fewer members of a species means less mating opportunities for gains in natural selection (Darwin realized this). Fewer people means fewer job options. There is a wide middle ground with an exponential. It is higher on the tail end as "population" benefit starts to kick in, and decreases at the high end as efficiency exponential starts being blocked by the energy (species), time (language), or spatial (cities) limits.
This is possibly my favorite of the articles:
I checked Benford's law log(1+1/r) times 2.2 compared to Mandelbrot's modified Zipf law ~1/(r+2.7) for english. After the rank of 21, the error is less than 5%. It's higher for ranks 1 to 21, matching the first few English words better. Both are too high for large r. Benford also predicts country populations better.
Concerning the relationship between the Zipf and Benford:
The Pareto distribution is a similar function applied to wealth (Xmin/X)^a where a greater than 1 and has been used as a measure of wealth inequality.
But it appears the wide-ranging real-world observations of these power-like laws is largely the result of the "preferential attachment". In short "success breeds success", the rich get richer. Words that are common become more common because they are common. Same thing with cities and species. Darwin wrote about how species become distinct because when you have a larger population to breed with, you have more options for the best selecting the best. Cities become more efficient in terms of providing potential employment. Companies gain efficiency as they get larger, allowing them to get larger. The kind of ranking that results from this is the Yule-Simon distribution. On a log-log plot, it give the most common words lower than expected from a log-log plot, which is what words do. It's formula is
freq = x*x!R!/(x + R)!
where x! is the gamma function of x+1 and x is a real value greater than 0. R = rank-1. (x+R)! is the gamma function of (x+1+R). The Gamma function is the continuous version of (N-1)!. I would call x the "amplifier" in the positive feedback. k*k!*x!/(k+x)! For x=1 it is R!/(1+R)! = 1/R = zipf's law.
But it is inadequate for the tail end as it is straight when it also needs to drop off. One of the following papers used the formula expressed as P(r) = 1/r^a where a=1+1/(1-p) where p is a constant probability of a new word being added during a time step. In this version they modified it to have a downward concave shape, so it worked really well.
It has been show to model language excellently and in city population
Yule-simon Works better in language
works better in cities
But there is a dropping off of the log-log straight line at both ends in most data that the straight Yule-Simon law does not handle. Successful cities do not merely add new nearby cities as Yule shows. The bigger city's relative population drops off from from this which is a different way of saying maybe overpopulation starts losing efficiency of its attraction. On the tail end there is there are otehr disadvantages. Commonly-used words are used more often because they are common, but since they convey less information due to being common, the effect is limited which prevents it from following a straight log-log curve. On the other end rare words are more rare than expected because not enough people know them to be able to usually use them. Similarly cities would follow a strict log-log curve due to statistics, but inefficiencies are created for different reasons in the most and least populated regions. In animals, they either start eating each other's food source, or they are not able to find a mate. Wealth on the other hand may not be subject to an "overpopulation" effect.
So the DPLN may be the ultimate:
For cities if not a wide range of physics, it seems better to combine Yule with the Geometric Brownian Motion (GBM, random variation of a random variable with a fuel source for new entrants) which is supposed to be Gibrat's log-normal law for cities in its pure form.
"A random variable X is said to follow GBM if its behavior over time is governed by the following differential equation dX = (µdt +σdB)X, (15) where dB is the increment of a standard Brownian motion (a.k.a. the white noise). For a GBM the proportional increment of X in time dt comprises a systematic component µdt, which is a steady contribution to X, and a random component σdB, which is fluctuated over time. Thus the GBM can be seen to be a stochastic version of simple exponential growth."
GBM feeds in new populations or words, and where they settle has a random fluctuation. Maybe this some how causes the tail to drop off, as Yule causes the high end to drop off.
Here's the best complete explanation of city sizes.
"The double Pareto lognormal seems more appropriate since it comprises a lognormal body and power law tails. Reed  suggests a GBM model, similar to the one that models personal incomes, for obtaining the settlement size distribution. Individual human settlements grow in many different ways. At the macro level a GBM process can be used to model the size growth by assuming a steady systematic growing rate and a random component. The steady growing rate reflects the average growth rate over all settlements and times, and the random component re- flects the variability of the growth rate. The time when a city is founded varies from settlement to settlement. If we assume in the time interval (t,t + dt) any existing settlement can form a new satellite settlement with probability λdt, the creation of settlements is a Yule process , which was first proposed as a model for the creation of new biological species. Under Yule process, the expected number of settlements is e^λt after t time since the first settlement. That is, the number of settlements is growing at rate λ. Therefore, the existing time for all settlements is exponentially distributed. It is straightforward to conclude that under GBM and Yule processes, the overall settlements size distribution will is a double Pareto distribution. If we further assume a lognormal initial assume a lognormal initial settlement size, the result will converge to the double Pareto lognormal distribution
Reed 2004 , DPLN invention, applicable to physics