About EPJ
The European Physical Journal (EPJ) is a series of peer-reviewed journals covering the whole spectrum of physics and related interdisciplinary subjects. EPJ is committed to high scientific quality in publishing and is indexed in all main citation databases.
Latest news
EPJ Data Science Highlight - Twitter’s tampered samples: Limitations of big data sampling in social media
- Details
- Published on 16 January 2019

Social networks are widely used as sources of data in computational social science studies, and so it is of particular importance to determine whether these datasets are bias-free. In EPJ Data Science, Jürgen Pfeffer, Katja Mayer and Fred Morstatter demonstrate how Twitter’s sampling mechanism is prone to manipulation that could influence how researchers, journalists, marketeers and policy analysts interpret their data.
(Guest post by Jürgen Pfeffer, Katja Mayer and Fred Morstatter, originally published in the SpringerOpen blog)
Despite the many scandals surrounding social media companies and their practices of data sharing, they are still central platforms of opinion formation and public discourse. Therefore, social media data is widely analyzed in academic and applied social research. Twitter has become the de facto core data supplier for computational social science as the company provides access to its data for researchers via several interfaces. One of these – the “Sample API” – is promoted by Twitter as follows:

Twitter’s Sample API provides 1% of all Tweets worldwide for free, in real-time – a great data source for researchers, journalists, consultants and government analysts to study human behavior. Twitter promises “random” samples of their data. The randomness of a sample – each element has an equal probability of being chosen – is of high importance for social scientific methodological integrity as a sample selected randomly is regarded as valid representation of the total population. Even though Twitter shares (parts of its) data with potentially everybody (unlike other social media companies), the company does not reveal details about its data sampling mechanisms.
We set up experiments to test the sampling procedure of the Sample API by inducing tweets into the feed in such a way that they appear in the sample with high certainty. In other words, while a Tweet should have a 1% chance to be part of the Twitter’s 1% sample data, it is easily possible to increase that chance to 80%. Consequently, finding 100 Tweets in the 1% sample related to a certain topic might not result from a random sample of 10,000 Tweets but just from a manipulated sample based on 125 Tweets.

This figure illustrates the effect of a Tweet injection experiment during the Nov 2016 US presidential election campaigns using the hashtag #trump. The gray area represents Tweets in 1% sample from 328 million users, red represents the induced tweets, the black line illustrates the 1% Sample API Tweets. One hundred accounts were enough to manipulate the data stream for a globally important topic.
We also developed methods to identify over-represented user accounts in Twitter’s sample data and show that intentional tampering is not the only way Twitter’s data can get skewed. For instance, automated bots can accidentally be over-represented in the data samples or be invisible at all. The authors also show evidence that corporate Twitter users seem to be allowed by Twitter to send many more Tweets than regular users, which will automatically inflate their position in the data.
Our study lists potential solutions both for the architectural flaws and the regaining of scientific integrity. The latter could be achieved by making sampling methods transparent and cooperating with social media researchers more closely to create open interfaces as well as the possibility to better assess the data at hand. At a time when decision making is based increasingly on the analysis of social data, also industry should do everything to enhance public trust in the methodologies at hand.
Even though some big data evangelists state that sampling is “an artefact of a period of information scarcity”, reality makes sampling a central necessity in times of information abundance. Researchers have to trust Twitter to supply them with methodologically sound samples while dealing with all kinds of other problems, such as bias and ethical issues (see some here, some here and some here).
Open calls for papers
-
EPJ ST Special Issue: Heat Transfer in Nanofluids: Dynamics and Recent Developments
-
EPJ ST Special Issue: Modeling and simulation of heat/mass transport, nucleation and growth kinetics in phase transformations
-
EPJB: Topical Issue on Recent Advances in the Theory of Disordered Systems
-
EPJ ST Special Issue: Memristor-based systems: Nonlinearity, dynamics and application
-
EPJ ST Special Issue: Nonextensive Statistical Mechanics, Superstatistics and Beyond: Theory and Applications in Astrophysical and Other Complex Systems
-
EPJ Quantum Technology: Special issue on Quantum Magnetometers
-
EPJ ST Special Issue: Diffusion dynamics and information spreading in multilayer networks
-
EPJ D Topical issue: Dynamics of Systems on the Nanoscale
-
EPJE Topical Issue: Dielectric Spectroscopy Applied to Soft Matter
-
EPJA Topical Issue: First joint gravitational wave and electromagnetic observations: Implications for nuclear and particle physics
-
EPJ B Special Issue: Non-Linear and Complex Dynamics in Semiconductors and Related Materials
-
EPJ E Topical Issue: Branching Dynamics at Mesoscopic Scale
-
EPJ E Call for papers: Thermal non-equilibrium phenomena in soft matter
-
EPJ AM Call for papers: Themed Issue on Terahertz metamaterials
-
EPJ D Topical Issue: Quantum Correlations
-
EPJ B Special Issue: Complex Systems Science meets Matter and Materials
-
EPJ B Special Issue on Multiscale Materials Modeling
-
EPJ Data Science: Thematic series on human mobility
-
EPJ Quantum Technology: Thematic Series on Space Applications of Quantum Technology
-
EPJ Techniques and Instrumentation: Thematic Series on Novel Plasma Diagnostics