FindFiles.net - statistical data & scientific collaborationsThe Wikipedia/Dmoz corpus
We searched in 2010 all hosts linked in Wikipedia and the
Open Directory Project (DMOZ), all editions. The datafile
contains all 252 Mio files found in 7.7 hosts, with the
format
[805M] Wikipedia/Dmoz corpus: file-size, Mime-Type, hosting-domain The hosts are anonymised and numbered consecutively, the respective top-level domain has been retained. The file size is in Bytes.
We welcome scientific collaborations regarding network, web sciences
and/or complex system analysis of the FindFiles.net data corpus.
Please contact Claudius Gros. Neuropsychological constraints to human data production on a global scale
Analyzing the file size distribution of the Wikipedia/Dmoz corpus
for several distinct data types we find indications that
the neuropsychological capacity of the human brain
to process and record information may constitute the dominant
limiting factor for the overall growth of globally stored
information, with real-world economic constraints having
only a negligible influence. This supposition draws support
from the observation that the files size distributions follow
a power law for data without a time component, like images,
and a log-normal distribution for multimedia files, for
which time is a defining qualia.
The full article is available on the
arXiv server
and will be published in the
European Physical Journal B.
Our results have been discussed by the MIT Technology Review and
other web science portals:
The illustrations may be freely used when adding a reference to
FindFiles.net and to the
scientific publication: "European Physical Journal B, in press".
Distribution of public files in the Internet
We find that most public data files in the Internet are hosted on small sites.
The indegree is the number of links pointing to a domain.
Domains with with a large indegree a normally important hubs.
The number of hosts having a giving indegree k scales
approximatively like k-2.2 in the FindFiles.net corpus,
in agreement with other investigations.
Domains hosting large numbers of files have in
general a small indegree.
The results are being published in the
European Physical Journal B.
The illustrations may be freely used when adding a reference to
FindFiles.net and to the
scientific publication: "European Physical Journal B, in press".
|