Quantcast
FindFiles.net - web sciences and database

FindFiles.net - statistical data & scientific collaborations



The Wikipedia/Dmoz corpus

We searched in 2010 all hosts linked in Wikipedia and the Open Directory Project (DMOZ), all editions. The datafile contains all 252 Mio files found in 7.7 hosts, with the format

[805M]  Wikipedia/Dmoz corpus: file-size, Mime-Type, hosting-domain

The hosts are anonymised and numbered consecutively, the respective top-level domain has been retained. The file size is in Bytes.
We welcome scientific collaborations regarding network, web sciences and/or complex system analysis of the FindFiles.net data corpus.
Please contact Claudius Gros.

Neuropsychological constraints to human data production on a global scale

Analyzing the file size distribution of the Wikipedia/Dmoz corpus for several distinct data types we find indications that the neuropsychological capacity of the human brain to process and record information may constitute the dominant limiting factor for the overall growth of globally stored information, with real-world economic constraints having only a negligible influence. This supposition draws support from the observation that the files size distributions follow a power law for data without a time component, like images, and a log-normal distribution for multimedia files, for which time is a defining qualia.
The full article is available on the arXiv server and will be published in the European Physical Journal B. Our results have been discussed by the MIT Technology Review and other web science portals:
   
The illustrations may be freely used when adding a reference to FindFiles.net and to the scientific publication: "European Physical Journal B, in press".

Distribution of public files in the Internet

We find that most public data files in the Internet are hosted on small sites.
The indegree is the number of links pointing to a domain. Domains with with a large indegree a normally important hubs.
The number of hosts having a giving indegree k scales approximatively like k-2.2 in the FindFiles.net corpus, in agreement with other investigations.
Domains hosting large numbers of files have in general a small indegree.
   
The results are being published in the European Physical Journal B. The illustrations may be freely used when adding a reference to FindFiles.net and to the scientific publication: "European Physical Journal B, in press".
© 2012 FindFiles.net | The file search engine with an free antivirus scan & file converter