Wikipedia talk:Database download
This is the talk page for discussing improvements to the Database download page. |
|
Archives: 1, 2, 3Auto-archiving period: 12 months ![]() |
- Please note that questions about the database download are more likely to be answered on the xmldatadumps-l or wikitech-l mailing lists than on this talk page.
|
|||
This page has archives. Sections older than 365 days may be automatically archived by Lowercase sigmabot III when more than 4 sections are present. |
How to use multistream?
[edit]The "How to use multistream?" shows
" For multistream, you can get an index file, pages-articles-multistream-index.txt.bz2. The first field of this index is the number of bytes to seek into the compressed archive pages-articles-multistream.xml.bz2, the second is the article ID, the third the article title.
Cut a small part out of the archive with dd using the byte offset as found in the index. You could then either bzip2 decompress it or use bzip2recover, and search the first file for the article ID.
See https://docs.python.org/3/library/bz2.html#bz2.BZ2Decompressor for info about such multistream files and about how to decompress them with python; see also https://gerrit.wikimedia.org/r/plugins/gitiles/operations/dumps/+/ariel/toys/bz2multistream/README.txt and related files for an old working toy.
"
I have the index and the multistream, and I can make a live usb flash drive with https://trisquel.info/en/wiki/how-create-liveusb
lsblk
umount /dev/sdX*
sudo dd if=/path/to/image.iso of=/dev/sdX bs=8M;sync
,but I do not know how to use dd that well to
"Cut a small part out of the archive with dd using the byte offset as found in the index." than "You could then either bzip2 decompress it or use bzip2recover, and search the first file for the article ID. "
Is there any video or more information on Wikipedia about how to do this, so I can look at Wikipedia pages, or at least the text off-line?
Thank you for your time.
Other Cody (talk) 22:46, 4 December 2023 (UTC)
- https://trisquel.info/en/forum/how-do-you-cut-wikipedia-database-dump-dd
- has someone called Magic Banana who has information about how to do this.
- Maybe others as well. Other Cody (talk) 15:44, 26 January 2024 (UTC)
A tool for a similar multistream compressed file was written for xz compression and lives at https://github.com/kamathln/zeex . This will give a preliminary idea and could be adapted for bz2 as well. kamathln (talk) 12:21, 22 January 2025 (UTC)
How many "multiple" is "These files expand to multiple terabytes of text." - 4TB Drives are...
[edit]...cheap as chips.
In early 2025, a 4TB disk drive is $70 USD while SSD is just $200, and 24 TB Discs are under $500...
It's clear that the "current version only" expands to just 0.086 TB - Can anyone further clarify whether "multiple" a few lines below that is talking about expanding to 2 TB or 200 TB? Jonathon Barton (talk) 06:17, 16 February 2025 (UTC)
Semi-protected edit request on 20 March 2025
[edit]![]() | This edit request has been answered. Set the |answered= or |ans= parameter to no to reactivate your request. |
within "SQL Schema" section, change the link pointing to tables.sql to either tables-generated.sql or tables.json, I'd go with the former as it's more compact and readable.
the original tables.sql is empty as of Aug. 2024 and will be removed, see https://phabricator.wikimedia.org/T191231
old: https://phabricator.wikimedia.org/diffusion/MW/browse/master/maintenance/tables.sql
to either: https://phabricator.wikimedia.org/source/mediawiki/browse/master/sql/mysql/tables-generated.sql (prefered)
or: https://phabricator.wikimedia.org/source/mediawiki/browse/master/sql/tables.json
quoting from the issue: "we may want to switch to YAML later" hasn't happened yet. YAML would be the most readable format. KlausSchwab (talk) 14:22, 20 March 2025 (UTC)
Done -- John of Reading (talk) 17:43, 23 March 2025 (UTC)
Semi-protected edit request on 27 April 2025
[edit]![]() | This edit request has been answered. Set the |answered= or |ans= parameter to no to reactivate your request. |
The compressed size of 19GB is not the same as mentioned on https://en.wikipedia.org/wiki/Wikipedia:Size_of_Wikipedia, perhaps one of the pages got stale 2601:600:8480:2D10:5D63:3732:312A:9F99 (talk) 20:12, 27 April 2025 (UTC)
- Yes, the figures at Wikipedia:Size of Wikipedia#Size of the English Wikipedia database are stale. Each figure is labelled "As of <date>" so there shouldn't be too much confusion. -- John of Reading (talk) 06:20, 28 April 2025 (UTC)