Discogs Forum - XML Dumps / July

2

Show this post
For the last few years (2022, 2023), July and August XML dumps were significantly delayed. Are we to expect this again for 2024, or will releases, masters, artists, and labels all be available for soon?

Thanks very much.

Permalink

72

Show this post
They just published the releases dump.

Permalink

2

Show this post
Saw that. Great. Hopefully we’ll get the other dumps over the next day or two…

Permalink

2

Show this post
Still waiting for artists and labels. They usually come right after masters...

Permalink

2

Show this post
All good. THANK YOU.

Permalink

2

Show this post
August dumps never were provided. Can someone please confirm that September's will be available next week?

Permalink

2

Show this post
Can anyone from Discogs respond please?

Permalink

Staff 754

blindborges edited 9 months ago
Hi lincoln17,

It looks like the job that creates the XML dump for August indeed failed. I have fixed the issue and have restarted it just yesterday. The downside is that these jobs can take 7-12 days to finish. The upside is that we are currently putting some focus into making efficiency gains for these very old jobs to make them finish much faster and with greater fault tolerance.

Sorry for the inconvenience!

Permalink

2

Show this post
Thanks for the response. Much appreciated.

Permalink

2

Show this post
Any update on this?

Permalink

Staff 754

Show this post

lincoln17
Any update on this?

Unfortunately, the cronjob failed again. These failures have been caused by our recent migration from our old data center to AWS and the restart patterns of servers on AWS. The job that currently creates these files were designed to have a constant, stable connection to our database over a very long time (many days) and is therefore not very fault-tolerant in our new cloud environment. I don't have high confidence that the old cronjob will be able to succeed based on the failure history (though it might), so I'll be putting my focus on the rewrite of this cronjob.

Again, sorry for the inconvenience here.

Permalink

627

Show this post
Thank you blindborges for the detailed information, and your efforts.

Permalink

2

Show this post

vansteve
Thank you blindborges for the detailed information, and your efforts.

Hear, hear!

Permalink

17

Show this post
blindborges Tips from a fellow developer if it still continues to fail: Instead of creating one big file (e.g. the Release file), create multiple jobs that can be executed in parallel with 1 job per 1000000 releaseids. (e.g. first job releaseid 0-999999). When all jobs has been executed successfully, you just concatenates the output, zip and publish.

Permalink

627

Show this post

bjorn.stadig
blindborges Tips from a fellow developer if it still continues to fail: Instead of creating one big file (e.g. the Release file), create multiple jobs that can be executed in parallel with 1 job per 1000000 releaseids. (e.g. first job releaseid 0-999999). When all jobs has been executed successfully, you just concatenates the output, zip and publish.

Dividing the thing into handy slices sounds like an interesting idea to me. It could even be multiple jobs that are executed one after the other (rather than parallel), and whenever one goes wrong due to a server restart in the storage, then this single one could be repeated. It wouldn't always have to start from the beginning and would certainly get to the end at some point.

Permalink

2

Show this post
Any updates here? Will we just get an October dump at this point?

Permalink

Staff 754

Show this post

lincoln17
Will we just get an October dump at this point?

Probably not, unfortunately. The usual job for this kicked off 8 hours ago, but based on previous attempts since our AWS migration, I don't have great confidence it'll finish successfully.

As for updates, I'm in the process of setting up infrastructure for this rewrite. The majority of new app code has been written and tested with some sample data. Once all infrastructure is ready to go, I'll run some tests to generate a full dump (this will not be public because it's a test to work out any kinks and major inconsistencies). From there, I'll make any changes necessary to match the output of the old XML dump job as best as possible, then make it live. I'm hoping all of this can be done within the next couple weeks, but that depends on how the tests go, other work I have, etc.

w/r/t parallelization: yeah, that's part of the new solution. Some other improvements include making it async (the old solution is synchronous), checkpointing (if it fails, it can pick up where it left off; the old solution has to completely start over if it fails), and some code simplification and modernization (again, this old job is ~15 years old, was built for a much smaller database, and the code hasn't been significantly updated).

Permalink

2

Show this post
Any updates on status?

Permalink

4

Show this post
Thanks for the efforts blindborges. Any chance once the XML data dumps are back working there would also be JSON dumps available?

Permalink

Staff 754

Show this post

lincoln17
Any updates on status?

Finalizing some data transfers and getting started on running this in production (though as a veiled test, so not public).

franco334578
Thanks for the efforts blindborges. Any chance once the XML data dumps are back working there would also be JSON dumps available?

Likely not for the first iteration, but perhaps later on, especially if there's a demand for it. Would anyone else prefer a JSON dump instead of XML?

Permalink

72

Show this post
I'm also looking very forward to having the monthly data published again, so thank you in advance for your work, and for not keeping us in the dark about it!

And while I dread having to modify the processing part, I think having the data in a more modern format like JSON would be very useful in the long run.

Also, if I may offer a suggestion: would it be possible to split the data into smaller parts, based on the release ids?
For example:
dump1: releases 1 to 1 million
dump2: releases 1 million to 2 million
and so on.
This would be based on the release id, and not the number of releases.
So each dump would not necessarily contain 1 million releases, but a certain range of releases by id.
I expect this would also be easier on the resources, could be run in parallel, and at least personally it would help a lot: usually when new data is available, the first thing I need is the new stuff, so I could start with processing the last dump.

Permalink

931

Show this post
I am in favour of the json format and splitting the file into some several million parts. It should then be possible to load such a dump immediately into memory using a function like json decode. One file with more than 30 million elements and a size of more than 100 GB would probably have to be parsed differently anyway.

Permalink

627

Show this post
I find it very interesting that very fundamental discussions about data dumps are now starting. In fact, a rewrite like this is an opportunity to consider how such a thing would be solved using today's technology - without any legacy issues, at least on the server side.

One aspect is that JSON tends to be more compact than XML in of pure data volume. With the data volumes involved here, that is already an argument.

However, I think the question of what s do with the data is somewhat more decisive. I think that even if the monolith is broken down into smaller parts, we will still have files so large that we will need a streaming parser to read, filter and convert them. I know that such parsers exist for XML, because I am using one in C++ to get the job done. I assume that they also exist in a similar form for JSON, and hopefully in all common programming languages.

It would be different if really “very small” fragments (that fit into RAM) were generated. In this case, a client application could indeed implement a type of paging and load and unload parts in a targeted and dynamic manner. However, I have no idea whether there are really many applications out there that would work this way.

At this point, I can only speak for my own use case: to maintain the digital image of my vinyl collection (naming of music files, tagging...), I have the task of extracting exactly those releases from the huge releases file that are in my collection. So I have a C++ streaming XML parser that goes through the entire releases file sequentially and extracts only those entries into a filtered output file that are found in the collection export. The result is a reduced releases dump that can be used very well by downstream scripts.

I have no idea what other people do with the data. For me, it's just this kind of private use for my own collection, and filesystem based paging wouldn't bring me any added value for that. It may be different for other applications.

A change to the schema would mean a rewrite for the existing applications. Certainly not everyone would be enthusiastic about this. If the new solution brings real and concrete advantages for the applications, then acceptance will certainly be greater.

But if we see that almost every application needs a local preprocessor to filter or reformat the data anyway, then I think it would be better to stick with a single large file (to simplify the process / initial acquisition) and leave it up to the applications to decide how to break the data down into fragments or reformat it.

Thanks for reading and discussing! :-)

Permalink

931

Show this post
I agree with radum.

If the release id currently exceeds 30 million on discogs, then you could take 30+ snapshots of a million each and compress it all into one gz file with the releases. It would be good for a given snapshot to indicate in the name what range of release id it has. Then you would immediately know which snapshot to load to extract the data you are interested in.

This is really good idea :)

Permalink

6

Show this post
We already have a streamed XML parser, so XML can do.

However it would be easier for new implementations to have JSON. Even more, it should be more practical to have JSONL (files with one JSON per line). More than partitionning, it allows to fastly parallelize processing.

Permalink

627

Show this post

turb_gondolin
(...) Even more, it should be more practical to have JSONL (files with one JSON per line). More than partitionning, it allows to fastly parallelize processing.

A very nice summary & evaulation for those who are not familiar with JSONL (like myself, until a few minutes ago ;-)):
https://nicholaszhan.com/line-em-up-a-guide-to-json-lines-7c43215b3b82

The parallelization thing sounds quite appealing to me.

Permalink

14

Show this post
The database in json format and in smaller files makes a lot of sense. Any news on progress?
Looking forward to it coming back <3

Permalink

627

Show this post
The more I think about it, the more JSONL seems to make sense as a format. It is JSON, but at the same time it is absolutely non-monolithic, and it can be partitioned as required and further processed in sections of any size (parallelized parsing). I have the impression that this format could be the answer to many of the requests we have read here.

Of course, the newlines that appear in the data must be escaped correctly, but that should not be a hurdle.

But even if it remains XML... that's fine with me too.

Permalink

2

Show this post
As long as we’re listing some ideas for a possible rewrite, it would also be nice to offer a change-only extract (in XML or JSON) that would export only those releases which have been modified since the last extract. I imagine many of us are processing the ENTIRE extract but only acting on releases that are new or have been modified.

Just a thought…

Permalink

6

Show this post

lincoln17
As long as we’re listing some ideas for a possible rewrite, it would also be nice to offer a change-only extract (in XML or JSON) that would export only those releases which have been modified since the last extract. I imagine many of us are processing the ENTIRE extract but only acting on releases that are new or have been modified.

Just a thought…

I would be great, but it adds some complexity, starting with the fact there is currently no update date in the dumps (there are artists + labels + masters + releases).

It would require to design how to deal with deletes/removals/merges. Either explicitely NOT deal with it (take a full dump from time to time if you want them) or have 4 additional files just for that.

Permalink

2

Show this post
If separate change-only files is problematic, an alternative that would make life easier would be to simply add a lastModifiedDate with each entity (release, master, artist, label). If the entity is deleted or removed, then the body of the message would be blank. But at present, I'm keeping a copy of each entity for the sole purpose of comparing it to the latest extract, and processing only those where the diff fails. That's a lot of work and local storage I'd rather avoid in the future -- lastModifiedDate would solve that for me.

Is there an ETA on an October extract?

Permalink

72

Show this post
Just want to add that a separate dump of only the changed data would be amazing to have.
Info about modification in the main data not so much, as that would still mean parsing and checking the data.

lincoln17
But at present, I'm keeping a copy of each entity for the sole purpose of comparing it to the latest extract, and processing only those where the diff fails.

I also do something similar, but keeping only a checksum of the data and comparing that.

Permalink

4

Show this post
[quote=radum]Just want to add that a separate dump of only the changed data would be amazing to have.
Info about modification in the main data not so much, as that would still mean parsing and checking the data.

100%

Permalink

40.2k

Show this post

rambul
I agree with radum.

If the release id currently exceeds 30 million on discogs, then you could take 30+ snapshots of a million each and compress it all into one gz file with the releases. It would be good for a given snapshot to indicate in the name what range of release id it has. Then you would immediately know which snapshot to load to extract the data you are interested in.

This is really good idea :)

Fun fact: since about two years it seems that id numbers for new ids are increased by 3, not by 1. A dump as you propose would theoretically have at most 333,334 entries, not 1 million. In practice I am seeing that the ranges are closer to 250,000 or so due to merges.

Permalink

40.2k

Show this post
While in request mode, it would be really nice if information about merges would be made available. This information is available in the web interface for humans, but not in the data dump. It would be very helpful if we can track some kind of history and see which releases were merged.

Permalink

Staff 754

Show this post
Hi! Does anyone have a tool they use to load the XML files into a database (or really anything more manageable than a 75GB text file)?

I need to compare this new data dump with the ones available at https://www.discogs.sie.com/data to minimize the differences, but I also would just like to make sure whichever tools you all use will work with the new data dump and create as few hiccups as possible. Again, there may be some minor differences mostly due to the different library used here (like using empty self-closing elements <foo /> instead of either <foo></foo> or just omitting the elements completely), but I want to minimize those differences as much as possible.

Thanks

Permalink

627

Show this post

blindborges
Hi! Does anyone have a tool they use to load the XML files into a database (or really anything more manageable than a 75GB text file)? (...) Again, there may be some minor differences mostly due to the different library used here (like using empty self-closing elements <foo /> instead of either <foo></foo> or just omitting the elements completely) (...)

I would expect that XML syntax things like self-closing elements probably do not matter. Existing applications will hopefully use XML parsers that handle these details at appropriate level.

Would it help to have a tool that compares two XML streams semantically (not syntactically) according to certain (possibly pragmatic & hard-coded) rules and outputs the deltas to the console? The condition would be that the order of the top-level elements matches (e.g. releases still sorted in ascending order by ID).

Of course, such a low-level approach also requires that you have two dump files (old and new dump) that were generated from exactly the same data - otherwise there will be tons of irrelevant differences in the output (unless you can configure the tool so that it only compares a representative part of the entries, namely those that were not changed in the backend between the dumps - this knowledge would have to be provided from outside).

Perhaps such a stream-based tool already exists somewhere. I haven't found one in a hurry. Maybe someone has another tool anyway that is much better suited to the task, e.g. with a database approach.

But if we are at a loss here, I can have a look at my parser application (C++, console, based on Qt). Maybe I could teach it to perform such a comparison. Not sure, however. I would really have to take a closer look at the current capabilities of the code.

But, another thought: if you need quick initial , you could also make the new "prototype" dump (or part of it, e.g. a release dump with the first 5000 releases, small file) available today. Then some of us could have a look at how the applications cope with it and whether there are any obvious deviations. Is there really anyone who absolutely needs a total comparison?

Permalink

72

Show this post
I have an in-house tool that parses and imports the data.
It's not public but I'd be happy to do some tests on a new data dump format.

It can work on a full or partial file, reading the compressed file directly.

Any of the following is possible:

1) Only count the number of items in the XML.

2) Do an indexing-only run: for each release it only extracts the release id and index position in the XML file.

3) Extract the XML of one release from the file.
This could be used to do a comparison with a certain release from a previous data dump.

4) Run a validation only job: only checks the data for consistency, does not import anything.
This is not only about the XML format, but also about the actual data contained.
I do not handle all the data in the XML, but for the ones I do, I have strict rules to check (like release id is mandatory, country name max N chars, and so on).

5) Complete processing and import in the database.

By the way, while slightly different data format should not a a problem for any parser, omitting the elements completely may be, depending on the specific implementations.
However, that seems like a very good idea in order to reduce data size.

Permalink

2

Show this post
Any update on the timing of the next XML extract? I’m all for discussing how to improve things going forward, but it’s now been over three months since we’ve had a proper extract.

Permalink

627

Show this post

lincoln17
Any update on the timing of the next XML extract? I’m all for discussing how to improve things going forward, but it’s now been over three months since we’ve had a proper extract.

I don't want to argue away your question about a completion date ;-)

Just one thing: if I see it correctly, the discussions about improvements are not currently influencing the development process. On the contrary: the idea of now comparing the old and new dumps shows that the primary aim is now to simply restore the old functionality.

It also seems to me that blindborges is doing everything possible to rewrite this ancient cron job. This is certainly quite a challenging undertaking, considering how complex the data structure is and how many years the software has grown. I am extremely grateful that there is someone to take care of it at all. After the rewrite, and if everything works, I'm sure there will be a much better code base for improvements.

Permalink

Staff 754

Show this post

vansteve
On the contrary: the idea of now comparing the old and new dumps shows that the primary aim is now to simply restore the old functionality.

As of right now, yes, that's correct. Rebuilding this is part of a monolith decomposition effort mostly, and it's a bit hotter since the old data dump job is no longer working. Getting something that matches the old job in functionality but actually works is the first step. It'll have some minor improvements like speed, resiliency, and checkpointing, but overall it'll be the same functionality.

Where this project goes in the future is now an open conversation, though, and modernizing it gives us much more leverage than what we had with the old implementation. I can't promise anything (and if I could, I couldn't promise a timeline) since I don't necessary dictate what we work on and when for the most part, but there's a much greater chance for improvement now that this thing is on its own legs and using a modern tech stack.

vansteve
It also seems to me that blindborges is doing everything possible to rewrite this ancient cron job. This is certainly quite a challenging undertaking, considering how complex the data structure is and how many years the software has grown.

Yeah, this is my primary focus right now, though I have had to jump to a few other hot items while working on this. I've spent today and yesterday comparing the old dump to the new one by searching through the .xml files with ripgrep for specific releases/masters/artists/labels, and things overall look good, so I'm hoping the dump will be totally ready for its next scheduled execution on Nov 1st.

Permalink

136

Show this post
While I am not exactly happy about the current non-availability, I am glad that actual thought is being invested here. Please keep up the good work!

Permalink

2

Show this post
Thanks blindborges 👍️

Permalink

72

Show this post
Labels dump just published. Fingers crossed for complete data this month!

Permalink

Staff 754

Show this post
Hey all,

The full data dump for November has been published. Please let me know if you find anything missing or incorrect. Also, you may notice that some of the gzip files are smaller than the July data dump. This is likely a combination of a different level of gzip compression and skipping some invalid data that was included before.

Cheers!

Permalink

72

Show this post
Thank you! Having the data the same day is really useful!

There may be a problem with the Artists data.
Statistically, there are around 40 000 new entries each month.
However, this month there were only around 13 000 new items since the July data.
There should be almost 10 times that.
This is just considering how the data goes in general, maybe there were simply not so many new Artists added to the database in the last months.

The rest of the data are Ok from this point of view, the number of new items matches the expected increase.

Permalink

Staff 754

Show this post
Ah, I think I see the problem there with the artists. Thanks for the report!

Permalink

627

Show this post
Releases dump works great. At least in the parts that I extract from it, there are no deviations. It is great to see how fast the job is now running. Thanks for all your efforts!

Permalink

6

Show this post
Thanks for this dump!

blindborges
Ah, I think I see the problem there with the artists. Thanks for the report!

Is there something wrong with it (missing artists), or can we use it safely?

Permalink

72

Show this post
blindborges: Many thanks again for returning the monthly data, with improvements (same day availability, smaller size).

Really great work, and much appreciated!

Regarding:

blindborges
using empty self-closing elements <foo /> instead of either <foo></foo> or just omitting the elements completely

I noticed that in some cases empty elements were removed, while in others not.

For example for release id 367, artists node has empty elements for main and extra artists, but missing nodes for track artists.
Perhaps it will help to keep the file size smaller if also the main and extra artists empty nodes were removed.

Anyway, it would also help if everything was consistent (either all blank elements removed, or all kept).

Permalink

2

Show this post
Unfortunately, I'm finding changes in how the releases XML is extracted.

Example - release 7 in the <extraartists> section includes 4 artists -- Kevin Hodge, Koibito & Boku, Kory And The Gang, and Chris Gray. In the latest (November) extract, Koibito & Boku have no corresponding id -- that is, it looks like this: "<id></id>". In the previous (July) extract, the same artist had "<id>0</id".

While both <id> representations highlight the same thing - that this is not a valid artist -- the fact that the xml has changed (despite the release itself not being updated in over two years) means that those of us comparing the XML strings or checksums of the XML will now potentially reprocess this release.

I'm seeing similar changes in artists and labels as well.

Permalink

2

Show this post
I also see incorrect data with the “” element. See the release/artists section of release id 1681343. There’s only one artist yet you have “<>,</>”. Why the comma?

Previous extract has simply (and correctly) “< />”.

Why did so many things like this change?

Permalink

627

Show this post

lincoln17
Why did so many things like this change?

, the software that generates the dumps has been rewritten from scratch. You always get those things when you do a complete rewrite of something.

If you tear a house down to its foundations and have the same house rebuilt by a different architect, there will also be certain differences. A doorway will be a little lower, or a washbasin will be 10 cm further to the left. If there were no differences at all in the first run, then it would be an outright miracle.

From a software developer's point of view: the thing with “<>,</>” looks semantically correct to me. Here is a guess: the comma might be a default value that is present - and exported - in all cases. That would be systematic. The element has no meaning if only 1 artist is listed. So it shouldn't matter which text it contains. If I am right, replacing it with “< />” would be a pure optimization to save a few bytes.

The reason I am writing all this is because I am wondering how important it is for us to really reproduce the old output exactly. Perhaps it is enough if we only report genuine errors?

Permalink

2

Show this post
As mentioned previously, the absence of a lastModifiedDate in the XML for a given release (which I said above would be enormously helpful) implies that those coding against the XML extract are doing their own form of comparisons — full XML string diffs, checksums, etc. So changes like this — from “< />” to “<></>” or “<>,</>” (the latter of which is absolutely NOT a default and appears inconsistently throughout the new extract)) — require many to potentially re-process every release unnecessarily.

Absent this one-time headache, the bigger concern is the inconsistencies.
- Why have both < /> and <></> (as an example)?
- Better still, why have an empty element at all?
- If part of the rewrite is to conserve space — an irable goal — why continue to include large <images> elements (for example) and children that contain absolutely zero information? Who gains by a dozen duplicate elements that read <image type =“secondary” uri=“” uri150=“” width=“600” height=“600”>
- What are we to make of <artist> elements with no corresponding id, and if and when this occurs, why change from <id>0</id> to <id></id>?

It’s not necessary to mimic the old output exactly, but I would have hoped that in the four months between when I first reported the missing extract (as I have done at the beginning of every summer for the last few years) and now would have given developers time to clean up the extract once and for all and make it consistent for those programming against it.

I am like everyone enormously appreciative that this is being worked on, and in the chorus of thanks to blindborges for doing the heavy lifting. But I’m just surprised that what appeared to be foundational system issues - migration to AWS, changes to cron jobs, recovery logic, etc. — ended up affecting the underlying XML payload in such aforementioned ways.

Permalink

433

Show this post
complex systems exhibit complex behaviors

Permalink

Staff 754

Show this post

radum
For example for release id 367, artists node has empty elements for main and extra artists, but missing nodes for track artists.
Perhaps it will help to keep the file size smaller if also the main and extra artists empty nodes were removed.

I agree about removing the empty nodes, but can you that the main and extra artists nodes are empty for Arovane - Atol Scrap? I see both of those populated with data, and looking at the release page, there are both main artists and extra artists.

lincoln17
There’s only one artist yet you have “<>,</>”. Why the comma?

Thanks, I'll work on a fix for this.

vansteve
Here is a guess: the comma might be a default value that is present - and exported - in all cases.

Bingo. The underlying data model (that was created decades ago) includes a comma as the default er, and it's ignored/stripped if only one artist is present. I'll need to follow that same procedure for this new data dump.

lincoln17
why continue to include large <images> elements (for example) and children that contain absolutely zero information? Who gains by a dozen duplicate elements that read <image type =“secondary” uri=“” uri150=“” width=“600” height=“600”>

I'm so glad someone said this! I find this data quite useless, and I'm wondering if anyone would mind if that image data was removed. Unfortunately, we can't provide URLs to images in these data dumps due to potential legal issues, and the URL seems to be the crux of the valuable information here, so having the remainder of the "metadata" here seems a bit pointless.

---

Overall, thanks for the reports, and I'll work to tighten up that consistency a bit where it makes sense.

Permalink

2

Show this post

blindborges

Two things:
#1 - How should we interpret an <artist> element with a name but no corresponding <id>? I’m not sure how or why this happens.
#2 - I would imagine that eliminating the <images> elements would save CONSIDERABLE space in the overall extract. Since there’s no worthwhile information in these elements, might you consider eliminating these for the next extract?

Thanks again.

Permalink

17

Show this post
radum expected additional 120000 artist in the artist file, and you confirmed it.
What artists is missing, and will you create a new artist file shortly?

Permalink

72

Show this post

blindborges
I agree about removing the empty nodes, but can you that the main and extra artists nodes are empty for Arovane - Atol Scrap? I see both of those populated with data, and looking at the release page, there are both main artists and extra artists.

The data is Ok, the main and extra arists are present.
What i meant that there are some empty nodes in the main and extra artist, which are absent in the track artists:

Main artist (empty nodes "anv", "role", "tracks"):

<artists>
<artist>
<id>515</id>
<name>Arovane</name>
<anv/>
<>,</>
<role/>
<tracks/>
</artist>
</artists>

Extra artist (empty nodes "", "tracks"):

<extraartists>
<artist>
<id>166726</id>
<name>Rashad Becker</name>
<anv>Rashad</anv>
</>
<role>Mastered By</role>
<tracks/>
</artist>

Track extra artist (no empty nodes):

<extraartists>
<artist>
<id>515</id>
<name>Arovane</name>
<role>Mixed By [Amx]</role>
</artist>
</extraartists>

blindborges
I find this data quite useless, and I'm wondering if anyone would mind if that image data was removed.

Yes, anything that can help making the file size smaller.

lincoln17
#1 - How should we interpret an <artist> element with a name but no corresponding <id>? I’m not sure how or why this happens.

I noticed that too, seems there are some errors in the actual data, so not related to the export.

For example:
https://www.discogs.sie.com/release/76181-DJ-S-Walk-With-Me

Notice no links on the artist name.

Permalink

2

lincoln17 edited 6 months ago

blindborges

There's a bug with the artists file. The <> element (where the parent is a music group) contains multiple <name> elements, each with an id attribute. Unfortunately, the id attribute for each name is the id of the parent and NOT the id of the specific <name> being referenced.

Example is artist id 380484 -- the <> element in the extract is as follows:

<><name id="380484">Mark Stevens</name><name id="380484">Bernard Wright</name><name id="380484">Lenny White</name><name id="380484">Marcus Miller</name><name id="380484">Dinky Bingham</name></>

As you can see, 380484 is repeated for each of the underlying .

Any chance you can rerun this extract with a fix?

Permalink

Staff 754

Show this post
Sorry for the late response—I've been put on fixing some site stability issues for the last week or so.

I've got fixes to a handful of these XML dump issues which will show up in the December dump. And despite what I mentioned before, I'm actually not certain where some of the discrepancy of artist count is showing up. I do see that the new job is not including certain cases of invalid artists where the old job did, but I'm not certain this s for such a large discrepancy. I'll continue to look into this as time permits.

I will also be removing image data from the December dump as discussed. Hopefully this will reduce the size of the files a bit. If we don't see any complaints, then we'll continue to exclude image data for the sake of file size and speed.

Thanks

Permalink

17

Show this post

lincoln17
blindborges
There's a bug with the artists file. The <> element (where the parent is a music group) contains multiple <name> elements, each with an id attribute. Unfortunately, the id attribute for each name is the id of the parent and NOT the id of the specific <name> being referenced.

Example is artist id 380484 -- the <> element is the extract is as follows:

<><name id="380484">Mark Stevens</name><name id="380484">Bernard Wright</name><name id="380484">Lenny White</name><name id="380484">Marcus Miller</name><name id="380484">Dinky Bingham</name></>

As you can see, 380484 is repeated for each of the underlying .

Any chance you can rerun this extract with a fix?

also radum expected additional 120000 artist in the artist file, and blindborges confirmed it.
What artists is missing, and will you create a new artist file shortly?

Permalink

2

Show this post
You were probably expecting this, but let me be "that guy".

I actually disagree that the image metadata is "quite useless" and I think it should be retained (obviously the empty URI elements could be removed).

The knowledge that there exists an image of a given size can be useful in certain use cases. For example, if I'm trying to retrieve artwork for a set of albums I already know about from the XML dump, it means I don't need to make a call to the API to get the images when they might not exist.

Permalink

101

Show this post

blindborges
I will also be removing image data from the December dump as discussed.

elstensoftware
You were probably expecting this, but let me be "that guy".

I actually disagree that the image metadata is "quite useless" and I think it should be retained (obviously the empty URI elements could be removed).

Indeed. Please retain the image data.

Permalink

101

Show this post
blindborges the current XML exports do not include 'series'. There are only 'label' and 'company'
For example Various - Now That's What I Call Christmas (25149922) has a series. That should be in the dump too.

Are series included in the updated exports ?

Permalink

Staff 754

Show this post

jweijde
Are series included in the updated exports ?

I'm not seeing Series in past exports, but I've added them to future exports, so expect that data on Dec 01 and going forward.

I haven't had more time to look at the artists count discrepancy, but the ones I found were all invalid (artists with no releases or marked as invalid). If anyone finds examples of artists that were in the previous export but not the new one (and are still valid artists), please let me know.

The artist group ID bug has also been fixed.

Permalink

101

Show this post

blindborges
I haven't had more time to look at the artists count discrepancy, but the ones I found were all invalid (artists with no releases or marked as invalid).

Artist that are marked invalid can still have releases listed under them.

Permalink

6

Show this post
December 1st dump has been dropped (that was fast!). What changes/fixes is there from the November one?

Permalink

4.5k

Show this post

blindborges
I will also be removing image data from the December dump as discussed. Hopefully this will reduce the size of the files a bit. If we don't see any complaints, then we'll continue to exclude image data for the sake of file size and speed.

Good news, and finally I say! On a side note, I find the video data to be equally useless. It's just third party garbage wasting space.

Permalink

17

Show this post
blindborges

Found one artist missing in the December file: 14917751 - Hitmixers (2)

Permalink

72

Show this post
(please disregard)

Permalink

4.5k

Show this post
blindborges
See https://www.discogs.sie.com/forum/thread/1069573

Permalink

17

Show this post
blindborges

A massive amount of artists is missing in the January 2025 file!
The artist file ends at artist id 14894779, and the releases file has artist id greater than at least id 15534456. Also artists in between (e.g. 4569348) are missing.

Permalink

101

Show this post

blindborges
I will also be removing image data from the December dump as discussed

Relvet
Good news, and finally I say!

And why is that good news ? How can we now see that an item has images ?

Permalink

4.5k

Show this post

jweijde
And why is that good news ? How can we now see that an item has images ?

I'm all in favor of inserting a small flag in the dump, like has_image = "true". But I'm not sorry that the previously useless image elements are gone as they took up way to much space in relation to their usefulness.

Permalink

72

Show this post

bjorn.stadig
blindborges

A massive amount of artists is missing in the January 2025 file!
The artist file ends at artist id 14894779, and the releases file has artist id greater than at least id 15534456. Also artists in between (e.g. 4569348) are missing.

By my count there are 2 961 less items in the artists dump compared to December.

Permalink

72

Show this post
For the January data, there are thousands of releases with missing artist or label id.
This can also be seen in the release page display.
Please see this thread for further details: https://www.discogs.sie.com/forum/thread/1104747

Permalink

4

ijabz edited 4 months ago
Hi late to the party but I have just been trying to put the January 2025 data into postgres database using this tool which previously worked (last used with July 2024). So I see there is a rewrite and I have noticed various issues that have broken the code, please see https://github.com/philipmat/discogs-xml2db/issues for details

Main issues seem to be previously was storing id of zero when element didnt have an id which was wrong but allowed code to work and now id removed, my question is why do some elements not have an id (i,e some artists) anway.

e.g

https://www.discogs.sie.com/release/7-Moonchildren-Moonchildren-EP?redirected=true

We see

Other [Spirits Lifted By] – Koibito & Boku

and Koibito & Boku is not a link and this is the type of thing that causes issue, but why is it not a link anyway to https://www.discogs.sie.com/artist/246837-Boku-Koibito

Permalink

101

Show this post

ijabz
Other [Spirits Lifted By] – Koibito & Boku

and Koibito & Boku is not a link and this is the type of thing that causes issue, but why is it not a link anyway to https://www.discogs.sie.com/artist/246837-Boku-Koibito

"Other" is an unlinked credit role.
How does this appear in the dump file ?

Here's hoping that Discogs will actually fix the issues with the XML dump update soon. Let it not turn in another piece of functionality on this site that got upgraded to a broken state.
blindborges any news on improvements and fixes maybe ?

Permalink

4

Show this post
See https://github.com/philipmat/discogs-xml2db/issues/152

Permalink

4

ijabz edited 4 months ago
blindborges does your tool extract the metadtaa from a database in the first place ?

I just wonder because my tool parse the xml provided and then put into a relational database. If you are extracting from a relational database in the first place it could be simpler for Discogs and us if you just provided csv files of each database table plus database schema, this is essentially what MusicBrainz does.

Permalink

4

Show this post
For release images we extract data to create the following table, can we please put information back that shows if a release has an image, what type and the dimensions

Table "discogs.release_image"
Column | Type | Collation | Nullable | Default
------------+---------+-----------+----------+---------
release_id | integer | | not null |
type | text | | |
width | integer | | |
height | integer | | |

Permalink

Staff 754

Show this post
Hello,

Sorry for the delay. I will be looking into these reported issues over the next couple days and will post back when I have more info for each of them.

Thanks for the reports!

Permalink

72

Show this post

radum
For the January data, there are thousands of releases with missing artist or label id.
This can also be seen in the release page display.
Please see this thread for further details: https://www.discogs.sie.com/forum/thread/1104747

This seems to be a problem with the database, and not the export, and individual releases can be fixed by doing a blank edit.

Permalink

101

Show this post

radum
This seems to be a problem with the database, and not the export, and individual releases can be fixed by doing a blank edit.

Well, that's been tried over and over here with no effect https://www.discogs.sie.com/release/6511626
The label remains unlinked, so it's the same issue as with those credits. Apparently the export can't deal with entries that are unlinked. These entries do have an id, so that id should be in the export. They were in the old version, weren't they?

Permalink

Staff 754

Show this post

radum
By my count there are 2 961 less items in the artists dump compared to December.

I've just confirmed that there indeed are fewer artists and labels in the dumps, and this could also be the cause of the missing artists in the releases dump. This is a problem with an upstream data store (technically unrelated to the XML project but it affects it) that's actively being worked on. Once that problem is fixed, we should see the missing artist/labels in the future data dumps.

ijabz
it could be simpler for Discogs and us if you just provided csv files of each database table plus database schema

The idea of a JSON data dump was thrown around. I can also put in the idea of a CSV dump for each table instead and see what the team says. However, due to the size of the files and the length of time it takes to generate them, we might have to settle on a single format.

Permalink

627

Show this post

blindborges
The idea of a JSON data dump was thrown around.

I would still like to JSONL (JSON Lines) as mentioned & linked earlier in this thread ... In fact, JSONL might combine the advantages of JSON (robust & parser friendly structure) and CSV (stream nature, or at least stream compatibleness) in one single solution.

Permalink

17

Show this post
blindborges

There is still a massiv amount of artists missing in the February 2025 file.
The February file ends with artist 14894779 (same as the January file) with only 1268 new artists.
And the highest artist is at least > 15534456 (found when checking the first major artistid for a release in the release file). Also artists in between (e.g. 4569348) are still missing.

Permalink

Staff 754

Show this post

bjorn.stadig
There is still a massiv amount of artists missing in the February 2025 file.

Indeed

blindborges
This is a problem with an upstream data store (technically unrelated to the XML project but it affects it) that's actively being worked on. Once that problem is fixed, we should see the missing artist/labels in the future data dumps.

Permalink

4

ijabz edited 4 months ago
The thing with Json I assume its just another representation of artist, master, release, label. I dont find it any easier to parse than Xml and then I still have the problem of decoding and converting the data into relational tables e,g release, release_artist, release_company, release_track, track ectera

Instead if you simply output each table as a csv file that would be much quicker process for you to output, much quicker for us to import and much less chance of errors because there is no conversion into intermediate format (Xml, Json).

Permalink

69

Show this post

ijabz
Instead if you simply output each table as a csv file that would be much quicker process for you to output, much quicker for us to import and much less chance of errors because there is no conversion into intermediate format (Xml, Json).

I only recently started working with the dumps and I would agree with this take. I know for many DB applications there are ways to bulk import with CSVs.
There could be some way i'm unaware of to infer the DB schema as well, but then if we had CSVs for each table there wouldn't be any guesswork.

On that note, i'm curious how close my personal postgres tables are to the prod ones lol (and to any other discogs data dogs here)
https://textbin.net/excuurbmwt

pasted my CREATE TABLES queries there for main entity tables + relational ones

Permalink

4

jacobmgreer edited 3 months ago

yapercaper
i'm curious how close my personal postgres tables are to the prod ones lol (and to any other discogs data dogs here)

Very nice! I've created similar looking tables for key-valuing the datasets from the xml. xml -> xml parsed as csv -> convert csv to parquet -> load into duckdb. I've been jokingly referring to this conversion process as disco-ducking (disco(g)-duck(db)).

Maybe a virtual 'discogs data dogs' workgroup could be assembled for looking at how people are using the xml files? With perhaps the intention of sharing best practices and making some recommendations?

Permalink

7

Show this post

blindborges

I will also be removing image data from the December dump as discussed. Hopefully this will reduce the size of the files a bit. If we don't see any complaints, then we'll continue to exclude image data for the sake of file size and speed.

elstensoftware

I actually disagree that the image metadata is "quite useless" and I think it should be retained (obviously the empty URI elements could be removed).

The knowledge that there exists an image of a given size can be useful in certain use cases. For example, if I'm trying to retrieve artwork for a set of albums I already know about from the XML dump, it means I don't need to make a call to the API to get the images when they might not exist.

I actually quite agree with elstensoftware on this one. We might not need the full image elements but knowing how many images on the release is extremely helpful. Can we get a Cover art count added instead of removing all metadata on images?

Permalink

4.5k

Relvet edited 2 months ago
blindborges
Changes made to labels 2-3 months ago are not propagating to the data dumps.
A few examples:
Titan's Halo Records

This is also true for artists.
Alice M. Moyle

Maybe it's time to add two timestamps to the dumps so we can when a record has been created and when it last was changed?

Permalink

17

bjorn.stadig edited about 1 month ago
Diognes_The_Fox
Is there any news on the upstream data store issue? Still missing a massive amount of data for Artists and Labels...

Permalink

101

Show this post

blindborges
image data was removed.

This was a very bad move. How is anyone going to detect that an item has images now ? Also, some people may have used the dimensions to determine if the image is worth ing or not. For example, 200px isn't really worth it, 600px is.
If the URI is no longer provided, just ditch that attribute and not everything!

Permalink

Staff 754

Show this post

bjorn.stadig
blindborges Diognes_The_Fox
Is there any news on the upstream data store issue? Still missing a massive amount of data for Artists and Labels...

No update other than that it's actively being worked on and solid progress is being made. As you can imagine based on the time it has taken, it's not a very simple nor easy problem to solve, but good progress is being made.

Permalink

2

Show this post
blindborges thanks for your and your colleagues' work on this.

Permalink

17

Show this post

blindborges

No update other than that it's actively being worked on and solid progress is being made. As you can imagine based on the time it has taken, it's not a very simple nor easy problem to solve, but good progress is being made.

Thumbs up! You are doing a great job, including giving us updates here. Thanks!

Permalink

101

Show this post
Hope this will be fixed soon and that the image data gets included again too.

Permalink

XML Dumps / July

Show this post For the last few years (2022, 2023), July and August XML dumps were significantly delayed. Are we to expect this again for 2024, or will releases, masters, artists, and labels all be available for soon? Thanks very much.

Show this post They just published the releases dump.

Show this post Saw that. Great. Hopefully we’ll get the other dumps over the next day or two…

Show this post Still waiting for artists and labels. They usually come right after masters...

Show this post All good. THANK YOU.

Show this post August dumps never were provided. Can someone please confirm that September's will be available next week?

Show this post Can anyone from Discogs respond please?

Show this post Thanks for the response. Much appreciated.

Show this post Any update on this?

Show this post Thank you blindborges for the detailed information, and your efforts.

Show this post vansteve Thank you blindborges for the detailed information, and your efforts. Hear, hear!

Show this post Any updates here? Will we just get an October dump at this point?

Show this post Any updates on status?

Show this post Thanks for the efforts blindborges. Any chance once the XML data dumps are back working there would also be JSON dumps available?

Show this post We already have a streamed XML parser, so XML can do. However it would be easier for new implementations to have JSON. Even more, it should be more practical to have JSONL (files with one JSON per line). More than partitionning, it allows to fastly parallelize processing.

Show this post The database in json format and in smaller files makes a lot of sense. Any news on progress? Looking forward to it coming back <3

Show this post [quote=radum]Just want to add that a separate dump of only the changed data would be amazing to have. Info about modification in the main data not so much, as that would still mean parsing and checking the data. 100%

Show this post Any update on the timing of the next XML extract? I’m all for discussing how to improve things going forward, but it’s now been over three months since we’ve had a proper extract.

Show this post While I am not exactly happy about the current non-availability, I am glad that actual thought is being invested here. Please keep up the good work!

Show this post Thanks blindborges 👍️

Show this post Labels dump just published. Fingers crossed for complete data this month!

Show this post Ah, I think I see the problem there with the artists. Thanks for the report!

Show this post Releases dump works great. At least in the parts that I extract from it, there are no deviations. It is great to see how fast the job is now running. Thanks for all your efforts!

Show this post Thanks for this dump! blindborges Ah, I think I see the problem there with the artists. Thanks for the report! Is there something wrong with it (missing artists), or can we use it safely?

Show this post I also see incorrect data with the “” element. See the release/artists section of release id 1681343. There’s only one artist yet you have “<>,</>”. Why the comma? Previous extract has simply (and correctly) “< />”. Why did so many things like this change?

Show this post complex systems exhibit complex behaviors

Show this post radum expected additional 120000 artist in the artist file, and you confirmed it. What artists is missing, and will you create a new artist file shortly?

Show this post blindborges the current XML exports do not include 'series'. There are only 'label' and 'company' For example Various - Now That's What I Call Christmas (25149922) has a series. That should be in the dump too. Are series included in the updated exports ?

Show this post blindborges I haven't had more time to look at the artists count discrepancy, but the ones I found were all invalid (artists with no releases or marked as invalid). Artist that are marked invalid can still have releases listed under them.

Show this post December 1st dump has been dropped (that was fast!). What changes/fixes is there from the November one?

Show this post blindborges Found one artist missing in the December file: 14917751 - Hitmixers (2)

Show this post (please disregard)

Show this post blindborges See https://www.discogs.sie.com/forum/thread/1069573

Show this post blindborges A massive amount of artists is missing in the January 2025 file! The artist file ends at artist id 14894779, and the releases file has artist id greater than at least id 15534456. Also artists in between (e.g. 4569348) are missing.

Show this post blindborges I will also be removing image data from the December dump as discussed Relvet Good news, and finally I say! And why is that good news ? How can we now see that an item has images ?

Show this post For the January data, there are thousands of releases with missing artist or label id. This can also be seen in the release page display. Please see this thread for further details: https://www.discogs.sie.com/forum/thread/1104747

Show this post See https://github.com/philipmat/discogs-xml2db/issues/152

Show this post Hello, Sorry for the delay. I will be looking into these reported issues over the next couple days and will post back when I have more info for each of them. Thanks for the reports!

bjorn.stadig edited about 1 month ago Diognes_The_Fox Is there any news on the upstream data store issue? Still missing a massive amount of data for Artists and Labels...

Show this post blindborges thanks for your and your colleagues' work on this.

Show this post Hope this will be fixed soon and that the image data gets included again too.

Show this post
For the last few years (2022, 2023), July and August XML dumps were significantly delayed. Are we to expect this again for 2024, or will releases, masters, artists, and labels all be available for soon?

Thanks very much.

Show this post
They just published the releases dump.

Show this post
Saw that. Great. Hopefully we’ll get the other dumps over the next day or two…

Show this post
Still waiting for artists and labels. They usually come right after masters...

Show this post
All good. THANK YOU.

Show this post
August dumps never were provided. Can someone please confirm that September's will be available next week?

Show this post
Can anyone from Discogs respond please?

Show this post
Thanks for the response. Much appreciated.

Show this post
Any update on this?

Show this post
Thank you blindborges for the detailed information, and your efforts.

Show this post

vansteve
Thank you blindborges for the detailed information, and your efforts.

Hear, hear!

Show this post
Any updates here? Will we just get an October dump at this point?

Show this post
Any updates on status?

Show this post
Thanks for the efforts blindborges. Any chance once the XML data dumps are back working there would also be JSON dumps available?

Show this post
We already have a streamed XML parser, so XML can do.

However it would be easier for new implementations to have JSON. Even more, it should be more practical to have JSONL (files with one JSON per line). More than partitionning, it allows to fastly parallelize processing.

Show this post
The database in json format and in smaller files makes a lot of sense. Any news on progress?
Looking forward to it coming back <3

Show this post
[quote=radum]Just want to add that a separate dump of only the changed data would be amazing to have.
Info about modification in the main data not so much, as that would still mean parsing and checking the data.

100%

Show this post
Any update on the timing of the next XML extract? I’m all for discussing how to improve things going forward, but it’s now been over three months since we’ve had a proper extract.

Show this post
While I am not exactly happy about the current non-availability, I am glad that actual thought is being invested here. Please keep up the good work!

Show this post
Thanks blindborges 👍️

Show this post
Labels dump just published. Fingers crossed for complete data this month!

Show this post
Ah, I think I see the problem there with the artists. Thanks for the report!

Show this post
Releases dump works great. At least in the parts that I extract from it, there are no deviations. It is great to see how fast the job is now running. Thanks for all your efforts!

Show this post
Thanks for this dump!

blindborges
Ah, I think I see the problem there with the artists. Thanks for the report!

Is there something wrong with it (missing artists), or can we use it safely?

Show this post
I also see incorrect data with the “” element. See the release/artists section of release id 1681343. There’s only one artist yet you have “<>,</>”. Why the comma?

Previous extract has simply (and correctly) “< />”.

Why did so many things like this change?

Show this post
complex systems exhibit complex behaviors

Show this post
radum expected additional 120000 artist in the artist file, and you confirmed it.
What artists is missing, and will you create a new artist file shortly?

Show this post
blindborges the current XML exports do not include 'series'. There are only 'label' and 'company'
For example Various - Now That's What I Call Christmas (25149922) has a series. That should be in the dump too.

Are series included in the updated exports ?

Show this post

blindborges
I haven't had more time to look at the artists count discrepancy, but the ones I found were all invalid (artists with no releases or marked as invalid).

Artist that are marked invalid can still have releases listed under them.

Show this post
December 1st dump has been dropped (that was fast!). What changes/fixes is there from the November one?

Show this post
blindborges

Found one artist missing in the December file: 14917751 - Hitmixers (2)

Show this post
(please disregard)

Show this post
blindborges
See https://www.discogs.sie.com/forum/thread/1069573

Show this post
blindborges

A massive amount of artists is missing in the January 2025 file!
The artist file ends at artist id 14894779, and the releases file has artist id greater than at least id 15534456. Also artists in between (e.g. 4569348) are missing.

Show this post

blindborges
I will also be removing image data from the December dump as discussed

Relvet
Good news, and finally I say!

And why is that good news ? How can we now see that an item has images ?

Show this post
For the January data, there are thousands of releases with missing artist or label id.
This can also be seen in the release page display.
Please see this thread for further details: https://www.discogs.sie.com/forum/thread/1104747

Show this post
See https://github.com/philipmat/discogs-xml2db/issues/152

Show this post
Hello,

Sorry for the delay. I will be looking into these reported issues over the next couple days and will post back when I have more info for each of them.

Thanks for the reports!

bjorn.stadig edited about 1 month ago
Diognes_The_Fox
Is there any news on the upstream data store issue? Still missing a massive amount of data for Artists and Labels...

Show this post
blindborges thanks for your and your colleagues' work on this.

Show this post
Hope this will be fixed soon and that the image data gets included again too.