Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 203
Posts: 203   Pages: 21   [ 1 2 3 4 5 6 7 8 9 10 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 88650 times and has 202 replies Next Thread
Cyclops
Senior Cruncher
Joined: Jun 13, 2022
Post Count: 295
Status: Offline
Reply to this Post  Reply with Quote 
Hardware Recovery Updates (Updated March 31, 2023)

This post contains all of the updates of the hardware recovery process we have been experiencing since the beginning of March. Please keep all discussion of the hardware recovery process in this thread. Thank you for your support, patience and understanding.

WCG team

March 31, 2023 update

Data transfer update & maintenance check
Earlier in the week, we ran into HDD failures while transferring the data from recovery storage to the new storage system. The issue has since been resolved and we have resumed transferring data, expecting to finish it by 5pm today. At 4pm, we will be conducting a brief maintenance on the website and forums to transfer the DB2 filesystems to our new storage system, which will result in restricted access to the website for up to 30 minutes. If all goes well, this could be the final step towards the full storage system upgrade.

We evaluated the possibility of starting download of processed WUs, while not sending new WUs out. It was determined that the risk of complications that might result from doing this with incomplete information available to our scheduler and BOINC or any other unforeseen issues is too high. We have extended the deadlines for workunits that were processed and await upload to WCG.

While we wait for the data transfer to finish, we are working on resolving other long standing issues such as device recognition.

March 27, 2023 update

Data transfer to new storage system
We have started transferring all data from the recovery storage unit to our new storage system on Friday. Based on the current rate of transfer, we expect to have all data transferred/verified later this week. We will then download all processed WUs, after which we can resume sending work units to volunteers. We plan to start with MCM and OPN/OPNG; followed by ARP and then the new SCC work units.

In the meantime, we have confirmed that our daily database backups for BOINC and for the website/forums are working. The databases have been recovered and transferred to the new, faster storage already. Incremental backup to tape archive has been implemented on the new storage.

March 20, 2023 update

New hardware
While we prepare the new and improved hardware to host our databases and parallel filesystems, we have been using a temporary system provided to us by the data center. All data is confirmed intact and there has been no data loss as we continue to recover. The recovery system is a stand-in for the storage server that failed, selected for hardware compatibility to recover the data. We will not be continuing with the recovery system indefinitely, and it will be discontinued only once the new storage system has been fully installed and synced with the recovery system for a smooth handoff.

BOINC database is UP
The BOINC database is now up and running, joining the website/forums database which has been up since last week. However, upload/download of workunits is paused until we restore the parallel filesystem that supports the workunit management stack, to the state it was in at the time of the hardware failure. Deadlines have been extended and valid results computed during this pause will be credited when we resume.

Website crashes
During the hardware recovery process the website has been intermittently crashing. Looking into the cause we identified bugs that only present themselves in such cases as the BOINC database being offline, and other resources unavailable as we recover the system. The website will now remain available to users in these cases or restart automatically after crashing.

In the meantime, we have posted research updates from the ARP and MCM teams. We are planning on sharing more updates soon.

Initial Hardware Recovery Update
----------------------------------------
[Edit 2 times, last edit by Cyclops at Mar 31, 2023 7:00:44 PM]
[Mar 20, 2023 9:25:00 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Bryn Mawr
Senior Cruncher
Joined: Dec 26, 2018
Post Count: 308
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: March 20, 2023 Hardware Recovery Update

Thanks for the update, looking forward to getting back to work :-)
[Mar 20, 2023 11:06:52 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Freewill
Cruncher
United States
Joined: Mar 28, 2006
Post Count: 32
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: March 20, 2023 Hardware Recovery Update

When will upload/download of workunits be restarted?
[Mar 20, 2023 11:08:32 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Jake1402
Senior Cruncher
USA
Joined: Dec 30, 2005
Post Count: 178
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: March 20, 2023 Hardware Recovery Update

When will upload/download of workunits be restarted?


From Cyclops above:

The BOINC database is now up and running, joining the website/forums database which has been up since last week. However, upload/download of workunits is paused until we restore the parallel filesystem that supports the workunit management stack, to the state it was in at the time of the hardware failure. Deadlines have been extended and valid results computed during this pause will be credited when we resume.

Based on past performance their estimates aren't very accurate.
----------------------------------------
Join the Chicago-IL-USA team!
2 AMD FX 8320/AMD R9 270X/Win 10
2 AMD FX 8320/AMD RX 560/Linux Mint 20.3
Intel Pentium G240/AMD R5 240/Win 10

[Mar 21, 2023 12:22:18 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Kirel2
Advanced Cruncher
United States
Joined: Sep 24, 2014
Post Count: 99
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: March 20, 2023 Hardware Recovery Update

I presume they will let us know when uploads/downloads are working again, so we just have to be patient.
----------------------------------------

[Mar 21, 2023 12:23:27 AM]   Link   Report threatening or abusive post: please login first  Go to top 
xuejc1988
Cruncher
Joined: Oct 28, 2008
Post Count: 1
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: March 20, 2023 Hardware Recovery Update

With a new operator, various engineering problems are incredibly frequent.It's a shame.
[Mar 21, 2023 12:51:35 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Grumpy Swede
Master Cruncher
Svíþjóð
Joined: Apr 10, 2020
Post Count: 1878
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: March 20, 2023 Hardware Recovery Update

Deadlines have been extended Yes they have, but the extended deadlines will expire soon.

I have already extended deadlines that expires March 24, 26, and 27. So, unless you are sure that BOINC uploads and reporting are available before those dates, you'd better extend the deadlines again.
----------------------------------------

[Mar 21, 2023 1:17:26 AM]   Link   Report threatening or abusive post: please login first  Go to top 
gb009761
Master Cruncher
Scotland
Joined: Apr 6, 2005
Post Count: 2955
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: March 20, 2023 Hardware Recovery Update

Sounds like slow, but steady progress. At times, it's best not to rush these things, as rushing, tends to lead to steps being skipped/messed up, forcing rework. Keep up the 'good fight'.

It always amuses me, when someone says "new and improved" - as it can't be both, its GOT to be one or the other. If it's 'new', it can't have been in existence before, and likewise, if it's 'improved', then it can't be new (i.e., something's got to have existed before). BUT, I know what you mean biggrin
----------------------------------------

[Mar 21, 2023 1:18:57 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Grumpy Swede
Master Cruncher
Svíþjóð
Joined: Apr 10, 2020
Post Count: 1878
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: March 20, 2023 Hardware Recovery Update

When will upload/download of workunits be restarted?
Soon biggrin
How Much Time Is Soon
----------------------------------------

[Mar 21, 2023 2:01:36 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Unixchick
Veteran Cruncher
Joined: Apr 16, 2020
Post Count: 742
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: March 20, 2023 Hardware Recovery Update

Thank you for the update!
[Mar 21, 2023 2:29:37 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 203   Pages: 21   [ 1 2 3 4 5 6 7 8 9 10 | Next Page ]
[ Jump to Last Post ]
Post new Thread